From 3b3c4268ed71abeddd8d4a268776399fa7b165dc Mon Sep 17 00:00:00 2001 From: AlongWY Date: Fri, 7 Jun 2024 05:24:38 +0000 Subject: [PATCH] deploy: 72066be21ad467c8ffc76b74c152b38decf3f0ac --- .nojekyll | 0 cache.json | 1 + favicon.ico | Bin 0 -> 15086 bytes index.css | 355 + index.html | 105183 +++++++++++++++++++++++++++++++++++++++++++++++++ index.js | 39 + 6 files changed, 105578 insertions(+) create mode 100644 .nojekyll create mode 100644 cache.json create mode 100644 favicon.ico create mode 100644 index.css create mode 100644 index.html create mode 100644 index.js diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/cache.json b/cache.json new file mode 100644 index 00000000..010c8ed0 --- /dev/null +++ b/cache.json @@ -0,0 +1 @@ +{"2024-05-30T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2405.20341v1","updated":"2024-05-30T17:59:51Z","published":"2024-05-30T17:59:51Z","title":"From Zero to Hero: Cold-Start Anomaly Detection","summary":" When first deploying an anomaly detection system, e.g., to detect\nout-of-scope queries in chatbots, there are no observed data, making\ndata-driven approaches ineffective. Zero-shot anomaly detection methods offer a\nsolution to such \"cold-start\" cases, but unfortunately they are often not\naccurate enough. This paper studies the realistic but underexplored cold-start\nsetting where an anomaly detection model is initialized using zero-shot\nguidance, but subsequently receives a small number of contaminated observations\n(namely, that may include anomalies). The goal is to make efficient use of both\nthe zero-shot guidance and the observations. We propose ColdFusion, a method\nthat effectively adapts the zero-shot anomaly detector to contaminated\nobservations. To support future development of this new setting, we propose an\nevaluation suite consisting of evaluation protocols and metrics.\n","authors":["Tal Reiss","George Kour","Naama Zwerdling","Ateret Anaby-Tavor","Yedid Hoshen"],"pdf_url":"https://arxiv.org/pdf/2405.20341v1.pdf","comment":"ACL 2024. Our code is available at\n https://github.com/talreiss/ColdFusion"},{"id":"http://arxiv.org/abs/2405.20335v1","updated":"2024-05-30T17:59:31Z","published":"2024-05-30T17:59:31Z","title":"Xwin-LM: Strong and Scalable Alignment Practice for LLMs","summary":" In this work, we present Xwin-LM, a comprehensive suite of alignment\nmethodologies for large language models (LLMs). This suite encompasses several\nkey techniques, including supervised finetuning (SFT), reward modeling (RM),\nrejection sampling finetuning (RS), and direct preference optimization (DPO).\nThe key components are as follows: (1) Xwin-LM-SFT, models initially finetuned\nwith high-quality instruction data; (2) Xwin-Pair, a large-scale, multi-turn\npreference dataset meticulously annotated using GPT-4; (3) Xwin-RM, reward\nmodels trained on Xwin-Pair, developed at scales of 7B, 13B, and 70B\nparameters; (4) Xwin-Set, a multiwise preference dataset in which each prompt\nis linked to 64 unique responses generated by Xwin-LM-SFT and scored by\nXwin-RM; (5) Xwin-LM-RS, models finetuned with the highest-scoring responses\nfrom Xwin-Set; (6) Xwin-LM-DPO, models further optimized on Xwin-Set using the\nDPO algorithm. Our evaluations on AlpacaEval and MT-bench demonstrate\nconsistent and significant improvements across the pipeline, demonstrating the\nstrength and scalability of Xwin-LM. The repository\nhttps://github.com/Xwin-LM/Xwin-LM will be continually updated to foster\ncommunity research.\n","authors":["Bolin Ni","JingCheng Hu","Yixuan Wei","Houwen Peng","Zheng Zhang","Gaofeng Meng","Han Hu"],"pdf_url":"https://arxiv.org/pdf/2405.20335v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20318v1","updated":"2024-05-30T17:55:28Z","published":"2024-05-30T17:55:28Z","title":"CausalQuest: Collecting Natural Causal Questions for AI Agents","summary":" Humans have an innate drive to seek out causality. Whether fuelled by\ncuriosity or specific goals, we constantly question why things happen, how they\nare interconnected, and many other related phenomena. To develop AI agents\ncapable of addressing this natural human quest for causality, we urgently need\na comprehensive dataset of natural causal questions. Unfortunately, existing\ndatasets either contain only artificially-crafted questions that do not reflect\nreal AI usage scenarios or have limited coverage of questions from specific\nsources. To address this gap, we present CausalQuest, a dataset of 13,500\nnaturally occurring questions sourced from social networks, search engines, and\nAI assistants. We formalize the definition of causal questions and establish a\ntaxonomy for finer-grained classification. Through a combined effort of human\nannotators and large language models (LLMs), we carefully label the dataset. We\nfind that 42% of the questions humans ask are indeed causal, with the majority\nseeking to understand the causes behind given effects. Using this dataset, we\ntrain efficient classifiers (up to 2.85B parameters) for the binary task of\nidentifying causal questions, achieving high performance with F1 scores of up\nto 0.877. We conclude with a rich set of future research directions that can\nbuild upon our data and models.\n","authors":["Roberto Ceraolo","Dmitrii Kharlapenko","Amélie Reymond","Rada Mihalcea","Mrinmaya Sachan","Bernhard Schölkopf","Zhijing Jin"],"pdf_url":"https://arxiv.org/pdf/2405.20318v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.09919v3","updated":"2024-05-30T17:55:19Z","published":"2024-03-14T23:40:56Z","title":"Recurrent Drafter for Fast Speculative Decoding in Large Language Models","summary":" In this paper, we introduce an improved approach of speculative decoding\naimed at enhancing the efficiency of serving large language models. Our method\ncapitalizes on the strengths of two established techniques: the classic\ntwo-model speculative decoding approach, and the more recent single-model\napproach, Medusa. Drawing inspiration from Medusa, our approach adopts a\nsingle-model strategy for speculative decoding. However, our method\ndistinguishes itself by employing a single, lightweight draft head with a\nrecurrent dependency design, akin in essence to the small, draft model uses in\nclassic speculative decoding, but without the complexities of the full\ntransformer architecture. And because of the recurrent dependency, we can use\nbeam search to swiftly filter out undesired candidates with the draft head. The\noutcome is a method that combines the simplicity of single-model design and\navoids the need to create a data-dependent tree attention structure only for\ninference in Medusa. We empirically demonstrate the effectiveness of the\nproposed method on several popular open source language models, along with a\ncomprehensive analysis of the trade-offs involved in adopting this approach.\n","authors":["Aonan Zhang","Chong Wang","Yi Wang","Xuanyu Zhang","Yunfei Cheng"],"pdf_url":"https://arxiv.org/pdf/2403.09919v3.pdf","comment":"11 pages, 6 figures"},{"id":"http://arxiv.org/abs/2405.20315v1","updated":"2024-05-30T17:54:40Z","published":"2024-05-30T17:54:40Z","title":"ANAH: Analytical Annotation of Hallucinations in Large Language Models","summary":" Reducing the `$\\textit{hallucination}$' problem of Large Language Models\n(LLMs) is crucial for their wide applications. A comprehensive and fine-grained\nmeasurement of the hallucination is the first key step for the governance of\nthis issue but is under-explored in the community. Thus, we present\n$\\textbf{ANAH}$, a bilingual dataset that offers $\\textbf{AN}$alytical\n$\\textbf{A}$nnotation of $\\textbf{H}$allucinations in LLMs within Generative\nQuestion Answering. Each answer sentence in our dataset undergoes rigorous\nannotation, involving the retrieval of a reference fragment, the judgment of\nthe hallucination type, and the correction of hallucinated content. ANAH\nconsists of ~12k sentence-level annotations for ~4.3k LLM responses covering\nover 700 topics, constructed by a human-in-the-loop pipeline. Thanks to the\nfine granularity of the hallucination annotations, we can quantitatively\nconfirm that the hallucinations of LLMs progressively accumulate in the answer\nand use ANAH to train and evaluate hallucination annotators. We conduct\nextensive experiments on studying generative and discriminative annotators and\nshow that, although current open-source LLMs have difficulties in fine-grained\nhallucination annotation, the generative annotator trained with ANAH can\nsurpass all open-source LLMs and GPT-3.5, obtain performance competitive with\nGPT-4, and exhibits better generalization ability on unseen questions.\n","authors":["Ziwei Ji","Yuzhe Gu","Wenwei Zhang","Chengqi Lyu","Dahua Lin","Kai Chen"],"pdf_url":"https://arxiv.org/pdf/2405.20315v1.pdf","comment":"Accepted by ACL 2024"},{"id":"http://arxiv.org/abs/2405.20314v1","updated":"2024-05-30T17:54:35Z","published":"2024-05-30T17:54:35Z","title":"S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for\n Low-Memory GPUs","summary":" Speculative decoding (SD) has attracted a significant amount of research\nattention due to the substantial speedup it can achieve for LLM inference.\nHowever, despite the high speedups they offer, speculative decoding methods\noften achieve optimal performance on high-end devices or with a substantial GPU\nmemory overhead. Given limited memory and the necessity of quantization, a\nhigh-performing model on a high-end GPU can slow down by up to 7 times. To this\nend, we propose Skippy Simultaneous Speculative Decoding (or S3D), a\ncost-effective self-speculative SD method based on simultaneous multi-token\ndecoding and mid-layer skipping. When compared against recent effective\nopen-source SD systems, our method has achieved one of the top\nperformance-memory ratios while requiring minimal architecture changes and\ntraining data. Leveraging our memory efficiency, we created a smaller yet more\neffective SD model based on Phi-3. It is 1.4 to 2 times faster than the\nquantized EAGLE model and operates in half-precision while using less VRAM.\n","authors":["Wei Zhong","Manasa Bharadwaj"],"pdf_url":"https://arxiv.org/pdf/2405.20314v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20309v1","updated":"2024-05-30T17:52:36Z","published":"2024-05-30T17:52:36Z","title":"Large Language Models Can Self-Improve At Web Agent Tasks","summary":" Training models to act as agents that can effectively navigate and perform\nactions in a complex environment, such as a web browser, has typically been\nchallenging due to lack of training data. Large language models (LLMs) have\nrecently demonstrated some capability to navigate novel environments as agents\nin a zero-shot or few-shot fashion, purely guided by natural language\ninstructions as prompts. Recent research has also demonstrated LLMs have the\ncapability to exceed their base performance through self-improvement, i.e.\nfine-tuning on data generated by the model itself. In this work, we explore the\nextent to which LLMs can self-improve their performance as agents in\nlong-horizon tasks in a complex environment using the WebArena benchmark. In\nWebArena, an agent must autonomously navigate and perform actions on web pages\nto achieve a specified objective. We explore fine-tuning on three distinct\nsynthetic training data mixtures and achieve a 31\\% improvement in task\ncompletion rate over the base model on the WebArena benchmark through a\nself-improvement procedure. We additionally contribute novel evaluation metrics\nfor assessing the performance, robustness, capabilities, and quality of\ntrajectories of our fine-tuned agent models to a greater degree than simple,\naggregate-level benchmark scores currently used to measure self-improvement.\n","authors":["Ajay Patel","Markus Hofmarcher","Claudiu Leoveanu-Condrei","Marius-Constantin Dinu","Chris Callison-Burch","Sepp Hochreiter"],"pdf_url":"https://arxiv.org/pdf/2405.20309v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18719v2","updated":"2024-05-30T17:51:53Z","published":"2024-05-29T02:57:15Z","title":"Contextual Position Encoding: Learning to Count What's Important","summary":" The attention mechanism is a critical component of Large Language Models\n(LLMs) that allows tokens in a sequence to interact with each other, but is\norder-invariant. Incorporating position encoding (PE) makes it possible to\naddress by position, such as attending to the i-th token. However, current PE\nmethods use token counts to derive position, and thus cannot generalize to\nhigher levels of abstraction, such as attending to the i-th sentence. In this\npaper, we propose a new position encoding method, Contextual Position Encoding\n(CoPE), that allows positions to be conditioned on context by incrementing\nposition only on certain tokens determined by the model. This allows more\ngeneral position addressing such as attending to the $i$-th particular word,\nnoun, or sentence. We show that CoPE can solve the selective copy, counting and\nFlip-Flop tasks where popular position embeddings fail, and improves perplexity\non language modeling and coding tasks.\n","authors":["Olga Golovneva","Tianlu Wang","Jason Weston","Sainbayar Sukhbaatar"],"pdf_url":"https://arxiv.org/pdf/2405.18719v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20304v1","updated":"2024-05-30T17:50:04Z","published":"2024-05-30T17:50:04Z","title":"Group Robust Preference Optimization in Reward-free RLHF","summary":" Adapting large language models (LLMs) for specific tasks usually involves\nfine-tuning through reinforcement learning with human feedback (RLHF) on\npreference data. While these data often come from diverse labelers' groups\n(e.g., different demographics, ethnicities, company teams, etc.), traditional\nRLHF approaches adopt a \"one-size-fits-all\" approach, i.e., they\nindiscriminately assume and optimize a single preference model, thus not being\nrobust to unique characteristics and needs of the various groups. To address\nthis limitation, we propose a novel Group Robust Preference Optimization (GRPO)\nmethod to align LLMs to individual groups' preferences robustly. Our approach\nbuilds upon reward-free direct preference optimization methods, but unlike\nprevious approaches, it seeks a robust policy which maximizes the worst-case\ngroup performance. To achieve this, GRPO adaptively and sequentially weights\nthe importance of different groups, prioritizing groups with worse cumulative\nloss. We theoretically study the feasibility of GRPO and analyze its\nconvergence for the log-linear policy class. By fine-tuning LLMs with GRPO\nusing diverse group-based global opinion data, we significantly improved\nperformance for the worst-performing groups, reduced loss imbalances across\ngroups, and improved probability accuracies compared to non-robust baselines.\n","authors":["Shyam Sundhar Ramesh","Yifan Hu","Iason Chaimalas","Viraj Mehta","Pier Giuseppe Sessa","Haitham Bou Ammar","Ilija Bogunovic"],"pdf_url":"https://arxiv.org/pdf/2405.20304v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2403.01643v2","updated":"2024-05-30T17:46:22Z","published":"2024-03-03T23:40:35Z","title":"You Need to Pay Better Attention: Rethinking the Mathematics of\n Attention Mechanism","summary":" Scaled Dot Product Attention (SDPA) is the backbone of many modern\ndeep-learning models. It is so versatile that it has been used in natural\nlanguage, vision, and multi-modal domains with very little change compared to\nits original formulation. This paper discusses why the current formulation is\ninefficient by delving into the mathematical details of the attention\nmechanism. We propose three improvements to mitigate these inefficiencies,\nthereby, introducing three enhanced attention mechanisms: Optimised, Efficient,\nand Super Attention. Optimised and Efficient Attention have one and two matrix\nmultiplications fewer per head, respectively, and 25% and 50% fewer parameters,\nrespectively, than standard SDPA, but perform similarly to standard SDPA in\nboth vision and natural language tasks. They can be used in all applications\nwhere SDPA is used while offering smaller model sizes and faster training and\ninference without noticeable loss in performance. Super Attention introduces a\nnew linear transformation on the values, transforming them from the left. It\noutperforms standard SPDA on vision and natural language tasks by up to 17%\nwhile having one fewer matrix multiplication per head and 25% fewer parameters\nthan standard SDPA. Consequently, it is also faster than standard SDPA. Super\nAttention is ideal in applications where the attention layer's context length\nis fixed, such as Vision Transformers. In addition to providing mathematical\nreasoning, we evaluate the presented attention mechanisms on several datasets\nincluding MNIST, CIFAR100, ImageNet, IMDB Movie Reviews, and Amazon Reviews\ndatasets, as well as combined Europarl and Anki English-Spanish datasets for\nneural machine translation.\n","authors":["Mehran Hosseini","Peyman Hosseini"],"pdf_url":"https://arxiv.org/pdf/2403.01643v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20285v1","updated":"2024-05-30T17:38:44Z","published":"2024-05-30T17:38:44Z","title":"Who Writes the Review, Human or AI?","summary":" With the increasing use of Artificial Intelligence in Natural Language\nProcessing, concerns have been raised regarding the detection of AI-generated\ntext in various domains. This study aims to investigate this issue by proposing\na methodology to accurately distinguish AI-generated and human-written book\nreviews. Our approach utilizes transfer learning, enabling the model to\nidentify generated text across different topics while improving its ability to\ndetect variations in writing style and vocabulary. To evaluate the\neffectiveness of the proposed methodology, we developed a dataset consisting of\nreal book reviews and AI-generated reviews using the recently proposed Vicuna\nopen-source language model. The experimental results demonstrate that it is\nfeasible to detect the original source of text, achieving an accuracy rate of\n96.86%. Our efforts are oriented toward the exploration of the capabilities and\nlimitations of Large Language Models in the context of text identification.\nExpanding our knowledge in these aspects will be valuable for effectively\nnavigating similar models in the future and ensuring the integrity and\nauthenticity of human-generated content.\n","authors":["Panagiotis C. Theocharopoulos","Spiros V. Georgakopoulos","Sotiris K. Tasoulis","Vassilis P. Plagianakos"],"pdf_url":"https://arxiv.org/pdf/2405.20285v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.03893v3","updated":"2024-05-30T17:37:11Z","published":"2024-03-06T17:51:43Z","title":"From One to Many: Expanding the Scope of Toxicity Mitigation in Language\n Models","summary":" To date, toxicity mitigation in language models has almost entirely been\nfocused on single-language settings. As language models embrace multilingual\ncapabilities, it's crucial our safety measures keep pace. Recognizing this\nresearch gap, our approach expands the scope of conventional toxicity\nmitigation to address the complexities presented by multiple languages. In the\nabsence of sufficient annotated datasets across languages, we employ translated\ndata to evaluate and enhance our mitigation techniques. We also compare\nfinetuning mitigation approaches against retrieval-augmented techniques under\nboth static and continual toxicity mitigation scenarios. This allows us to\nexamine the effects of translation quality and the cross-lingual transfer on\ntoxicity mitigation. We also explore how model size and data quantity affect\nthe success of these mitigation efforts. Covering nine languages, our study\nrepresents a broad array of linguistic families and levels of resource\navailability, ranging from high to mid-resource languages. Through\ncomprehensive experiments, we provide insights into the complexities of\nmultilingual toxicity mitigation, offering valuable insights and paving the way\nfor future research in this increasingly important field. Code and data are\navailable at https://github.com/for-ai/goodtriever.\n","authors":["Luiza Pozzobon","Patrick Lewis","Sara Hooker","Beyza Ermis"],"pdf_url":"https://arxiv.org/pdf/2403.03893v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.05140v2","updated":"2024-05-30T17:37:06Z","published":"2024-02-06T20:11:54Z","title":"Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains","summary":" Large Language Models (LLMs) have demonstrated remarkable proficiency in\nunderstanding and generating natural language. However, their capabilities wane\nin highly specialized domains underrepresented in the pretraining corpus, such\nas physical and biomedical sciences. This work explores how to repurpose\ngeneral LLMs into effective task solvers for specialized domains. We introduce\na novel, model-agnostic framework for learning custom input tags, which are\nparameterized as continuous vectors appended to the LLM's embedding layer, to\ncondition the LLM. We design two types of input tags: domain tags are used to\ndelimit specialized representations (e.g., chemical formulas) and provide\ndomain-relevant context; function tags are used to represent specific functions\n(e.g., predicting molecular properties) and compress function-solving\ninstructions. We develop a three-stage protocol to learn these tags using\nauxiliary data and domain knowledge. By explicitly disentangling task domains\nfrom task functions, our method enables zero-shot generalization to unseen\nproblems through diverse combinations of the input tags. It also boosts LLM's\nperformance in various specialized domains, such as predicting protein or\nchemical properties and modeling drug-target interactions, outperforming expert\nmodels tailored to these tasks.\n","authors":["Junhong Shen","Neil Tenenholtz","James Brian Hall","David Alvarez-Melis","Nicolo Fusi"],"pdf_url":"https://arxiv.org/pdf/2402.05140v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.14700v3","updated":"2024-05-30T17:31:46Z","published":"2024-02-22T16:56:13Z","title":"Unveiling Linguistic Regions in Large Language Models","summary":" Large Language Models (LLMs) have demonstrated considerable cross-lingual\nalignment and generalization ability. Current research primarily focuses on\nimproving LLMs' cross-lingual generalization capabilities. However, there is\nstill a lack of research on the intrinsic mechanisms of how LLMs achieve\ncross-lingual alignment. From the perspective of region partitioning, this\npaper conducts several investigations on the linguistic competence of LLMs. We\ndiscover a core region in LLMs that corresponds to linguistic competence,\naccounting for approximately 1% of the total model parameters. Removing this\ncore region by setting parameters to zero results in a significant performance\ndecrease across 30 different languages. Furthermore, this core region exhibits\nsignificant dimensional dependence, perturbations to even a single parameter on\nspecific dimensions leading to a loss of linguistic competence. Moreover, we\ndiscover that distinct monolingual regions exist for different languages, and\ndisruption to these specific regions substantially reduces the LLMs'\nproficiency in those corresponding languages. Our research also indicates that\nfreezing the core linguistic region during further pre-training can mitigate\nthe issue of catastrophic forgetting (CF), a common phenomenon observed during\nfurther pre-training of LLMs. Overall, exploring the LLMs' functional regions\nprovides insights into the foundation of their intelligence.\n","authors":["Zhihao Zhang","Jun Zhao","Qi Zhang","Tao Gui","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2402.14700v3.pdf","comment":"Accepted by ACL 2024. Camera-Ready Version"},{"id":"http://arxiv.org/abs/2405.20274v1","updated":"2024-05-30T17:29:15Z","published":"2024-05-30T17:29:15Z","title":"ROAST: Review-level Opinion Aspect Sentiment Target Joint Detection","summary":" Aspect-Based Sentiment Analysis (ABSA) has experienced tremendous expansion\nand diversity due to various shared tasks spanning several languages and fields\nand organized via SemEval workshops and Germeval. Nonetheless, a few\nshortcomings still need to be addressed, such as the lack of low-resource\nlanguage evaluations and the emphasis on sentence-level analysis. To thoroughly\nassess ABSA techniques in the context of complete reviews, this research\npresents a novel task, Review-Level Opinion Aspect Sentiment Target (ROAST).\nROAST seeks to close the gap between sentence-level and text-level ABSA by\nidentifying every ABSA constituent at the review level. We extend the available\ndatasets to enable ROAST, addressing the drawbacks noted in previous research\nby incorporating low-resource languages, numerous languages, and a variety of\ntopics. Through this effort, ABSA research will be able to cover more ground\nand get a deeper comprehension of the task and its practical application in a\nvariety of languages and domains (https://github.com/RiTUAL-UH/ROAST-ABSA).\n","authors":["Siva Uday Sampreeth Chebolu","Franck Dernoncourt","Nedim Lipka","Thamar Solorio"],"pdf_url":"https://arxiv.org/pdf/2405.20274v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2309.13297"},{"id":"http://arxiv.org/abs/2405.20271v1","updated":"2024-05-30T17:26:02Z","published":"2024-05-30T17:26:02Z","title":"ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane\n Reflections","summary":" Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt\nfoundation models to downstream task requirements while retaining their\ngeneralization ability. However, the amount of additionally introduced\nparameters and compute for successful adaptation and hyperparameter searches\ncan explode quickly, especially when deployed at scale to serve numerous\nindividual requests. To ensure effective, parameter-efficient, and\nhyperparameter-robust adaptation, we propose the ETHER transformation family,\nwhich performs Efficient fineTuning via HypErplane Reflections. By design,\nETHER transformations require a minimal number of parameters, are less likely\nto deteriorate model performance, and exhibit robustness to hyperparameter and\nlearning rate choices. In particular, we introduce ETHER and its relaxation\nETHER+, which match or outperform existing PEFT methods with significantly\nfewer parameters ($\\sim$$10$-$100$ times lower than LoRA or OFT) across\nmultiple image synthesis and natural language tasks without exhaustive\nhyperparameter tuning. Finally, we investigate the recent emphasis on\nHyperspherical Energy retention for adaptation and raise questions on its\npractical utility. The code is available at https://github.com/mwbini/ether.\n","authors":["Massimo Bini","Karsten Roth","Zeynep Akata","Anna Khoreva"],"pdf_url":"https://arxiv.org/pdf/2405.20271v1.pdf","comment":"Accepted to ICML 2024. Code available at\n https://github.com/mwbini/ether"},{"id":"http://arxiv.org/abs/2405.20269v1","updated":"2024-05-30T17:21:15Z","published":"2024-05-30T17:21:15Z","title":"IsraParlTweet: The Israeli Parliamentary and Twitter Resource","summary":" We introduce IsraParlTweet, a new linked corpus of Hebrew-language\nparliamentary discussions from the Knesset (Israeli Parliament) between the\nyears 1992-2023 and Twitter posts made by Members of the Knesset between the\nyears 2008-2023, containing a total of 294.5 million Hebrew tokens. In addition\nto raw text, the corpus contains comprehensive metadata on speakers and Knesset\nsessions as well as several linguistic annotations. As a result, IsraParlTweet\ncan be used to conduct a wide variety of quantitative and qualitative analyses\nand provide valuable insights into political discourse in Israel.\n","authors":["Guy Mor-Lan","Effi Levi","Tamir Sheafer","Shaul R. Shenhav"],"pdf_url":"https://arxiv.org/pdf/2405.20269v1.pdf","comment":"Presented at LREC-COLING 2024"},{"id":"http://arxiv.org/abs/2405.20267v1","updated":"2024-05-30T17:19:19Z","published":"2024-05-30T17:19:19Z","title":"Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles\n and Committee Discussions","summary":" As LLMs evolve on a daily basis, there is an urgent need for a trustworthy\nevaluation method that can provide robust evaluation results in a timely\nfashion. Currently, as static benchmarks are prone to contamination concerns,\nusers tend to trust human voting platforms, such as Chatbot Arena. However,\nhuman annotations require extensive manual efforts. To provide an automatic,\nrobust, and trustworthy evaluation framework, we innovatively propose the\nAuto-Arena of LLMs, which automates the entire evaluation process with LLM\nagents. Firstly, an examiner LLM devises queries. Then, a pair of candidate\nLLMs engage in a multi-round peer-battle around the query, during which the\nLLM's true performance gaps become visible. Finally, a committee of LLM judges\ncollectively discuss and determine the winner, which alleviates bias and\npromotes fairness. In our extensive experiment on the 17 newest LLMs,\nAuto-Arena shows the highest correlation with human preferences, providing a\npromising alternative to human evaluation platforms.\n","authors":["Ruochen Zhao","Wenxuan Zhang","Yew Ken Chia","Deli Zhao","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2405.20267v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19327v2","updated":"2024-05-30T17:17:21Z","published":"2024-05-29T17:57:16Z","title":"MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model\n Series","summary":" Large Language Models (LLMs) have made great strides in recent years to\nachieve unprecedented performance across different tasks. However, due to\ncommercial interest, the most competitive models like GPT, Gemini, and Claude\nhave been gated behind proprietary interfaces without disclosing the training\ndetails. Recently, many institutions have open-sourced several strong LLMs like\nLLaMA-3, comparable to existing closed-source LLMs. However, only the model's\nweights are provided with most details (e.g., intermediate checkpoints,\npre-training corpus, and training code, etc.) being undisclosed. To improve the\ntransparency of LLMs, the research community has formed to open-source truly\nopen LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training\ncorpus and training code) are being provided. These models have greatly\nadvanced the scientific study of these large models including their strengths,\nweaknesses, biases and risks. However, we observe that the existing truly open\nLLMs on reasoning, knowledge, and coding tasks are still inferior to existing\nstate-of-the-art LLMs with similar model sizes. To this end, we open-source\nMAP-Neo, a highly capable and transparent bilingual language model with 7B\nparameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the\nfirst fully open-sourced bilingual LLM with comparable performance compared to\nexisting state-of-the-art LLMs. Moreover, we open-source all details to\nreproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning\npipeline, checkpoints, and well-optimized training/evaluation framework are\nprovided. Finally, we hope our MAP-Neo will enhance and strengthen the open\nresearch community and inspire more innovations and creativities to facilitate\nthe further improvements of LLMs.\n","authors":["Ge Zhang","Scott Qu","Jiaheng Liu","Chenchen Zhang","Chenghua Lin","Chou Leuang Yu","Danny Pan","Esther Cheng","Jie Liu","Qunshu Lin","Raven Yuan","Tuney Zheng","Wei Pang","Xinrun Du","Yiming Liang","Yinghao Ma","Yizhi Li","Ziyang Ma","Bill Lin","Emmanouil Benetos","Huan Yang","Junting Zhou","Kaijing Ma","Minghao Liu","Morry Niu","Noah Wang","Quehry Que","Ruibo Liu","Sine Liu","Shawn Guo","Soren Gao","Wangchunshu Zhou","Xinyue Zhang","Yizhi Zhou","Yubo Wang","Yuelin Bai","Yuhan Zhang","Yuxiang Zhang","Zenith Wang","Zhenzhu Yang","Zijian Zhao","Jiajun Zhang","Wanli Ouyang","Wenhao Huang","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2405.19327v2.pdf","comment":"https://map-neo.github.io/"},{"id":"http://arxiv.org/abs/2310.12815v2","updated":"2024-05-30T17:09:56Z","published":"2023-10-19T15:12:09Z","title":"Formalizing and Benchmarking Prompt Injection Attacks and Defenses","summary":" A prompt injection attack aims to inject malicious instruction/data into the\ninput of an LLM-Integrated Application such that it produces results as an\nattacker desires. Existing works are limited to case studies. As a result, the\nliterature lacks a systematic understanding of prompt injection attacks and\ntheir defenses. We aim to bridge the gap in this work. In particular, we\npropose a framework to formalize prompt injection attacks. Existing attacks are\nspecial cases in our framework. Moreover, based on our framework, we design a\nnew attack by combining existing ones. Using our framework, we conduct a\nsystematic evaluation on 5 prompt injection attacks and 10 defenses with 10\nLLMs and 7 tasks. Our work provides a common benchmark for quantitatively\nevaluating future prompt injection attacks and defenses. To facilitate research\non this topic, we make our platform public at\nhttps://github.com/liu00222/Open-Prompt-Injection.\n","authors":["Yupei Liu","Yuqi Jia","Runpeng Geng","Jinyuan Jia","Neil Zhenqiang Gong"],"pdf_url":"https://arxiv.org/pdf/2310.12815v2.pdf","comment":"To appear in USENIX Security Symposium 2024"},{"id":"http://arxiv.org/abs/2405.20253v1","updated":"2024-05-30T17:06:03Z","published":"2024-05-30T17:06:03Z","title":"Evaluating Large Language Model Biases in Persona-Steered Generation","summary":" The task of persona-steered text generation requires large language models\n(LLMs) to generate text that reflects the distribution of views that an\nindividual fitting a persona could have. People have multifaceted personas, but\nprior work on bias in LLM-generated opinions has only explored multiple-choice\nsettings or one-dimensional personas. We define an incongruous persona as a\npersona with multiple traits where one trait makes its other traits less likely\nin human survey data, e.g. political liberals who support increased military\nspending. We find that LLMs are 9.7% less steerable towards incongruous\npersonas than congruous ones, sometimes generating the stereotypical stance\nassociated with its demographic rather than the target stance. Models that we\nevaluate that are fine-tuned with Reinforcement Learning from Human Feedback\n(RLHF) are more steerable, especially towards stances associated with political\nliberals and women, but present significantly less diverse views of personas.\nWe also find variance in LLM steerability that cannot be predicted from\nmultiple-choice opinion evaluation. Our results show the importance of\nevaluating models in open-ended text generation, as it can surface new LLM\nopinion biases. Moreover, such a setup can shed light on our ability to steer\nmodels toward a richer and more diverse range of viewpoints.\n","authors":["Andy Liu","Mona Diab","Daniel Fried"],"pdf_url":"https://arxiv.org/pdf/2405.20253v1.pdf","comment":"Accepted to Findings of ACL 2024. Code and data available at\n https://github.com/andyjliu/persona-steered-generation-bias"},{"id":"http://arxiv.org/abs/2405.20252v1","updated":"2024-05-30T17:05:45Z","published":"2024-05-30T17:05:45Z","title":"Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt\n Optimization","summary":" Large language models (LLMs) have shown great progress in responding to user\nquestions, allowing for a multitude of diverse applications. Yet, the quality\nof LLM outputs heavily depends on the prompt design, where a good prompt might\nenable the LLM to answer a very challenging question correctly. Therefore,\nrecent works have developed many strategies for improving the prompt, including\nboth manual crafting and in-domain optimization. However, their efficacy in\nunrestricted scenarios remains questionable, as the former depends on human\ndesign for specific questions and the latter usually generalizes poorly to\nunseen scenarios. To address these problems, we give LLMs the freedom to design\nthe best prompts according to themselves. Specifically, we include a hierarchy\nof LLMs, first constructing a prompt with precise instructions and accurate\nwording in a hierarchical manner, and then using this prompt to generate the\nfinal answer to the user query. We term this pipeline Hierarchical Multi-Agent\nWorkflow, or HMAW. In contrast with prior works, HMAW imposes no human\nrestriction and requires no training, and is completely task-agnostic while\ncapable of adjusting to the nuances of the underlying task. Through both\nquantitative and qualitative experiments across multiple benchmarks, we verify\nthat despite its simplicity, the proposed approach can create detailed and\nsuitable prompts, further boosting the performance of current LLMs.\n","authors":["Yuchi Liu","Jaskirat Singh","Gaowen Liu","Ali Payani","Liang Zheng"],"pdf_url":"https://arxiv.org/pdf/2405.20252v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.12715v2","updated":"2024-05-30T16:59:56Z","published":"2024-04-19T08:52:22Z","title":"Ensemble Learning for Heterogeneous Large Language Models with Deep\n Parallel Collaboration","summary":" Large language models (LLMs) exhibit complementary strengths in various\ntasks, motivating the research of LLM ensembling. However, existing work\nfocuses on training an extra reward model or fusion model to select or combine\nall candidate answers, posing a great challenge to the generalization on unseen\ndata distributions. Besides, prior methods use textual responses as\ncommunication media, ignoring the valuable information in the internal\nrepresentations. In this work, we propose a training-free ensemble framework\nDeePEn, fusing the informative probability distributions yielded by different\nLLMs at each decoding step. Unfortunately, the vocabulary discrepancy between\nheterogeneous LLMs directly makes averaging the distributions unfeasible due to\nthe token misalignment. To address this challenge, DeePEn maps the probability\ndistribution of each model from its own probability space to a universal\nrelative space based on the relative representation theory, and performs\naggregation. Next, we devise a search-based inverse transformation to transform\nthe aggregated result back to the probability space of one of the ensembling\nLLMs (main model), in order to determine the next token. We conduct extensive\nexperiments on ensembles of different number of LLMs, ensembles of LLMs with\ndifferent architectures, and ensembles between the LLM and the specialist\nmodel. Experimental results show that (i) DeePEn achieves consistent\nimprovements across six benchmarks covering subject examination, reasoning, and\nknowledge, (ii) a well-performing specialist model can benefit from a less\neffective LLM through distribution fusion, and (iii) DeePEn has complementary\nstrengths with other ensemble methods such as voting.\n","authors":["Yichong Huang","Xiaocheng Feng","Baohang Li","Yang Xiang","Hui Wang","Bing Qin","Ting Liu"],"pdf_url":"https://arxiv.org/pdf/2404.12715v2.pdf","comment":"16 pages, 9 figures, 9 tables"},{"id":"http://arxiv.org/abs/2405.20245v1","updated":"2024-05-30T16:54:42Z","published":"2024-05-30T16:54:42Z","title":"Retrieval Augmented Structured Generation: Business Document Information\n Extraction As Tool Use","summary":" Business Document Information Extraction (BDIE) is the problem of\ntransforming a blob of unstructured information (raw text, scanned documents,\netc.) into a structured format that downstream systems can parse and use. It\nhas two main tasks: Key-Information Extraction (KIE) and Line Items Recognition\n(LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem,\nwhere the tools are these downstream systems. We then present Retrieval\nAugmented Structured Generation (RASG), a novel general framework for BDIE that\nachieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE\nbenchmarks.\n The contributions of this paper are threefold: (1) We show, with ablation\nbenchmarks, that Large Language Models (LLMs) with RASG are already competitive\nwith or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on\nBDIE benchmarks. (2) We propose a new metric class for Line Items Recognition,\nGeneral Line Items Recognition Metric (GLIRM), that is more aligned with\npractical BDIE use cases compared to existing metrics, such as ANLS*, DocILE,\nand GriTS. (3) We provide a heuristic algorithm for backcalculating bounding\nboxes of predicted line items and tables without the need for vision encoders.\nFinally, we claim that, while LMMs might sometimes offer marginal performance\nbenefits, LLMs + RASG is oftentimes superior given real-world applications and\nconstraints of BDIE.\n","authors":["Franz Louis Cesista","Rui Aguiar","Jason Kim","Paolo Acilo"],"pdf_url":"https://arxiv.org/pdf/2405.20245v1.pdf","comment":"Accepted by IEEE 7th International Conference on Multimedia\n Information Processing and Retrieval (MIPR), 2024"},{"id":"http://arxiv.org/abs/2205.15744v2","updated":"2024-05-30T16:40:52Z","published":"2022-05-31T12:29:25Z","title":"EMS: Efficient and Effective Massively Multilingual Sentence Embedding\n Learning","summary":" Massively multilingual sentence representation models, e.g., LASER,\nSBERT-distill, and LaBSE, help significantly improve cross-lingual downstream\ntasks. However, the use of a large amount of data or inefficient model\narchitectures results in heavy computation to train a new model according to\nour preferred languages and domains. To resolve this issue, we introduce\nefficient and effective massively multilingual sentence embedding (EMS), using\ncross-lingual token-level reconstruction (XTR) and sentence-level contrastive\nlearning as training objectives. Compared with related studies, the proposed\nmodel can be efficiently trained using significantly fewer parallel sentences\nand GPU computation resources. Empirical results showed that the proposed model\nsignificantly yields better or comparable results with regard to cross-lingual\nsentence retrieval, zero-shot cross-lingual genre classification, and sentiment\nclassification. Ablative analyses demonstrated the efficiency and effectiveness\nof each component of the proposed model. We release the codes for model\ntraining and the EMS pre-trained sentence embedding model, which supports 62\nlanguages ( https://github.com/Mao-KU/EMS ).\n","authors":["Zhuoyuan Mao","Chenhui Chu","Sadao Kurohashi"],"pdf_url":"https://arxiv.org/pdf/2205.15744v2.pdf","comment":"This work is a multilingual extension of arXiv:2105.13856. This work\n has been accepted by IEEE/ACM Transactions on Audio, Speech, and Language\n Processing (DOI: 10.1109/TASLP.2024.3402064). Copyright has been transferred"},{"id":"http://arxiv.org/abs/2402.14800v2","updated":"2024-05-30T16:24:16Z","published":"2024-02-22T18:56:07Z","title":"Not All Experts are Equal: Efficient Expert Pruning and Skipping for\n Mixture-of-Experts Large Language Models","summary":" A pivotal advancement in the progress of large language models (LLMs) is the\nemergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs,\nMoE LLMs can achieve higher performance with fewer parameters, but it is still\nhard to deploy them due to their immense parameter sizes. Different from\nprevious weight pruning methods that rely on specifically designed hardware,\nthis paper mainly aims to enhance the deployment efficiency of MoE LLMs by\nintroducing plug-and-play expert-level sparsification techniques. Specifically,\nwe propose, for the first time to our best knowledge, post-training approaches\nfor task-agnostic and task-specific expert pruning and skipping of MoE LLMs,\ntailored to improve deployment efficiency while maintaining model performance\nacross a wide range of tasks. Extensive experiments show that our proposed\nmethods can simultaneously reduce model sizes and increase the inference speed,\nwhile maintaining satisfactory performance. Data and code will be available at\nhttps://github.com/Lucky-Lance/Expert_Sparsity.\n","authors":["Xudong Lu","Qi Liu","Yuhui Xu","Aojun Zhou","Siyuan Huang","Bo Zhang","Junchi Yan","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2402.14800v2.pdf","comment":"Mixture-of-Experts Large Language Models, ACL2024"},{"id":"http://arxiv.org/abs/2405.20215v1","updated":"2024-05-30T16:17:40Z","published":"2024-05-30T16:17:40Z","title":"TS-Align: A Teacher-Student Collaborative Framework for Scalable\n Iterative Finetuning of Large Language Models","summary":" Mainstream approaches to aligning large language models (LLMs) heavily rely\non human preference data, particularly when models require periodic updates.\nThe standard process for iterative alignment of LLMs involves collecting new\nhuman feedback for each update. However, the data collection process is costly\nand challenging to scale. To address this issue, we introduce the \"TS-Align\"\nframework, which fine-tunes a policy model using pairwise feedback data\nautomatically mined from its outputs. This automatic mining process is\nefficiently accomplished through the collaboration between a large-scale\nteacher model and a small-scale student model. The policy fine-tuning process\ncan be iteratively repeated using on-policy generations within our proposed\nteacher-student collaborative framework. Through extensive experiments, we\ndemonstrate that our final aligned policy outperforms the base policy model\nwith an average win rate of 69.7% across seven conversational or\ninstruction-following datasets. Furthermore, we show that the ranking\ncapability of the teacher is effectively distilled into the student through our\npipeline, resulting in a small-scale yet effective reward model for policy\nmodel alignment.\n","authors":["Chen Zhang","Chengguang Tang","Dading Chong","Ke Shi","Guohua Tang","Feng Jiang","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2405.20215v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20213v1","updated":"2024-05-30T16:16:25Z","published":"2024-05-30T16:16:25Z","title":"PostDoc: Generating Poster from a Long Multimodal Document Using Deep\n Submodular Optimization","summary":" A poster from a long input document can be considered as a one-page\neasy-to-read multimodal (text and images) summary presented on a nice template\nwith good design elements. Automatic transformation of a long document into a\nposter is a very less studied but challenging task. It involves content\nsummarization of the input document followed by template generation and\nharmonization. In this work, we propose a novel deep submodular function which\ncan be trained on ground truth summaries to extract multimodal content from the\ndocument and explicitly ensures good coverage, diversity and alignment of text\nand images. Then, we use an LLM based paraphraser and propose to generate a\ntemplate with various design aspects conditioned on the input content. We show\nthe merits of our approach through extensive automated and human evaluations.\n","authors":["Vijay Jaisankar","Sambaran Bandyopadhyay","Kalp Vyas","Varre Chaitanya","Shwetha Somasundaram"],"pdf_url":"https://arxiv.org/pdf/2405.20213v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20204v1","updated":"2024-05-30T16:07:54Z","published":"2024-05-30T16:07:54Z","title":"Jina CLIP: Your CLIP Model Is Also Your Text Retriever","summary":" Contrastive Language-Image Pretraining (CLIP) is widely used to train models\nto align images and texts in a common embedding space by mapping them to\nfixed-sized vectors. These models are key to multimodal information retrieval\nand related tasks. However, CLIP models generally underperform in text-only\ntasks compared to specialized text models. This creates inefficiencies for\ninformation retrieval systems that keep separate embeddings and models for\ntext-only and multimodal tasks. We propose a novel, multi-task contrastive\ntraining method to address this issue, which we use to train the jina-clip-v1\nmodel to achieve the state-of-the-art performance on both text-image and\ntext-text retrieval tasks.\n","authors":["Andreas Koukounas","Georgios Mastrapas","Michael Günther","Bo Wang","Scott Martens","Isabelle Mohr","Saba Sturua","Mohammad Kalim Akram","Joan Fontanals Martínez","Saahil Ognawala","Susana Guzman","Maximilian Werk","Nan Wang","Han Xiao"],"pdf_url":"https://arxiv.org/pdf/2405.20204v1.pdf","comment":"4 pages, ICML2024 workshop submission"},{"id":"http://arxiv.org/abs/2405.20192v1","updated":"2024-05-30T15:57:19Z","published":"2024-05-30T15:57:19Z","title":"TAIA: Large Language Models are Out-of-Distribution Data Learners","summary":" Fine-tuning on task-specific question-answer pairs is a predominant method\nfor enhancing the performance of instruction-tuned large language models (LLMs)\non downstream tasks. However, in certain specialized domains, such as\nhealthcare or harmless content generation, it is nearly impossible to obtain a\nlarge volume of high-quality data that matches the downstream distribution. To\nimprove the performance of LLMs in data-scarce domains with domain-mismatched\ndata, we re-evaluated the Transformer architecture and discovered that not all\nparameter updates during fine-tuning contribute positively to downstream\nperformance. Our analysis reveals that within the self-attention and\nfeed-forward networks, only the fine-tuned attention parameters are\nparticularly beneficial when the training set's distribution does not fully\nalign with the test set. Based on this insight, we propose an effective\ninference-time intervention method: \\uline{T}raining \\uline{A}ll parameters but\n\\uline{I}nferring with only \\uline{A}ttention (\\trainallInfAttn). We\nempirically validate \\trainallInfAttn using two general instruction-tuning\ndatasets and evaluate it on seven downstream tasks involving math, reasoning,\nand knowledge understanding across LLMs of different parameter sizes and\nfine-tuning techniques. Our comprehensive experiments demonstrate that\n\\trainallInfAttn achieves superior improvements compared to both the fully\nfine-tuned model and the base model in most scenarios, with significant\nperformance gains. The high tolerance of \\trainallInfAttn to data mismatches\nmakes it resistant to jailbreaking tuning and enhances specialized tasks using\ngeneral data.\n","authors":["Shuyang Jiang","Yusheng Liao","Ya Zhang","Yu Wang","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2405.20192v1.pdf","comment":"25 pages"},{"id":"http://arxiv.org/abs/2405.20179v1","updated":"2024-05-30T15:47:54Z","published":"2024-05-30T15:47:54Z","title":"Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning\n CodeLLMs","summary":" Large language models (LLMs) have shown great promise at generating robot\nprograms from natural language given domain-specific robot application\nprogramming interfaces (APIs). However, the performance gap between proprietary\nLLMs and smaller open-weight LLMs remains wide. This raises a question: Can we\nfine-tune smaller open-weight LLMs for generating domain-specific robot\nprograms to close the performance gap with proprietary LLMs? While\nSelf-Instruct is a promising solution by generating a diverse set of training\ndata, it cannot verify the correctness of these programs. In contrast, a robot\nsimulator with a well-defined world can identify execution errors but limits\nthe diversity of programs that it can verify. In this work, we introduce\nRobo-Instruct, which brings the best of both worlds -- it promotes the\ndiversity of Self-Instruct while providing the correctness of simulator-based\nchecking. Robo-Instruct introduces RoboSim to synthesize a consistent world\nstate on the fly by inferring properties relevant to the program being checked,\nand simulating actions accordingly. Furthermore, the instructions and programs\ngenerated by Self-Instruct may be subtly inconsistent -- such as the program\nmissing a step implied by the instruction. Robo-Instruct further addresses this\nwith InstAlign, an instruction-program alignment procedure that revises the\ntask instruction to reflect the actual results of the generated program. Given\na few seed task descriptions and the robot APIs, Robo-Instruct is capable of\ngenerating a training dataset using only a small open-weight model. This\ndataset can then be used to fine-tune small open-weight language models,\nenabling them to match or even exceed the performance of several proprietary\nLLMs, such as GPT-3.5-Turbo and Gemini-Pro.\n","authors":["Zichao Hu","Junyi Jessy Li","Arjun Guha","Joydeep Biswas"],"pdf_url":"https://arxiv.org/pdf/2405.20179v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.11505v2","updated":"2024-05-30T15:46:10Z","published":"2024-02-18T08:32:59Z","title":"Federated Fine-tuning of Large Language Models under Heterogeneous Tasks\n and Client Resources","summary":" Federated Learning (FL) has recently been applied to the parameter-efficient\nfine-tuning of Large Language Models (LLMs). While promising, it raises\nsignificant challenges due to the heterogeneous resources and data\ndistributions of clients. This study introduces FlexLoRA, a simple yet\neffective aggregation scheme for LLM fine-tuning, which mitigates the ``bucket\neffect'' in traditional FL that restricts the potential of clients with ample\nresources by tying them to the capabilities of the least-resourced\nparticipants. FlexLoRA allows for dynamic adjustment of local LoRA ranks,\nfostering the development of a global model imbued with broader, less\ntask-specific knowledge. By synthesizing a full-size LoRA weight from\nindividual client contributions and employing Singular Value Decomposition\n(SVD) for weight redistribution, FlexLoRA fully leverages heterogeneous client\nresources. Involving thousands of clients performing heterogeneous NLP tasks\nand client resources, our experiments validate the efficacy of FlexLoRA, with\nthe federated global model achieving consistently better improvement over SOTA\nFL methods in downstream NLP task performance across various heterogeneous\ndistributions. FlexLoRA's practicality is further underscored by our\ntheoretical analysis and its seamless integration with existing LoRA-based FL\nmethods, offering a path toward cross-device, privacy-preserving federated\ntuning for LLMs.\n","authors":["Jiamu Bai","Daoyuan Chen","Bingchen Qian","Liuyi Yao","Yaliang Li"],"pdf_url":"https://arxiv.org/pdf/2402.11505v2.pdf","comment":"19 pages, 13 tables, 9 figures"},{"id":"http://arxiv.org/abs/2405.20175v1","updated":"2024-05-30T15:45:13Z","published":"2024-05-30T15:45:13Z","title":"InstructionCP: A fast approach to transfer Large Language Models into\n target language","summary":" The rapid development of large language models (LLMs) in recent years has\nlargely focused on English, resulting in models that respond exclusively in\nEnglish. To adapt these models to other languages, continual pre-training (CP)\nis often employed, followed by supervised fine-tuning (SFT) to maintain\nconversational abilities. However, CP and SFT can reduce a model's ability to\nfilter harmful content. We propose Instruction Continual Pre-training (InsCP),\nwhich integrates instruction tags into the CP process to prevent loss of\nconversational proficiency while acquiring new languages. Our experiments\ndemonstrate that InsCP retains conversational and Reinforcement Learning from\nHuman Feedback (RLHF) abilities. Empirical evaluations on language alignment,\nreliability, and knowledge benchmarks confirm the efficacy of InsCP. Notably,\nthis approach requires only 0.1 billion tokens of high-quality\ninstruction-following data, thereby reducing resource consumption.\n","authors":["Kuang-Ming Chen","Hung-yi Lee"],"pdf_url":"https://arxiv.org/pdf/2405.20175v1.pdf","comment":"10 pages, 1 figure"},{"id":"http://arxiv.org/abs/2402.15159v3","updated":"2024-05-30T15:44:51Z","published":"2024-02-23T07:43:26Z","title":"Machine Unlearning of Pre-trained Large Language Models","summary":" This study investigates the concept of the `right to be forgotten' within the\ncontext of large language models (LLMs). We explore machine unlearning as a\npivotal solution, with a focus on pre-trained models--a notably\nunder-researched area. Our research delineates a comprehensive framework for\nmachine unlearning in pre-trained LLMs, encompassing a critical analysis of\nseven diverse unlearning methods. Through rigorous evaluation using curated\ndatasets from arXiv, books, and GitHub, we establish a robust benchmark for\nunlearning performance, demonstrating that these methods are over $10^5$ times\nmore computationally efficient than retraining. Our results show that\nintegrating gradient ascent with gradient descent on in-distribution data\nimproves hyperparameter robustness. We also provide detailed guidelines for\nefficient hyperparameter tuning in the unlearning process. Our findings advance\nthe discourse on ethical AI practices, offering substantive insights into the\nmechanics of machine unlearning for pre-trained LLMs and underscoring the\npotential for responsible AI development.\n","authors":["Jin Yao","Eli Chien","Minxin Du","Xinyao Niu","Tianhao Wang","Zezhou Cheng","Xiang Yue"],"pdf_url":"https://arxiv.org/pdf/2402.15159v3.pdf","comment":"ACL 2024 main. Code and data at\n https://github.com/yaojin17/Unlearning_LLM"},{"id":"http://arxiv.org/abs/2405.20172v1","updated":"2024-05-30T15:44:27Z","published":"2024-05-30T15:44:27Z","title":"Iterative Feature Boosting for Explainable Speech Emotion Recognition","summary":" In speech emotion recognition (SER), using predefined features without\nconsidering their practical importance may lead to high dimensional datasets,\nincluding redundant and irrelevant information. Consequently, high-dimensional\nlearning often results in decreasing model accuracy while increasing\ncomputational complexity. Our work underlines the importance of carefully\nconsidering and analyzing features in order to build efficient SER systems. We\npresent a new supervised SER method based on an efficient feature engineering\napproach. We pay particular attention to the explainability of results to\nevaluate feature relevance and refine feature sets. This is performed\niteratively through feature evaluation loop, using Shapley values to boost\nfeature selection and improve overall framework performance. Our approach\nallows thus to balance the benefits between model performance and transparency.\nThe proposed method outperforms human-level performance (HLP) and\nstate-of-the-art machine learning methods in emotion recognition on the TESS\ndataset.\n","authors":["Alaa Nfissi","Wassim Bouachir","Nizar Bouguila","Brian Mishara"],"pdf_url":"https://arxiv.org/pdf/2405.20172v1.pdf","comment":"Published in: 2023 International Conference on Machine Learning and\n Applications (ICMLA)"},{"id":"http://arxiv.org/abs/2405.20163v1","updated":"2024-05-30T15:38:54Z","published":"2024-05-30T15:38:54Z","title":"Reasoning about concepts with LLMs: Inconsistencies abound","summary":" The ability to summarize and organize knowledge into abstract concepts is key\nto learning and reasoning. Many industrial applications rely on the consistent\nand systematic use of concepts, especially when dealing with decision-critical\nknowledge. However, we demonstrate that, when methodically questioned, large\nlanguage models (LLMs) often display and demonstrate significant\ninconsistencies in their knowledge. Computationally, the basic aspects of the\nconceptualization of a given domain can be represented as Is-A hierarchies in a\nknowledge graph (KG) or ontology, together with a few properties or axioms that\nenable straightforward reasoning. We show that even simple ontologies can be\nused to reveal conceptual inconsistencies across several LLMs. We also propose\nstrategies that domain experts can use to evaluate and improve the coverage of\nkey domain concepts in LLMs of various sizes. In particular, we have been able\nto significantly enhance the performance of LLMs of various sizes with openly\navailable weights using simple knowledge-graph (KG) based prompting strategies.\n","authors":["Rosario Uceda-Sosa","Karthikeyan Natesan Ramamurthy","Maria Chang","Moninder Singh"],"pdf_url":"https://arxiv.org/pdf/2405.20163v1.pdf","comment":"15 pages, 5 figures, 3 tables"},{"id":"http://arxiv.org/abs/2405.09983v2","updated":"2024-05-30T15:34:10Z","published":"2024-05-16T11:01:09Z","title":"Zero-Shot Hierarchical Classification on the Common Procurement\n Vocabulary Taxonomy","summary":" Classifying public tenders is a useful task for both companies that are\ninvited to participate and for inspecting fraudulent activities. To facilitate\nthe task for both participants and public administrations, the European Union\npresented a common taxonomy (Common Procurement Vocabulary, CPV) which is\nmandatory for tenders of certain importance; however, the contracts in which a\nCPV label is mandatory are the minority compared to all the Public\nAdministrations activities. Classifying over a real-world taxonomy introduces\nsome difficulties that can not be ignored. First of all, some fine-grained\nclasses have an insufficient (if any) number of observations in the training\nset, while other classes are far more frequent (even thousands of times) than\nthe average. To overcome those difficulties, we present a zero-shot approach,\nbased on a pre-trained language model that relies only on label description and\nrespects the label taxonomy. To train our proposed model, we used industrial\ndata, which comes from contrattipubblici.org, a service by SpazioDati s.r.l.\nthat collects public contracts stipulated in Italy in the last 25 years.\nResults show that the proposed model achieves better performance in classifying\nlow-frequent classes compared to three different baselines, and is also able to\npredict never-seen classes.\n","authors":["Federico Moiraghi","Matteo Palmonari","Davide Allavena","Federico Morando"],"pdf_url":"https://arxiv.org/pdf/2405.09983v2.pdf","comment":"Full-length version of the short paper accepted at COMPSAC 2024"},{"id":"http://arxiv.org/abs/2405.20145v1","updated":"2024-05-30T15:23:34Z","published":"2024-05-30T15:23:34Z","title":"Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource\n Language Analysis With Character-Aware Hierarchical Transformers","summary":" Historical languages present unique challenges to the NLP community, with one\nprominent hurdle being the limited resources available in their closed corpora.\nThis work describes our submission to the constrained subtask of the SIGTYP\n2024 shared task, focusing on PoS tagging, morphological tagging, and\nlemmatization for 13 historical languages. For PoS and morphological tagging we\nadapt a hierarchical tokenization method from Sun et al. (2023) and combine it\nwith the advantages of the DeBERTa-V3 architecture, enabling our models to\nefficiently learn from every character in the training data. We also\ndemonstrate the effectiveness of character-level T5 models on the lemmatization\ntask. Pre-trained from scratch with limited data, our models achieved first\nplace in the constrained subtask, nearly reaching the performance levels of the\nunconstrained task's winner. Our code is available at\nhttps://github.com/bowphs/SIGTYP-2024-hierarchical-transformers\n","authors":["Frederick Riemenschneider","Kevin Krahn"],"pdf_url":"https://arxiv.org/pdf/2405.20145v1.pdf","comment":"Accepted for publication at the 6th Workshop on Research in\n Computational Linguistic Typology and Multilingual NLP (SIGTYP-WS) 2024; 11\n pages, 1 figure, 9 tables"},{"id":"http://arxiv.org/abs/2405.17503v2","updated":"2024-05-30T15:20:19Z","published":"2024-05-26T04:00:30Z","title":"Code Repair with LLMs gives an Exploration-Exploitation Tradeoff","summary":" Iteratively improving and repairing source code with large language models\n(LLMs), known as refinement, has emerged as a popular way of generating\nprograms that would be too complex to construct in one shot. Given a bank of\ntest cases, together with a candidate program, an LLM can improve that program\nby being prompted with failed test cases. But it remains an open question how\nto best iteratively refine code, with prior work employing simple greedy or\nbreadth-first strategies. We show here that refinement exposes an\nexplore-exploit tradeoff: exploit by refining the program that passes the most\ntest cases, or explore by refining a lesser considered program. We frame this\nas an arm-acquiring bandit problem, which we solve with Thompson Sampling. The\nresulting LLM-based program synthesis algorithm is broadly applicable: Across\nloop invariant synthesis, visual reasoning puzzles, and competition programming\nproblems, we find that our new method can solve more problems using fewer\nlanguage model calls.\n","authors":["Hao Tang","Keya Hu","Jin Peng Zhou","Sicheng Zhong","Wei-Long Zheng","Xujie Si","Kevin Ellis"],"pdf_url":"https://arxiv.org/pdf/2405.17503v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.15112v3","updated":"2024-05-30T15:17:55Z","published":"2024-03-22T11:08:48Z","title":"Text clustering with LLM embeddings","summary":" Text clustering is an important approach for organising the growing amount of\ndigital content, helping to structure and find hidden patterns in uncategorised\ndata. However, the effectiveness of text clustering heavily relies on the\nchoice of textual embeddings and clustering algorithms. We argue that recent\nadvances in large language models (LLMs) can potentially improve this task. In\nthis research, we investigated how different textual embeddings -- particularly\nthose used in LLMs -- and clustering algorithms affect how text datasets are\nclustered. A series of experiments were conducted to assess how embeddings\ninfluence clustering results, the role played by dimensionality reduction\nthrough summarisation, and model size adjustment. Findings reveal that LLM\nembeddings excel at capturing subtleties in structured language, while BERT\nleads the lightweight options in performance. In addition, we observe that\nincreasing model dimensionality and employing summarization techniques do not\nconsistently lead to improvements in clustering efficiency, suggesting that\nthese strategies require careful analysis to use in real-life models. These\nresults highlight a complex balance between the need for refined text\nrepresentation and computational feasibility in text clustering applications.\nThis study extends traditional text clustering frameworks by incorporating\nembeddings from LLMs, providing a path for improved methodologies, while\ninforming new avenues for future research in various types of textual analysis.\n","authors":["Alina Petukhova","João P. Matos-Carvalho","Nuno Fachada"],"pdf_url":"https://arxiv.org/pdf/2403.15112v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20139v1","updated":"2024-05-30T15:14:24Z","published":"2024-05-30T15:14:24Z","title":"GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning","summary":" Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form\nof triplets (head, relation, tail), which collectively form a graph. Question\nAnswering over KGs (KGQA) is the task of answering natural questions grounding\nthe reasoning to the information provided by the KG. Large Language Models\n(LLMs) are the state-of-the-art models for QA tasks due to their remarkable\nability to understand natural language. On the other hand, Graph Neural\nNetworks (GNNs) have been widely used for KGQA as they can handle the complex\ngraph information stored in the KG. In this work, we introduce GNN-RAG, a novel\nmethod for combining language understanding abilities of LLMs with the\nreasoning abilities of GNNs in a retrieval-augmented generation (RAG) style.\nFirst, a GNN reasons over a dense KG subgraph to retrieve answer candidates for\na given question. Second, the shortest paths in the KG that connect question\nentities and answer candidates are extracted to represent KG reasoning paths.\nThe extracted paths are verbalized and given as input for LLM reasoning with\nRAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to\nextract useful graph information, while the LLM leverages its natural language\nprocessing ability for ultimate KGQA. Furthermore, we develop a retrieval\naugmentation (RA) technique to further boost KGQA performance with GNN-RAG.\nExperimental results show that GNN-RAG achieves state-of-the-art performance in\ntwo widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching\nGPT-4 performance with a 7B tuned LLM. In addition, GNN-RAG excels on multi-hop\nand multi-entity questions outperforming competing approaches by 8.9--15.5%\npoints at answer F1.\n","authors":["Costas Mavromatis","George Karypis"],"pdf_url":"https://arxiv.org/pdf/2405.20139v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20131v1","updated":"2024-05-30T15:10:37Z","published":"2024-05-30T15:10:37Z","title":"Language Models Need Inductive Biases to Count Inductively","summary":" Counting is a fundamental example of generalization, whether viewed through\nthe mathematical lens of Peano's axioms defining the natural numbers or the\ncognitive science literature for children learning to count. The argument holds\nfor both cases that learning to count means learning to count infinitely. While\nfew papers have tried to distill transformer \"reasoning\" to the simplest case\nof counting, investigating length generalization does occur throughout the\nliterature. In the \"train short, test long\" paradigm of NLP, length refers to\nthe training sentence length. In formal language recognition, length refers to\nthe input sequence length, or the maximum stack size induced by a pushdown\nautomata. In general problem solving, length refers to the number of hops in a\ndeductive reasoning chain or the recursion depth. For all cases, counting is\ncentral to task success. And crucially, generalizing counting inductively is\ncentral to success on OOD instances. This work provides extensive empirical\nresults on training language models to count. We experiment with architectures\nranging from RNNs, Transformers, State-Space Models and RWKV. We present\ncarefully-designed task formats, auxiliary tasks and positional embeddings to\navoid limitations in generalization with OOD-position and OOD-vocabulary. We\nfind that while traditional RNNs trivially achieve inductive counting,\nTransformers have to rely on positional embeddings to count out-of-domain. As\ncounting is the basis for many arguments concerning the expressivity of\nTransformers, our finding calls for the community to reexamine the application\nscope of primitive functions defined in formal characterizations. Finally,\nmodern RNNs also largely underperform traditional RNNs in generalizing counting\ninductively. We discuss how design choices that enable parallelized training of\nmodern RNNs cause them to lose merits of a recurrent nature.\n","authors":["Yingshan Chang","Yonatan Bisk"],"pdf_url":"https://arxiv.org/pdf/2405.20131v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.12942v5","updated":"2024-05-30T14:49:25Z","published":"2023-10-19T17:39:47Z","title":"On the Representational Capacity of Recurrent Neural Language Models","summary":" This work investigates the computational expressivity of language models\n(LMs) based on recurrent neural networks (RNNs). Siegelmann and Sontag (1992)\nfamously showed that RNNs with rational weights and hidden states and unbounded\ncomputation time are Turing complete. However, LMs define weightings over\nstrings in addition to just (unweighted) language membership and the analysis\nof the computational power of RNN LMs (RLMs) should reflect this. We extend the\nTuring completeness result to the probabilistic case, showing how a rationally\nweighted RLM with unbounded computation time can simulate any deterministic\nprobabilistic Turing machine (PTM) with rationally weighted transitions. Since,\nin practice, RLMs work in real-time, processing a symbol at every time step, we\ntreat the above result as an upper bound on the expressivity of RLMs. We also\nprovide a lower bound by showing that under the restriction to real-time\ncomputation, such models can simulate deterministic real-time rational PTMs.\n","authors":["Franz Nowak","Anej Svete","Li Du","Ryan Cotterell"],"pdf_url":"https://arxiv.org/pdf/2310.12942v5.pdf","comment":"Added requirement for non-negative probabilities to definitions 2.3\n and 3.1, fixed typos"},{"id":"http://arxiv.org/abs/2404.04530v2","updated":"2024-05-30T14:44:10Z","published":"2024-04-06T07:10:47Z","title":"A Morphology-Based Investigation of Positional Encodings","summary":" Contemporary deep learning models effectively handle languages with diverse\nmorphology despite not being directly integrated into them. Morphology and word\norder are closely linked, with the latter incorporated into transformer-based\nmodels through positional encodings. This prompts a fundamental inquiry: Is\nthere a correlation between the morphological complexity of a language and the\nutilization of positional encoding in pre-trained language models? In pursuit\nof an answer, we present the first study addressing this question, encompassing\n22 languages and 5 downstream tasks. Our findings reveal that the importance of\npositional encoding diminishes with increasing morphological complexity in\nlanguages. Our study motivates the need for a deeper understanding of\npositional encoding, augmenting them to better reflect the different languages\nunder consideration.\n","authors":["Poulami Ghosh","Shikhar Vashishth","Raj Dabre","Pushpak Bhattacharyya"],"pdf_url":"https://arxiv.org/pdf/2404.04530v2.pdf","comment":"Work in Progress"},{"id":"http://arxiv.org/abs/2405.20101v1","updated":"2024-05-30T14:41:39Z","published":"2024-05-30T14:41:39Z","title":"Fill in the Gap! Combining Self-supervised Representation Learning with\n Neural Audio Synthesis for Speech Inpainting","summary":" Most speech self-supervised learning (SSL) models are trained with a pretext\ntask which consists in predicting missing parts of the input signal, either\nfuture segments (causal prediction) or segments masked anywhere within the\ninput (non-causal prediction). Learned speech representations can then be\nefficiently transferred to downstream tasks (e.g., automatic speech or speaker\nrecognition). In the present study, we investigate the use of a speech SSL\nmodel for speech inpainting, that is reconstructing a missing portion of a\nspeech signal from its surrounding context, i.e., fulfilling a downstream task\nthat is very similar to the pretext task. To that purpose, we combine an SSL\nencoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role\nof a decoder. In particular, we propose two solutions to match the HuBERT\noutput with the HiFiGAN input, by freezing one and fine-tuning the other, and\nvice versa. Performance of both approaches was assessed in single- and\nmulti-speaker settings, for both informed and blind inpainting configurations\n(i.e., the position of the mask is known or unknown, respectively), with\ndifferent objective metrics and a perceptual evaluation. Performances show that\nif both solutions allow to correctly reconstruct signal portions up to the size\nof 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a\nmore accurate signal reconstruction in the single-speaker setting case, while\nfreezing it (and training the neural vocoder instead) is a better strategy when\ndealing with multi-speaker data.\n","authors":["Ihab Asaad","Maxime Jacquelin","Olivier Perrotin","Laurent Girin","Thomas Hueber"],"pdf_url":"https://arxiv.org/pdf/2405.20101v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20092v1","updated":"2024-05-30T14:31:33Z","published":"2024-05-30T14:31:33Z","title":"Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in\n Code Generation","summary":" Despite recent progress made by large language models in code generation,\nthey still struggle with programs that meet complex requirements. Recent work\nutilizes plan-and-solve decomposition to decrease the complexity and leverage\nself-tests to refine the generated program. Yet, planning deep-inside\nrequirements in advance can be challenging, and the tests need to be accurate\nto accomplish self-improvement. To this end, we propose FunCoder, a code\ngeneration framework incorporating the divide-and-conquer strategy with\nfunctional consensus. Specifically, FunCoder recursively branches off\nsub-functions as smaller goals during code generation, represented by a tree\nhierarchy. These sub-functions are then composited to attain more complex\nobjectives. Additionally, we designate functions via a consensus formed by\nidentifying similarities in program behavior, mitigating error propagation.\nFunCoder outperforms state-of-the-art methods by +9.8% on average in HumanEval,\nMBPP, xCodeEval and MATH with GPT-3.5 and GPT-4. Moreover, our method\ndemonstrates superiority on smaller models: With FunCoder, StableCode-3b\nsurpasses GPT-3.5 by +18.6% and achieves 97.7% of GPT-4's performance on\nHumanEval. Further analysis reveals that our proposed dynamic function\ndecomposition is capable of handling complex requirements, and the functional\nconsensus prevails over self-testing in correctness evaluation.\n","authors":["Jingchang Chen","Hongxuan Tang","Zheng Chu","Qianglong Chen","Zekun Wang","Ming Liu","Bing Qin"],"pdf_url":"https://arxiv.org/pdf/2405.20092v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.01032v2","updated":"2024-05-30T14:27:21Z","published":"2022-12-02T08:56:53Z","title":"Systematic Analysis for Pretrained Language Model Priming for\n Parameter-Efficient Fine-tuning","summary":" Parameter-efficient (PE) methods (like Prompts or Adapters) for adapting\npre-trained language models (PLM) to downstream tasks have been popular\nrecently. However, hindrances still prevent these methods from reaching their\nfull potential. For example, two significant challenges are few-shot adaptation\nand cross-task generalization. To tackle these issues, we propose a general PE\npriming framework to enhance and explore the few-shot adaptation and\ngeneralization ability of PE methods. In this framework, PLMs are primed with\nPE methods for rapidly adapting to various target tasks. To evaluate the\ngeneralization ability of these PE methods, we conduct experiments on a\nfew-shot cross-domain benchmark containing 160 diverse NLP tasks. Our\nexperiment not only reveals the best priming strategy but also verifies that\npriming facilitates the adaptation to target tasks.\n","authors":["Shih-Cheng Huang","Shih-Heng Wang","Min-Han Shih","Saurav Sahay","Hung-yi Lee"],"pdf_url":"https://arxiv.org/pdf/2212.01032v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20089v1","updated":"2024-05-30T14:25:56Z","published":"2024-05-30T14:25:56Z","title":"The Fine-Tuning Paradox: Boosting Translation Quality Without\n Sacrificing LLM Abilities","summary":" Fine-tuning large language models (LLMs) for machine translation has shown\nimprovements in overall translation quality. However, it is unclear what is the\nimpact of fine-tuning on desirable LLM behaviors that are not present in neural\nmachine translation models, such as steerability, inherent document-level\ntranslation abilities, and the ability to produce less literal translations. We\nperform an extensive translation evaluation on the LLaMA and Falcon family of\nmodels with model size ranging from 7 billion up to 65 billion parameters. Our\nresults show that while fine-tuning improves the general translation quality of\nLLMs, several abilities degrade. In particular, we observe a decline in the\nability to perform formality steering, to produce technical translations\nthrough few-shot examples, and to perform document-level translation. On the\nother hand, we observe that the model produces less literal translations after\nfine-tuning on parallel data. We show that by including monolingual data as\npart of the fine-tuning data we can maintain the abilities while simultaneously\nenhancing overall translation quality. Our findings emphasize the need for\nfine-tuning strategies that preserve the benefits of LLMs for machine\ntranslation.\n","authors":["David Stap","Eva Hasler","Bill Byrne","Christof Monz","Ke Tran"],"pdf_url":"https://arxiv.org/pdf/2405.20089v1.pdf","comment":"Accepted to ACL 2024 (long, main)"},{"id":"http://arxiv.org/abs/2405.20079v1","updated":"2024-05-30T14:09:43Z","published":"2024-05-30T14:09:43Z","title":"Student Answer Forecasting: Transformer-Driven Answer Choice Prediction\n for Language Learning","summary":" Intelligent Tutoring Systems (ITS) enhance personalized learning by\npredicting student answers to provide immediate and customized instruction.\nHowever, recent research has primarily focused on the correctness of the answer\nrather than the student's performance on specific answer choices, limiting\ninsights into students' thought processes and potential misconceptions. To\naddress this gap, we present MCQStudentBert, an answer forecasting model that\nleverages the capabilities of Large Language Models (LLMs) to integrate\ncontextual understanding of students' answering history along with the text of\nthe questions and answers. By predicting the specific answer choices students\nare likely to make, practitioners can easily extend the model to new answer\nchoices or remove answer choices for the same multiple-choice question (MCQ)\nwithout retraining the model. In particular, we compare MLP, LSTM, BERT, and\nMistral 7B architectures to generate embeddings from students' past\ninteractions, which are then incorporated into a finetuned BERT's\nanswer-forecasting mechanism. We apply our pipeline to a dataset of language\nlearning MCQ, gathered from an ITS with over 10,000 students to explore the\npredictive accuracy of MCQStudentBert, which incorporates student interaction\npatterns, in comparison to correct answer prediction and traditional\nmastery-learning feature-based approaches. This work opens the door to more\npersonalized content, modularization, and granular support.\n","authors":["Elena Grazia Gado","Tommaso Martorella","Luca Zunino","Paola Mejia-Domenzain","Vinitra Swamy","Jibril Frej","Tanja Käser"],"pdf_url":"https://arxiv.org/pdf/2405.20079v1.pdf","comment":"Accepted as a poster paper at EDM 2024: 17th International Conference\n on Educational Data Mining in Atlanta, USA"},{"id":"http://arxiv.org/abs/2402.03271v2","updated":"2024-05-30T14:03:35Z","published":"2024-02-05T18:28:44Z","title":"Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information\n Seeking in Large Language Models","summary":" In the face of uncertainty, the ability to *seek information* is of\nfundamental importance. In many practical applications, such as medical\ndiagnosis and troubleshooting, the information needed to solve the task is not\ninitially given and has to be actively sought by asking follow-up questions\n(for example, a doctor asking a patient for more details about their symptoms).\nIn this work, we introduce Uncertainty of Thoughts (UoT), an algorithm to\naugment large language models with the ability to actively seek information by\nasking effective questions. UoT combines 1) an *uncertainty-aware simulation\napproach* which enables the model to simulate possible future scenarios and how\nlikely they are to occur, 2) *uncertainty-based rewards* motivated by\ninformation gain which incentivizes the model to seek information, and 3) a\n*reward propagation scheme* to select the optimal question to ask in a way that\nmaximizes the expected reward. In experiments on medical diagnosis,\ntroubleshooting, and the `20 Questions` game, UoT achieves an average\nperformance improvement of 38.1% in the rate of successful task completion\nacross multiple LLMs compared with direct prompting and also improves\nefficiency (i.e., the number of questions needed to complete the task). Our\ncode has been released [here](https://github.com/zhiyuanhubj/UoT)\n","authors":["Zhiyuan Hu","Chumin Liu","Xidong Feng","Yilun Zhao","See-Kiong Ng","Anh Tuan Luu","Junxian He","Pang Wei Koh","Bryan Hooi"],"pdf_url":"https://arxiv.org/pdf/2402.03271v2.pdf","comment":"Update Results"},{"id":"http://arxiv.org/abs/2402.06782v3","updated":"2024-05-30T13:59:34Z","published":"2024-02-09T21:05:01Z","title":"Debating with More Persuasive LLMs Leads to More Truthful Answers","summary":" Common methods for aligning large language models (LLMs) with desired\nbehaviour heavily rely on human-labelled data. However, as models grow\nincreasingly sophisticated, they will surpass human expertise, and the role of\nhuman evaluation will evolve into non-experts overseeing experts. In\nanticipation of this, we ask: can weaker models assess the correctness of\nstronger models? We investigate this question in an analogous setting, where\nstronger models (experts) possess the necessary information to answer questions\nand weaker models (non-experts) lack this information. The method we evaluate\nis debate, where two LLM experts each argue for a different answer, and a\nnon-expert selects the answer. We find that debate consistently helps both\nnon-expert models and humans answer questions, achieving 76% and 88% accuracy\nrespectively (naive baselines obtain 48% and 60%). Furthermore, optimising\nexpert debaters for persuasiveness in an unsupervised manner improves\nnon-expert ability to identify the truth in debates. Our results provide\nencouraging empirical evidence for the viability of aligning models with debate\nin the absence of ground truth.\n","authors":["Akbir Khan","John Hughes","Dan Valentine","Laura Ruis","Kshitij Sachan","Ansh Radhakrishnan","Edward Grefenstette","Samuel R. Bowman","Tim Rocktäschel","Ethan Perez"],"pdf_url":"https://arxiv.org/pdf/2402.06782v3.pdf","comment":"For code please check: https://github.com/ucl-dark/llm_debate"},{"id":"http://arxiv.org/abs/2309.08952v2","updated":"2024-05-30T13:49:47Z","published":"2023-09-16T11:07:52Z","title":"Cross-Lingual Knowledge Editing in Large Language Models","summary":" Knowledge editing aims to change language models' performance on several\nspecial cases (i.e., editing scope) by infusing the corresponding expected\nknowledge into them. With the recent advancements in large language models\n(LLMs), knowledge editing has been shown as a promising technique to adapt LLMs\nto new knowledge without retraining from scratch. However, most of the previous\nstudies neglect the multi-lingual nature of some main-stream LLMs (e.g., LLaMA,\nChatGPT and GPT-4), and typically focus on monolingual scenarios, where LLMs\nare edited and evaluated in the same language. As a result, it is still unknown\nthe effect of source language editing on a different target language. In this\npaper, we aim to figure out this cross-lingual effect in knowledge editing.\nSpecifically, we first collect a large-scale cross-lingual synthetic dataset by\ntranslating ZsRE from English to Chinese. Then, we conduct English editing on\nvarious knowledge editing methods covering different paradigms, and evaluate\ntheir performance in Chinese, and vice versa. To give deeper analyses of the\ncross-lingual effect, the evaluation includes four aspects, i.e., reliability,\ngenerality, locality and portability. Furthermore, we analyze the inconsistent\nbehaviors of the edited models and discuss their specific challenges. Data and\ncodes are available at https://github.com/krystalan/Bi_ZsRE\n","authors":["Jiaan Wang","Yunlong Liang","Zengkui Sun","Yuxuan Cao","Jiarong Xu","Fandong Meng"],"pdf_url":"https://arxiv.org/pdf/2309.08952v2.pdf","comment":"Accepted to ACL 2024 main conference"},{"id":"http://arxiv.org/abs/2306.16092v2","updated":"2024-05-30T13:46:00Z","published":"2023-06-28T10:48:34Z","title":"Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge\n Graph Enhanced Mixture-of-Experts Large Language Model","summary":" AI legal assistants based on Large Language Models (LLMs) can provide\naccessible legal consulting services, but the hallucination problem poses\npotential legal risks. This paper presents Chatlaw, an innovative legal\nassistant utilizing a Mixture-of-Experts (MoE) model and a multi-agent system\nto enhance the reliability and accuracy of AI-driven legal services. By\nintegrating knowledge graphs with artificial screening, we construct a\nhigh-quality legal dataset to train the MoE model. This model utilizes\ndifferent experts to address various legal issues, optimizing the accuracy of\nlegal responses. Additionally, Standardized Operating Procedures (SOP), modeled\nafter real law firm workflows, significantly reduce errors and hallucinations\nin legal services. Our MoE model outperforms GPT-4 in the Lawbench and Unified\nQualification Exam for Legal Professionals by 7.73% in accuracy and 11 points,\nrespectively, and also surpasses other models in multiple dimensions during\nreal-case consultations, demonstrating our robust capability for legal\nconsultation.\n","authors":["Jiaxi Cui","Munan Ning","Zongjian Li","Bohua Chen","Yang Yan","Hao Li","Bin Ling","Yonghong Tian","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2306.16092v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20053v1","updated":"2024-05-30T13:38:52Z","published":"2024-05-30T13:38:52Z","title":"Would I Lie To You? Inference Time Alignment of Language Models using\n Direct Preference Heads","summary":" Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context\nlearning capabilities; however, their behaviors are often difficult to control.\nBy utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible\nto fine-tune unsupervised LMs to follow instructions and produce outputs that\nreflect human preferences. Despite its benefits, RLHF has been shown to\npotentially harm a language model's reasoning capabilities and introduce\nartifacts such as hallucinations where the model may fabricate facts. To\naddress this issue we introduce Direct Preference Heads (DPH), a fine-tuning\nframework that enables LMs to learn human preference signals through an\nauxiliary reward head without directly affecting the output distribution of the\nlanguage modeling head. We perform a theoretical analysis of our objective\nfunction and find strong ties to Conservative Direct Preference Optimization\n(cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All\nevaluation suite and demonstrate that our method produces models which achieve\nhigher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct\nPreference Optimization (DPO) alone.\n","authors":["Avelina Asada Hadji-Kyriacou","Ognjen Arandjelovic"],"pdf_url":"https://arxiv.org/pdf/2405.20053v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.14016v3","updated":"2024-05-30T13:26:38Z","published":"2024-01-25T08:48:21Z","title":"Towards Uncertainty-Aware Language Agent","summary":" While Language Agents have achieved promising success by placing Large\nLanguage Models at the core of a more versatile design that dynamically\ninteracts with the external world, the existing approaches neglect the notion\nof uncertainty during these interactions. We present the Uncertainty-Aware\nLanguage Agent (UALA), a framework that orchestrates the interaction between\nthe agent and the external world using uncertainty quantification. Compared\nwith other well-known counterparts like ReAct, our extensive experiments across\n3 representative tasks (HotpotQA, StrategyQA, MMLU) and various LLM sizes\ndemonstrate that UALA brings a significant improvement of performance, while\nhaving a substantially lower reliance on the external world (i.e., reduced\nnumber of tool calls and tokens). Our analyses provide various insights\nincluding the great potential of UALA compared with agent fine-tuning, and\nunderscore the unreliability of verbalised confidence of LLMs as a proxy for\nuncertainty.\n","authors":["Jiuzhou Han","Wray Buntine","Ehsan Shareghi"],"pdf_url":"https://arxiv.org/pdf/2401.14016v3.pdf","comment":"Our code and data are at https://uala-agent.github.io. (accepted to\n ACL 2024 Findings). arXiv admin note: text overlap with arXiv:2310.05915"},{"id":"http://arxiv.org/abs/2305.12392v3","updated":"2024-05-30T13:23:24Z","published":"2023-05-21T08:11:24Z","title":"PiVe: Prompting with Iterative Verification Improving Graph-based\n Generative Capability of LLMs","summary":" Large language models (LLMs) have shown great abilities of solving various\nnatural language tasks in different domains. Due to the training objective of\nLLMs and their pre-training data, LLMs are not very well equipped for tasks\ninvolving structured data generation. We propose a framework, Prompting with\nIterative Verification (PiVe), to improve graph-based generative capability of\nLLMs. We show how a small language model could be trained to act as a verifier\nmodule for the output of an LLM~(i.e., ChatGPT, GPT-4), and to iteratively\nimprove its performance via fine-grained corrective instructions. We also show\nhow the verifier module could apply iterative corrections offline for a more\ncost-effective solution to the text-to-graph generation task. Experiments on\nthree graph-based datasets show consistent improvement gained via PiVe.\nAdditionally, we create GenWiki-HIQ and highlight that the verifier module can\nbe used as a data augmentation tool to help improve the quality of\nautomatically generated parallel text-graph datasets.\n","authors":["Jiuzhou Han","Nigel Collier","Wray Buntine","Ehsan Shareghi"],"pdf_url":"https://arxiv.org/pdf/2305.12392v3.pdf","comment":"Our code and data are at https://github.com/Jiuzhouh/PiVe (accepted\n to ACL 2024 Findings)"},{"id":"http://arxiv.org/abs/2402.07865v2","updated":"2024-05-30T13:08:48Z","published":"2024-02-12T18:21:14Z","title":"Prismatic VLMs: Investigating the Design Space of Visually-Conditioned\n Language Models","summary":" Visually-conditioned language models (VLMs) have seen growing adoption in\napplications such as visual dialogue, scene understanding, and robotic task\nplanning; adoption that has fueled a wealth of new models such as LLaVa,\nInstructBLIP, and PaLI-3. Despite the volume of new releases, key design\ndecisions around image preprocessing, architecture, and optimization are\nunder-explored, making it challenging to understand what factors account for\nmodel performance $-$ a challenge further complicated by the lack of objective,\nconsistent evaluations. To address these gaps, we first compile a suite of\nstandardized evaluations spanning visual question answering, object\nlocalization, and challenge sets that probe properties such as hallucination;\nevaluations that provide fine-grained insight VLM capabilities. Second, we\nrigorously investigate VLMs along key design axes, including pretrained visual\nrepresentations and training from base vs. instruct-tuned language models,\namongst others. We couple our analysis with three resource contributions: (1) a\nunified framework for evaluating VLMs, (2) optimized, flexible training code,\nand (3) checkpoints for all models, including a family of VLMs at the 7-13B\nscale that strictly outperform InstructBLIP and LLaVa v1.5, the\nstate-of-the-art in open VLMs.\n","authors":["Siddharth Karamcheti","Suraj Nair","Ashwin Balakrishna","Percy Liang","Thomas Kollar","Dorsa Sadigh"],"pdf_url":"https://arxiv.org/pdf/2402.07865v2.pdf","comment":"Published at ICML 2024. 22 pages, 11 figures. Training code and\n models: https://github.com/TRI-ML/prismatic-vlms. Evaluation code:\n https://github.com/TRI-ML/vlm-evaluation"},{"id":"http://arxiv.org/abs/2402.18496v3","updated":"2024-05-30T12:43:01Z","published":"2024-02-28T17:25:59Z","title":"Language Models Represent Beliefs of Self and Others","summary":" Understanding and attributing mental states, known as Theory of Mind (ToM),\nemerges as a fundamental capability for human social reasoning. While Large\nLanguage Models (LLMs) appear to possess certain ToM abilities, the mechanisms\nunderlying these capabilities remain elusive. In this study, we discover that\nit is possible to linearly decode the belief status from the perspectives of\nvarious agents through neural activations of language models, indicating the\nexistence of internal representations of self and others' beliefs. By\nmanipulating these representations, we observe dramatic changes in the models'\nToM performance, underscoring their pivotal role in the social reasoning\nprocess. Additionally, our findings extend to diverse social reasoning tasks\nthat involve different causal inference patterns, suggesting the potential\ngeneralizability of these representations.\n","authors":["Wentao Zhu","Zhining Zhang","Yizhou Wang"],"pdf_url":"https://arxiv.org/pdf/2402.18496v3.pdf","comment":"project page: https://walter0807.github.io/RepBelief/"},{"id":"http://arxiv.org/abs/2405.20003v1","updated":"2024-05-30T12:42:05Z","published":"2024-05-30T12:42:05Z","title":"Kernel Language Entropy: Fine-grained Uncertainty Quantification for\n LLMs from Semantic Similarities","summary":" Uncertainty quantification in Large Language Models (LLMs) is crucial for\napplications where safety and reliability are important. In particular,\nuncertainty can be used to improve the trustworthiness of LLMs by detecting\nfactually incorrect model responses, commonly called hallucinations.\nCritically, one should seek to capture the model's semantic uncertainty, i.e.,\nthe uncertainty over the meanings of LLM outputs, rather than uncertainty over\nlexical or syntactic variations that do not affect answer correctness. To\naddress this problem, we propose Kernel Language Entropy (KLE), a novel method\nfor uncertainty estimation in white- and black-box LLMs. KLE defines positive\nsemidefinite unit trace kernels to encode the semantic similarities of LLM\noutputs and quantifies uncertainty using the von Neumann entropy. It considers\npairwise semantic dependencies between answers (or semantic clusters),\nproviding more fine-grained uncertainty estimates than previous methods based\non hard clustering of answers. We theoretically prove that KLE generalizes the\nprevious state-of-the-art method called semantic entropy and empirically\ndemonstrate that it improves uncertainty quantification performance across\nmultiple natural language generation datasets and LLM architectures.\n","authors":["Alexander Nikitin","Jannik Kossen","Yarin Gal","Pekka Marttinen"],"pdf_url":"https://arxiv.org/pdf/2405.20003v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16510v3","updated":"2024-05-30T12:40:06Z","published":"2024-05-26T10:33:17Z","title":"Meta-Task Planning for Language Agents","summary":" The rapid advancement of neural language models has sparked a new surge of\nintelligent agent research. Unlike traditional agents, large language\nmodel-based agents (LLM agents) have emerged as a promising paradigm for\nachieving artificial general intelligence (AGI) due to their superior reasoning\nand generalization capabilities. Effective planning is crucial for the success\nof LLM agents in real-world tasks, making it a highly pursued topic in the\ncommunity. Current planning methods typically translate tasks into executable\naction sequences. However, determining a feasible or optimal sequence for\ncomplex tasks at fine granularity, which often requires compositing long chains\nof heterogeneous actions, remains challenging. This paper introduces Meta-Task\nPlanning (MTP), a zero-shot methodology for collaborative LLM-based multi-agent\nsystems that simplifies complex task planning by decomposing it into a\nhierarchy of subordinate tasks, or meta-tasks. Each meta-task is then mapped\ninto executable actions. MTP was assessed on two rigorous benchmarks,\nTravelPlanner and API-Bank. Notably, MTP achieved an average $\\sim40\\%$ success\nrate on TravelPlanner, significantly higher than the state-of-the-art (SOTA)\nbaseline ($2.92\\%$), and outperforming $LLM_{api}$-4 with ReAct on API-Bank by\n$\\sim14\\%$, showing the immense potential of integrating LLM with multi-agent\nsystems.\n","authors":["Cong Zhang","Derrick Goh Xin Deik","Dexun Li","Hao Zhang","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2405.16510v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08975v2","updated":"2024-05-30T12:39:51Z","published":"2023-10-13T09:45:14Z","title":"ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question\n Answering with Fine-tuned Large Language Models","summary":" Knowledge Base Question Answering (KBQA) aims to answer natural language\nquestions over large-scale knowledge bases (KBs), which can be summarized into\ntwo crucial steps: knowledge retrieval and semantic parsing. However, three\ncore challenges remain: inefficient knowledge retrieval, mistakes of retrieval\nadversely impacting semantic parsing, and the complexity of previous KBQA\nmethods. To tackle these challenges, we introduce ChatKBQA, a novel and simple\ngenerate-then-retrieve KBQA framework, which proposes first generating the\nlogical form with fine-tuned LLMs, then retrieving and replacing entities and\nrelations with an unsupervised retrieval method, to improve both generation and\nretrieval more directly. Experimental results show that ChatKBQA achieves new\nstate-of-the-art performance on standard KBQA datasets, WebQSP, and CWQ. This\nwork can also be regarded as a new paradigm for combining LLMs with knowledge\ngraphs (KGs) for interpretable and knowledge-required question answering. Our\ncode is publicly available.\n","authors":["Haoran Luo","Haihong E","Zichen Tang","Shiyao Peng","Yikai Guo","Wentai Zhang","Chenghao Ma","Guanting Dong","Meina Song","Wei Lin","Yifan Zhu","Luu Anh Tuan"],"pdf_url":"https://arxiv.org/pdf/2310.08975v2.pdf","comment":"Accepted by Findings of ACL 2024"},{"id":"http://arxiv.org/abs/2403.04280v2","updated":"2024-05-30T12:17:51Z","published":"2024-03-07T07:24:32Z","title":"A New Benchmark for Evaluating Automatic Speech Recognition in the\n Arabic Call Domain","summary":" This work is an attempt to introduce a comprehensive benchmark for Arabic\nspeech recognition, specifically tailored to address the challenges of\ntelephone conversations in Arabic language. Arabic, characterized by its rich\ndialectal diversity and phonetic complexity, presents a number of unique\nchallenges for automatic speech recognition (ASR) systems. These challenges are\nfurther amplified in the domain of telephone calls, where audio quality,\nbackground noise, and conversational speech styles negatively affect\nrecognition accuracy. Our work aims to establish a robust benchmark that not\nonly encompasses the broad spectrum of Arabic dialects but also emulates the\nreal-world conditions of call-based communications. By incorporating diverse\ndialectical expressions and accounting for the variable quality of call\nrecordings, this benchmark seeks to provide a rigorous testing ground for the\ndevelopment and evaluation of ASR systems capable of navigating the\ncomplexities of Arabic speech in telephonic contexts. This work also attempts\nto establish a baseline performance evaluation using state-of-the-art ASR\ntechnologies.\n","authors":["Qusai Abo Obaidah","Muhy Eddin Za'ter","Adnan Jaljuli","Ali Mahboub","Asma Hakouz","Bashar Al-Rfooh","Yazan Estaitia"],"pdf_url":"https://arxiv.org/pdf/2403.04280v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.18350v2","updated":"2024-05-30T12:16:39Z","published":"2024-03-27T08:42:31Z","title":"Evaluation of Semantic Search and its Role in\n Retrieved-Augmented-Generation (RAG) for Arabic Language","summary":" The latest advancements in machine learning and deep learning have brought\nforth the concept of semantic similarity, which has proven immensely beneficial\nin multiple applications and has largely replaced keyword search. However,\nevaluating semantic similarity and conducting searches for a specific query\nacross various documents continue to be a complicated task. This complexity is\ndue to the multifaceted nature of the task, the lack of standard benchmarks,\nwhereas these challenges are further amplified for Arabic language. This paper\nendeavors to establish a straightforward yet potent benchmark for semantic\nsearch in Arabic. Moreover, to precisely evaluate the effectiveness of these\nmetrics and the dataset, we conduct our assessment of semantic search within\nthe framework of retrieval augmented generation (RAG).\n","authors":["Ali Mahboub","Muhy Eddin Za'ter","Bashar Al-Rfooh","Yazan Estaitia","Adnan Jaljuli","Asma Hakouz"],"pdf_url":"https://arxiv.org/pdf/2403.18350v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.12052v3","updated":"2024-05-30T12:03:51Z","published":"2024-02-19T11:11:08Z","title":"Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When\n and What to Retrieve for LLMs","summary":" The integration of large language models (LLMs) and search engines represents\na significant evolution in knowledge acquisition methodologies. However,\ndetermining the knowledge that an LLM already possesses and the knowledge that\nrequires the help of a search engine remains an unresolved issue. Most existing\nmethods solve this problem through the results of preliminary answers or\nreasoning done by the LLM itself, but this incurs excessively high\ncomputational costs. This paper introduces a novel collaborative approach,\nnamely SlimPLM, that detects missing knowledge in LLMs with a slim proxy model,\nto enhance the LLM's knowledge acquisition process. We employ a proxy model\nwhich has far fewer parameters, and take its answers as heuristic answers.\nHeuristic answers are then utilized to predict the knowledge required to answer\nthe user question, as well as the known and unknown knowledge within the LLM.\nWe only conduct retrieval for the missing knowledge in questions that the LLM\ndoes not know. Extensive experimental results on five datasets with two LLMs\ndemonstrate a notable improvement in the end-to-end performance of LLMs in\nquestion-answering tasks, achieving or surpassing current state-of-the-art\nmodels with lower LLM inference costs.\n","authors":["Jiejun Tan","Zhicheng Dou","Yutao Zhu","Peidong Guo","Kun Fang","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2402.12052v3.pdf","comment":"Accepted by ACL 2024 main conference. Repo:\n https://github.com/plageon/SlimPLM"},{"id":"http://arxiv.org/abs/2405.19967v1","updated":"2024-05-30T11:46:42Z","published":"2024-05-30T11:46:42Z","title":"Improved Out-of-Scope Intent Classification with Dual Encoding and\n Threshold-based Re-Classification","summary":" Detecting out-of-scope user utterances is essential for task-oriented\ndialogues and intent classification. Current methodologies face difficulties\nwith the unpredictable distribution of outliers and often rely on assumptions\nabout data distributions. We present the Dual Encoder for Threshold-Based\nRe-Classification (DETER) to address these challenges. This end-to-end\nframework efficiently detects out-of-scope intents without requiring\nassumptions on data distributions or additional post-processing steps. The core\nof DETER utilizes dual text encoders, the Universal Sentence Encoder (USE) and\nthe Transformer-based Denoising AutoEncoder (TSDAE), to generate user utterance\nembeddings, which are classified through a branched neural architecture.\nFurther, DETER generates synthetic outliers using self-supervision and\nincorporates out-of-scope phrases from open-domain datasets. This approach\nensures a comprehensive training set for out-of-scope detection. Additionally,\na threshold-based re-classification mechanism refines the model's initial\npredictions. Evaluations on the CLINC-150, Stackoverflow, and Banking77\ndatasets demonstrate DETER's efficacy. Our model outperforms previous\nbenchmarks, increasing up to 13% and 5% in F1 score for known and unknown\nintents on CLINC-150 and Stackoverflow, and 16% for known and 24% % for unknown\nintents on Banking77. The source code has been released at\nhttps://github.com/Hossam-Mohammed-tech/Intent\\_Classification\\_OOS.\n","authors":["Hossam M. Zawbaa","Wael Rashwan","Sourav Dutta","Haytham Assem"],"pdf_url":"https://arxiv.org/pdf/2405.19967v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17732v2","updated":"2024-05-30T11:32:05Z","published":"2024-05-28T01:23:58Z","title":"C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark\n for Large Language Models","summary":" Classical Chinese Understanding (CCU) holds significant value in preserving\nand exploration of the outstanding traditional Chinese culture. Recently,\nresearchers have attempted to leverage the potential of Large Language Models\n(LLMs) for CCU by capitalizing on their remarkable comprehension and semantic\ncapabilities. However, no comprehensive benchmark is available to assess the\nCCU capabilities of LLMs. To fill this gap, this paper introduces C$^{3}$bench,\na Comprehensive Classical Chinese understanding benchmark, which comprises\n50,000 text pairs for five primary CCU tasks, including classification,\nretrieval, named entity recognition, punctuation, and translation. Furthermore,\nthe data in C$^{3}$bench originates from ten different domains, covering most\nof the categories in classical Chinese. Leveraging the proposed C$^{3}$bench,\nwe extensively evaluate the quantitative performance of 15 representative LLMs\non all five CCU tasks. Our results not only establish a public leaderboard of\nLLMs' CCU capabilities but also gain some findings. Specifically, existing LLMs\nare struggle with CCU tasks and still inferior to supervised models.\nAdditionally, the results indicate that CCU is a task that requires special\nattention. We believe this study could provide a standard benchmark,\ncomprehensive baselines, and valuable insights for the future advancement of\nLLM-based CCU research. The evaluation pipeline and dataset are available at\n\\url{https://github.com/SCUT-DLVCLab/C3bench}.\n","authors":["Jiahuan Cao","Yongxin Shi","Dezhi Peng","Yang Liu","Lianwen Jin"],"pdf_url":"https://arxiv.org/pdf/2405.17732v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.12174v2","updated":"2024-05-30T11:26:58Z","published":"2024-02-19T14:28:31Z","title":"BIDER: Bridging Knowledge Inconsistency for Efficient\n Retrieval-Augmented LLMs via Key Supporting Evidence","summary":" Retrieval-augmented large language models (LLMs) have demonstrated efficacy\nin knowledge-intensive tasks such as open-domain QA, addressing inherent\nchallenges in knowledge update and factual inadequacy. However, inconsistencies\nbetween retrieval knowledge and the necessary knowledge for LLMs, leading to a\ndecline in LLM's answer quality. This paper introduces BIDER, an approach that\nrefines retrieval documents into Key Supporting Evidence (KSE) through\nknowledge synthesis, supervised fine-tuning (SFT), and preference alignment. We\ntrain BIDER by learning from crafting KSE, while maximizing its output to align\nwith LLM's information acquisition preferences through reinforcement learning.\nEvaluations across five datasets show BIDER boosts LLMs' answer quality by 7%\nwhile reducing input content length in retrieval documents by 80%,\noutperforming existing methods. The proposed KSE simulation effectively equips\nLLMs with essential information for accurate question answering.\n","authors":["Jiajie Jin","Yutao Zhu","Yujia Zhou","Zhicheng Dou"],"pdf_url":"https://arxiv.org/pdf/2402.12174v2.pdf","comment":"Accepted by ACL 2024 Findings"},{"id":"http://arxiv.org/abs/2405.19958v1","updated":"2024-05-30T11:25:42Z","published":"2024-05-30T11:25:42Z","title":"Multi-Aspect Controllable Text Generation with Disentangled\n Counterfactual Augmentation","summary":" Multi-aspect controllable text generation aims to control the generated texts\nin attributes from multiple aspects (e.g., \"positive\" from sentiment and\n\"sport\" from topic). For ease of obtaining training samples, existing works\nneglect attribute correlations formed by the intertwining of different\nattributes. Particularly, the stereotype formed by imbalanced attribute\ncorrelations significantly affects multi-aspect control. In this paper, we\npropose MAGIC, a new multi-aspect controllable text generation method with\ndisentangled counterfactual augmentation. We alleviate the issue of imbalanced\nattribute correlations during training using counterfactual feature vectors in\nthe attribute latent space by disentanglement. During inference, we enhance\nattribute correlations by target-guided counterfactual augmentation to further\nimprove multi-aspect control. Experiments show that MAGIC outperforms\nstate-of-the-art baselines in both imbalanced and balanced attribute\ncorrelation scenarios. Our source code and data are available at\nhttps://github.com/nju-websoft/MAGIC.\n","authors":["Yi Liu","Xiangyu Liu","Xiangrong Zhu","Wei Hu"],"pdf_url":"https://arxiv.org/pdf/2405.19958v1.pdf","comment":"Accepted in the 62nd Annual Meeting of the Association for\n Computational Linguistics (ACL 2024)"},{"id":"http://arxiv.org/abs/2405.19954v1","updated":"2024-05-30T11:18:52Z","published":"2024-05-30T11:18:52Z","title":"GenKubeSec: LLM-Based Kubernetes Misconfiguration Detection,\n Localization, Reasoning, and Remediation","summary":" A key challenge associated with Kubernetes configuration files (KCFs) is that\nthey are often highly complex and error-prone, leading to security\nvulnerabilities and operational setbacks. Rule-based (RB) tools for KCF\nmisconfiguration detection rely on static rule sets, making them inherently\nlimited and unable to detect newly-discovered misconfigurations. RB tools also\nsuffer from misdetection, since mistakes are likely when coding the detection\nrules. Recent methods for detecting and remediating KCF misconfigurations are\nlimited in terms of their scalability and detection coverage, or due to the\nfact that they have high expertise requirements and do not offer automated\nremediation along with misconfiguration detection. Novel approaches that employ\nLLMs in their pipeline rely on API-based, general-purpose, and mainly\ncommercial models. Thus, they pose security challenges, have inconsistent\nclassification performance, and can be costly. In this paper, we propose\nGenKubeSec, a comprehensive and adaptive, LLM-based method, which, in addition\nto detecting a wide variety of KCF misconfigurations, also identifies the exact\nlocation of the misconfigurations and provides detailed reasoning about them,\nalong with suggested remediation. When empirically compared with three\nindustry-standard RB tools, GenKubeSec achieved equivalent precision (0.990)\nand superior recall (0.999). When a random sample of KCFs was examined by a\nKubernetes security expert, GenKubeSec's explanations as to misconfiguration\nlocalization, reasoning and remediation were 100% correct, informative and\nuseful. To facilitate further advancements in this domain, we share the unique\ndataset we collected, a unified misconfiguration index we developed for label\nstandardization, our experimentation code, and GenKubeSec itself as an\nopen-source tool.\n","authors":["Ehud Malul","Yair Meidan","Dudu Mimran","Yuval Elovici","Asaf Shabtai"],"pdf_url":"https://arxiv.org/pdf/2405.19954v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17935v2","updated":"2024-05-30T11:01:10Z","published":"2024-05-28T08:01:26Z","title":"Tool Learning with Large Language Models: A Survey","summary":" Recently, tool learning with large language models (LLMs) has emerged as a\npromising paradigm for augmenting the capabilities of LLMs to tackle highly\ncomplex problems. Despite growing attention and rapid advancements in this\nfield, the existing literature remains fragmented and lacks systematic\norganization, posing barriers to entry for newcomers. This gap motivates us to\nconduct a comprehensive survey of existing works on tool learning with LLMs. In\nthis survey, we focus on reviewing existing literature from the two primary\naspects (1) why tool learning is beneficial and (2) how tool learning is\nimplemented, enabling a comprehensive understanding of tool learning with LLMs.\nWe first explore the \"why\" by reviewing both the benefits of tool integration\nand the inherent benefits of the tool learning paradigm from six specific\naspects. In terms of \"how\", we systematically review the literature according\nto a taxonomy of four key stages in the tool learning workflow: task planning,\ntool selection, tool calling, and response generation. Additionally, we provide\na detailed summary of existing benchmarks and evaluation methods, categorizing\nthem according to their relevance to different stages. Finally, we discuss\ncurrent challenges and outline potential future directions, aiming to inspire\nboth researchers and industrial developers to further explore this emerging and\npromising area. We also maintain a GitHub repository to continually keep track\nof the relevant papers and resources in this rising area at\n\\url{https://github.com/quchangle1/LLM-Tool-Survey}.\n","authors":["Changle Qu","Sunhao Dai","Xiaochi Wei","Hengyi Cai","Shuaiqiang Wang","Dawei Yin","Jun Xu","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2405.17935v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13067v2","updated":"2024-05-30T10:00:14Z","published":"2023-05-22T14:37:05Z","title":"Distilling Robustness into Natural Language Inference Models with\n Domain-Targeted Augmentation","summary":" Knowledge distillation optimises a smaller student model to behave similarly\nto a larger teacher model, retaining some of the performance benefits. While\nthis method can improve results on in-distribution examples, it does not\nnecessarily generalise to out-of-distribution (OOD) settings. We investigate\ntwo complementary methods for improving the robustness of the resulting student\nmodels on OOD domains. The first approach augments the distillation with\ngenerated unlabelled examples that match the target distribution. The second\nmethod upsamples data points among the training set that are similar to the\ntarget distribution. When applied on the task of natural language inference\n(NLI), our experiments on MNLI show that distillation with these modifications\noutperforms previous robustness solutions. We also find that these methods\nimprove performance on OOD domains even beyond the target domain.\n","authors":["Joe Stacey","Marek Rei"],"pdf_url":"https://arxiv.org/pdf/2305.13067v2.pdf","comment":"Accepted at ACL Findings 2024"},{"id":"http://arxiv.org/abs/2402.02446v3","updated":"2024-05-30T09:49:47Z","published":"2024-02-04T10:59:52Z","title":"LQER: Low-Rank Quantization Error Reconstruction for LLMs","summary":" Post-training quantization of Large Language Models (LLMs) is challenging. In\nthis work, we introduce Low-rank Quantization Error Reduction (LQER), which\ncombines quantization and low-rank approximation to recover the model\ncapability. LQER leverages an activation-induced scale matrix to drive the\nsingular value distribution of quantization error towards a desirable\ndistribution, which enables nearly-lossless W4A8 quantization on various LLMs\nand downstream tasks without the need for knowledge distillation, grid search,\nor gradient-base iterative optimization. Unlike existing methods, the\ncomputation pattern of LQER eliminates the need for specialized Scatter and\nGather processes to collect high-precision weights from irregular memory\nlocations. Our W4A8 LLMs achieve near-lossless performance on six popular\ndownstream tasks, while using 1.36$\\times$ fewer hardware resources than the\nleading state-of-the-art method. We open-source our framework at\nhttps://github.com/ChengZhang-98/lqer\n","authors":["Cheng Zhang","Jianyi Cheng","George A. Constantinides","Yiren Zhao"],"pdf_url":"https://arxiv.org/pdf/2402.02446v3.pdf","comment":"Accepted at ICML2024"},{"id":"http://arxiv.org/abs/2405.19877v1","updated":"2024-05-30T09:32:14Z","published":"2024-05-30T09:32:14Z","title":"KNOW: A Real-World Ontology for Knowledge Capture with Large Language\n Models","summary":" We present KNOW--the Knowledge Navigator Ontology for the World--the first\nontology designed to capture everyday knowledge to augment large language\nmodels (LLMs) in real-world generative AI use cases such as personal AI\nassistants. Our domain is human life, both its everyday concerns and its major\nmilestones. We have limited the initial scope of the modeled concepts to only\nestablished human universals: spacetime (places, events) plus social (people,\ngroups, organizations). The inclusion criteria for modeled concepts are\npragmatic, beginning with universality and utility. We compare and contrast\nprevious work such as Schema.org and Cyc--as well as attempts at a synthesis of\nknowledge graphs and language models--noting how LLMs already encode internally\nmuch of the commonsense tacit knowledge that took decades to capture in the Cyc\nproject. We also make available code-generated software libraries for the 12\nmost popular programming languages, enabling the direct use of ontology\nconcepts in software engineering. We emphasize simplicity and developer\nexperience in promoting AI interoperability.\n","authors":["Arto Bendiken"],"pdf_url":"https://arxiv.org/pdf/2405.19877v1.pdf","comment":"5 pages, 1 figure"},{"id":"http://arxiv.org/abs/2405.19874v1","updated":"2024-05-30T09:28:56Z","published":"2024-05-30T09:28:56Z","title":"Is In-Context Learning Sufficient for Instruction Following in LLMs?","summary":" In-context learning (ICL) allows LLMs to learn from examples without changing\ntheir weights, which is a particularly promising capability for long-context\nLLMs that can potentially learn from many examples. Recently, Lin et al. (2024)\nproposed URIAL, a method using only three in-context examples to align base\nLLMs, achieving non-trivial instruction following performance. In this work, we\nshow that, while effective, ICL alignment with URIAL still underperforms\ncompared to instruction fine-tuning on established benchmarks such as MT-Bench\nand AlpacaEval 2.0 (LC), especially with more capable base LMs. Unlike for\ntasks such as classification, translation, or summarization, adding more ICL\ndemonstrations for long-context LLMs does not systematically improve\ninstruction following performance. To address this limitation, we derive a\ngreedy selection approach for ICL examples that noticeably improves\nperformance, yet without bridging the gap to instruction fine-tuning. Finally,\nwe provide a series of ablation studies to better understand the reasons behind\nthe remaining gap, and we show how some aspects of ICL depart from the existing\nknowledge and are specific to the instruction tuning setting. Overall, our work\nadvances the understanding of ICL as an alignment technique. We provide our\ncode at https://github.com/tml-epfl/icl-alignment.\n","authors":["Hao Zhao","Maksym Andriushchenko","Francesco Croce","Nicolas Flammarion"],"pdf_url":"https://arxiv.org/pdf/2405.19874v1.pdf","comment":"Preprint. Code at https://github.com/tml-epfl/icl-alignment"},{"id":"http://arxiv.org/abs/2402.12786v2","updated":"2024-05-30T09:06:34Z","published":"2024-02-20T07:51:43Z","title":"Advancing Large Language Models to Capture Varied Speaking Styles and\n Respond Properly in Spoken Conversations","summary":" In spoken dialogue, even if two current turns are the same sentence, their\nresponses might still differ when they are spoken in different styles. The\nspoken styles, containing paralinguistic and prosodic information, mark the\nmost significant difference between text and speech modality. When using\ntext-only LLMs to model spoken dialogue, text-only LLMs cannot give different\nresponses based on the speaking style of the current turn. In this paper, we\nfocus on enabling LLMs to listen to the speaking styles and respond properly.\nOur goal is to teach the LLM that \"even if the sentences are identical if they\nare spoken in different styles, their corresponding responses might be\ndifferent\". Since there is no suitable dataset for achieving this goal, we\ncollect a speech-to-speech dataset, StyleTalk, with the following desired\ncharacteristics: when two current speeches have the same content but are spoken\nin different styles, their responses will be different. To teach LLMs to\nunderstand and respond properly to the speaking styles, we propose the\nSpoken-LLM framework that can model the linguistic content and the speaking\nstyles. We train Spoken-LLM using the StyleTalk dataset and devise a two-stage\ntraining pipeline to help the Spoken-LLM better learn the speaking styles.\nBased on extensive experiments, we show that Spoken-LLM outperforms text-only\nbaselines and prior speech LLMs methods.\n","authors":["Guan-Ting Lin","Cheng-Han Chiang","Hung-yi Lee"],"pdf_url":"https://arxiv.org/pdf/2402.12786v2.pdf","comment":"Accepted by ACL 2024"},{"id":"http://arxiv.org/abs/2405.19856v1","updated":"2024-05-30T09:03:42Z","published":"2024-05-30T09:03:42Z","title":"DevEval: A Manually-Annotated Code Generation Benchmark Aligned with\n Real-World Code Repositories","summary":" How to evaluate the coding abilities of Large Language Models (LLMs) remains\nan open question. We find that existing benchmarks are poorly aligned with\nreal-world code repositories and are insufficient to evaluate the coding\nabilities of LLMs.\n To address the knowledge gap, we propose a new benchmark named DevEval, which\nhas three advances. (1) DevEval aligns with real-world repositories in multiple\ndimensions, e.g., code distributions and dependency distributions. (2) DevEval\nis annotated by 13 developers and contains comprehensive annotations (e.g.,\nrequirements, original repositories, reference code, and reference\ndependencies). (3) DevEval comprises 1,874 testing samples from 117\nrepositories, covering 10 popular domains (e.g., Internet, Database). Based on\nDevEval, we propose repository-level code generation and evaluate 8 popular\nLLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa).\nOur experiments reveal these LLMs' coding abilities in real-world code\nrepositories. For example, in our experiments, the highest Pass@1 of\ngpt-4-turbo is only 53.04%. We also analyze LLMs' failed cases and summarize\ntheir shortcomings. We hope DevEval can facilitate the development of LLMs in\nreal code repositories. DevEval, prompts, and LLMs' predictions have been\nreleased.\n","authors":["Jia Li","Ge Li","Yunfei Zhao","Yongmin Li","Huanyu Liu","Hao Zhu","Lecheng Wang","Kaibo Liu","Zheng Fang","Lanshen Wang","Jiazheng Ding","Xuanming Zhang","Yuqi Zhu","Yihong Dong","Zhi Jin","Binhua Li","Fei Huang","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2405.19856v1.pdf","comment":"Accepted by the 62nd Annual Meeting of the Association for\n Computational Linguistics (ACL 2024). arXiv admin note: substantial text\n overlap with arXiv:2404.00599, arXiv:2401.06401"},{"id":"http://arxiv.org/abs/2404.07972v2","updated":"2024-05-30T08:55:12Z","published":"2024-04-11T17:56:05Z","title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real\n Computer Environments","summary":" Autonomous agents that accomplish complex computer tasks with minimal human\ninterventions have the potential to transform human-computer interaction,\nsignificantly enhancing accessibility and productivity. However, existing\nbenchmarks either lack an interactive environment or are limited to\nenvironments specific to certain applications or domains, failing to reflect\nthe diverse and complex nature of real-world computer use, thereby limiting the\nscope of tasks and agent scalability. To address this issue, we introduce\nOSWorld, the first-of-its-kind scalable, real computer environment for\nmultimodal agents, supporting task setup, execution-based evaluation, and\ninteractive learning across various operating systems such as Ubuntu, Windows,\nand macOS. OSWorld can serve as a unified, integrated computer environment for\nassessing open-ended computer tasks that involve arbitrary applications.\nBuilding upon OSWorld, we create a benchmark of 369 computer tasks involving\nreal web and desktop apps in open domains, OS file I/O, and workflows spanning\nmultiple applications. Each task example is derived from real-world computer\nuse cases and includes a detailed initial state setup configuration and a\ncustom execution-based evaluation script for reliable, reproducible evaluation.\nExtensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld\nreveals significant deficiencies in their ability to serve as computer\nassistants. While humans can accomplish over 72.36% of the tasks, the best\nmodel achieves only 12.24% success, primarily struggling with GUI grounding and\noperational knowledge. Comprehensive analysis using OSWorld provides valuable\ninsights for developing multimodal generalist agents that were not possible\nwith previous benchmarks. Our code, environment, baseline models, and data are\npublicly available at https://os-world.github.io.\n","authors":["Tianbao Xie","Danyang Zhang","Jixuan Chen","Xiaochuan Li","Siheng Zhao","Ruisheng Cao","Toh Jing Hua","Zhoujun Cheng","Dongchan Shin","Fangyu Lei","Yitao Liu","Yiheng Xu","Shuyan Zhou","Silvio Savarese","Caiming Xiong","Victor Zhong","Tao Yu"],"pdf_url":"https://arxiv.org/pdf/2404.07972v2.pdf","comment":"51 pages, 21 figures"},{"id":"http://arxiv.org/abs/2405.19846v1","updated":"2024-05-30T08:50:55Z","published":"2024-05-30T08:50:55Z","title":"Quest: Query-centric Data Synthesis Approach for Long-context Scaling of\n Large Language Model","summary":" Large language models, initially pre-trained with a limited context length,\ncan better handle longer texts by continuing training on a corpus with extended\ncontexts. However, obtaining effective long-context data is challenging due to\nthe scarcity and uneven distribution of long documents across different\ndomains. To address this issue, we propose a Query-centric data synthesis\nmethod, abbreviated as Quest. Quest is an interpretable method based on the\nobservation that documents retrieved by similar queries are relevant but\nlow-redundant, thus well-suited for synthesizing long-context data. The method\nis also scalable and capable of constructing large amounts of long-context\ndata. Using Quest, we synthesize a long-context dataset up to 128k context\nlength, significantly outperforming other data synthesis methods on multiple\nlong-context benchmark datasets. In addition, we further verify that the Quest\nmethod is predictable through scaling law experiments, making it a reliable\nsolution for advancing long-context models.\n","authors":["Chaochen Gao","Xing Wu","Qi Fu","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2405.19846v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19842v1","updated":"2024-05-30T08:49:34Z","published":"2024-05-30T08:49:34Z","title":"Improve Student's Reasoning Generalizability through Cascading\n Decomposed CoTs Distillation","summary":" Large language models (LLMs) exhibit enhanced reasoning at larger scales,\ndriving efforts to distill these capabilities into smaller models via\nteacher-student learning. Previous works simply fine-tune student models on\nteachers' generated Chain-of-Thoughts (CoTs) data. Although these methods\nenhance in-domain (IND) reasoning performance, they struggle to generalize to\nout-of-domain (OOD) tasks. We believe that the widespread spurious correlations\nbetween questions and answers may lead the model to preset a specific answer\nwhich restricts the diversity and generalizability of its reasoning process. In\nthis paper, we propose Cascading Decomposed CoTs Distillation (CasCoD) to\naddress these issues by decomposing the traditional single-step learning\nprocess into two cascaded learning steps. Specifically, by restructuring the\ntraining objectives -- removing the answer from outputs and concatenating the\nquestion with the rationale as input -- CasCoD's two-step learning process\nensures that students focus on learning rationales without interference from\nthe preset answers, thus improving reasoning generalizability. Extensive\nexperiments demonstrate the effectiveness of CasCoD on both IND and OOD\nbenchmark reasoning datasets. Code can be found at\nhttps://github.com/C-W-D/CasCoD.\n","authors":["Chengwei Dai","Kun Li","Wei Zhou","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2405.19842v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19831v1","updated":"2024-05-30T08:41:33Z","published":"2024-05-30T08:41:33Z","title":"Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic\n Similarity and Privacy Preservation of Differentially Private Rewritten Text","summary":" The study of Differential Privacy (DP) in Natural Language Processing often\nviews the task of text privatization as a $\\textit{rewriting}$ task, in which\nsensitive input texts are rewritten to hide explicit or implicit private\ninformation. In order to evaluate the privacy-preserving capabilities of a DP\ntext rewriting mechanism, $\\textit{empirical privacy}$ tests are frequently\nemployed. In these tests, an adversary is modeled, who aims to infer sensitive\ninformation (e.g., gender) about the author behind a (privatized) text. Looking\nto improve the empirical protections provided by DP rewriting methods, we\npropose a simple post-processing method based on the goal of aligning rewritten\ntexts with their original counterparts, where DP rewritten texts are rewritten\n$\\textit{again}$. Our results shown that such an approach not only produces\noutputs that are more semantically reminiscent of the original inputs, but also\ntexts which score on average better in empirical privacy evaluations.\nTherefore, our approach raises the bar for DP rewriting methods in their\nempirical privacy evaluations, providing an extra layer of protection against\nmalicious adversaries.\n","authors":["Stephen Meisenbacher","Florian Matthes"],"pdf_url":"https://arxiv.org/pdf/2405.19831v1.pdf","comment":"10 pages, 2 figures, 2 tables. Accepted to ARES 2024 (IWAPS)"},{"id":"http://arxiv.org/abs/2403.11904v3","updated":"2024-05-30T08:37:45Z","published":"2024-03-18T16:04:55Z","title":"CICLe: Conformal In-Context Learning for Largescale Multi-Class Food\n Risk Classification","summary":" Contaminated or adulterated food poses a substantial risk to human health.\nGiven sets of labeled web texts for training, Machine Learning and Natural\nLanguage Processing can be applied to automatically detect such risks. We\npublish a dataset of 7,546 short texts describing public food recall\nannouncements. Each text is manually labeled, on two granularity levels (coarse\nand fine), for food products and hazards that the recall corresponds to. We\ndescribe the dataset and benchmark naive, traditional, and Transformer models.\nBased on our analysis, Logistic Regression based on a tf-idf representation\noutperforms RoBERTa and XLM-R on classes with low support. Finally, we discuss\ndifferent prompting strategies and present an LLM-in-the-loop framework, based\non Conformal Prediction, which boosts the performance of the base classifier\nwhile reducing energy consumption compared to normal prompting.\n","authors":["Korbinian Randl","John Pavlopoulos","Aron Henriksson","Tony Lindgren"],"pdf_url":"https://arxiv.org/pdf/2403.11904v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19799v1","updated":"2024-05-30T08:10:50Z","published":"2024-05-30T08:10:50Z","title":"Unsupervised Mutual Learning of Dialogue Discourse Parsing and Topic\n Segmentation","summary":" The advancement of large language models (LLMs) has propelled the development\nof dialogue systems. Unlike the popular ChatGPT-like assistant model, which\nonly satisfies the user's preferences, task-oriented dialogue systems have also\nfaced new requirements and challenges in the broader business field. They are\nexpected to provide correct responses at each dialogue turn, at the same time,\nachieve the overall goal defined by the task. By understanding rhetorical\nstructures and topic structures via topic segmentation and discourse parsing, a\ndialogue system may do a better planning to achieve both objectives. However,\nwhile both structures belong to discourse structure in linguistics, rhetorical\nstructure and topic structure are mostly modeled separately or with one\nassisting the other in the prior work. The interaction between these two\nstructures has not been considered for joint modeling and mutual learning.\nFurthermore, unsupervised learning techniques to achieve the above are not well\nexplored. To fill this gap, we propose an unsupervised mutual learning\nframework of two structures leveraging the global and local connections between\nthem. We extend the topic modeling between non-adjacent discourse units to\nensure global structural relevance with rhetorical structures. We also\nincorporate rhetorical structures into the topic structure through a graph\nneural network model to ensure local coherence consistency. Finally, we utilize\nthe similarity between the two fused structures for mutual learning. The\nexperimental results demonstrate that our methods outperform all strong\nbaselines on two dialogue rhetorical datasets (STAC and Molweni), as well as\ndialogue topic datasets (Doc2Dial and TIAGE).\n","authors":["Jiahui Xu","Feng Jiang","Anningzhe Gao","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2405.19799v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19795v1","updated":"2024-05-30T08:03:15Z","published":"2024-05-30T08:03:15Z","title":"SLM as Guardian: Pioneering AI Safety with Small Language Models","summary":" Most prior safety research of large language models (LLMs) has focused on\nenhancing the alignment of LLMs to better suit the safety requirements of\nhumans. However, internalizing such safeguard features into larger models\nbrought challenges of higher training cost and unintended degradation of\nhelpfulness. To overcome such challenges, a modular approach employing a\nsmaller LLM to detect harmful user queries is regarded as a convenient solution\nin designing LLM-based system with safety requirements.\n In this paper, we leverage a smaller LLM for both harmful query detection and\nsafeguard response generation. We introduce our safety requirements and the\ntaxonomy of harmfulness categories, and then propose a multi-task learning\nmechanism fusing the two tasks into a single model. We demonstrate the\neffectiveness of our approach, providing on par or surpassing harmful query\ndetection and safeguard response performance compared to the publicly available\nLLMs.\n","authors":["Ohjoon Kwon","Donghyeon Jeon","Nayoung Choi","Gyu-Hwung Cho","Changbong Kim","Hyunwoo Lee","Inho Kang","Sun Kim","Taiwoo Park"],"pdf_url":"https://arxiv.org/pdf/2405.19795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19793v1","updated":"2024-05-30T08:01:20Z","published":"2024-05-30T08:01:20Z","title":"PDDLEGO: Iterative Planning in Textual Environments","summary":" Planning in textual environments have been shown to be a long-standing\nchallenge even for current models. A recent, promising line of work uses LLMs\nto generate a formal representation of the environment that can be solved by a\nsymbolic planner. However, existing methods rely on a fully-observed\nenvironment where all entity states are initially known, so a one-off\nrepresentation can be constructed, leading to a complete plan. In contrast, we\ntackle partially-observed environments where there is initially no sufficient\ninformation to plan for the end-goal. We propose PDDLEGO that iteratively\nconstruct a planning representation that can lead to a partial plan for a given\nsub-goal. By accomplishing the sub-goal, more information is acquired to\naugment the representation, eventually achieving the end-goal. We show that\nplans produced by few-shot PDDLEGO are 43% more efficient than generating plans\nend-to-end on the Coin Collector simulation, with strong performance (98%) on\nthe more complex Cooking World simulation where end-to-end LLMs fail to\ngenerate coherent plans (4%).\n","authors":["Li Zhang","Peter Jansen","Tianyi Zhang","Peter Clark","Chris Callison-Burch","Niket Tandon"],"pdf_url":"https://arxiv.org/pdf/2405.19793v1.pdf","comment":"In *SEM 2024"},{"id":"http://arxiv.org/abs/2405.19787v1","updated":"2024-05-30T07:54:07Z","published":"2024-05-30T07:54:07Z","title":"From Symbolic Tasks to Code Generation: Diversification Yields Better\n Task Performers","summary":" Instruction tuning -- tuning large language models on instruction-output\npairs -- is a promising technique for making models better adapted to the real\nworld. Yet, the key factors driving the model's capability to understand and\nfollow instructions not seen during training remain under-explored. Our\ninvestigation begins with a series of synthetic experiments within the\ntheoretical framework of a Turing-complete algorithm called Markov algorithm,\nwhich allows fine-grained control over the instruction-tuning data.\nGeneralization and robustness with respect to the training distribution emerge\nonce a diverse enough set of tasks is provided, even though very few examples\nare provided for each task. We extend these initial results to a real-world\napplication scenario of code generation and find that a more diverse\ninstruction set, extending beyond code-related tasks, improves the performance\nof code generation. Our observations suggest that a more diverse semantic space\nfor instruction-tuning sets greatly improves the model's ability to follow\ninstructions and perform tasks.\n","authors":["Dylan Zhang","Justin Wang","Francois Charton"],"pdf_url":"https://arxiv.org/pdf/2405.19787v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19782v1","updated":"2024-05-30T07:48:00Z","published":"2024-05-30T07:48:00Z","title":"Dataflow-Guided Retrieval Augmentation for Repository-Level Code\n Completion","summary":" Recent years have witnessed the deployment of code language models (LMs) in\nvarious code intelligence tasks such as code completion. Yet, it is challenging\nfor pre-trained LMs to generate correct completions in private repositories.\nPrevious studies retrieve cross-file context based on import relations or text\nsimilarity, which is insufficiently relevant to completion targets. In this\npaper, we propose a dataflow-guided retrieval augmentation approach, called\nDraCo, for repository-level code completion. DraCo parses a private repository\ninto code entities and establishes their relations through an extended dataflow\nanalysis, forming a repo-specific context graph. Whenever triggering code\ncompletion, DraCo precisely retrieves relevant background knowledge from the\nrepo-specific context graph and generates well-formed prompts to query code\nLMs. Furthermore, we construct a large Python dataset, ReccEval, with more\ndiverse completion targets. Our experiments demonstrate the superior accuracy\nand applicable efficiency of DraCo, improving code exact match by 3.43% and\nidentifier F1-score by 3.27% on average compared to the state-of-the-art\napproach.\n","authors":["Wei Cheng","Yuhan Wu","Wei Hu"],"pdf_url":"https://arxiv.org/pdf/2405.19782v1.pdf","comment":"Accepted in the 62nd Annual Meeting of the Association for\n Computational Linguistics (ACL 2024)"},{"id":"http://arxiv.org/abs/2405.19778v1","updated":"2024-05-30T07:44:16Z","published":"2024-05-30T07:44:16Z","title":"Enhancing Consistency and Role-Specific Knowledge Capturing by\n Rebuilding Fictional Character's Persona","summary":" With the recent introduction of Assistants API, it is expected that\ndocument-based language models will be actively used in various domains,\nespecially Role-playing. However, a key challenge lies in utilizing\nprotagonist's persona: Assistants API often fails to achieve with its search\nbecause the information extraction part is different each time and it often\nomits important information such as protagonist's backstory or relationships.\nIt is hard to maintain a consistent persona simply by using the persona\ndocument as input to the Assistants API. To address the challenge of achieving\nstable persona consistency, we propose CharacterGPT, a novel persona\nreconstruction framework to alleviate the shortcomings of the Assistants API.\nOur method involves Character Persona Training (CPT), an effective persona\nrebuilding process that updates the character persona by extracting the\ncharacter's traits from given summary of the novel for each character as if the\nstory in a novel progresses. In our experiments, we ask each character to take\nthe Big Five Inventory personality test in various settings and analyze the\nresults. To assess whether it can think outside the box, we let each character\ngenerate short novels. Extensive experiments and human evaluation demonstrate\nthat CharacterGPT presents new possibilities for role-playing agent research.\n","authors":["Jeiyoon Park","Chanjun Park","Heuiseok Lim"],"pdf_url":"https://arxiv.org/pdf/2405.19778v1.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2405.19763v1","updated":"2024-05-30T07:19:31Z","published":"2024-05-30T07:19:31Z","title":"Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural\n Language Understanding","summary":" Recent strides in large language models (LLMs) have yielded remarkable\nperformance, leveraging reinforcement learning from human feedback (RLHF) to\nsignificantly enhance generation and alignment capabilities. However, RLHF\nencounters numerous challenges, including the objective mismatch issue, leading\nto suboptimal performance in Natural Language Understanding (NLU) tasks. To\naddress this limitation, we propose a novel Reinforcement Learning framework\nenhanced with Label-sensitive Reward (RLLR) to amplify the performance of LLMs\nin NLU tasks. By incorporating label-sensitive pairs into reinforcement\nlearning, our method aims to adeptly capture nuanced label-sensitive semantic\nfeatures during RL, thereby enhancing natural language understanding.\nExperiments conducted on five diverse foundation models across eight tasks\nshowcase promising results. In comparison to Supervised Fine-tuning models\n(SFT), RLLR demonstrates an average performance improvement of 1.54%. Compared\nwith RLHF models, the improvement averages at 0.69%. These results reveal the\neffectiveness of our method for LLMs in NLU tasks. Code and data available at:\nhttps://github.com/MagiaSN/ACL2024_RLLR.\n","authors":["Kuo Liao","Shuang Li","Meng Zhao","Liqun Liu","Mengge Xue","Zhenyu Hu","Honglin Han","Chengguo Yin"],"pdf_url":"https://arxiv.org/pdf/2405.19763v1.pdf","comment":"Accept at ACL2024 Main"},{"id":"http://arxiv.org/abs/2311.04044v2","updated":"2024-05-30T06:56:56Z","published":"2023-11-07T14:55:52Z","title":"PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language\n Models","summary":" The rapid development of language models (LMs) brings unprecedented\naccessibility and usage for both models and users. On the one hand, powerful\nLMs achieve state-of-the-art performance over numerous downstream NLP tasks. On\nthe other hand, more and more attention is paid to unrestricted model accesses\nthat may bring malicious privacy risks of data leakage. To address these\nissues, many recent works propose privacy-preserving language models (PPLMs)\nwith differential privacy (DP). Unfortunately, different DP implementations\nmake it challenging for a fair comparison among existing PPLMs. In this paper,\nwe present PrivLM-Bench, a multi-perspective privacy evaluation benchmark to\nempirically and intuitively quantify the privacy leakage of LMs. Instead of\nonly reporting DP parameters, PrivLM-Bench sheds light on the neglected\ninference data privacy during actual usage. PrivLM-Bench first clearly defines\nmulti-faceted privacy objectives. Then, PrivLM-Bench constructs a unified\npipeline to perform private fine-tuning. Lastly, PrivLM-Bench performs existing\nprivacy attacks on LMs with pre-defined privacy objectives as the empirical\nevaluation results. The empirical attack results are used to fairly and\nintuitively evaluate the privacy leakage of various PPLMs. We conduct extensive\nexperiments on three datasets of GLUE for mainstream LMs.\n","authors":["Haoran Li","Dadi Guo","Donghao Li","Wei Fan","Qi Hu","Xin Liu","Chunkit Chan","Duanyi Yao","Yuan Yao","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2311.04044v2.pdf","comment":"To appear at ACL 2024"},{"id":"http://arxiv.org/abs/2405.15143v2","updated":"2024-05-30T06:48:44Z","published":"2024-05-24T01:45:27Z","title":"Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation\n Models","summary":" Go-Explore is a powerful family of algorithms designed to solve\nhard-exploration problems, built on the principle of archiving discovered\nstates, and iteratively returning to and exploring from the most promising\nstates. This approach has led to superhuman performance across a wide variety\nof challenging problems including Atari games and robotic control, but requires\nmanually designing heuristics to guide exploration, which is time-consuming and\ninfeasible in general. To resolve this, we propose Intelligent Go-Explore (IGE)\nwhich greatly extends the scope of the original Go-Explore by replacing these\nheuristics with the intelligence and internalized human notions of\ninterestingness captured by giant foundation models (FMs). This provides IGE\nwith a human-like ability to instinctively identify how interesting or\npromising any new state is (e.g. discovering new objects, locations, or\nbehaviors), even in complex environments where heuristics are hard to define.\nMoreover, IGE offers the exciting and previously impossible opportunity to\nrecognize and capitalize on serendipitous discoveries that cannot be predicted\nahead of time. We evaluate IGE on a range of language-based tasks that require\nsearch and exploration. In Game of 24, a multistep mathematical reasoning\nproblem, IGE reaches 100% success rate 70.8% faster than the best classic graph\nsearch baseline. Next, in BabyAI-Text, a challenging partially observable\ngridworld, IGE exceeds the previous SOTA with orders of magnitude fewer online\nsamples. Finally, in TextWorld, we show the unique ability of IGE to succeed in\nsettings requiring long-horizon exploration where prior SOTA FM agents like\nReflexion completely fail. Overall, IGE combines the tremendous strengths of\nFMs and the powerful Go-Explore algorithm, opening up a new frontier of\nresearch into creating more generally capable agents with impressive\nexploration capabilities.\n","authors":["Cong Lu","Shengran Hu","Jeff Clune"],"pdf_url":"https://arxiv.org/pdf/2405.15143v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19744v1","updated":"2024-05-30T06:45:23Z","published":"2024-05-30T06:45:23Z","title":"X-Instruction: Aligning Language Model in Low-resource Languages with\n Self-curated Cross-lingual Instructions","summary":" Large language models respond well in high-resource languages like English\nbut struggle in low-resource languages. It may arise from the lack of\nhigh-quality instruction following data in these languages. Directly\ntranslating English samples into these languages can be a solution but\nunreliable, leading to responses with translation errors and lacking\nlanguage-specific or cultural knowledge. To address this issue, we propose a\nnovel method to construct cross-lingual instruction following samples with\ninstruction in English and response in low-resource languages. Specifically,\nthe language model first learns to generate appropriate English instructions\naccording to the natural web texts in other languages as responses. The\ncandidate cross-lingual instruction tuning samples are further refined and\ndiversified. We have employed this method to build a large-scale cross-lingual\ninstruction tuning dataset on 10 languages, namely X-Instruction. The\ninstruction data built using our method incorporate more language-specific\nknowledge compared with the naive translation method. Experimental results have\nshown that the response quality of the model tuned on X-Instruction greatly\nexceeds the model distilled from a powerful teacher model, reaching or even\nsurpassing the ones of ChatGPT. In addition, we find that models tuned on\ncross-lingual instruction following samples can follow the instruction in the\noutput language without further tuning.\n","authors":["Chong Li","Wen Yang","Jiajun Zhang","Jinliang Lu","Shaonan Wang","Chengqing Zong"],"pdf_url":"https://arxiv.org/pdf/2405.19744v1.pdf","comment":"ACL 2024. Our codes, data and model weights are available at\n https://github.com/ZNLP/X-Instruction"},{"id":"http://arxiv.org/abs/2405.19740v1","updated":"2024-05-30T06:38:32Z","published":"2024-05-30T06:38:32Z","title":"PertEval: Unveiling Real Knowledge Capacity of LLMs with\n Knowledge-Invariant Perturbations","summary":" Expert-designed close-ended benchmarks serve as vital tools in assessing the\nknowledge capacity of large language models (LLMs). Despite their widespread\nuse, concerns have mounted regarding their reliability due to limited test\nscenarios and an unavoidable risk of data contamination. To rectify this, we\npresent PertEval, a toolkit devised for in-depth probing of LLMs' knowledge\ncapacity through knowledge-invariant perturbations. These perturbations employ\nhuman-like restatement techniques to generate on-the-fly test samples from\nstatic benchmarks, meticulously retaining knowledge-critical content while\naltering irrelevant details. Our toolkit further includes a suite of transition\nanalyses that compare performance on raw vs. perturbed test sets to precisely\nassess LLMs' genuine knowledge capacity. Six state-of-the-art LLMs are\nre-evaluated using PertEval. Results reveal significantly inflated performance\nof the LLMs on raw benchmarks, including an absolute 21% overestimation for\nGPT-4. Additionally, through a nuanced response pattern analysis, we discover\nthat PertEval retains LLMs' uncertainty to specious knowledge, potentially\nbeing resolved through rote memorization and leading to inflated performance.\nWe also find that the detailed transition analyses by PertEval could illuminate\nweaknesses in existing LLMs' knowledge mastery and guide the development of\nrefinement. Given these insights, we posit that PertEval can act as an\nessential tool that, when applied alongside any close-ended benchmark, unveils\nthe true knowledge capacity of LLMs, marking a significant step toward more\ntrustworthy LLM evaluation.\n","authors":["Jiatong Li","Renjun Hu","Kunzhe Huang","Yan Zhuang","Qi Liu","Mengxiao Zhu","Xing Shi","Wei Lin"],"pdf_url":"https://arxiv.org/pdf/2405.19740v1.pdf","comment":"23 pages, 12 figures, 10 tables"},{"id":"http://arxiv.org/abs/2405.19737v1","updated":"2024-05-30T06:32:11Z","published":"2024-05-30T06:32:11Z","title":"Beyond Imitation: Learning Key Reasoning Steps from Dual\n Chain-of-Thoughts in Reasoning Distillation","summary":" As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts\n(CoTs) reasoning abilities, practical resource constraints drive efforts to\ndistill these capabilities into more compact Smaller Language Models (SLMs). We\nfind that CoTs consist mainly of simple reasoning forms, with a small\nproportion ($\\approx 4.7\\%$) of key reasoning steps that truly impact\nconclusions. However, previous distillation methods typically involve\nsupervised fine-tuning student SLMs only on correct CoTs data produced by\nteacher LLMs, resulting in students struggling to learn the key reasoning\nsteps, instead imitating the teacher's reasoning forms and making errors or\nomissions on these steps. To address these issues, drawing an analogy to human\nlearning, where analyzing mistakes according to correct solutions often reveals\nthe crucial steps leading to successes or failures, we propose\nmistak\\textbf{E}-\\textbf{D}riven key reason\\textbf{I}ng step\ndistilla\\textbf{T}ion (\\textbf{EDIT}), a novel method that further aids SLMs\nlearning key reasoning steps rather than mere simple fine-tuning. Firstly, to\nexpose these crucial steps in CoTs, we design specific prompts to generate dual\nCoTs data with similar reasoning paths but divergent conclusions. Then, we\napply the minimum edit distance algorithm on the dual CoTs data to locate these\nkey steps and optimize the likelihood of these steps. Extensive experiments\nvalidate the effectiveness of EDIT across both in-domain and out-of-domain\nbenchmark reasoning datasets. Further analysis shows that EDIT can generate\nhigh-quality CoTs with more correct key reasoning steps. Notably, we also\nexplore how different mistake patterns affect performance and find that EDIT\nbenefits more from logical errors than from knowledge or mathematical\ncalculation errors in dual CoTs\\footnote{Code can be found at\n\\url{https://github.com/C-W-D/EDIT}}.\n","authors":["Chengwei Dai","Kun Li","Wei Zhou","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2405.19737v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19732v1","updated":"2024-05-30T06:24:14Z","published":"2024-05-30T06:24:14Z","title":"Two Optimizers Are Better Than One: LLM Catalyst for Enhancing\n Gradient-Based Optimization","summary":" Learning a skill generally relies on both practical experience by doer and\ninsightful high-level guidance by instructor. Will this strategy also work well\nfor solving complex non-convex optimization problems? Here, a common\ngradient-based optimizer acts like a disciplined doer, making locally optimal\nupdate at each step. Recent methods utilize large language models (LLMs) to\noptimize solutions for concrete problems by inferring from natural language\ninstructions, akin to a high-level instructor. In this paper, we show that\nthese two optimizers are complementary to each other, suggesting a\ncollaborative optimization approach. The gradient-based optimizer and LLM-based\noptimizer are combined in an interleaved manner. We instruct LLMs using task\ndescriptions and timely optimization trajectories recorded during\ngradient-based optimization. Inferred results from LLMs are used as restarting\npoints for the next stage of gradient optimization. By leveraging both the\nlocally rigorous gradient-based optimizer and the high-level deductive\nLLM-based optimizer, our combined optimization method consistently yields\nimprovements over competitive baseline prompt tuning methods. Our results\ndemonstrate the synergistic effect of conventional gradient-based optimization\nand the inference ability of LLMs. The code is released at\nhttps://github.com/guozix/LLM-catalyst.\n","authors":["Zixian Guo","Ming Liu","Zhilong Ji","Jinfeng Bai","Yiwen Guo","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2405.19732v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.15316v2","updated":"2024-05-30T06:18:20Z","published":"2023-11-26T14:35:23Z","title":"Sibyl: Sensible Empathetic Dialogue Generation with Visionary\n Commonsense Knowledge","summary":" Recently, there has been a heightened interest in building chatbots based on\nLarge Language Models (LLMs) to emulate human-like qualities in dialogues,\nincluding expressing empathy and offering emotional support. Despite having\naccess to commonsense knowledge to better understand the psychological aspects\nand causality of dialogue context, even these powerful LLMs struggle to achieve\nthe goals of empathy and emotional support. As current approaches do not\nadequately anticipate dialogue future, they may mislead language models to\nignore complex dialogue goals of empathy and emotional support, resulting in\nunsupportive responses lacking empathy. To address this issue, we present an\ninnovative framework named Sensible Empathetic Dialogue Generation with\nVisionary Commonsense Knowledge (Sibyl). Designed to concentrate on the\nimminent dialogue future, this paradigm directs LLMs toward the implicit\nrequirements of the conversation, aiming to provide more sensible responses.\nExperimental results demonstrate that incorporating our paradigm for acquiring\ncommonsense knowledge into LLMs comprehensively enhances the quality of their\nresponses.\n","authors":["Lanrui Wang","Jiangnan Li","Chenxu Yang","Zheng Lin","Hongyin Tang","Huan Liu","Xiaolei Huang","Yanan Cao","Jingang Wang","Weiping Wang"],"pdf_url":"https://arxiv.org/pdf/2311.15316v2.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2405.19716v1","updated":"2024-05-30T05:53:49Z","published":"2024-05-30T05:53:49Z","title":"Enhancing Large Vision Language Models with Self-Training on Image\n Comprehension","summary":" Large vision language models (LVLMs) integrate large language models (LLMs)\nwith pre-trained vision encoders, thereby activating the perception capability\nof the model to understand image inputs for different queries and conduct\nsubsequent reasoning. Improving this capability requires high-quality\nvision-language data, which is costly and labor-intensive to acquire.\nSelf-training approaches have been effective in single-modal settings to\nalleviate the need for labeled data by leveraging model's own generation.\nHowever, effective self-training remains a challenge regarding the unique\nvisual perception and reasoning capability of LVLMs. To address this, we\nintroduce Self-Training on Image Comprehension (STIC), which emphasizes a\nself-training approach specifically for image comprehension. First, the model\nself-constructs a preference dataset for image descriptions using unlabeled\nimages. Preferred responses are generated through a step-by-step prompt, while\ndis-preferred responses are generated from either corrupted images or\nmisleading prompts. To further self-improve reasoning on the extracted visual\ninformation, we let the model reuse a small portion of existing\ninstruction-tuning data and append its self-generated image descriptions to the\nprompts. We validate the effectiveness of STIC across seven different\nbenchmarks, demonstrating substantial performance gains of 4.0% on average\nwhile using 70% less supervised fine-tuning data than the current method.\nFurther studies investigate various components of STIC and highlight its\npotential to leverage vast quantities of unlabeled images for self-training.\nCode and data are made publicly available.\n","authors":["Yihe Deng","Pan Lu","Fan Yin","Ziniu Hu","Sheng Shen","James Zou","Kai-Wei Chang","Wei Wang"],"pdf_url":"https://arxiv.org/pdf/2405.19716v1.pdf","comment":"19 pages, 14 figures, 6 tables"},{"id":"http://arxiv.org/abs/2405.19715v1","updated":"2024-05-30T05:49:38Z","published":"2024-05-30T05:49:38Z","title":"SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths","summary":" Speculative decoding reduces the inference latency of a target large language\nmodel via utilizing a smaller and faster draft model. Its performance depends\non a hyperparameter K -- the candidate length, i.e., the number of candidate\ntokens for the target model to verify in each round. However, previous methods\noften use simple heuristics to choose K, which may result in sub-optimal\nperformance. We study the choice of the candidate length K and formulate it as\na Markov Decision Process. We theoretically show that the optimal policy of\nthis Markov decision process takes the form of a threshold policy, i.e., the\ncurrent speculation should stop and be verified when the probability of getting\na rejection exceeds a threshold value. Motivated by this theory, we propose\nSpecDec++, an enhanced version of speculative decoding that adaptively\ndetermines the candidate length on the fly. We augment the draft model with a\ntrained acceptance prediction head to predict the conditional acceptance\nprobability of the candidate tokens. SpecDec++ will stop the current\nspeculation when the predicted probability that at least one token gets\nrejected exceeds a threshold. We implement SpecDec++ and apply it to the\nllama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup\non the Alpaca dataset (an additional 7.2% improvement over the baseline\nspeculative decoding). On the GSM8K and HumanEval datasets, our method achieves\na 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement),\nrespectively.\n","authors":["Kaixuan Huang","Xudong Guo","Mengdi Wang"],"pdf_url":"https://arxiv.org/pdf/2405.19715v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10160v2","updated":"2024-05-30T05:27:35Z","published":"2023-12-15T19:16:21Z","title":"Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in\n Chart Captioning","summary":" Recent advancements in large vision-language models (LVLMs) have led to\nsignificant progress in generating natural language descriptions for visual\ncontent and thus enhancing various applications. One issue with these powerful\nmodels is that they sometimes produce texts that are factually inconsistent\nwith the visual input. While there has been some effort to mitigate such\ninconsistencies in natural image captioning, the factuality of generated\ncaptions for structured document images, such as charts, has not received as\nmuch scrutiny, posing a potential threat to information reliability in critical\napplications. This work delves into the factuality aspect by introducing a\ncomprehensive typology of factual errors in generated chart captions. A\nlarge-scale human annotation effort provides insight into the error patterns\nand frequencies in captions crafted by various chart captioning models,\nultimately forming the foundation of a novel dataset, CHOCOLATE. Our analysis\nreveals that even state-of-the-art models, including GPT-4V, frequently produce\ncaptions laced with factual inaccuracies. In response to this challenge, we\nestablish the new task of Chart Caption Factual Error Correction and introduce\nCHARTVE, a model for visual entailment that outperforms proprietary and\nopen-source LVLMs in evaluating factual consistency. Furthermore, we propose\nC2TFEC, an interpretable two-stage framework that excels at correcting factual\nerrors. This work inaugurates a new domain in factual error correction for\nchart captions, presenting a novel evaluation mechanism, and demonstrating an\neffective approach to ensuring the factuality of generated chart captions. The\ncode and data as well as the continuously updated benchmark can be found at:\nhttps://khuangaf.github.io/CHOCOLATE/.\n","authors":["Kung-Hsiang Huang","Mingyang Zhou","Hou Pong Chan","Yi R. Fung","Zhenhailong Wang","Lingyu Zhang","Shih-Fu Chang","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2312.10160v2.pdf","comment":"ACL 2024 Findings"},{"id":"http://arxiv.org/abs/2405.19701v1","updated":"2024-05-30T05:26:57Z","published":"2024-05-30T05:26:57Z","title":"Significance of Chain of Thought in Gender Bias Mitigation for\n English-Dravidian Machine Translation","summary":" Gender bias in machine translation (MT) systems poses a significant challenge\nto achieving accurate and inclusive translations. This paper examines gender\nbias in machine translation systems for languages such as Telugu and Kannada\nfrom the Dravidian family, analyzing how gender inflections affect translation\naccuracy and neutrality using Google Translate and ChatGPT. It finds that while\nplural forms can reduce bias, individual-centric sentences often maintain the\nbias due to historical stereotypes. The study evaluates the Chain of Thought\nprocessing, noting significant bias mitigation from 80% to 4% in Telugu and\nfrom 40% to 0% in Kannada. It also compares Telugu and Kannada translations,\nemphasizing the need for language specific strategies to address these\nchallenges and suggesting directions for future research to enhance fairness in\nboth data preparation and prompts during inference.\n","authors":["Lavanya Prahallad","Radhika Mamidi"],"pdf_url":"https://arxiv.org/pdf/2405.19701v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.07088v3","updated":"2024-05-30T05:21:23Z","published":"2024-03-11T18:26:02Z","title":"SPA: Towards A Computational Friendly Cloud-Base and On-Devices\n Collaboration Seq2seq Personalized Generation","summary":" Large language models(LLMs) have shown its outperforming ability on various\ntasks and question answering. However, LLMs require substantial memory storage\non low-resource devices. More critically, the computational speed on these\ndevices is also severely limited. In this paper, we propose SPA(Side Plugin\nAdaption), a lightweight architecture for fast on-devices inference on the\nconstraints of strict on-devices computation and memory constraints. Compared\nwith other on-devices seq2seq generation, SPA could make a fast and stable\ninference on low-resource constraints, allowing it to obtain cost effiency. Our\nmethod establish an interaction between a pretrained LLMs on-cloud and additive\nparameters on-devices, which could provide the knowledge on both pretrained\nLLMs and featured personal feature. Further more, SPA provides a framework to\nkeep feature-base parameters on low computational devices while leave the\nparameters containing general information on the high computational devices.\n","authors":["Yanming Liu","Xinyue Peng","Jiannan Cao","Le Dai","Xingzu Liu","Weihao Liu","Mingbang Wang"],"pdf_url":"https://arxiv.org/pdf/2403.07088v3.pdf","comment":"15 pages, second version of SPA(Side Plugin Adaption)"},{"id":"http://arxiv.org/abs/2405.11577v3","updated":"2024-05-30T05:13:19Z","published":"2024-05-19T15:00:50Z","title":"A Multi-Perspective Analysis of Memorization in Large Language Models","summary":" Large Language Models (LLMs), trained on massive corpora with billions of\nparameters, show unprecedented performance in various fields. Though surprised\nby their excellent performances, researchers also noticed some special\nbehaviors of those LLMs. One of those behaviors is memorization, in which LLMs\ncan generate the same content used to train them. Though previous research has\ndiscussed memorization, the memorization of LLMs still lacks explanation,\nespecially the cause of memorization and the dynamics of generating them. In\nthis research, we comprehensively discussed memorization from various\nperspectives and extended the discussion scope to not only just the memorized\ncontent but also less and unmemorized content. Through various studies, we\nfound that: (1) Through experiments, we revealed the relation of memorization\nbetween model size, continuation size, and context size. Further, we showed how\nunmemorized sentences transition to memorized sentences. (2) Through embedding\nanalysis, we showed the distribution and decoding dynamics across model size in\nembedding space for sentences with different memorization scores. The n-gram\nstatistics analysis presents d (3) An analysis over n-gram and entropy decoding\ndynamics discovered a boundary effect when the model starts to generate\nmemorized sentences or unmemorized sentences. (4)We trained a Transformer model\nto predict the memorization of different models, showing that it is possible to\npredict memorizations by context.\n","authors":["Bowen Chen","Namgi Han","Yusuke Miyao"],"pdf_url":"https://arxiv.org/pdf/2405.11577v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.14808v3","updated":"2024-05-30T05:09:25Z","published":"2024-02-22T18:58:28Z","title":"RelayAttention for Efficient Large Language Model Serving with Long\n System Prompts","summary":" A practical large language model (LLM) service may involve a long system\nprompt, which specifies the instructions, examples, and knowledge documents of\nthe task and is reused across requests. However, the long system prompt causes\nthroughput/latency bottlenecks as the cost of generating the next token grows\nw.r.t. the sequence length. This paper aims to improve the efficiency of LLM\nservices that involve long system prompts. Our key observation is that handling\nthese system prompts requires heavily redundant memory accesses in existing\ncausal attention computation algorithms. Specifically, for batched requests,\nthe cached hidden states (\\ie, key-value pairs) of system prompts are\ntransferred from off-chip DRAM to on-chip SRAM multiple times, each\ncorresponding to an individual request. To eliminate such a redundancy, we\npropose RelayAttention, an attention algorithm that allows reading these hidden\nstates from DRAM exactly once for a batch of input tokens. RelayAttention is a\nfree lunch: it maintains the generation quality while requiring no model\nretraining, as it is based on a mathematical reformulation of causal attention.\nWe have observed significant performance improvements to a production-level\nsystem, vLLM, through integration with RelayAttention. The improvements are\neven more profound with longer system prompts.\n","authors":["Lei Zhu","Xinjiang Wang","Wayne Zhang","Rynson W. H. Lau"],"pdf_url":"https://arxiv.org/pdf/2402.14808v3.pdf","comment":"accepted by the ACL 2024 main conference"},{"id":"http://arxiv.org/abs/2402.06967v2","updated":"2024-05-30T04:57:36Z","published":"2024-02-10T14:52:52Z","title":"Instruct Once, Chat Consistently in Multiple Rounds: An Efficient Tuning\n Framework for Dialogue","summary":" Tuning language models for dialogue generation has been a prevalent paradigm\nfor building capable dialogue agents. Yet, traditional tuning narrowly views\ndialogue generation as resembling other language generation tasks, ignoring the\nrole disparities between two speakers and the multi-round interactive process\nthat dialogues ought to be. Such a manner often leads to unsatisfactory chat\nconsistency for the built agent. In this work, we emphasize the interactive,\ncommunicative nature of dialogue and argue that it is more feasible to model\nthe speaker roles of agent and user separately, enabling the agent to adhere to\nits role consistently. With this in mind, we propose an efficient Multi-round\nInteractive Dialogue Tuning (Midi-Tuning) framework. It models the agent and\nuser individually with two adapters built upon large language models. The\nadapters make use of respective utterances round by round in alternating order\nand they are tuned via a round-level memory caching mechanism. Extensive\nexperiments demonstrate that, our framework performs superior to traditional\nfine-tuning and harbors the tremendous potential for improving dialogue\nconsistency.\n","authors":["Jian Wang","Chak Tou Leong","Jiashuo Wang","Dongding Lin","Wenjie Li","Xiao-Yong Wei"],"pdf_url":"https://arxiv.org/pdf/2402.06967v2.pdf","comment":"Accepted by ACL 2024"},{"id":"http://arxiv.org/abs/2401.02415v2","updated":"2024-05-30T04:45:34Z","published":"2024-01-04T18:59:12Z","title":"LLaMA Pro: Progressive LLaMA with Block Expansion","summary":" Humans generally acquire new skills without compromising the old; however,\nthe opposite holds for Large Language Models (LLMs), e.g., from LLaMA to\nCodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with\nan expansion of Transformer blocks. We tune the expanded blocks using only new\ncorpus, efficiently and effectively improving the model's knowledge without\ncatastrophic forgetting. In this paper, we experiment on the corpus of code and\nmath, yielding LLaMA Pro-8.3B, a versatile foundation model initialized from\nLLaMA2-7B, excelling in general tasks, programming, and mathematics. LLaMA Pro\nand its instruction-following counterpart (LLaMA Pro-Instruct) achieve advanced\nperformance among various benchmarks, demonstrating superiority over existing\nopen models in the LLaMA family and the immense potential of reasoning and\naddressing diverse tasks as an intelligent agent. Our findings provide valuable\ninsights into integrating natural and programming languages, laying a solid\nfoundation for developing advanced language agents that operate effectively in\nvarious environments.\n","authors":["Chengyue Wu","Yukang Gan","Yixiao Ge","Zeyu Lu","Jiahao Wang","Ye Feng","Ying Shan","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2401.02415v2.pdf","comment":"Accepted by ACL 2024, Main Conference"},{"id":"http://arxiv.org/abs/2402.10466v4","updated":"2024-05-30T04:19:54Z","published":"2024-02-16T06:13:18Z","title":"Large Language Models as Zero-shot Dialogue State Tracker through\n Function Calling","summary":" Large language models (LLMs) are increasingly prevalent in conversational\nsystems due to their advanced understanding and generative capabilities in\ngeneral contexts. However, their effectiveness in task-oriented dialogues\n(TOD), which requires not only response generation but also effective dialogue\nstate tracking (DST) within specific tasks and domains, remains less\nsatisfying. In this work, we propose a novel approach FnCTOD for solving DST\nwith LLMs through function calling. This method improves zero-shot DST,\nallowing adaptation to diverse domains without extensive data collection or\nmodel tuning. Our experimental results demonstrate that our approach achieves\nexceptional performance with both modestly sized open-source and also\nproprietary LLMs: with in-context prompting it enables various 7B or 13B\nparameter models to surpass the previous state-of-the-art (SOTA) achieved by\nChatGPT, and improves ChatGPT's performance beating the SOTA by 5.6% average\njoint goal accuracy (JGA). Individual model results for GPT-3.5 and GPT-4 are\nboosted by 4.8% and 14%, respectively. We also show that by fine-tuning on a\nsmall collection of diverse task-oriented dialogues, we can equip modestly\nsized models, specifically a 13B parameter LLaMA2-Chat model, with\nfunction-calling capabilities and DST performance comparable to ChatGPT while\nmaintaining their chat capabilities. We have made the code publicly available\nat https://github.com/facebookresearch/FnCTOD\n","authors":["Zekun Li","Zhiyu Zoey Chen","Mike Ross","Patrick Huber","Seungwhan Moon","Zhaojiang Lin","Xin Luna Dong","Adithya Sagar","Xifeng Yan","Paul A. Crook"],"pdf_url":"https://arxiv.org/pdf/2402.10466v4.pdf","comment":"ACL 2024 Main. Code available at:\n https://github.com/facebookresearch/FnCTOD"},{"id":"http://arxiv.org/abs/2402.01869v2","updated":"2024-05-30T04:18:03Z","published":"2024-02-02T19:47:57Z","title":"InferCept: Efficient Intercept Support for Augmented Large Language\n Model Inference","summary":" Large language models are increasingly integrated with external environments,\ntools, and agents like ChatGPT plugins to extend their capability beyond\nlanguage-centric tasks. However, today's LLM inference systems are designed for\nstandalone LLMs. They treat each external interaction as the end of LLM\ngeneration and form a new request when the interaction finishes, causing\nunnecessary recomputation of already computed contexts, which accounts for\n37-40% of total model forwarding time. This paper presents InferCept, the first\nLLM inference framework targeting augmented LLMs and supporting the efficient\ninterception of LLM generation. InferCept minimizes the GPU resource waste\ncaused by LLM interceptions and dedicates saved memory for serving more\nrequests. InferCept improves the overall serving throughput by 1.6x-2x and\ncompletes 2x more requests per second compared to the state-of-the-art LLM\ninference systems.\n","authors":["Reyna Abhyankar","Zijian He","Vikranth Srivatsa","Hao Zhang","Yiying Zhang"],"pdf_url":"https://arxiv.org/pdf/2402.01869v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19670v1","updated":"2024-05-30T03:44:54Z","published":"2024-05-30T03:44:54Z","title":"One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for\n Retrieval-Augmented Large Language Models","summary":" Retrieval-augmented generation (RAG) is a promising way to improve large\nlanguage models (LLMs) for generating more factual, accurate, and up-to-date\ncontent. Existing methods either optimize prompts to guide LLMs in leveraging\nretrieved information or directly fine-tune the LLMs to adapt to RAG scenarios.\nAlthough fine-tuning can yield better performance, it often compromises the\nLLMs' general generation capabilities by modifying their parameters. This\nlimitation poses challenges in practical applications, especially when LLMs are\nalready deployed, as parameter adjustments may affect their original\nfunctionality. To address this, we propose a novel method that involves\nlearning scalable and pluggable virtual tokens for RAG. By maintaining the\nLLMs' original parameters and fine-tuning only the embeddings of these\npluggable tokens, our approach not only enhances LLMs' performance but also\npreserves their general generation capacities. Furthermore, we design several\ntraining strategies to improve the scalability, flexibility, and\ngeneralizability of our method. Comprehensive experiments across nine\nquestion-answering tasks demonstrate the superiority of our approach.\n","authors":["Yutao Zhu","Zhaoheng Huang","Zhicheng Dou","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2405.19670v1.pdf","comment":"working in progress, repo: https://github.com/DaoD/SPRING/"},{"id":"http://arxiv.org/abs/2405.19660v1","updated":"2024-05-30T03:20:56Z","published":"2024-05-30T03:20:56Z","title":"PATIENT-Ψ: Using Large Language Models to Simulate Patients for\n Training Mental Health Professionals","summary":" Mental illness remains one of the most critical public health issues, with a\nsignificant gap between the available mental health support and patient needs.\nMany mental health professionals highlight a disconnect between their training\nand real-world patient interactions, leaving some trainees feeling unprepared\nand potentially affecting their early career success. In this paper, we propose\nPATIENT-{\\Psi}, a novel patient simulation framework for cognitive behavior\ntherapy (CBT) training. To build PATIENT-{\\Psi}, we constructed diverse patient\nprofiles and their corresponding cognitive models based on CBT principles, and\nthen used large language models (LLMs) programmed with the patient cognitive\nmodels to act as a simulated therapy patient. We propose an interactive\ntraining scheme, PATIENT-{\\Psi}-TRAINER, for mental health trainees to practice\na key skill in CBT -- formulating the cognitive model of the patient -- through\nrole-playing a therapy session with PATIENT-{\\Psi}. To evaluate PATIENT-{\\Psi},\nwe conducted a user study of 4 mental health trainees and 10 experts. The\nresults demonstrate that practice using PATIENT-{\\Psi}-TRAINER greatly enhances\nthe perceived skill acquisition and confidence of the trainees beyond existing\nforms of training such as textbooks, videos, and role-play with non-patients.\nBased on the experts' perceptions, PATIENT-{\\Psi} is perceived to be closer to\nreal patient interactions than GPT-4, and PATIENT-{\\Psi}-TRAINER holds strong\npromise to improve trainee competencies. Our pioneering patient simulation\ntraining framework, using LLMs, holds great potential to enhance and advance\nmental health training, ultimately leading to improved patient care and\noutcomes. We will release all our data, code, and the training platform.\n","authors":["Ruiyi Wang","Stephanie Milani","Jamie C. Chiu","Shaun M. Eack","Travis Labrum","Samuel M. Murphy","Nev Jones","Kate Hardy","Hong Shen","Fei Fang","Zhiyu Zoey Chen"],"pdf_url":"https://arxiv.org/pdf/2405.19660v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2405.19648v1","updated":"2024-05-30T03:00:47Z","published":"2024-05-30T03:00:47Z","title":"Detecting Hallucinations in Large Language Model Generation: A Token\n Probability Approach","summary":" Concerns regarding the propensity of Large Language Models (LLMs) to produce\ninaccurate outputs, also known as hallucinations, have escalated. Detecting\nthem is vital for ensuring the reliability of applications relying on\nLLM-generated content. Current methods often demand substantial resources and\nrely on extensive LLMs or employ supervised learning with multidimensional\nfeatures or intricate linguistic and semantic analyses difficult to reproduce\nand largely depend on using the same LLM that hallucinated. This paper\nintroduces a supervised learning approach employing two simple classifiers\nutilizing only four numerical features derived from tokens and vocabulary\nprobabilities obtained from other LLM evaluators, which are not necessarily the\nsame. The method yields promising results, surpassing state-of-the-art outcomes\nin multiple tasks across three different benchmarks. Additionally, we provide a\ncomprehensive examination of the strengths and weaknesses of our approach,\nhighlighting the significance of the features utilized and the LLM employed as\nan evaluator. We have released our code publicly at\nhttps://github.com/Baylor-AI/HalluDetect.\n","authors":["Ernesto Quevedo","Jorge Yero","Rachel Koerner","Pablo Rivas","Tomas Cerny"],"pdf_url":"https://arxiv.org/pdf/2405.19648v1.pdf","comment":"ICAI'24 - The 26th Int'l Conf on Artificial Intelligence"},{"id":"http://arxiv.org/abs/2401.06102v3","updated":"2024-05-30T02:52:08Z","published":"2024-01-11T18:33:48Z","title":"Patchscopes: A Unifying Framework for Inspecting Hidden Representations\n of Language Models","summary":" Understanding the internal representations of large language models (LLMs)\ncan help explain models' behavior and verify their alignment with human values.\nGiven the capabilities of LLMs in generating human-understandable text, we\npropose leveraging the model itself to explain its internal representations in\nnatural language. We introduce a framework called Patchscopes and show how it\ncan be used to answer a wide range of questions about an LLM's computation. We\nshow that many prior interpretability methods based on projecting\nrepresentations into the vocabulary space and intervening on the LLM\ncomputation can be viewed as instances of this framework. Moreover, several of\ntheir shortcomings such as failure in inspecting early layers or lack of\nexpressivity can be mitigated by Patchscopes. Beyond unifying prior inspection\ntechniques, Patchscopes also opens up new possibilities such as using a more\ncapable model to explain the representations of a smaller model, and multihop\nreasoning error correction.\n","authors":["Asma Ghandeharioun","Avi Caciularu","Adam Pearce","Lucas Dixon","Mor Geva"],"pdf_url":"https://arxiv.org/pdf/2401.06102v3.pdf","comment":"ICML 2024 (to appear)"},{"id":"http://arxiv.org/abs/2405.12107v2","updated":"2024-05-30T02:47:10Z","published":"2024-05-20T15:23:19Z","title":"Imp: Highly Capable Large Multimodal Models for Mobile Devices","summary":" By harnessing the capabilities of large language models (LLMs), recent large\nmultimodal models (LMMs) have shown remarkable versatility in open-world\nmultimodal understanding. Nevertheless, they are usually parameter-heavy and\ncomputation-intensive, thus hindering their applicability in\nresource-constrained scenarios. To this end, several lightweight LMMs have been\nproposed successively to maximize the capabilities under constrained scale\n(e.g., 3B). Despite the encouraging results achieved by these methods, most of\nthem only focus on one or two aspects of the design space, and the key design\nchoices that influence model capability have not yet been thoroughly\ninvestigated. In this paper, we conduct a systematic study for lightweight LMMs\nfrom the aspects of model architecture, training strategy, and training data.\nBased on our findings, we obtain Imp -- a family of highly capable LMMs at the\n2B-4B scales. Notably, our Imp-3B model steadily outperforms all the existing\nlightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs\nat the 13B scale. With low-bit quantization and resolution reduction\ntechniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile\nchip with a high inference speed of about 13 tokens/s.\n","authors":["Zhenwei Shao","Zhou Yu","Jun Yu","Xuecheng Ouyang","Lihao Zheng","Zhenbiao Gai","Mingyang Wang","Jiajun Ding"],"pdf_url":"https://arxiv.org/pdf/2405.12107v2.pdf","comment":"fix some typos and correct a few number in the tables"},{"id":"http://arxiv.org/abs/2405.19635v1","updated":"2024-05-30T02:37:35Z","published":"2024-05-30T02:37:35Z","title":"GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient\n Cloud-edge Collaboration LLM Deployment","summary":" The burgeoning size of Large Language Models (LLMs) has led to enhanced\ncapabilities in generating responses, albeit at the expense of increased\ninference times and elevated resource demands. Existing methods of\nacceleration, predominantly hinged on knowledge distillation, generally\nnecessitate fine-tuning of considerably large models, such as Llama-7B, posing\na challenge for average users. Furthermore, present techniques for expediting\ninference and reducing costs operate independently. To address these issues, we\nintroduce a novel and intuitive Guidance-based Knowledge Transfer (GKT)\nframework. This approach leverages a larger LLM as a ''teacher'' to create\nguidance prompts, paired with a smaller ''student'' model to finalize\nresponses. Remarkably, GKT requires no fine-tuning and doesn't necessitate the\nteacher and student models to have the same vocabulary, allowing for extensive\nbatch generation to accelerate the process while ensuring user customization.\nGKT can be seamlessly integrated into cloud-edge collaboration architectures,\nand is versatile enough for plug-and-play application across various models. It\nexcels in both efficiency and affordability, epitomizing a ''cheap and\ncheerful'' solution. GKT achieves a maximum accuracy improvement of 14.18%,\nalong with a 10.72 times speed-up on GSM8K and an accuracy improvement of 14.00\n% along with a 7.73 times speed-up in CSQA. When utilizing ChatGPT as teacher\nmodel and Llama2-70B as the student model, we can achieve 95.00% of ChatGPT's\nperformance at 52% of the cost. The results highlight substantial enhancements\nin accuracy and processing speed on the GSM8K and CSQA datasets, surpassing the\nperformance of using either the student or teacher models in isolation.\n","authors":["Yao Yao","Zuchao Li","Hai Zhao"],"pdf_url":"https://arxiv.org/pdf/2405.19635v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.15924v3","updated":"2024-05-30T02:13:56Z","published":"2024-05-24T20:32:49Z","title":"SLIDE: A Framework Integrating Small and Large Language Models for\n Open-Domain Dialogues Evaluation","summary":" The long-standing one-to-many problem of gold standard responses in\nopen-domain dialogue systems presents challenges for automatic evaluation\nmetrics. Though prior works have demonstrated some success by applying powerful\nLarge Language Models (LLMs), existing approaches still struggle with the\none-to-many problem, and exhibit subpar performance in domain-specific\nscenarios. We assume the commonsense reasoning biases within LLMs may hinder\ntheir performance in domainspecific evaluations. To address both issues, we\npropose a novel framework SLIDE (Small and Large Integrated for Dialogue\nEvaluation), that leverages both a small, specialised model (SLM), and LLMs for\nthe evaluation of open domain dialogues. Our approach introduces several\ntechniques: (1) Contrastive learning to differentiate between robust and\nnon-robust response embeddings; (2) A novel metric for semantic sensitivity\nthat combines embedding cosine distances with similarity learned through neural\nnetworks, and (3) a strategy for incorporating the evaluation results from both\nthe SLM and LLMs. Our empirical results demonstrate that our approach achieves\nstate-of-the-art performance in both the classification and evaluation tasks,\nand additionally the SLIDE evaluator exhibits better correlation with human\njudgements. Our code is available at https://\ngithub.com/hegehongcha/SLIDE-ACL2024.\n","authors":["Kun Zhao","Bohao Yang","Chen Tang","Chenghua Lin","Liang Zhan"],"pdf_url":"https://arxiv.org/pdf/2405.15924v3.pdf","comment":"Accepted by ACL2024 Findings"},{"id":"http://arxiv.org/abs/2405.17743v2","updated":"2024-05-30T02:12:05Z","published":"2024-05-28T01:55:35Z","title":"ORLM: Training Large Language Models for Optimization Modeling","summary":" Large Language Models (LLMs) have emerged as powerful tools for tackling\ncomplex Operations Research (OR) problem by providing the capacity in\nautomating optimization modeling. However, current methodologies heavily rely\non prompt engineering (e.g., multi-agent cooperation) with proprietary LLMs,\nraising data privacy concerns that could be prohibitive in industry\napplications. To tackle this issue, we propose training open-source LLMs for\noptimization modeling. We identify four critical requirements for the training\ndataset of OR LLMs, design and implement OR-Instruct, a semi-automated process\nfor creating synthetic data tailored to specific requirements. We also\nintroduce the IndustryOR benchmark, the first industrial benchmark for testing\nLLMs on solving real-world OR problems. We apply the data from OR-Instruct to\nvarious open-source LLMs of 7b size (termed as ORLMs), resulting in a\nsignificantly improved capability for optimization modeling. Our\nbest-performing ORLM achieves state-of-the-art performance on the NL4OPT, MAMO,\nand IndustryOR benchmarks. Our code and data are available at\n\\url{https://github.com/Cardinal-Operations/ORLM}.\n","authors":["Zhengyang Tang","Chenyu Huang","Xin Zheng","Shixi Hu","Zizhuo Wang","Dongdong Ge","Benyou Wang"],"pdf_url":"https://arxiv.org/pdf/2405.17743v2.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2405.19616v1","updated":"2024-05-30T02:09:51Z","published":"2024-05-30T02:09:51Z","title":"Easy Problems That LLMs Get Wrong","summary":" We introduce a comprehensive Linguistic Benchmark designed to evaluate the\nlimitations of Large Language Models (LLMs) in domains such as logical\nreasoning, spatial intelligence, and linguistic understanding, among others.\nThrough a series of straightforward questions, it uncovers the significant\nlimitations of well-regarded models to perform tasks that humans manage with\nease. It also highlights the potential of prompt engineering to mitigate some\nerrors and underscores the necessity for better training methodologies. Our\nfindings stress the importance of grounding LLMs with human reasoning and\ncommon sense, emphasising the need for human-in-the-loop for enterprise\napplications. We hope this work paves the way for future research to enhance\nthe usefulness and reliability of new models.\n","authors":["Sean Williams","James Huckle"],"pdf_url":"https://arxiv.org/pdf/2405.19616v1.pdf","comment":"AutogenAI Ltd. Associated code at\n https://github.com/autogenai/easy-problems-that-llms-get-wrong"},{"id":"http://arxiv.org/abs/2402.01349v2","updated":"2024-05-30T01:57:14Z","published":"2024-02-02T12:07:00Z","title":"Beyond the Answers: Reviewing the Rationality of Multiple Choice\n Question Answering for the Evaluation of Large Language Models","summary":" In the field of natural language processing (NLP), Large Language Models\n(LLMs) have precipitated a paradigm shift, markedly enhancing performance in\nnatural language generation tasks. Despite these advancements, the\ncomprehensive evaluation of LLMs remains an inevitable challenge for the\ncommunity. Recently, the utilization of Multiple Choice Question Answering\n(MCQA) as a benchmark for LLMs has gained considerable traction. This study\nfirst investigates the limitations of MCQA as an evaluation method for LLMs and\nthen analyzes the fundamental reason for the limitations of MCQA, that while\nLLMs may select the correct answers, it is possible that they also recognize\nother wrong options as correct. Finally, we propose a dataset augmenting method\nfor Multiple-Choice Questions (MCQs), MCQA+, that can more accurately reflect\nthe performance of the model, which underscores the need for more robust\nevaluation mechanisms in assessing the performance of LLMs.\n","authors":["Haochun Wang","Sendong Zhao","Zewen Qiang","Nuwa Xi","Bing Qin","Ting Liu"],"pdf_url":"https://arxiv.org/pdf/2402.01349v2.pdf","comment":"17 pages, 8 figures"},{"id":"http://arxiv.org/abs/2312.17267v4","updated":"2024-05-30T01:56:51Z","published":"2023-12-26T14:16:16Z","title":"Enhancing Low-Resource Relation Representations through Multi-View\n Decoupling","summary":" Recently, prompt-tuning with pre-trained language models (PLMs) has\ndemonstrated the significantly enhancing ability of relation extraction (RE)\ntasks. However, in low-resource scenarios, where the available training data is\nscarce, previous prompt-based methods may still perform poorly for prompt-based\nrepresentation learning due to a superficial understanding of the relation. To\nthis end, we highlight the importance of learning high-quality relation\nrepresentation in low-resource scenarios for RE, and propose a novel\nprompt-based relation representation method, named MVRE\n(\\underline{M}ulti-\\underline{V}iew \\underline{R}elation\n\\underline{E}xtraction), to better leverage the capacity of PLMs to improve the\nperformance of RE within the low-resource prompt-tuning paradigm. Specifically,\nMVRE decouples each relation into different perspectives to encompass\nmulti-view relation representations for maximizing the likelihood during\nrelation inference. Furthermore, we also design a Global-Local loss and a\nDynamic-Initialization method for better alignment of the multi-view\nrelation-representing virtual words, containing the semantics of relation\nlabels during the optimization learning process and initialization. Extensive\nexperiments on three benchmark datasets show that our method can achieve\nstate-of-the-art in low-resource settings.\n","authors":["Chenghao Fan","Wei Wei","Xiaoye Qu","Zhenyi Lu","Wenfeng Xie","Yu Cheng","Dangyang Chen"],"pdf_url":"https://arxiv.org/pdf/2312.17267v4.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2405.19597v1","updated":"2024-05-30T01:27:43Z","published":"2024-05-30T01:27:43Z","title":"SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors","summary":" Popular parameter-efficient fine-tuning (PEFT) methods, such as LoRA and its\nvariants, freeze pre-trained model weights \\(W\\) and inject learnable matrices\n\\(\\Delta W\\). These \\(\\Delta W\\) matrices are structured for efficient\nparameterization, often using techniques like low-rank approximations or\nscaling vectors. However, these methods typically show a performance gap\ncompared to full fine-tuning. Although recent PEFT methods have narrowed this\ngap, they do so at the cost of additional learnable parameters. We propose\nSVFT, a simple approach that fundamentally differs from existing methods: the\nstructure imposed on \\(\\Delta W\\) depends on the specific weight matrix \\(W\\).\nSpecifically, SVFT updates \\(W\\) as a sparse combination of outer products of\nits singular vectors, training only the coefficients (scales) of these sparse\ncombinations. This approach allows fine-grained control over expressivity\nthrough the number of coefficients. Extensive experiments on language and\nvision benchmarks show that SVFT recovers up to 96% of full fine-tuning\nperformance while training only 0.006 to 0.25% of parameters, outperforming\nexisting methods that only recover up to 85% performance using 0.03 to 0.8% of\nthe trainable parameter budget.\n","authors":["Vijay Lingam","Atula Tejaswi","Aditya Vavre","Aneesh Shetty","Gautham Krishna Gudur","Joydeep Ghosh","Alex Dimakis","Eunsol Choi","Aleksandar Bojchevski","Sujay Sanghavi"],"pdf_url":"https://arxiv.org/pdf/2405.19597v1.pdf","comment":"17 pages, 5 figures, 14 tables"},{"id":"http://arxiv.org/abs/2405.19592v1","updated":"2024-05-30T01:11:35Z","published":"2024-05-30T01:11:35Z","title":"Why Larger Language Models Do In-context Learning Differently?","summary":" Large language models (LLM) have emerged as a powerful tool for AI, with the\nkey ability of in-context learning (ICL), where they can perform well on unseen\ntasks based on a brief series of task examples without necessitating any\nadjustments to the model parameters. One recent interesting mysterious\nobservation is that models of different scales may have different ICL\nbehaviors: larger models tend to be more sensitive to noise in the test\ncontext. This work studies this observation theoretically aiming to improve the\nunderstanding of LLM and ICL. We analyze two stylized settings: (1) linear\nregression with one-layer single-head linear transformers and (2) parity\nclassification with two-layer multiple attention heads transformers (non-linear\ndata and non-linear model). In both settings, we give closed-form optimal\nsolutions and find that smaller models emphasize important hidden features\nwhile larger ones cover more hidden features; thus, smaller models are more\nrobust to noise while larger ones are more easily distracted, leading to\ndifferent ICL behaviors. This sheds light on where transformers pay attention\nto and how that affects ICL. Preliminary experimental results on large base and\nchat models provide positive support for our analysis.\n","authors":["Zhenmei Shi","Junyi Wei","Zhuoyan Xu","Yingyu Liang"],"pdf_url":"https://arxiv.org/pdf/2405.19592v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2112.07783v2","updated":"2024-05-30T00:26:06Z","published":"2021-12-14T23:06:21Z","title":"Online antisemitism across platforms","summary":" We created a fine-grained AI system for the detection of antisemitism. This\nExplainable AI will identify English and German anti-Semitic expressions of\ndehumanization, verbal aggression and conspiracies in online social media\nmessages across platforms, to support high-level decision making.\n","authors":["Tom De Smedt"],"pdf_url":"https://arxiv.org/pdf/2112.07783v2.pdf","comment":"6 pages"},{"id":"http://arxiv.org/abs/2405.15525v2","updated":"2024-05-30T00:08:51Z","published":"2024-05-24T13:12:14Z","title":"Sparse Matrix in Large Language Model Fine-tuning","summary":" LoRA and its variants have become popular parameter-efficient fine-tuning\n(PEFT) methods due to their ability to avoid excessive computational costs.\nHowever, an accuracy gap often exists between PEFT methods and full fine-tuning\n(FT), and this gap has yet to be systematically studied. In this work, we\nintroduce a method for selecting sparse sub-matrices that aim to minimize the\nperformance gap between PEFT vs. full fine-tuning (FT) while also reducing both\nfine-tuning computational cost and memory cost. Our Sparse Matrix Tuning (SMT)\nmethod begins by identifying the most significant sub-matrices in the gradient\nupdate, updating only these blocks during the fine-tuning process. In our\nexperiments, we demonstrate that SMT consistently surpasses other PEFT baseline\n(e.g. LoRA and DoRA) in fine-tuning popular large language models such as LLaMA\nacross a broad spectrum of tasks, while reducing the GPU memory footprint by\n67% compared to FT. We also examine how the performance of LoRA and DoRA tends\nto plateau and decline as the number of trainable parameters increases, in\ncontrast, our SMT method does not suffer from such issue.\n","authors":["Haoze He","Juncheng Billy Li","Xuan Jiang","Heather Miller"],"pdf_url":"https://arxiv.org/pdf/2405.15525v2.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2405.20331v1","updated":"2024-05-30T17:59:04Z","published":"2024-05-30T17:59:04Z","title":"CoSy: Evaluating Textual Explanations of Neurons","summary":" A crucial aspect of understanding the complex nature of Deep Neural Networks\n(DNNs) is the ability to explain learned concepts within their latent\nrepresentations. While various methods exist to connect neurons to textual\ndescriptions of human-understandable concepts, evaluating the quality of these\nexplanation methods presents a major challenge in the field due to a lack of\nunified, general-purpose quantitative evaluation. In this work, we introduce\nCoSy (Concept Synthesis) -- a novel, architecture-agnostic framework to\nevaluate the quality of textual explanations for latent neurons. Given textual\nexplanations, our proposed framework leverages a generative model conditioned\non textual input to create data points representing the textual explanation.\nThen, the neuron's response to these explanation data points is compared with\nthe response to control data points, providing a quality estimate of the given\nexplanation. We ensure the reliability of our proposed framework in a series of\nmeta-evaluation experiments and demonstrate practical value through insights\nfrom benchmarking various concept-based textual explanation methods for\nComputer Vision tasks, showing that tested explanation methods significantly\ndiffer in quality.\n","authors":["Laura Kopf","Philine Lou Bommer","Anna Hedström","Sebastian Lapuschkin","Marina M. -C. Höhne","Kirill Bykov"],"pdf_url":"https://arxiv.org/pdf/2405.20331v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2405.20218v1","updated":"2024-05-30T16:19:02Z","published":"2024-05-30T16:19:02Z","title":"ESG-FTSE: A corpus of news articles with ESG relevance labels and use\n cases","summary":" We present ESG-FTSE, the first corpus comprised of news articles with\nEnvironmental, Social and Governance (ESG) relevance annotations. In recent\nyears, investors and regulators have pushed ESG investing to the mainstream due\nto the urgency of climate change. This has led to the rise of ESG scores to\nevaluate an investment's credentials as socially responsible. While demand for\nESG scores is high, their quality varies wildly. Quantitative techniques can be\napplied to improve ESG scores, thus, responsible investing. To contribute to\nresource building for ESG and financial text mining, we pioneer the ESG-FTSE\ncorpus. We further present the first of its kind ESG annotation schema. It has\nthree levels: a binary classification (relevant versus irrelevant news\narticles), ESG classification (ESG-related news articles), and target company.\nBoth supervised and unsupervised learning experiments for ESG relevance\ndetection were conducted to demonstrate that the corpus can be used in\ndifferent settings to derive accurate ESG predictions. Keywords: corpus\nannotation, ESG labels, annotation schema, news article, natural language\nprocessing\n","authors":["Mariya Pavlova","Bernard Casey","Miaosen Wang"],"pdf_url":"https://arxiv.org/pdf/2405.20218v1.pdf","comment":"The corpus is available at\n https://github.com/mariavpavlova/ESG-FTSE-Corpus.\n https://aclanthology.org/2024.finnlp-1.14/"},{"id":"http://arxiv.org/abs/2405.20018v1","updated":"2024-05-30T12:57:35Z","published":"2024-05-30T12:57:35Z","title":"Safe Multi-agent Reinforcement Learning with Natural Language\n Constraints","summary":" The role of natural language constraints in Safe Multi-agent Reinforcement\nLearning (MARL) is crucial, yet often overlooked. While Safe MARL has vast\npotential, especially in fields like robotics and autonomous vehicles, its full\npotential is limited by the need to define constraints in pre-designed\nmathematical terms, which requires extensive domain expertise and reinforcement\nlearning knowledge, hindering its broader adoption. To address this limitation\nand make Safe MARL more accessible and adaptable, we propose a novel approach\nnamed Safe Multi-agent Reinforcement Learning with Natural Language constraints\n(SMALL). Our method leverages fine-tuned language models to interpret and\nprocess free-form textual constraints, converting them into semantic embeddings\nthat capture the essence of prohibited states and behaviours. These embeddings\nare then integrated into the multi-agent policy learning process, enabling\nagents to learn policies that minimize constraint violations while optimizing\nrewards. To evaluate the effectiveness of SMALL, we introduce the LaMaSafe, a\nmulti-task benchmark designed to assess the performance of multiple agents in\nadhering to natural language constraints. Empirical evaluations across various\nenvironments demonstrate that SMALL achieves comparable rewards and\nsignificantly fewer constraint violations, highlighting its effectiveness in\nunderstanding and enforcing natural language constraints.\n","authors":["Ziyan Wang","Meng Fang","Tristan Tomilin","Fei Fang","Yali Du"],"pdf_url":"https://arxiv.org/pdf/2405.20018v1.pdf","comment":"23 pages, 6 figures"},{"id":"http://arxiv.org/abs/2405.20015v1","updated":"2024-05-30T12:50:32Z","published":"2024-05-30T12:50:32Z","title":"Efficient LLM-Jailbreaking by Introducing Visual Modality","summary":" This paper focuses on jailbreaking attacks against large language models\n(LLMs), eliciting them to generate objectionable content in response to harmful\nuser queries. Unlike previous LLM-jailbreaks that directly orient to LLMs, our\napproach begins by constructing a multimodal large language model (MLLM)\nthrough the incorporation of a visual module into the target LLM. Subsequently,\nwe conduct an efficient MLLM-jailbreak to generate jailbreaking embeddings\nembJS. Finally, we convert the embJS into text space to facilitate the\njailbreaking of the target LLM. Compared to direct LLM-jailbreaking, our\napproach is more efficient, as MLLMs are more vulnerable to jailbreaking than\npure LLM. Additionally, to improve the attack success rate (ASR) of\njailbreaking, we propose an image-text semantic matching scheme to identify a\nsuitable initial input. Extensive experiments demonstrate that our approach\nsurpasses current state-of-the-art methods in terms of both efficiency and\neffectiveness. Moreover, our approach exhibits superior cross-class\njailbreaking capabilities.\n","authors":["Zhenxing Niu","Yuyao Sun","Haodong Ren","Haoxuan Ji","Quan Wang","Xiaoke Ma","Gang Hua","Rong Jin"],"pdf_url":"https://arxiv.org/pdf/2405.20015v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19988v1","updated":"2024-05-30T12:18:06Z","published":"2024-05-30T12:18:06Z","title":"Video-Language Critic: Transferable Reward Functions for\n Language-Conditioned Robotics","summary":" Natural language is often the easiest and most convenient modality for humans\nto specify tasks for robots. However, learning to ground language to behavior\ntypically requires impractical amounts of diverse, language-annotated\ndemonstrations collected on each target robot. In this work, we aim to separate\nthe problem of what to accomplish from how to accomplish it, as the former can\nbenefit from substantial amounts of external observation-only data, and only\nthe latter depends on a specific robot embodiment. To this end, we propose\nVideo-Language Critic, a reward model that can be trained on readily available\ncross-embodiment data using contrastive learning and a temporal ranking\nobjective, and use it to score behavior traces from a separate reinforcement\nlearning actor. When trained on Open X-Embodiment data, our reward model\nenables 2x more sample-efficient policy training on Meta-World tasks than a\nsparse reward only, despite a significant domain gap. Using in-domain data but\nin a challenging task generalization setting on Meta-World, we further\ndemonstrate more sample-efficient training than is possible with prior\nlanguage-conditioned reward models that are either trained with binary\nclassification, use static images, or do not leverage the temporal information\npresent in video data.\n","authors":["Minttu Alakuijala","Reginald McLean","Isaac Woungang","Nariman Farsad","Samuel Kaski","Pekka Marttinen","Kai Yuan"],"pdf_url":"https://arxiv.org/pdf/2405.19988v1.pdf","comment":"10 pages in the main text, 16 pages including references and\n supplementary materials. 4 figures and 3 tables in the main text, 1 table in\n supplementary materials"},{"id":"http://arxiv.org/abs/2405.19893v1","updated":"2024-05-30T09:50:38Z","published":"2024-05-30T09:50:38Z","title":"Similarity is Not All You Need: Endowing Retrieval Augmented Generation\n with Multi Layered Thoughts","summary":" In recent years, large language models (LLMs) have made remarkable\nachievements in various domains. However, the untimeliness and cost of\nknowledge updates coupled with hallucination issues of LLMs have curtailed\ntheir applications in knowledge intensive tasks, where retrieval augmented\ngeneration (RAG) can be of help. Nevertheless, existing retrieval augmented\nmodels typically use similarity as a bridge between queries and documents and\nfollow a retrieve then read procedure. In this work, we argue that similarity\nis not always the panacea and totally relying on similarity would sometimes\ndegrade the performance of retrieval augmented generation. To this end, we\npropose MetRag, a Multi layEred Thoughts enhanced Retrieval Augmented\nGeneration framework. To begin with, beyond existing similarity oriented\nthought, we embrace a small scale utility model that draws supervision from an\nLLM for utility oriented thought and further come up with a smarter model by\ncomprehensively combining the similarity and utility oriented thoughts.\nFurthermore, given the fact that the retrieved document set tends to be huge\nand using them in isolation makes it difficult to capture the commonalities and\ncharacteristics among them, we propose to make an LLM as a task adaptive\nsummarizer to endow retrieval augmented generation with compactness-oriented\nthought. Finally, with multi layered thoughts from the precedent stages, an LLM\nis called for knowledge augmented generation. Extensive experiments on\nknowledge-intensive tasks have demonstrated the superiority of MetRag.\n","authors":["Chunjing Gan","Dan Yang","Binbin Hu","Hanxiao Zhang","Siyuan Li","Ziqi Liu","Yue Shen","Lin Ju","Zhiqiang Zhang","Jinjie Gu","Lei Liang","Jun Zhou"],"pdf_url":"https://arxiv.org/pdf/2405.19893v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2405.19883v1","updated":"2024-05-30T09:42:54Z","published":"2024-05-30T09:42:54Z","title":"From Words to Actions: Unveiling the Theoretical Underpinnings of\n LLM-Driven Autonomous Systems","summary":" In this work, from a theoretical lens, we aim to understand why large\nlanguage model (LLM) empowered agents are able to solve decision-making\nproblems in the physical world. To this end, consider a hierarchical\nreinforcement learning (RL) model where the LLM Planner and the Actor perform\nhigh-level task planning and low-level execution, respectively. Under this\nmodel, the LLM Planner navigates a partially observable Markov decision process\n(POMDP) by iteratively generating language-based subgoals via prompting. Under\nproper assumptions on the pretraining data, we prove that the pretrained LLM\nPlanner effectively performs Bayesian aggregated imitation learning (BAIL)\nthrough in-context learning. Additionally, we highlight the necessity for\nexploration beyond the subgoals derived from BAIL by proving that naively\nexecuting the subgoals returned by LLM leads to a linear regret. As a remedy,\nwe introduce an $\\epsilon$-greedy exploration strategy to BAIL, which is proven\nto incur sublinear regret when the pretraining error is small. Finally, we\nextend our theoretical framework to include scenarios where the LLM Planner\nserves as a world model for inferring the transition model of the environment\nand to multi-agent settings, enabling coordination among multiple Actors.\n","authors":["Jianliang He","Siyu Chen","Fengzhuo Zhang","Zhuoran Yang"],"pdf_url":"https://arxiv.org/pdf/2405.19883v1.pdf","comment":"Accepted by ICML 2024"},{"id":"http://arxiv.org/abs/2405.19653v1","updated":"2024-05-30T03:12:04Z","published":"2024-05-30T03:12:04Z","title":"SysCaps: Language Interfaces for Simulation Surrogates of Complex\n Systems","summary":" Data-driven simulation surrogates help computational scientists study complex\nsystems. They can also help inform impactful policy decisions. We introduce a\nlearning framework for surrogate modeling where language is used to interface\nwith the underlying system being simulated. We call a language description of a\nsystem a \"system caption\", or SysCap. To address the lack of datasets of paired\nnatural language SysCaps and simulation runs, we use large language models\n(LLMs) to synthesize high-quality captions. Using our framework, we train\nmultimodal text and timeseries regression models for two real-world simulators\nof complex energy systems. Our experiments demonstrate the feasibility of\ndesigning language interfaces for real-world surrogate models at comparable\naccuracy to standard baselines. We qualitatively and quantitatively show that\nSysCaps unlock text-prompt-style surrogate modeling and new generalization\nabilities beyond what was previously possible. We will release the generated\nSysCaps datasets and our code to support follow-on studies.\n","authors":["Patrick Emami","Zhaonan Li","Saumya Sinha","Truc Nguyen"],"pdf_url":"https://arxiv.org/pdf/2405.19653v1.pdf","comment":"17 pages. Under review"},{"id":"http://arxiv.org/abs/2405.20541v1","updated":"2024-05-30T23:50:20Z","published":"2024-05-30T23:50:20Z","title":"Perplexed by Perplexity: Perplexity-Based Data Pruning With Small\n Reference Models","summary":" In this work, we investigate whether small language models can determine\nhigh-quality subsets of large-scale text datasets that improve the performance\nof larger language models. While existing work has shown that pruning based on\nthe perplexity of a larger model can yield high-quality data, we investigate\nwhether smaller models can be used for perplexity-based pruning and how pruning\nis affected by the domain composition of the data being pruned. We demonstrate\nthat for multiple dataset compositions, perplexity-based pruning of pretraining\ndata can \\emph{significantly} improve downstream task performance: pruning\nbased on perplexities computed with a 125 million parameter model improves the\naverage performance on downstream tasks of a 3 billion parameter model by up to\n2.04 and achieves up to a $1.45\\times$ reduction in pretraining steps to reach\ncommensurate baseline performance. Furthermore, we demonstrate that such\nperplexity-based data pruning also yields downstream performance gains in the\nover-trained and data-constrained regimes.\n","authors":["Zachary Ankner","Cody Blakeney","Kartik Sreenivasan","Max Marion","Matthew L. Leavitt","Mansheej Paul"],"pdf_url":"https://arxiv.org/pdf/2405.20541v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20535v1","updated":"2024-05-30T23:20:25Z","published":"2024-05-30T23:20:25Z","title":"Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large\n Language Models Reasoning","summary":" Instruction Fine-Tuning (IFT) significantly enhances the zero-shot\ncapabilities of pretrained Large Language Models (LLMs). While coding data is\nknown to boost reasoning abilities during LLM pretraining, its role in\nactivating internal reasoning capacities during IFT remains understudied. This\npaper investigates a key question: How does coding data impact LLMs' reasoning\ncapacities during the IFT stage? To explore this, we thoroughly examine the\nimpact of coding data across different coding data proportions, model families,\nsizes, and reasoning domains, from various perspectives. Specifically, we\ncreate three IFT datasets with increasing coding data proportions, fine-tune\nsix LLM backbones across different families and scales on these datasets,\nevaluate the tuned models' performance across twelve tasks in three reasoning\ndomains, and analyze the outcomes from three broad-to-granular perspectives:\noverall, domain-level, and task-specific. Our holistic analysis provides\nvaluable insights in each perspective. First, coding data tuning enhances the\noverall reasoning capabilities of LLMs across different model families and\nscales. Moreover, the effect of coding data varies among different domains but\nshows consistent trends across model families and scales within each domain.\nAdditionally, coding data generally yields comparable task-specific benefits\nacross different model families, with the optimal coding data proportions in\nIFT datasets being task-specific.\n","authors":["Xinlu Zhang","Zhiyu Zoey Chen","Xi Ye","Xianjun Yang","Lichang Chen","William Yang Wang","Linda Ruth Petzold"],"pdf_url":"https://arxiv.org/pdf/2405.20535v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.05660v2","updated":"2024-05-30T23:10:00Z","published":"2023-09-11T17:56:57Z","title":"Hypothesis Search: Inductive Reasoning with Language Models","summary":" Inductive reasoning is a core problem-solving capacity: humans can identify\nunderlying principles from a few examples, which robustly generalize to novel\nscenarios. Recent work evaluates large language models (LLMs) on inductive\nreasoning tasks by directly prompting them yielding \"in context learning.\" This\nworks well for straightforward inductive tasks but performs poorly on complex\ntasks such as the Abstraction and Reasoning Corpus (ARC). In this work, we\npropose to improve the inductive reasoning ability of LLMs by generating\nexplicit hypotheses at multiple levels of abstraction: we prompt the LLM to\npropose multiple abstract hypotheses about the problem, in natural language,\nthen implement the natural language hypotheses as concrete Python programs.\nThese programs can be verified by running on observed examples and generalized\nto novel inputs. To reduce the hypothesis search space, we explore steps to\nfilter the set of hypotheses to implement: we either ask the LLM to summarize\nthem into a smaller set of hypotheses or ask human annotators to select a\nsubset. We verify our pipeline's effectiveness on the ARC visual inductive\nreasoning benchmark, its variant 1D-ARC, string transformation dataset SyGuS,\nand list transformation dataset List Functions. On a random 100-problem subset\nof ARC, our automated pipeline using LLM summaries achieves 30% accuracy,\noutperforming the direct prompting baseline (accuracy of 17%). With the minimal\nhuman input of selecting from LLM-generated candidates, performance is boosted\nto 33%. Our ablations show that both abstract hypothesis generation and\nconcrete program representations benefit LLMs on inductive reasoning tasks.\n","authors":["Ruocheng Wang","Eric Zelikman","Gabriel Poesia","Yewen Pu","Nick Haber","Noah D. Goodman"],"pdf_url":"https://arxiv.org/pdf/2309.05660v2.pdf","comment":"ICLR 2024. The first two authors contributed equally. Code:\n https://github.com/Relento/hypothesis_search"},{"id":"http://arxiv.org/abs/2405.20529v1","updated":"2024-05-30T23:04:53Z","published":"2024-05-30T23:04:53Z","title":"An Automatic Question Usability Evaluation Toolkit","summary":" Evaluating multiple-choice questions (MCQs) involves either labor intensive\nhuman assessments or automated methods that prioritize readability, often\noverlooking deeper question design flaws. To address this issue, we introduce\nthe Scalable Automatic Question Usability Evaluation Toolkit (SAQUET), an\nopen-source tool that leverages the Item-Writing Flaws (IWF) rubric for a\ncomprehensive and automated quality evaluation of MCQs. By harnessing the\nlatest in large language models such as GPT-4, advanced word embeddings, and\nTransformers designed to analyze textual complexity, SAQUET effectively\npinpoints and assesses a wide array of flaws in MCQs. We first demonstrate the\ndiscrepancy between commonly used automated evaluation metrics and the human\nassessment of MCQ quality. Then we evaluate SAQUET on a diverse dataset of MCQs\nacross the five domains of Chemistry, Statistics, Computer Science, Humanities,\nand Healthcare, showing how it effectively distinguishes between flawed and\nflawless questions, providing a level of analysis beyond what is achievable\nwith traditional metrics. With an accuracy rate of over 94% in detecting the\npresence of flaws identified by human evaluators, our findings emphasize the\nlimitations of existing evaluation methods and showcase potential in improving\nthe quality of educational assessments.\n","authors":["Steven Moore","Eamon Costello","Huy A. Nguyen","John Stamper"],"pdf_url":"https://arxiv.org/pdf/2405.20529v1.pdf","comment":"Artificial Intelligence in Education 2024"},{"id":"http://arxiv.org/abs/2405.20527v1","updated":"2024-05-30T23:01:10Z","published":"2024-05-30T23:01:10Z","title":"Towards Ontology-Enhanced Representation Learning for Large Language\n Models","summary":" Taking advantage of the widespread use of ontologies to organise and\nharmonize knowledge across several distinct domains, this paper proposes a\nnovel approach to improve an embedding-Large Language Model (embedding-LLM) of\ninterest by infusing the knowledge formalized by a reference ontology:\nontological knowledge infusion aims at boosting the ability of the considered\nLLM to effectively model the knowledge domain described by the infused\nontology. The linguistic information (i.e. concept synonyms and descriptions)\nand structural information (i.e. is-a relations) formalized by the ontology are\nutilized to compile a comprehensive set of concept definitions, with the\nassistance of a powerful generative LLM (i.e. GPT-3.5-turbo). These concept\ndefinitions are then employed to fine-tune the target embedding-LLM using a\ncontrastive learning framework. To demonstrate and evaluate the proposed\napproach, we utilize the biomedical disease ontology MONDO. The results show\nthat embedding-LLMs enhanced by ontological disease knowledge exhibit an\nimproved capability to effectively evaluate the similarity of in-domain\nsentences from biomedical documents mentioning diseases, without compromising\ntheir out-of-domain performance.\n","authors":["Francesco Ronzano","Jay Nanavati"],"pdf_url":"https://arxiv.org/pdf/2405.20527v1.pdf","comment":"14 pages, 1 figure"},{"id":"http://arxiv.org/abs/2405.20526v1","updated":"2024-05-30T22:57:49Z","published":"2024-05-30T22:57:49Z","title":"Automated Generation and Tagging of Knowledge Components from\n Multiple-Choice Questions","summary":" Knowledge Components (KCs) linked to assessments enhance the measurement of\nstudent learning, enrich analytics, and facilitate adaptivity. However,\ngenerating and linking KCs to assessment items requires significant effort and\ndomain-specific knowledge. To streamline this process for higher-education\ncourses, we employed GPT-4 to generate KCs for multiple-choice questions (MCQs)\nin Chemistry and E-Learning. We analyzed discrepancies between the KCs\ngenerated by the Large Language Model (LLM) and those made by humans through\nevaluation from three domain experts in each subject area. This evaluation\naimed to determine whether, in instances of non-matching KCs, evaluators showed\na preference for the LLM-generated KCs over their human-created counterparts.\nWe also developed an ontology induction algorithm to cluster questions that\nassess similar KCs based on their content. Our most effective LLM strategy\naccurately matched KCs for 56% of Chemistry and 35% of E-Learning MCQs, with\neven higher success when considering the top five KC suggestions. Human\nevaluators favored LLM-generated KCs, choosing them over human-assigned ones\napproximately two-thirds of the time, a preference that was statistically\nsignificant across both domains. Our clustering algorithm successfully grouped\nquestions by their underlying KCs without needing explicit labels or contextual\ninformation. This research advances the automation of KC generation and\nclassification for assessment items, alleviating the need for student data or\npredefined KC labels.\n","authors":["Steven Moore","Robin Schmucker","Tom Mitchell","John Stamper"],"pdf_url":"https://arxiv.org/pdf/2405.20526v1.pdf","comment":"Learning @ Scale 2024"},{"id":"http://arxiv.org/abs/2405.07960v3","updated":"2024-05-30T22:56:17Z","published":"2024-05-13T17:38:53Z","title":"AgentClinic: a multimodal agent benchmark to evaluate AI in simulated\n clinical environments","summary":" Diagnosing and managing a patient is a complex, sequential decision making\nprocess that requires physicians to obtain information -- such as which tests\nto perform -- and to act upon it. Recent advances in artificial intelligence\n(AI) and large language models (LLMs) promise to profoundly impact clinical\ncare. However, current evaluation schemes overrely on static medical\nquestion-answering benchmarks, falling short on interactive decision-making\nthat is required in real-life clinical work. Here, we present AgentClinic: a\nmultimodal benchmark to evaluate LLMs in their ability to operate as agents in\nsimulated clinical environments. In our benchmark, the doctor agent must\nuncover the patient's diagnosis through dialogue and active data collection. We\npresent two open medical agent benchmarks: a multimodal image and dialogue\nenvironment, AgentClinic-NEJM, and a dialogue-only environment,\nAgentClinic-MedQA. We embed cognitive and implicit biases both in patient and\ndoctor agents to emulate realistic interactions between biased agents. We find\nthat introducing bias leads to large reductions in diagnostic accuracy of the\ndoctor agents, as well as reduced compliance, confidence, and follow-up\nconsultation willingness in patient agents. Evaluating a suite of\nstate-of-the-art LLMs, we find that several models that excel in benchmarks\nlike MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM\nused in the patient agent is an important factor for performance in the\nAgentClinic benchmark. We show that both having limited interactions as well as\ntoo many interaction reduces diagnostic accuracy in doctor agents. The code and\ndata for this work is publicly available at https://AgentClinic.github.io.\n","authors":["Samuel Schmidgall","Rojin Ziaei","Carl Harris","Eduardo Reis","Jeffrey Jopling","Michael Moor"],"pdf_url":"https://arxiv.org/pdf/2405.07960v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20512v1","updated":"2024-05-30T22:08:20Z","published":"2024-05-30T22:08:20Z","title":"How Multilingual Are Large Language Models Fine-Tuned for Translation?","summary":" A new paradigm for machine translation has recently emerged: fine-tuning\nlarge language models (LLM) on parallel text has been shown to outperform\ndedicated translation systems trained in a supervised fashion on much larger\namounts of parallel data (Xu et al., 2024a; Alves et al., 2024). However, it\nremains unclear whether this paradigm can enable massively multilingual machine\ntranslation or whether it requires fine-tuning dedicated models for a small\nnumber of language pairs. How does translation fine-tuning impact the MT\ncapabilities of LLMs for zero-shot languages, zero-shot language pairs, and\ntranslation tasks that do not involve English? To address these questions, we\nconduct an extensive empirical evaluation of the translation quality of the\nTOWER family of language models (Alves et al., 2024) on 132 translation tasks\nfrom the multi-parallel FLORES-200 data. We find that translation fine-tuning\nimproves translation quality even for zero-shot languages on average, but that\nthe impact is uneven depending on the language pairs involved. These results\ncall for further research to effectively enable massively multilingual\ntranslation with LLMs.\n","authors":["Aquia Richburg","Marine Carpuat"],"pdf_url":"https://arxiv.org/pdf/2405.20512v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20505v1","updated":"2024-05-30T21:51:01Z","published":"2024-05-30T21:51:01Z","title":"SPOT: Text Source Prediction from Originality Score Thresholding","summary":" The wide acceptance of large language models (LLMs) has unlocked new\napplications and social risks. Popular countermeasures aim at detecting\nmisinformation, usually involve domain specific models trained to recognize the\nrelevance of any information. Instead of evaluating the validity of the\ninformation, we propose to investigate LLM generated text from the perspective\nof trust. In this study, we define trust as the ability to know if an input\ntext was generated by a LLM or a human. To do so, we design SPOT, an efficient\nmethod, that classifies the source of any, standalone, text input based on\noriginality score. This score is derived from the prediction of a given LLM to\ndetect other LLMs. We empirically demonstrate the robustness of the method to\nthe architecture, training data, evaluation data, task and compression of\nmodern LLMs.\n","authors":["Edouard Yvinec","Gabriel Kasser"],"pdf_url":"https://arxiv.org/pdf/2405.20505v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20495v1","updated":"2024-05-30T21:36:12Z","published":"2024-05-30T21:36:12Z","title":"Transfer Q Star: Principled Decoding for LLM Alignment","summary":" Aligning foundation models is essential for their safe and trustworthy\ndeployment. However, traditional fine-tuning methods are computationally\nintensive and require updating billions of model parameters. A promising\nalternative, alignment via decoding, adjusts the response distribution directly\nwithout model updates to maximize a target reward $r$, thus providing a\nlightweight and adaptable framework for alignment. However, principled decoding\nmethods rely on oracle access to an optimal Q-function ($Q^*$), which is often\nunavailable in practice. Hence, prior SoTA methods either approximate this\n$Q^*$ using $Q^{\\pi_{\\texttt{sft}}}$ (derived from the reference $\\texttt{SFT}$\nmodel) or rely on short-term rewards, resulting in sub-optimal decoding\nperformance. In this work, we propose Transfer $Q^*$, which implicitly\nestimates the optimal value function for a target reward $r$ through a baseline\nmodel $\\rho_{\\texttt{BL}}$ aligned with a baseline reward $\\rho_{\\texttt{BL}}$\n(which can be different from the target reward $r$). Theoretical analyses of\nTransfer $Q^*$ provide a rigorous characterization of its optimality, deriving\nan upper bound on the sub-optimality gap and identifying a hyperparameter to\ncontrol the deviation from the pre-trained reference $\\texttt{SFT}$ model based\non user needs. Our approach significantly reduces the sub-optimality gap\nobserved in prior SoTA methods and demonstrates superior empirical performance\nacross key metrics such as coherence, diversity, and quality in extensive tests\non several synthetic and real datasets.\n","authors":["Souradip Chakraborty","Soumya Suvra Ghosal","Ming Yin","Dinesh Manocha","Mengdi Wang","Amrit Singh Bedi","Furong Huang"],"pdf_url":"https://arxiv.org/pdf/2405.20495v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20485v1","updated":"2024-05-30T21:19:24Z","published":"2024-05-30T21:19:24Z","title":"Phantom: General Trigger Attacks on Retrieval Augmented Language\n Generation","summary":" Retrieval Augmented Generation (RAG) expands the capabilities of modern large\nlanguage models (LLMs) in chatbot applications, enabling developers to adapt\nand personalize the LLM output without expensive training or fine-tuning. RAG\nsystems use an external knowledge database to retrieve the most relevant\ndocuments for a given query, providing this context to the LLM generator. While\nRAG achieves impressive utility in many applications, its adoption to enable\npersonalized generative models introduces new security risks. In this work, we\npropose new attack surfaces for an adversary to compromise a victim's RAG\nsystem, by injecting a single malicious document in its knowledge database. We\ndesign Phantom, general two-step attack framework against RAG augmented LLMs.\nThe first step involves crafting a poisoned document designed to be retrieved\nby the RAG system within the top-k results only when an adversarial trigger, a\nspecific sequence of words acting as backdoor, is present in the victim's\nqueries. In the second step, a specially crafted adversarial string within the\npoisoned document triggers various adversarial attacks in the LLM generator,\nincluding denial of service, reputation damage, privacy violations, and harmful\nbehaviors. We demonstrate our attacks on multiple LLM architectures, including\nGemma, Vicuna, and Llama.\n","authors":["Harsh Chaudhari","Giorgio Severi","John Abascal","Matthew Jagielski","Christopher A. Choquette-Choo","Milad Nasr","Cristina Nita-Rotaru","Alina Oprea"],"pdf_url":"https://arxiv.org/pdf/2405.20485v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.02691v3","updated":"2024-05-30T21:16:29Z","published":"2023-09-06T03:54:57Z","title":"A Joint Study of Phrase Grounding and Task Performance in Vision and\n Language Models","summary":" Key to tasks that require reasoning about natural language in visual contexts\nis grounding words and phrases to image regions. However, observing this\ngrounding in contemporary models is complex, even if it is generally expected\nto take place if the task is addressed in a way that is conductive to\ngeneralization. We propose a framework to jointly study task performance and\nphrase grounding, and propose three benchmarks to study the relation between\nthe two. Our results show that contemporary models demonstrate inconsistency\nbetween their ability to ground phrases and solve tasks. We show how this can\nbe addressed through brute-force training on ground phrasing annotations, and\nanalyze the dynamics it creates. Code and at available at\nhttps://github.com/lil-lab/phrase_grounding.\n","authors":["Noriyuki Kojima","Hadar Averbuch-Elor","Yoav Artzi"],"pdf_url":"https://arxiv.org/pdf/2309.02691v3.pdf","comment":"This was published in TMLR in 2024, on January 24th"},{"id":"http://arxiv.org/abs/2402.03299v4","updated":"2024-05-30T21:14:26Z","published":"2024-02-05T18:54:43Z","title":"GUARD: Role-playing to Generate Natural-language Jailbreakings to Test\n Guideline Adherence of Large Language Models","summary":" The discovery of \"jailbreaks\" to bypass safety filters of Large Language\nModels (LLMs) and harmful responses have encouraged the community to implement\nsafety measures. One major safety measure is to proactively test the LLMs with\njailbreaks prior to the release. Therefore, such testing will require a method\nthat can generate jailbreaks massively and efficiently. In this paper, we\nfollow a novel yet intuitive strategy to generate jailbreaks in the style of\nthe human generation. We propose a role-playing system that assigns four\ndifferent roles to the user LLMs to collaborate on new jailbreaks. Furthermore,\nwe collect existing jailbreaks and split them into different independent\ncharacteristics using clustering frequency and semantic patterns sentence by\nsentence. We organize these characteristics into a knowledge graph, making them\nmore accessible and easier to retrieve. Our system of different roles will\nleverage this knowledge graph to generate new jailbreaks, which have proved\neffective in inducing LLMs to generate unethical or guideline-violating\nresponses. In addition, we also pioneer a setting in our system that will\nautomatically follow the government-issued guidelines to generate jailbreaks to\ntest whether LLMs follow the guidelines accordingly. We refer to our system as\nGUARD (Guideline Upholding through Adaptive Role-play Diagnostics). We have\nempirically validated the effectiveness of GUARD on three cutting-edge\nopen-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a\nwidely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the\nrealm of vision language models (MiniGPT-v2 and Gemini Vision Pro), showcasing\nGUARD's versatility and contributing valuable insights for the development of\nsafer, more reliable LLM-based applications across diverse modalities.\n","authors":["Haibo Jin","Ruoxi Chen","Andy Zhou","Yang Zhang","Haohan Wang"],"pdf_url":"https://arxiv.org/pdf/2402.03299v4.pdf","comment":"28 papges"},{"id":"http://arxiv.org/abs/2405.20477v1","updated":"2024-05-30T20:56:41Z","published":"2024-05-30T20:56:41Z","title":"Automated Focused Feedback Generation for Scientific Writing Assistance","summary":" Scientific writing is a challenging task, particularly for novice researchers\nwho often rely on feedback from experienced peers. Recent work has primarily\nfocused on improving surface form and style rather than manuscript content. In\nthis paper, we propose a novel task: automated focused feedback generation for\nscientific writing assistance. We present SWIF$^{2}$T: a Scientific WrIting\nFocused Feedback Tool. It is designed to generate specific, actionable and\ncoherent comments, which identify weaknesses in a scientific paper and/or\npropose revisions to it. Our approach consists of four components - planner,\ninvestigator, reviewer and controller - leveraging multiple Large Language\nModels (LLMs) to implement them. We compile a dataset of 300 peer reviews\nciting weaknesses in scientific papers and conduct human evaluation. The\nresults demonstrate the superiority in specificity, reading comprehension, and\noverall helpfulness of SWIF$^{2}$T's feedback compared to other approaches. In\nour analysis, we also identified cases where automatically generated reviews\nwere judged better than human ones, suggesting opportunities for integration of\nAI-generated feedback in scientific writing.\n","authors":["Eric Chamoun","Michael Schlichktrull","Andreas Vlachos"],"pdf_url":"https://arxiv.org/pdf/2405.20477v1.pdf","comment":"Accepted to ACL 2024 (Findings)"},{"id":"http://arxiv.org/abs/2405.20468v1","updated":"2024-05-30T20:34:37Z","published":"2024-05-30T20:34:37Z","title":"Extending the Massive Text Embedding Benchmark to French","summary":" In recent years, numerous embedding models have been made available and\nwidely used for various NLP tasks. Choosing a model that performs well for\nseveral tasks in English has been largely simplified by the Massive Text\nEmbedding Benchmark (MTEB), but extensions to other languages remain\nchallenging. This is why we expand MTEB to propose the first massive benchmark\nof sentence embeddings for French. Not only we gather 22 existing datasets in\nan easy-to-use interface, but we also create three new French datasets for a\nglobal evaluation over 8 different tasks. We perform a large scale comparison\nwith 46 carefully selected embedding models, conduct comprehensive statistical\ntests, and analyze the correlation between model performance and many of their\ncharacteristics. We find out that even if no model is the best on all tasks,\nlarge multilingual models pre-trained on sentence similarity perform\nparticularly well. Our work comes with open-source code, new datasets and a\npublic leaderboard.\n","authors":["Mathieu Ciancone","Imene Kerboua","Marion Schaeffer","Wissam Siblini"],"pdf_url":"https://arxiv.org/pdf/2405.20468v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11052v3","updated":"2024-05-30T20:28:08Z","published":"2024-01-19T23:00:31Z","title":"Mining experimental data from Materials Science literature with Large\n Language Models: an evaluation study","summary":" This study is dedicated to assessing the capabilities of large language\nmodels (LLMs) such as GPT-3.5-Turbo, GPT-4, and GPT-4-Turbo in extracting\nstructured information from scientific documents in materials science. To this\nend, we primarily focus on two critical tasks of information extraction: (i) a\nnamed entity recognition (NER) of studied materials and physical properties and\n(ii) a relation extraction (RE) between these entities. Due to the evident lack\nof datasets within Materials Informatics (MI), we evaluated using SuperMat,\nbased on superconductor research, and MeasEval, a generic measurement\nevaluation corpus. The performance of LLMs in executing these tasks is\nbenchmarked against traditional models based on the BERT architecture and\nrule-based approaches (baseline). We introduce a novel methodology for the\ncomparative analysis of intricate material expressions, emphasising the\nstandardisation of chemical formulas to tackle the complexities inherent in\nmaterials science information assessment. For NER, LLMs fail to outperform the\nbaseline with zero-shot prompting and exhibit only limited improvement with\nfew-shot prompting. However, a GPT-3.5-Turbo fine-tuned with the appropriate\nstrategy for RE outperforms all models, including the baseline. Without any\nfine-tuning, GPT-4 and GPT-4-Turbo display remarkable reasoning and\nrelationship extraction capabilities after being provided with merely a couple\nof examples, surpassing the baseline. Overall, the results suggest that\nalthough LLMs demonstrate relevant reasoning skills in connecting concepts,\nspecialised models are currently a better choice for tasks requiring extracting\ncomplex domain-specific entities like materials. These insights provide initial\nguidance applicable to other materials science sub-domains in future work.\n","authors":["Luca Foppiano","Guillaume Lambard","Toshiyuki Amagasa","Masashi Ishii"],"pdf_url":"https://arxiv.org/pdf/2401.11052v3.pdf","comment":"40 pages: 5 figures and 1 table in the body. 32 Tables in the\n Appendix / Supplementary materials"},{"id":"http://arxiv.org/abs/2405.20461v1","updated":"2024-05-30T20:16:27Z","published":"2024-05-30T20:16:27Z","title":"Scalable Detection of Salient Entities in News Articles","summary":" News articles typically mention numerous entities, a large fraction of which\nare tangential to the story. Detecting the salience of entities in articles is\nthus important to applications such as news search, analysis and summarization.\nIn this work, we explore new approaches for efficient and effective salient\nentity detection by fine-tuning pretrained transformer models with\nclassification heads that use entity tags or contextualized entity\nrepresentations directly. Experiments show that these straightforward\ntechniques dramatically outperform prior work across datasets with varying\nsizes and salience definitions. We also study knowledge distillation techniques\nto effectively reduce the computational cost of these models without affecting\ntheir accuracy. Finally, we conduct extensive analyses and ablation experiments\nto characterize the behavior of the proposed models.\n","authors":["Eliyar Asgarieh","Kapil Thadani","Neil O'Hare"],"pdf_url":"https://arxiv.org/pdf/2405.20461v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.15713v2","updated":"2024-05-30T19:55:58Z","published":"2024-01-28T17:34:42Z","title":"Contrastive Learning and Mixture of Experts Enables Precise Vector\n Embeddings","summary":" The advancement of transformer neural networks has significantly elevated the\ncapabilities of sentence similarity models, but they struggle with highly\ndiscriminative tasks and produce sub-optimal representations of important\ndocuments like scientific literature. With the increased reliance on retrieval\naugmentation and search, representing diverse documents as concise and\ndescriptive vectors is crucial. This paper improves upon the vectors embeddings\nof scientific literature by assembling niche datasets using co-citations as a\nsimilarity metric, focusing on biomedical domains. We apply a novel Mixture of\nExperts (MoE) extension pipeline to pretrained BERT models, where every\nmulti-layer perceptron section is enlarged and copied into multiple distinct\nexperts. Our MoE variants perform well over $N$ scientific domains with $N$\ndedicated experts, whereas standard BERT models excel in only one domain.\nNotably, extending just a single transformer block to MoE captures 85% of the\nbenefit seen from full MoE extension at every layer. This holds promise for\nversatile and efficient One-Size-Fits-All transformer networks for numerically\nrepresenting diverse inputs. Our methodology marks significant advancements in\nrepresenting scientific text and holds promise for enhancing vector database\nsearch and compilation.\n","authors":["Logan Hallee","Rohan Kapur","Arjun Patel","Jason P. Gleghorn","Bohdan Khomtchouk"],"pdf_url":"https://arxiv.org/pdf/2401.15713v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.09723v3","updated":"2024-05-30T19:40:21Z","published":"2024-02-15T05:31:13Z","title":"Efficient Prompt Optimization Through the Lens of Best Arm\n Identification","summary":" The remarkable instruction-following capability of large language models\n(LLMs) has sparked a growing interest in automatically finding good prompts,\ni.e., prompt optimization. Most existing works follow the scheme of selecting\nfrom a pre-generated pool of candidate prompts. However, these designs mainly\nfocus on the generation strategy, while limited attention has been paid to the\nselection method. Especially, the cost incurred during the selection (e.g.,\naccessing LLM and evaluating the responses) is rarely explicitly considered. To\novercome this limitation, this work provides a principled framework, TRIPLE, to\nefficiently perform prompt selection under an explicit budget constraint.\nTRIPLE is built on a novel connection established between prompt optimization\nand fixed-budget best arm identification (BAI-FB) in multi-armed bandits (MAB);\nthus, it is capable of leveraging the rich toolbox from BAI-FB systematically\nand also incorporating unique characteristics of prompt optimization. Extensive\nexperiments on multiple well-adopted tasks using various LLMs demonstrate the\nremarkable performance improvement of TRIPLE over baselines while satisfying\nthe limited budget constraints. As an extension, variants of TRIPLE are\nproposed to efficiently select examples for few-shot prompts, also achieving\nsuperior empirical performance.\n","authors":["Chengshuai Shi","Kun Yang","Zihan Chen","Jundong Li","Jing Yang","Cong Shen"],"pdf_url":"https://arxiv.org/pdf/2402.09723v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.15793v2","updated":"2024-05-30T19:09:01Z","published":"2024-05-06T17:41:33Z","title":"SWE-agent: Agent-Computer Interfaces Enable Automated Software\n Engineering","summary":" Language model (LM) agents are increasingly being used to automate\ncomplicated tasks in digital environments. Just as humans benefit from powerful\nsoftware applications, such as integrated development environments, for complex\ntasks like software engineering, we posit that LM agents represent a new\ncategory of end users with their own needs and abilities, and would benefit\nfrom specially-built interfaces to the software they use. We investigate how\ninterface design affects the performance of language model agents. As a result\nof this exploration, we introduce SWE-agent: a system that facilitates LM\nagents to autonomously use computers to solve software engineering tasks.\nSWE-agent's custom agent-computer interface (ACI) significantly enhances an\nagent's ability to create and edit code files, navigate entire repositories,\nand execute tests and other programs. We evaluate SWE-agent on SWE-bench and\nHumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate\nof 12.5% and 87.7%, respectively, far exceeding the previous state-of-the-art\nachieved with non-interactive LMs. Finally, we provide insight on how the\ndesign of the ACI can impact agents' behavior and performance.\n","authors":["John Yang","Carlos E. Jimenez","Alexander Wettig","Kilian Lieret","Shunyu Yao","Karthik Narasimhan","Ofir Press"],"pdf_url":"https://arxiv.org/pdf/2405.15793v2.pdf","comment":"Code, data, and demo available at https://swe-agent.com"},{"id":"http://arxiv.org/abs/2405.20419v1","updated":"2024-05-30T18:53:53Z","published":"2024-05-30T18:53:53Z","title":"Enhancing Antibiotic Stewardship using a Natural Language Approach for\n Better Feature Representation","summary":" The rapid emergence of antibiotic-resistant bacteria is recognized as a\nglobal healthcare crisis, undermining the efficacy of life-saving antibiotics.\nThis crisis is driven by the improper and overuse of antibiotics, which\nescalates bacterial resistance. In response, this study explores the use of\nclinical decision support systems, enhanced through the integration of\nelectronic health records (EHRs), to improve antibiotic stewardship. However,\nEHR systems present numerous data-level challenges, complicating the effective\nsynthesis and utilization of data. In this work, we transform EHR data into a\nserialized textual representation and employ pretrained foundation models to\ndemonstrate how this enhanced feature representation can aid in antibiotic\nsusceptibility predictions. Our results suggest that this text representation,\ncombined with foundation models, provides a valuable tool to increase\ninterpretability and support antibiotic stewardship efforts.\n","authors":["Simon A. Lee","Trevor Brokowski","Jeffrey N. Chiang"],"pdf_url":"https://arxiv.org/pdf/2405.20419v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.13350v2","updated":"2024-05-30T18:42:45Z","published":"2024-05-22T05:12:35Z","title":"Efficacy of ByT5 in Multilingual Translation of Biblical Texts for\n Underrepresented Languages","summary":" This study presents the development and evaluation of a ByT5-based\nmultilingual translation model tailored for translating the Bible into\nunderrepresented languages. Utilizing the comprehensive Johns Hopkins\nUniversity Bible Corpus, we trained the model to capture the intricate nuances\nof character-based and morphologically rich languages. Our results, measured by\nthe BLEU score and supplemented with sample translations, suggest the model can\nimprove accessibility to sacred texts. It effectively handles the distinctive\nbiblical lexicon and structure, thus bridging the linguistic divide. The study\nalso discusses the model's limitations and suggests pathways for future\nenhancements, focusing on expanding access to sacred literature across\nlinguistic boundaries.\n","authors":["Corinne Aars","Lauren Adams","Xiaokan Tian","Zhaoyu Wang","Colton Wismer","Jason Wu","Pablo Rivas","Korn Sooksatra","Matthew Fendt"],"pdf_url":"https://arxiv.org/pdf/2405.13350v2.pdf","comment":"LXAI Workshop at the 2024 Annual Conference of the North American\n Chapter of the Association for Computational Linguistics (NAACL 2024)"},{"id":"http://arxiv.org/abs/2405.20413v1","updated":"2024-05-30T18:38:36Z","published":"2024-05-30T18:38:36Z","title":"Jailbreaking Large Language Models Against Moderation Guardrails via\n Cipher Characters","summary":" Large Language Models (LLMs) are typically harmless but remain vulnerable to\ncarefully crafted prompts known as ``jailbreaks'', which can bypass protective\nmeasures and induce harmful behavior. Recent advancements in LLMs have\nincorporated moderation guardrails that can filter outputs, which trigger\nprocessing errors for certain malicious questions. Existing red-teaming\nbenchmarks often neglect to include questions that trigger moderation\nguardrails, making it difficult to evaluate jailbreak effectiveness. To address\nthis issue, we introduce JAMBench, a harmful behavior benchmark designed to\ntrigger and evaluate moderation guardrails. JAMBench involves 160 manually\ncrafted instructions covering four major risk categories at multiple severity\nlevels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against\nModeration), designed to attack moderation guardrails using jailbreak prefixes\nto bypass input-level filters and a fine-tuned shadow model functionally\nequivalent to the guardrail model to generate cipher characters to bypass\noutput-level filters. Our extensive experiments on four LLMs demonstrate that\nJAM achieves higher jailbreak success ($\\sim$ $\\times$ 19.88) and lower\nfiltered-out rates ($\\sim$ $\\times$ 1/6) than baselines.\n","authors":["Haibo Jin","Andy Zhou","Joe D. Menke","Haohan Wang"],"pdf_url":"https://arxiv.org/pdf/2405.20413v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2405.20410v1","updated":"2024-05-30T18:28:31Z","published":"2024-05-30T18:28:31Z","title":"SeamlessExpressiveLM: Speech Language Model for Expressive\n Speech-to-Speech Translation with Chain-of-Thought","summary":" Expressive speech-to-speech translation (S2ST) is a key research topic in\nseamless communication, which focuses on the preservation of semantics and\nspeaker vocal style in translated speech. Early works synthesized speaker style\naligned speech in order to directly learn the mapping from speech to target\nspeech spectrogram. Without reliance on style aligned data, recent studies\nleverage the advances of language modeling (LM) and build cascaded LMs on\nsemantic and acoustic tokens. This work proposes SeamlessExpressiveLM, a single\nspeech language model for expressive S2ST. We decompose the complex\nsource-to-target speech mapping into intermediate generation steps with\nchain-of-thought prompting. The model is first guided to translate target\nsemantic content and then transfer the speaker style to multi-stream acoustic\nunits. Evaluated on Spanish-to-English and Hungarian-to-English translations,\nSeamlessExpressiveLM outperforms cascaded LMs in both semantic quality and\nstyle transfer, meanwhile achieving better parameter efficiency.\n","authors":["Hongyu Gong","Bandhav Veluri"],"pdf_url":"https://arxiv.org/pdf/2405.20410v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20404v1","updated":"2024-05-30T18:16:41Z","published":"2024-05-30T18:16:41Z","title":"XPrompt:Explaining Large Language Model's Generation via Joint Prompt\n Attribution","summary":" Large Language Models (LLMs) have demonstrated impressive performances in\ncomplex text generation tasks. However, the contribution of the input prompt to\nthe generated content still remains obscure to humans, underscoring the\nnecessity of elucidating and explaining the causality between input and output\npairs. Existing works for providing prompt-specific explanation often confine\nmodel output to be classification or next-word prediction. Few initial attempts\naiming to explain the entire language generation often treat input prompt texts\nindependently, ignoring their combinatorial effects on the follow-up\ngeneration. In this study, we introduce a counterfactual explanation framework\nbased on joint prompt attribution, XPrompt, which aims to explain how a few\nprompt texts collaboratively influences the LLM's complete generation.\nParticularly, we formulate the task of prompt attribution for generation\ninterpretation as a combinatorial optimization problem, and introduce a\nprobabilistic algorithm to search for the casual input combination in the\ndiscrete space. We define and utilize multiple metrics to evaluate the produced\nexplanations, demonstrating both faithfulness and efficiency of our framework.\n","authors":["Yurui Chang","Bochuan Cao","Yujia Wang","Jinghui Chen","Lu Lin"],"pdf_url":"https://arxiv.org/pdf/2405.20404v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20362v1","updated":"2024-05-30T17:56:05Z","published":"2024-05-30T17:56:05Z","title":"Hallucination-Free? Assessing the Reliability of Leading AI Legal\n Research Tools","summary":" Legal practice has witnessed a sharp rise in products incorporating\nartificial intelligence (AI). Such tools are designed to assist with a wide\nrange of core legal tasks, from search and summarization of caselaw to document\ndrafting. But the large language models used in these tools are prone to\n\"hallucinate,\" or make up false information, making their use risky in\nhigh-stakes domains. Recently, certain legal research providers have touted\nmethods such as retrieval-augmented generation (RAG) as \"eliminating\"\n(Casetext, 2023) or \"avoid[ing]\" hallucinations (Thomson Reuters, 2023), or\nguaranteeing \"hallucination-free\" legal citations (LexisNexis, 2023). Because\nof the closed nature of these systems, systematically assessing these claims is\nchallenging. In this article, we design and report on the first preregistered\nempirical evaluation of AI-driven legal research tools. We demonstrate that the\nproviders' claims are overstated. While hallucinations are reduced relative to\ngeneral-purpose chatbots (GPT-4), we find that the AI research tools made by\nLexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and\nAsk Practical Law AI) each hallucinate between 17% and 33% of the time. We also\ndocument substantial differences between systems in responsiveness and\naccuracy. Our article makes four key contributions. It is the first to assess\nand report the performance of RAG-based proprietary legal AI tools. Second, it\nintroduces a comprehensive, preregistered dataset for identifying and\nunderstanding vulnerabilities in these systems. Third, it proposes a clear\ntypology for differentiating between hallucinations and accurate legal\nresponses. Last, it provides evidence to inform the responsibilities of legal\nprofessionals in supervising and verifying AI outputs, which remains a central\nopen question for the responsible integration of AI into law.\n","authors":["Varun Magesh","Faiz Surani","Matthew Dahl","Mirac Suzgun","Christopher D. Manning","Daniel E. Ho"],"pdf_url":"https://arxiv.org/pdf/2405.20362v1.pdf","comment":"Our dataset, tool outputs, and labels will be made available upon\n publication. This version of the manuscript (May 30, 2024) is updated to\n reflect an evaluation of Westlaw's AI-Assisted Research"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2405.20343v1","updated":"2024-05-30T17:59:54Z","published":"2024-05-30T17:59:54Z","title":"Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single\n Image","summary":" In this work, we introduce Unique3D, a novel image-to-3D framework for\nefficiently generating high-quality 3D meshes from single-view images,\nfeaturing state-of-the-art generation fidelity and strong generalizability.\nPrevious methods based on Score Distillation Sampling (SDS) can produce\ndiversified 3D results by distilling 3D knowledge from large 2D diffusion\nmodels, but they usually suffer from long per-case optimization time with\ninconsistent issues. Recent works address the problem and generate better 3D\nresults either by finetuning a multi-view diffusion model or training a fast\nfeed-forward model. However, they still lack intricate textures and complex\ngeometries due to inconsistency and limited generated resolution. To\nsimultaneously achieve high fidelity, consistency, and efficiency in single\nimage-to-3D, we propose a novel framework Unique3D that includes a multi-view\ndiffusion model with a corresponding normal diffusion model to generate\nmulti-view images with their normal maps, a multi-level upscale process to\nprogressively improve the resolution of generated orthographic multi-views, as\nwell as an instant and consistent mesh reconstruction algorithm called ISOMER,\nwhich fully integrates the color and geometric priors into mesh results.\nExtensive experiments demonstrate that our Unique3D significantly outperforms\nother image-to-3D baselines in terms of geometric and textural details.\n","authors":["Kailu Wu","Fangfu Liu","Zhihan Cai","Runjie Yan","Hanyang Wang","Yating Hu","Yueqi Duan","Kaisheng Ma"],"pdf_url":"https://arxiv.org/pdf/2405.20343v1.pdf","comment":"Project page: https://wukailu.github.io/Unique3D"},{"id":"http://arxiv.org/abs/2405.20340v1","updated":"2024-05-30T17:59:50Z","published":"2024-05-30T17:59:50Z","title":"MotionLLM: Understanding Human Behaviors from Human Motions and Videos","summary":" This study delves into the realm of multi-modality (i.e., video and motion\nmodalities) human behavior understanding by leveraging the powerful\ncapabilities of Large Language Models (LLMs). Diverging from recent LLMs\ndesigned for video-only or motion-only understanding, we argue that\nunderstanding human behavior necessitates joint modeling from both videos and\nmotion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics\nand semantics effectively. In light of this, we present MotionLLM, a\nstraightforward yet effective framework for human motion understanding,\ncaptioning, and reasoning. Specifically, MotionLLM adopts a unified\nvideo-motion training strategy that leverages the complementary advantages of\nexisting coarse video-text data and fine-grained motion-text data to glean rich\nspatial-temporal insights. Furthermore, we collect a substantial dataset,\nMoVid, comprising diverse videos, motions, captions, and instructions.\nAdditionally, we propose the MoVid-Bench, with carefully manual annotations,\nfor better evaluation of human behavior understanding on video and motion.\nExtensive experiments show the superiority of MotionLLM in the caption,\nspatial-temporal comprehension, and reasoning ability.\n","authors":["Ling-Hao Chen","Shunlin Lu","Ailing Zeng","Hao Zhang","Benyou Wang","Ruimao Zhang","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20340v1.pdf","comment":"MotionLLM version 1.0, project page see https://lhchen.top/MotionLLM"},{"id":"http://arxiv.org/abs/2405.20339v1","updated":"2024-05-30T17:59:47Z","published":"2024-05-30T17:59:47Z","title":"Visual Perception by Large Language Model's Weights","summary":" Existing Multimodal Large Language Models (MLLMs) follow the paradigm that\nperceives visual information by aligning visual features with the input space\nof Large Language Models (LLMs), and concatenating visual tokens with text\ntokens to form a unified sequence input for LLMs. These methods demonstrate\npromising results on various vision-language tasks but are limited by the high\ncomputational effort due to the extended input sequence resulting from the\ninvolvement of visual tokens. In this paper, instead of input space alignment,\nwe propose a novel parameter space alignment paradigm that represents visual\ninformation as model weights. For each input image, we use a vision encoder to\nextract visual features, convert features into perceptual weights, and merge\nthe perceptual weights with LLM's weights. In this way, the input of LLM does\nnot require visual tokens, which reduces the length of the input sequence and\ngreatly improves efficiency. Following this paradigm, we propose VLoRA with the\nperceptual weights generator. The perceptual weights generator is designed to\nconvert visual features to perceptual weights with low-rank property,\nexhibiting a form similar to LoRA. The experimental results show that our VLoRA\nachieves comparable performance on various benchmarks for MLLMs, while\nsignificantly reducing the computational costs for both training and inference.\nThe code and models will be made open-source.\n","authors":["Feipeng Ma","Hongwei Xue","Guangting Wang","Yizhou Zhou","Fengyun Rao","Shilin Yan","Yueyi Zhang","Siying Wu","Mike Zheng Shou","Xiaoyan Sun"],"pdf_url":"https://arxiv.org/pdf/2405.20339v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20337v1","updated":"2024-05-30T17:59:42Z","published":"2024-05-30T17:59:42Z","title":"OccSora: 4D Occupancy Generation Models as World Simulators for\n Autonomous Driving","summary":" Understanding the evolution of 3D scenes is important for effective\nautonomous driving. While conventional methods mode scene development with the\nmotion of individual instances, world models emerge as a generative framework\nto describe the general scene dynamics. However, most existing methods adopt an\nautoregressive framework to perform next-token prediction, which suffer from\ninefficiency in modeling long-term temporal evolutions. To address this, we\npropose a diffusion-based 4D occupancy generation model, OccSora, to simulate\nthe development of the 3D world for autonomous driving. We employ a 4D scene\ntokenizer to obtain compact discrete spatial-temporal representations for 4D\noccupancy input and achieve high-quality reconstruction for long-sequence\noccupancy videos. We then learn a diffusion transformer on the spatial-temporal\nrepresentations and generate 4D occupancy conditioned on a trajectory prompt.\nWe conduct extensive experiments on the widely used nuScenes dataset with Occ3D\noccupancy annotations. OccSora can generate 16s-videos with authentic 3D layout\nand temporal consistency, demonstrating its ability to understand the spatial\nand temporal distributions of driving scenes. With trajectory-aware 4D\ngeneration, OccSora has the potential to serve as a world simulator for the\ndecision-making of autonomous driving. Code is available at:\nhttps://github.com/wzzheng/OccSora.\n","authors":["Lening Wang","Wenzhao Zheng","Yilong Ren","Han Jiang","Zhiyong Cui","Haiyang Yu","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2405.20337v1.pdf","comment":"Code is available at: https://github.com/wzzheng/OccSora"},{"id":"http://arxiv.org/abs/2405.20336v1","updated":"2024-05-30T17:59:39Z","published":"2024-05-30T17:59:39Z","title":"RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text","summary":" In this work, we introduce a challenging task for simultaneously generating\n3D holistic body motions and singing vocals directly from textual lyrics\ninputs, advancing beyond existing works that typically address these two\nmodalities in isolation. To facilitate this, we first collect the RapVerse\ndataset, a large dataset containing synchronous rapping vocals, lyrics, and\nhigh-quality 3D holistic body meshes. With the RapVerse dataset, we investigate\nthe extent to which scaling autoregressive multimodal transformers across\nlanguage, audio, and motion can enhance the coherent and realistic generation\nof vocals and whole-body human motions. For modality unification, a\nvector-quantized variational autoencoder is employed to encode whole-body\nmotion sequences into discrete motion tokens, while a vocal-to-unit model is\nleveraged to obtain quantized audio tokens preserving content, prosodic\ninformation, and singer identity. By jointly performing transformer modeling on\nthese three modalities in a unified way, our framework ensures a seamless and\nrealistic blend of vocals and human motions. Extensive experiments demonstrate\nthat our unified generation framework not only produces coherent and realistic\nsinging vocals alongside human motions directly from textual inputs but also\nrivals the performance of specialized single-modality generation systems,\nestablishing new benchmarks for joint vocal-motion generation. The project page\nis available for research purposes at https://vis-www.cs.umass.edu/RapVerse.\n","authors":["Jiaben Chen","Xin Yan","Yihang Chen","Siyuan Cen","Qinwei Ma","Haoyu Zhen","Kaizhi Qian","Lie Lu","Chuang Gan"],"pdf_url":"https://arxiv.org/pdf/2405.20336v1.pdf","comment":"Project website: https://vis-www.cs.umass.edu/RapVerse"},{"id":"http://arxiv.org/abs/2405.20334v1","updated":"2024-05-30T17:59:24Z","published":"2024-05-30T17:59:24Z","title":"VividDream: Generating 3D Scene with Ambient Dynamics","summary":" We introduce VividDream, a method for generating explorable 4D scenes with\nambient dynamics from a single input image or text prompt. VividDream first\nexpands an input image into a static 3D point cloud through iterative\ninpainting and geometry merging. An ensemble of animated videos is then\ngenerated using video diffusion models with quality refinement techniques and\nconditioned on renderings of the static 3D scene from the sampled camera\ntrajectories. We then optimize a canonical 4D scene representation using an\nanimated video ensemble, with per-video motion embeddings and visibility masks\nto mitigate inconsistencies. The resulting 4D scene enables free-view\nexploration of a 3D scene with plausible ambient scene dynamics. Experiments\ndemonstrate that VividDream can provide human viewers with compelling 4D\nexperiences generated based on diverse real images and text prompts.\n","authors":["Yao-Chih Lee","Yi-Ting Chen","Andrew Wang","Ting-Hsuan Liao","Brandon Y. Feng","Jia-Bin Huang"],"pdf_url":"https://arxiv.org/pdf/2405.20334v1.pdf","comment":"Project page: https://vivid-dream-4d.github.io"},{"id":"http://arxiv.org/abs/2405.20333v1","updated":"2024-05-30T17:59:10Z","published":"2024-05-30T17:59:10Z","title":"SurgiTrack: Fine-Grained Multi-Class Multi-Tool Tracking in Surgical\n Videos","summary":" Accurate tool tracking is essential for the success of computer-assisted\nintervention. Previous efforts often modeled tool trajectories rigidly,\noverlooking the dynamic nature of surgical procedures, especially tracking\nscenarios like out-of-body and out-of-camera views. Addressing this limitation,\nthe new CholecTrack20 dataset provides detailed labels that account for\nmultiple tool trajectories in three perspectives: (1) intraoperative, (2)\nintracorporeal, and (3) visibility, representing the different types of\ntemporal duration of tool tracks. These fine-grained labels enhance tracking\nflexibility but also increase the task complexity. Re-identifying tools after\nocclusion or re-insertion into the body remains challenging due to high visual\nsimilarity, especially among tools of the same category. This work recognizes\nthe critical role of the tool operators in distinguishing tool track instances,\nespecially those belonging to the same tool category. The operators'\ninformation are however not explicitly captured in surgical videos. We\ntherefore propose SurgiTrack, a novel deep learning method that leverages\nYOLOv7 for precise tool detection and employs an attention mechanism to model\nthe originating direction of the tools, as a proxy to their operators, for tool\nre-identification. To handle diverse tool trajectory perspectives, SurgiTrack\nemploys a harmonizing bipartite matching graph, minimizing conflicts and\nensuring accurate tool identity association. Experimental results on\nCholecTrack20 demonstrate SurgiTrack's effectiveness, outperforming baselines\nand state-of-the-art methods with real-time inference capability. This work\nsets a new standard in surgical tool tracking, providing dynamic trajectories\nfor more adaptable and precise assistance in minimally invasive surgeries.\n","authors":["Chinedu Innocent Nwoye","Nicolas Padoy"],"pdf_url":"https://arxiv.org/pdf/2405.20333v1.pdf","comment":"15 pages, 7 figures, 9 tables, 1 video. Supplementary video available\n at: https://vimeo.com/951853260"},{"id":"http://arxiv.org/abs/2405.20330v1","updated":"2024-05-30T17:59:02Z","published":"2024-05-30T17:59:02Z","title":"4DHands: Reconstructing Interactive Hands in 4D with Transformers","summary":" In this paper, we introduce 4DHands, a robust approach to recovering\ninteractive hand meshes and their relative movement from monocular inputs. Our\napproach addresses two major limitations of previous methods: lacking a unified\nsolution for handling various hand image inputs and neglecting the positional\nrelationship of two hands within images. To overcome these challenges, we\ndevelop a transformer-based architecture with novel tokenization and feature\nfusion strategies. Specifically, we propose a Relation-aware Two-Hand\nTokenization (RAT) method to embed positional relation information into the\nhand tokens. In this way, our network can handle both single-hand and two-hand\ninputs and explicitly leverage relative hand positions, facilitating the\nreconstruction of intricate hand interactions in real-world scenarios. As such\ntokenization indicates the relative relationship of two hands, it also supports\nmore effective feature fusion. To this end, we further develop a\nSpatio-temporal Interaction Reasoning (SIR) module to fuse hand tokens in 4D\nwith attention and decode them into 3D hand meshes and relative temporal\nmovements. The efficacy of our approach is validated on several benchmark\ndatasets. The results on in-the-wild videos and real-world scenarios\ndemonstrate the superior performances of our approach for interactive hand\nreconstruction. More video results can be found on the project page:\nhttps://4dhands.github.io.\n","authors":["Dixuan Lin","Yuxiang Zhang","Mengcheng Li","Yebin Liu","Wei Jing","Qi Yan","Qianying Wang","Hongwen Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20330v1.pdf","comment":"More demo videos can be seen at our project page:\n https://4dhands.github.io"},{"id":"http://arxiv.org/abs/2405.20327v1","updated":"2024-05-30T17:58:00Z","published":"2024-05-30T17:58:00Z","title":"GECO: Generative Image-to-3D within a SECOnd","summary":" 3D generation has seen remarkable progress in recent years. Existing\ntechniques, such as score distillation methods, produce notable results but\nrequire extensive per-scene optimization, impacting time efficiency.\nAlternatively, reconstruction-based approaches prioritize efficiency but\ncompromise quality due to their limited handling of uncertainty. We introduce\nGECO, a novel method for high-quality 3D generative modeling that operates\nwithin a second. Our approach addresses the prevalent issues of uncertainty and\ninefficiency in current methods through a two-stage approach. In the initial\nstage, we train a single-step multi-view generative model with score\ndistillation. Then, a second-stage distillation is applied to address the\nchallenge of view inconsistency from the multi-view prediction. This two-stage\nprocess ensures a balanced approach to 3D generation, optimizing both quality\nand efficiency. Our comprehensive experiments demonstrate that GECO achieves\nhigh-quality image-to-3D generation with an unprecedented level of efficiency.\n","authors":["Chen Wang","Jiatao Gu","Xiaoxiao Long","Yuan Liu","Lingjie Liu"],"pdf_url":"https://arxiv.org/pdf/2405.20327v1.pdf","comment":"Project Page: https://cwchenwang.github.io/geco"},{"id":"http://arxiv.org/abs/2405.20325v1","updated":"2024-05-30T17:57:30Z","published":"2024-05-30T17:57:30Z","title":"MotionFollower: Editing Video Motion via Lightweight Score-Guided\n Diffusion","summary":" Despite impressive advancements in diffusion-based video editing models in\naltering video attributes, there has been limited exploration into modifying\nmotion information while preserving the original protagonist's appearance and\nbackground. In this paper, we propose MotionFollower, a lightweight\nscore-guided diffusion model for video motion editing. To introduce conditional\ncontrols to the denoising process, MotionFollower leverages two of our proposed\nlightweight signal controllers, one for poses and the other for appearances,\nboth of which consist of convolution blocks without involving heavy attention\ncalculations. Further, we design a score guidance principle based on a\ntwo-branch architecture, including the reconstruction and editing branches,\nwhich significantly enhance the modeling capability of texture details and\ncomplicated backgrounds. Concretely, we enforce several consistency\nregularizers and losses during the score estimation. The resulting gradients\nthus inject appropriate guidance to the intermediate latents, forcing the model\nto preserve the original background details and protagonists' appearances\nwithout interfering with the motion modification. Experiments demonstrate the\ncompetitive motion editing ability of MotionFollower qualitatively and\nquantitatively. Compared with MotionEditor, the most advanced motion editing\nmodel, MotionFollower achieves an approximately 80% reduction in GPU memory\nwhile delivering superior motion editing performance and exclusively supporting\nlarge camera movements and actions.\n","authors":["Shuyuan Tu","Qi Dai","Zihao Zhang","Sicheng Xie","Zhi-Qi Cheng","Chong Luo","Xintong Han","Zuxuan Wu","Yu-Gang Jiang"],"pdf_url":"https://arxiv.org/pdf/2405.20325v1.pdf","comment":"23 pages, 18 figures. Project page at\n https://francis-rings.github.io/MotionFollower/"},{"id":"http://arxiv.org/abs/2405.20324v1","updated":"2024-05-30T17:57:26Z","published":"2024-05-30T17:57:26Z","title":"Don't drop your samples! Coherence-aware training benefits Conditional\n diffusion","summary":" Conditional diffusion models are powerful generative models that can leverage\nvarious types of conditional information, such as class labels, segmentation\nmasks, or text captions. However, in many real-world scenarios, conditional\ninformation may be noisy or unreliable due to human annotation errors or weak\nalignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a\nnovel method that integrates coherence in conditional information into\ndiffusion models, allowing them to learn from noisy annotations without\ndiscarding data. We assume that each data point has an associated coherence\nscore that reflects the quality of the conditional information. We then\ncondition the diffusion model on both the conditional information and the\ncoherence score. In this way, the model learns to ignore or discount the\nconditioning when the coherence is low. We show that CAD is theoretically sound\nand empirically effective on various conditional generation tasks. Moreover, we\nshow that leveraging coherence generates realistic and diverse samples that\nrespect conditional information better than models trained on cleaned datasets\nwhere samples with low coherence have been discarded.\n","authors":["Nicolas Dufour","Victor Besnier","Vicky Kalogeiton","David Picard"],"pdf_url":"https://arxiv.org/pdf/2405.20324v1.pdf","comment":"Accepted at CVPR 2024 as a Highlight. Project page:\n https://nicolas-dufour.github.io/cad.html"},{"id":"http://arxiv.org/abs/2405.20323v1","updated":"2024-05-30T17:57:08Z","published":"2024-05-30T17:57:08Z","title":"$\\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous\n Driving","summary":" Photorealistic 3D reconstruction of street scenes is a critical technique for\ndeveloping real-world simulators for autonomous driving. Despite the efficacy\nof Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting\n(3DGS) emerges as a promising direction due to its faster speed and more\nexplicit representation. However, most existing street 3DGS methods require\ntracked 3D vehicle bounding boxes to decompose the static and dynamic elements\nfor effective reconstruction, limiting their applications for in-the-wild\nscenarios. To facilitate efficient 3D scene reconstruction without costly\nannotations, we propose a self-supervised street Gaussian\n($\\textit{S}^3$Gaussian) method to decompose dynamic and static elements from\n4D consistency. We represent each scene with 3D Gaussians to preserve the\nexplicitness and further accompany them with a spatial-temporal field network\nto compactly model the 4D dynamics. We conduct extensive experiments on the\nchallenging Waymo-Open dataset to evaluate the effectiveness of our method. Our\n$\\textit{S}^3$Gaussian demonstrates the ability to decompose static and dynamic\nscenes and achieves the best performance without using 3D annotations. Code is\navailable at: https://github.com/nnanhuang/S3Gaussian/.\n","authors":["Nan Huang","Xiaobao Wei","Wenzhao Zheng","Pengju An","Ming Lu","Wei Zhan","Masayoshi Tomizuka","Kurt Keutzer","Shanghang Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20323v1.pdf","comment":"Code is available at: https://github.com/nnanhuang/S3Gaussian/"},{"id":"http://arxiv.org/abs/2405.20321v1","updated":"2024-05-30T17:56:54Z","published":"2024-05-30T17:56:54Z","title":"Vision-based Manipulation from Single Human Video with Open-World Object\n Graphs","summary":" We present an object-centric approach to empower robots to learn vision-based\nmanipulation skills from human videos. We investigate the problem of imitating\nrobot manipulation from a single human video in the open-world setting, where a\nrobot must learn to manipulate novel objects from one video demonstration. We\nintroduce ORION, an algorithm that tackles the problem by extracting an\nobject-centric manipulation plan from a single RGB-D video and deriving a\npolicy that conditions on the extracted plan. Our method enables the robot to\nlearn from videos captured by daily mobile devices such as an iPad and\ngeneralize the policies to deployment environments with varying visual\nbackgrounds, camera angles, spatial layouts, and novel object instances. We\nsystematically evaluate our method on both short-horizon and long-horizon\ntasks, demonstrating the efficacy of ORION in learning from a single human\nvideo in the open world. Videos can be found in the project website\nhttps://ut-austin-rpl.github.io/ORION-release.\n","authors":["Yifeng Zhu","Arisrei Lim","Peter Stone","Yuke Zhu"],"pdf_url":"https://arxiv.org/pdf/2405.20321v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20320v1","updated":"2024-05-30T17:56:04Z","published":"2024-05-30T17:56:04Z","title":"Improving the Training of Rectified Flows","summary":" Diffusion models have shown great promise for image and video generation, but\nsampling from state-of-the-art models requires expensive numerical integration\nof a generative ODE. One approach for tackling this problem is rectified flows,\nwhich iteratively learn smooth ODE paths that are less susceptible to\ntruncation error. However, rectified flows still require a relatively large\nnumber of function evaluations (NFEs). In this work, we propose improved\ntechniques for training rectified flows, allowing them to compete with\nknowledge distillation methods even in the low NFE setting. Our main insight is\nthat under realistic settings, a single iteration of the Reflow algorithm for\ntraining rectified flows is sufficient to learn nearly straight trajectories;\nhence, the current practice of using multiple Reflow iterations is unnecessary.\nWe thus propose techniques to improve one-round training of rectified flows,\nincluding a U-shaped timestep distribution and LPIPS-Huber premetric. With\nthese techniques, we improve the FID of the previous 2-rectified flow by up to\n72% in the 1 NFE setting on CIFAR-10. On ImageNet 64$\\times$64, our improved\nrectified flow outperforms the state-of-the-art distillation methods such as\nconsistency distillation and progressive distillation in both one-step and\ntwo-step settings and rivals the performance of improved consistency training\n(iCT) in FID. Code is available at https://github.com/sangyun884/rfpp.\n","authors":["Sangyun Lee","Zinan Lin","Giulia Fanti"],"pdf_url":"https://arxiv.org/pdf/2405.20320v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20319v1","updated":"2024-05-30T17:55:46Z","published":"2024-05-30T17:55:46Z","title":"ParSEL: Parameterized Shape Editing with Language","summary":" The ability to edit 3D assets from natural language presents a compelling\nparadigm to aid in the democratization of 3D content creation. However, while\nnatural language is often effective at communicating general intent, it is\npoorly suited for specifying precise manipulation. To address this gap, we\nintroduce ParSEL, a system that enables controllable editing of high-quality 3D\nassets from natural language. Given a segmented 3D mesh and an editing request,\nParSEL produces a parameterized editing program. Adjusting the program\nparameters allows users to explore shape variations with a precise control over\nthe magnitudes of edits. To infer editing programs which align with an input\nedit request, we leverage the abilities of large-language models (LLMs).\nHowever, while we find that LLMs excel at identifying initial edit operations,\nthey often fail to infer complete editing programs, and produce outputs that\nviolate shape semantics. To overcome this issue, we introduce Analytical Edit\nPropagation (AEP), an algorithm which extends a seed edit with additional\noperations until a complete editing program has been formed. Unlike prior\nmethods, AEP searches for analytical editing operations compatible with a range\nof possible user edits through the integration of computer algebra systems for\ngeometric analysis. Experimentally we demonstrate ParSEL's effectiveness in\nenabling controllable editing of 3D objects through natural language requests\nover alternative system designs.\n","authors":["Aditya Ganeshan","Ryan Y. Huang","Xianghao Xu","R. Kenny Jones","Daniel Ritchie"],"pdf_url":"https://arxiv.org/pdf/2405.20319v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20310v1","updated":"2024-05-30T17:52:52Z","published":"2024-05-30T17:52:52Z","title":"A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D\n Reconstruction","summary":" Learning 3D scene representation from a single-view image is a long-standing\nfundamental problem in computer vision, with the inherent ambiguity in\npredicting contents unseen from the input view. Built on the recently proposed\n3D Gaussian Splatting (3DGS), the Splatter Image method has made promising\nprogress on fast single-image novel view synthesis via learning a single 3D\nGaussian for each pixel based on the U-Net feature map of an input image.\nHowever, it has limited expressive power to represent occluded components that\nare not observable in the input view. To address this problem, this paper\npresents a Hierarchical Splatter Image method in which a pixel is worth more\nthan one 3D Gaussians. Specifically,\n each pixel is represented by a parent 3D Gaussian and a small number of child\n3D Gaussians. Parent 3D Gaussians are learned as done in the vanilla Splatter\nImage. Child 3D Gaussians are learned via a lightweight Multi-Layer Perceptron\n(MLP) which takes as input the projected image features of a parent 3D Gaussian\nand the embedding of a target camera view. Both parent and child 3D Gaussians\nare learned end-to-end in a stage-wise way. The joint condition of input image\nfeatures from eyes of the parent Gaussians and the target camera position\nfacilitates learning to allocate child Gaussians to ``see the unseen'',\nrecovering the occluded details that are often missed by parent Gaussians.\n In experiments, the proposed method is tested on the ShapeNet-SRN and CO3D\ndatasets with state-of-the-art performance obtained, especially showing\npromising capabilities of reconstructing occluded contents in the input view.\n","authors":["Jianghao Shen","Tianfu Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20310v1.pdf","comment":"preprint, under review"},{"id":"http://arxiv.org/abs/2209.15210v5","updated":"2024-05-30T17:51:36Z","published":"2022-09-30T03:40:10Z","title":"Multi-Prompt Alignment for Multi-Source Unsupervised Domain Adaptation","summary":" Most existing methods for unsupervised domain adaptation (UDA) rely on a\nshared network to extract domain-invariant features. However, when facing\nmultiple source domains, optimizing such a network involves updating the\nparameters of the entire network, making it both computationally expensive and\nchallenging, particularly when coupled with min-max objectives. Inspired by\nrecent advances in prompt learning that adapts high-capacity models for\ndownstream tasks in a computationally economic way, we introduce Multi-Prompt\nAlignment (MPA), a simple yet efficient framework for multi-source UDA. Given a\nsource and target domain pair, MPA first trains an individual prompt to\nminimize the domain gap through a contrastive loss. Then, MPA denoises the\nlearned prompts through an auto-encoding process and aligns them by maximizing\nthe agreement of all the reconstructed prompts. Moreover, we show that the\nresulting subspace acquired from the auto-encoding process can easily\ngeneralize to a streamlined set of target domains, making our method more\nefficient for practical usage. Extensive experiments show that MPA achieves\nstate-of-the-art results on three popular datasets with an impressive average\naccuracy of 54.1% on DomainNet.\n","authors":["Haoran Chen","Xintong Han","Zuxuan Wu","Yu-Gang Jiang"],"pdf_url":"https://arxiv.org/pdf/2209.15210v5.pdf","comment":"NeurIPS 2023 camera-ready version"},{"id":"http://arxiv.org/abs/2405.20305v1","updated":"2024-05-30T17:50:08Z","published":"2024-05-30T17:50:08Z","title":"Can't make an Omelette without Breaking some Eggs: Plausible Action\n Anticipation using Large Video-Language Models","summary":" We introduce PlausiVL, a large video-language model for anticipating action\nsequences that are plausible in the real-world. While significant efforts have\nbeen made towards anticipating future actions, prior approaches do not take\ninto account the aspect of plausibility in an action sequence. To address this\nlimitation, we explore the generative capability of a large video-language\nmodel in our work and further, develop the understanding of plausibility in an\naction sequence by introducing two objective functions, a counterfactual-based\nplausible action sequence learning loss and a long-horizon action repetition\nloss. We utilize temporal logical constraints as well as verb-noun action pair\nlogical constraints to create implausible/counterfactual action sequences and\nuse them to train the model with plausible action sequence learning loss. This\nloss helps the model to differentiate between plausible and not plausible\naction sequences and also helps the model to learn implicit temporal cues\ncrucial for the task of action anticipation. The long-horizon action repetition\nloss puts a higher penalty on the actions that are more prone to repetition\nover a longer temporal window. With this penalization, the model is able to\ngenerate diverse, plausible action sequences. We evaluate our approach on two\nlarge-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the\ntask of action anticipation.\n","authors":["Himangi Mittal","Nakul Agarwal","Shao-Yuan Lo","Kwonjoon Lee"],"pdf_url":"https://arxiv.org/pdf/2405.20305v1.pdf","comment":"CVPR 2024"},{"id":"http://arxiv.org/abs/2405.20299v1","updated":"2024-05-30T17:46:23Z","published":"2024-05-30T17:46:23Z","title":"Scaling White-Box Transformers for Vision","summary":" CRATE, a white-box transformer architecture designed to learn compressed and\nsparse representations, offers an intriguing alternative to standard vision\ntransformers (ViTs) due to its inherent mathematical interpretability. Despite\nextensive investigations into the scaling behaviors of language and vision\ntransformers, the scalability of CRATE remains an open question which this\npaper aims to address. Specifically, we propose CRATE-$\\alpha$, featuring\nstrategic yet minimal modifications to the sparse coding block in the CRATE\narchitecture design, and a light training recipe designed to improve the\nscalability of CRATE. Through extensive experiments, we demonstrate that\nCRATE-$\\alpha$ can effectively scale with larger model sizes and datasets. For\nexample, our CRATE-$\\alpha$-B substantially outperforms the prior best CRATE-B\nmodel accuracy on ImageNet classification by 3.7%, achieving an accuracy of\n83.2%. Meanwhile, when scaling further, our CRATE-$\\alpha$-L obtains an\nImageNet classification accuracy of 85.1%. More notably, these model\nperformance improvements are achieved while preserving, and potentially even\nenhancing the interpretability of learned CRATE models, as we demonstrate\nthrough showing that the learned token representations of increasingly larger\ntrained CRATE-$\\alpha$ models yield increasingly higher-quality unsupervised\nobject segmentation of images. The project page is\nhttps://rayjryang.github.io/CRATE-alpha/.\n","authors":["Jinrui Yang","Xianhang Li","Druv Pai","Yuyin Zhou","Yi Ma","Yaodong Yu","Cihang Xie"],"pdf_url":"https://arxiv.org/pdf/2405.20299v1.pdf","comment":"project page: https://rayjryang.github.io/CRATE-alpha/"},{"id":"http://arxiv.org/abs/2403.01643v2","updated":"2024-05-30T17:46:22Z","published":"2024-03-03T23:40:35Z","title":"You Need to Pay Better Attention: Rethinking the Mathematics of\n Attention Mechanism","summary":" Scaled Dot Product Attention (SDPA) is the backbone of many modern\ndeep-learning models. It is so versatile that it has been used in natural\nlanguage, vision, and multi-modal domains with very little change compared to\nits original formulation. This paper discusses why the current formulation is\ninefficient by delving into the mathematical details of the attention\nmechanism. We propose three improvements to mitigate these inefficiencies,\nthereby, introducing three enhanced attention mechanisms: Optimised, Efficient,\nand Super Attention. Optimised and Efficient Attention have one and two matrix\nmultiplications fewer per head, respectively, and 25% and 50% fewer parameters,\nrespectively, than standard SDPA, but perform similarly to standard SDPA in\nboth vision and natural language tasks. They can be used in all applications\nwhere SDPA is used while offering smaller model sizes and faster training and\ninference without noticeable loss in performance. Super Attention introduces a\nnew linear transformation on the values, transforming them from the left. It\noutperforms standard SPDA on vision and natural language tasks by up to 17%\nwhile having one fewer matrix multiplication per head and 25% fewer parameters\nthan standard SDPA. Consequently, it is also faster than standard SDPA. Super\nAttention is ideal in applications where the attention layer's context length\nis fixed, such as Vision Transformers. In addition to providing mathematical\nreasoning, we evaluate the presented attention mechanisms on several datasets\nincluding MNIST, CIFAR100, ImageNet, IMDB Movie Reviews, and Amazon Reviews\ndatasets, as well as combined Europarl and Anki English-Spanish datasets for\nneural machine translation.\n","authors":["Mehran Hosseini","Peyman Hosseini"],"pdf_url":"https://arxiv.org/pdf/2403.01643v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20291v1","updated":"2024-05-30T17:41:32Z","published":"2024-05-30T17:41:32Z","title":"Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning\n Weight Changes and Backdoor Activeness","summary":" The security threat of backdoor attacks is a central concern for deep neural\nnetworks (DNNs). Recently, without poisoned data, unlearning models with clean\ndata and then learning a pruning mask have contributed to backdoor defense.\nAdditionally, vanilla fine-tuning with those clean data can help recover the\nlost clean accuracy. However, the behavior of clean unlearning is still\nunder-explored, and vanilla fine-tuning unintentionally induces back the\nbackdoor effect. In this work, we first investigate model unlearning from the\nperspective of weight changes and gradient norms, and find two interesting\nobservations in the backdoored model: 1) the weight changes between poison and\nclean unlearning are positively correlated, making it possible for us to\nidentify the backdoored-related neurons without using poisoned data; 2) the\nneurons of the backdoored model are more active (i.e., larger changes in\ngradient norm) than those in the clean model, suggesting the need to suppress\nthe gradient norm during fine-tuning. Then, we propose an effective two-stage\ndefense method. In the first stage, an efficient Neuron Weight Change\n(NWC)-based Backdoor Reinitialization is proposed based on observation 1). In\nthe second stage, based on observation 2), we design an Activeness-Aware\nFine-Tuning to replace the vanilla fine-tuning. Extensive experiments,\ninvolving eight backdoor attacks on three benchmark datasets, demonstrate the\nsuperior performance of our proposed method compared to recent state-of-the-art\nbackdoor defense approaches.\n","authors":["Weilin Lin","Li Liu","Shaokui Wei","Jianze Li","Hui Xiong"],"pdf_url":"https://arxiv.org/pdf/2405.20291v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20283v1","updated":"2024-05-30T17:35:49Z","published":"2024-05-30T17:35:49Z","title":"TetSphere Splatting: Representing High-Quality Geometry with Lagrangian\n Volumetric Meshes","summary":" We present TetSphere splatting, an explicit, Lagrangian representation for\nreconstructing 3D shapes with high-quality geometry. In contrast to\nconventional object reconstruction methods which predominantly use Eulerian\nrepresentations, including both neural implicit (e.g., NeRF, NeuS) and explicit\nrepresentations (e.g., DMTet), and often struggle with high computational\ndemands and suboptimal mesh quality, TetSphere splatting utilizes an underused\nbut highly effective geometric primitive -- tetrahedral meshes. This approach\ndirectly yields superior mesh quality without relying on neural networks or\npost-processing. It deforms multiple initial tetrahedral spheres to accurately\nreconstruct the 3D shape through a combination of differentiable rendering and\ngeometric energy optimization, resulting in significant computational\nefficiency. Serving as a robust and versatile geometry representation,\nTet-Sphere splatting seamlessly integrates into diverse applications, including\nsingle-view 3D reconstruction, image-/text-to-3D content generation.\nExperimental results demonstrate that TetSphere splatting outperforms existing\nrepresentations, delivering faster optimization speed, enhanced mesh quality,\nand reliable preservation of thin structures.\n","authors":["Minghao Guo","Bohan Wang","Kaiming He","Wojciech Matusik"],"pdf_url":"https://arxiv.org/pdf/2405.20283v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20282v1","updated":"2024-05-30T17:34:40Z","published":"2024-05-30T17:34:40Z","title":"SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified\n Flow","summary":" Semantic segmentation and semantic image synthesis are two representative\ntasks in visual perception and generation. While existing methods consider them\nas two distinct tasks, we propose a unified diffusion-based framework (SemFlow)\nand model them as a pair of reverse problems. Specifically, motivated by\nrectified flow theory, we train an ordinary differential equation (ODE) model\nto transport between the distributions of real images and semantic masks. As\nthe training object is symmetric, samples belonging to the two distributions,\nimages and semantic masks, can be effortlessly transferred reversibly. For\nsemantic segmentation, our approach solves the contradiction between the\nrandomness of diffusion outputs and the uniqueness of segmentation results. For\nimage synthesis, we propose a finite perturbation approach to enhance the\ndiversity of generated results without changing the semantic categories.\nExperiments show that our SemFlow achieves competitive results on semantic\nsegmentation and semantic image synthesis tasks. We hope this simple framework\nwill motivate people to rethink the unification of low-level and high-level\nvision. Project page: https://github.com/wang-chaoyang/SemFlow.\n","authors":["Chaoyang Wang","Xiangtai Li","Lu Qi","Henghui Ding","Yunhai Tong","Ming-Hsuan Yang"],"pdf_url":"https://arxiv.org/pdf/2405.20282v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20279v1","updated":"2024-05-30T17:33:10Z","published":"2024-05-30T17:33:10Z","title":"CV-VAE: A Compatible Video VAE for Latent Generative Video Models","summary":" Spatio-temporal compression of videos, utilizing networks such as Variational\nAutoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other\nvideo generative models. For instance, many LLM-like video models learn the\ndistribution of discrete tokens derived from 3D VAEs within the VQVAE\nframework, while most diffusion-based video models capture the distribution of\ncontinuous latent extracted by 2D VAEs without quantization. The temporal\ncompression is simply realized by uniform frame sampling which results in\nunsmooth motion between consecutive frames. Currently, there lacks of a\ncommonly used continuous video (3D) VAE for latent diffusion-based video models\nin the research community. Moreover, since current diffusion-based approaches\nare often implemented using pre-trained text-to-image (T2I) models, directly\ntraining a video VAE without considering the compatibility with existing T2I\nmodels will result in a latent space gap between them, which will take huge\ncomputational resources for training to bridge the gap even with the T2I models\nas initialization. To address this issue, we propose a method for training a\nvideo VAE of latent video models, namely CV-VAE, whose latent space is\ncompatible with that of a given image VAE, e.g., image VAE of Stable Diffusion\n(SD). The compatibility is achieved by the proposed novel latent space\nregularization, which involves formulating a regularization loss using the\nimage VAE. Benefiting from the latent space compatibility, video models can be\ntrained seamlessly from pre-trained T2I or video models in a truly\nspatio-temporally compressed latent space, rather than simply sampling video\nframes at equal intervals. With our CV-VAE, existing video models can generate\nfour times more frames with minimal finetuning. Extensive experiments are\nconducted to demonstrate the effectiveness of the proposed video VAE.\n","authors":["Sijie Zhao","Yong Zhang","Xiaodong Cun","Shaoshu Yang","Muyao Niu","Xiaoyu Li","Wenbo Hu","Ying Shan"],"pdf_url":"https://arxiv.org/pdf/2405.20279v1.pdf","comment":"Project Page: https://ailab-cvc.github.io/cvvae/index.html"},{"id":"http://arxiv.org/abs/2405.20271v1","updated":"2024-05-30T17:26:02Z","published":"2024-05-30T17:26:02Z","title":"ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane\n Reflections","summary":" Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt\nfoundation models to downstream task requirements while retaining their\ngeneralization ability. However, the amount of additionally introduced\nparameters and compute for successful adaptation and hyperparameter searches\ncan explode quickly, especially when deployed at scale to serve numerous\nindividual requests. To ensure effective, parameter-efficient, and\nhyperparameter-robust adaptation, we propose the ETHER transformation family,\nwhich performs Efficient fineTuning via HypErplane Reflections. By design,\nETHER transformations require a minimal number of parameters, are less likely\nto deteriorate model performance, and exhibit robustness to hyperparameter and\nlearning rate choices. In particular, we introduce ETHER and its relaxation\nETHER+, which match or outperform existing PEFT methods with significantly\nfewer parameters ($\\sim$$10$-$100$ times lower than LoRA or OFT) across\nmultiple image synthesis and natural language tasks without exhaustive\nhyperparameter tuning. Finally, we investigate the recent emphasis on\nHyperspherical Energy retention for adaptation and raise questions on its\npractical utility. The code is available at https://github.com/mwbini/ether.\n","authors":["Massimo Bini","Karsten Roth","Zeynep Akata","Anna Khoreva"],"pdf_url":"https://arxiv.org/pdf/2405.20271v1.pdf","comment":"Accepted to ICML 2024. Code available at\n https://github.com/mwbini/ether"},{"id":"http://arxiv.org/abs/2405.20259v1","updated":"2024-05-30T17:09:05Z","published":"2024-05-30T17:09:05Z","title":"FaceMixup: Enhancing Facial Expression Recognition through Mixed Face\n Regularization","summary":" The proliferation of deep learning solutions and the scarcity of large\nannotated datasets pose significant challenges in real-world applications.\nVarious strategies have been explored to overcome this challenge, with data\naugmentation (DA) approaches emerging as prominent solutions. DA approaches\ninvolve generating additional examples by transforming existing labeled data,\nthereby enriching the dataset and helping deep learning models achieve improved\ngeneralization without succumbing to overfitting. In real applications, where\nsolutions based on deep learning are widely used, there is facial expression\nrecognition (FER), which plays an essential role in human communication,\nimproving a range of knowledge areas (e.g., medicine, security, and marketing).\nIn this paper, we propose a simple and comprehensive face data augmentation\napproach based on mixed face component regularization that outperforms the\nclassical DA approaches from the literature, including the MixAugment which is\na specific approach for the target task in two well-known FER datasets existing\nin the literature.\n","authors":["Fabio A. Faria","Mateus M. Souza","Raoni F. da S. Teixeira","Mauricio P. Segundo"],"pdf_url":"https://arxiv.org/pdf/2405.20259v1.pdf","comment":"29 pages, 9 figures, paper is under review on journal"},{"id":"http://arxiv.org/abs/2405.20247v1","updated":"2024-05-30T16:58:34Z","published":"2024-05-30T16:58:34Z","title":"KerasCV and KerasNLP: Vision and Language Power-Ups","summary":" We present the Keras domain packages KerasCV and KerasNLP, extensions of the\nKeras API for Computer Vision and Natural Language Processing workflows,\ncapable of running on either JAX, TensorFlow, or PyTorch. These domain packages\nare designed to enable fast experimentation, with a focus on ease-of-use and\nperformance. We adopt a modular, layered design: at the library's lowest level\nof abstraction, we provide building blocks for creating models and data\npreprocessing pipelines, and at the library's highest level of abstraction, we\nprovide pretrained ``task\" models for popular architectures such as Stable\nDiffusion, YOLOv8, GPT2, BERT, Mistral, CLIP, Gemma, T5, etc. Task models have\nbuilt-in preprocessing, pretrained weights, and can be fine-tuned on raw\ninputs. To enable efficient training, we support XLA compilation for all\nmodels, and run all preprocessing via a compiled graph of TensorFlow operations\nusing the tf.data API. The libraries are fully open-source (Apache 2.0 license)\nand available on GitHub.\n","authors":["Matthew Watson","Divyashree Shivakumar Sreepathihalli","Francois Chollet","Martin Gorner","Kiranbir Sodhia","Ramesh Sampath","Tirth Patel","Haifeng Jin","Neel Kovelamudi","Gabriel Rasskin","Samaneh Saadat","Luke Wood","Chen Qian","Jonathan Bischof","Ian Stenbit"],"pdf_url":"https://arxiv.org/pdf/2405.20247v1.pdf","comment":"Submitted to Journal of Machine Learning Open Source Software"},{"id":"http://arxiv.org/abs/2405.16470v2","updated":"2024-05-30T16:57:57Z","published":"2024-05-26T07:45:12Z","title":"Image Deraining with Frequency-Enhanced State Space Model","summary":" Removing rain artifacts in images is recognized as a significant issue. In\nthis field, deep learning-based approaches, such as convolutional neural\nnetworks (CNNs) and Transformers, have succeeded. Recently, State Space Models\n(SSMs) have exhibited superior performance across various tasks in both natural\nlanguage processing and image processing due to their ability to model\nlong-range dependencies. This study introduces SSM to rain removal and proposes\na Deraining Frequency-Enhanced State Space Model (DFSSM). To effectively remove\nrain streaks, which produce high-intensity frequency components in specific\ndirections, we employ frequency domain processing concurrently with SSM.\nAdditionally, we develop a novel mixed-scale gated-convolutional block, which\nuses convolutions with multiple kernel sizes to capture various scale\ndegradations effectively and integrates a gating mechanism to manage the flow\nof information. Finally, experiments on synthetic and real-world rainy image\ndatasets show that our method surpasses state-of-the-art methods.\n","authors":["Shugo Yamashita","Masaaki Ikehara"],"pdf_url":"https://arxiv.org/pdf/2405.16470v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20222v1","updated":"2024-05-30T16:22:22Z","published":"2024-05-30T16:22:22Z","title":"MOFA-Video: Controllable Image Animation via Generative Motion Field\n Adaptions in Frozen Image-to-Video Diffusion Model","summary":" We present MOFA-Video, an advanced controllable image animation method that\ngenerates video from the given image using various additional controllable\nsignals (such as human landmarks reference, manual trajectories, and another\neven provided video) or their combinations. This is different from previous\nmethods which only can work on a specific motion domain or show weak control\nabilities with diffusion prior. To achieve our goal, we design several\ndomain-aware motion field adapters (\\ie, MOFA-Adapters) to control the\ngenerated motions in the video generation pipeline. For MOFA-Adapters, we\nconsider the temporal motion consistency of the video and generate the dense\nmotion flow from the given sparse control conditions first, and then, the\nmulti-scale features of the given image are wrapped as a guided feature for\nstable video diffusion generation. We naively train two motion adapters for the\nmanual trajectories and the human landmarks individually since they both\ncontain sparse information about the control. After training, the MOFA-Adapters\nin different domains can also work together for more controllable video\ngeneration.\n","authors":["Muyao Niu","Xiaodong Cun","Xintao Wang","Yong Zhang","Ying Shan","Yinqiang Zheng"],"pdf_url":"https://arxiv.org/pdf/2405.20222v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20216v1","updated":"2024-05-30T16:18:05Z","published":"2024-05-30T16:18:05Z","title":"Boost Your Own Human Image Generation Model via Direct Preference\n Optimization with AI Feedback","summary":" The generation of high-quality human images through text-to-image (T2I)\nmethods is a significant yet challenging task. Distinct from general image\ngeneration, human image synthesis must satisfy stringent criteria related to\nhuman pose, anatomy, and alignment with textual prompts, making it particularly\ndifficult to achieve realistic results. Recent advancements in T2I generation\nbased on diffusion models have shown promise, yet challenges remain in meeting\nhuman-specific preferences. In this paper, we introduce a novel approach\ntailored specifically for human image generation utilizing Direct Preference\nOptimization (DPO). Specifically, we introduce an efficient method for\nconstructing a specialized DPO dataset for training human image generation\nmodels without the need for costly human feedback. We also propose a modified\nloss function that enhances the DPO training process by minimizing artifacts\nand improving image fidelity. Our method demonstrates its versatility and\neffectiveness in generating human images, including personalized text-to-image\ngeneration. Through comprehensive evaluations, we show that our approach\nsignificantly advances the state of human image generation, achieving superior\nresults in terms of natural anatomies, poses, and text-image alignment.\n","authors":["Sanghyeon Na","Yonggyu Kim","Hyunjoon Lee"],"pdf_url":"https://arxiv.org/pdf/2405.20216v1.pdf","comment":"28 pages, 18 figures"},{"id":"http://arxiv.org/abs/2405.20204v1","updated":"2024-05-30T16:07:54Z","published":"2024-05-30T16:07:54Z","title":"Jina CLIP: Your CLIP Model Is Also Your Text Retriever","summary":" Contrastive Language-Image Pretraining (CLIP) is widely used to train models\nto align images and texts in a common embedding space by mapping them to\nfixed-sized vectors. These models are key to multimodal information retrieval\nand related tasks. However, CLIP models generally underperform in text-only\ntasks compared to specialized text models. This creates inefficiencies for\ninformation retrieval systems that keep separate embeddings and models for\ntext-only and multimodal tasks. We propose a novel, multi-task contrastive\ntraining method to address this issue, which we use to train the jina-clip-v1\nmodel to achieve the state-of-the-art performance on both text-image and\ntext-text retrieval tasks.\n","authors":["Andreas Koukounas","Georgios Mastrapas","Michael Günther","Bo Wang","Scott Martens","Isabelle Mohr","Saba Sturua","Mohammad Kalim Akram","Joan Fontanals Martínez","Saahil Ognawala","Susana Guzman","Maximilian Werk","Nan Wang","Han Xiao"],"pdf_url":"https://arxiv.org/pdf/2405.20204v1.pdf","comment":"4 pages, ICML2024 workshop submission"},{"id":"http://arxiv.org/abs/2405.20188v1","updated":"2024-05-30T15:55:04Z","published":"2024-05-30T15:55:04Z","title":"SPARE: Symmetrized Point-to-Plane Distance for Robust Non-Rigid\n Registration","summary":" Existing optimization-based methods for non-rigid registration typically\nminimize an alignment error metric based on the point-to-point or\npoint-to-plane distance between corresponding point pairs on the source surface\nand target surface. However, these metrics can result in slow convergence or a\nloss of detail. In this paper, we propose SPARE, a novel formulation that\nutilizes a symmetrized point-to-plane distance for robust non-rigid\nregistration. The symmetrized point-to-plane distance relies on both the\npositions and normals of the corresponding points, resulting in a more accurate\napproximation of the underlying geometry and can achieve higher accuracy than\nexisting methods. To solve this optimization problem efficiently, we propose an\nalternating minimization solver using a majorization-minimization strategy.\nMoreover, for effective initialization of the solver, we incorporate a\ndeformation graph-based coarse alignment that improves registration quality and\nefficiency. Extensive experiments show that the proposed method greatly\nimproves the accuracy of non-rigid registration problems and maintains\nrelatively high solution efficiency. The code is publicly available at\nhttps://github.com/yaoyx689/spare.\n","authors":["Yuxin Yao","Bailin Deng","Junhui Hou","Juyong Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20188v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20180v1","updated":"2024-05-30T15:48:04Z","published":"2024-05-30T15:48:04Z","title":"Transformers and Slot Encoding for Sample Efficient Physical World\n Modelling","summary":" World modelling, i.e. building a representation of the rules that govern the\nworld so as to predict its evolution, is an essential ability for any agent\ninteracting with the physical world. Recent applications of the Transformer\narchitecture to the problem of world modelling from video input show notable\nimprovements in sample efficiency. However, existing approaches tend to work\nonly at the image level thus disregarding that the environment is composed of\nobjects interacting with each other. In this paper, we propose an architecture\ncombining Transformers for world modelling with the slot-attention paradigm, an\napproach for learning representations of objects appearing in a scene. We\ndescribe the resulting neural architecture and report experimental results\nshowing an improvement over the existing solutions in terms of sample\nefficiency and a reduction of the variation of the performance over the\ntraining examples. The code for our architecture and experiments is available\nat https://github.com/torchipeppo/transformers-and-slot-encoding-for-wm\n","authors":["Francesco Petri","Luigi Asprino","Aldo Gangemi"],"pdf_url":"https://arxiv.org/pdf/2405.20180v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20161v1","updated":"2024-05-30T15:33:32Z","published":"2024-05-30T15:33:32Z","title":"Landslide mapping from Sentinel-2 imagery through change detection","summary":" Landslides are one of the most critical and destructive geohazards.\nWidespread development of human activities and settlements combined with the\neffects of climate change on weather are resulting in a high increase in the\nfrequency and destructive power of landslides, making them a major threat to\nhuman life and the economy. In this paper, we explore methodologies to map\nnewly-occurred landslides using Sentinel-2 imagery automatically. All\napproaches presented are framed as a bi-temporal change detection problem,\nrequiring only a pair of Sentinel-2 images, taken respectively before and after\na landslide-triggering event. Furthermore, we introduce a novel deep learning\narchitecture for fusing Sentinel-2 bi-temporal image pairs with Digital\nElevation Model (DEM) data, showcasing its promising performances w.r.t. other\nchange detection models in the literature. As a parallel task, we address\nlimitations in existing datasets by creating a novel geodatabase, which\nincludes manually validated open-access landslide inventories over\nheterogeneous ecoregions of the world. We release both code and dataset with an\nopen-source license.\n","authors":["Tommaso Monopoli","Fabio Montello","Claudio Rossi"],"pdf_url":"https://arxiv.org/pdf/2405.20161v1.pdf","comment":"to be published in IEEE IGARSS 2024 conference proceedings"},{"id":"http://arxiv.org/abs/2405.20155v1","updated":"2024-05-30T15:30:38Z","published":"2024-05-30T15:30:38Z","title":"MotionDreamer: Zero-Shot 3D Mesh Animation from Video Diffusion Models","summary":" Animation techniques bring digital 3D worlds and characters to life. However,\nmanual animation is tedious and automated techniques are often specialized to\nnarrow shape classes. In our work, we propose a technique for automatic\nre-animation of arbitrary 3D shapes based on a motion prior extracted from a\nvideo diffusion model. Unlike existing 4D generation methods, we focus solely\non the motion, and we leverage an explicit mesh-based representation compatible\nwith existing computer-graphics pipelines. Furthermore, our utilization of\ndiffusion features enhances accuracy of our motion fitting. We analyze efficacy\nof these features for animation fitting and we experimentally validate our\napproach for two different diffusion models and four animation models. Finally,\nwe demonstrate that our time-efficient zero-shot method achieves a superior\nperformance re-animating a diverse set of 3D shapes when compared to existing\ntechniques in a user study. The project website is located at\nhttps://lukas.uzolas.com/MotionDreamer.\n","authors":["Lukas Uzolas","Elmar Eisemann","Petr Kellnhofer"],"pdf_url":"https://arxiv.org/pdf/2405.20155v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20152v1","updated":"2024-05-30T15:27:56Z","published":"2024-05-30T15:27:56Z","title":"Uncovering Bias in Large Vision-Language Models at Scale with\n Counterfactuals","summary":" With the advent of Large Language Models (LLMs) possessing increasingly\nimpressive capabilities, a number of Large Vision-Language Models (LVLMs) have\nbeen proposed to augment LLMs with visual inputs. Such models condition\ngenerated text on both an input image and a text prompt, enabling a variety of\nuse cases such as visual question answering and multimodal chat. While prior\nstudies have examined the social biases contained in text generated by LLMs,\nthis topic has been relatively unexplored in LVLMs. Examining social biases in\nLVLMs is particularly challenging due to the confounding contributions of bias\ninduced by information contained across the text and visual modalities. To\naddress this challenging problem, we conduct a large-scale study of text\ngenerated by different LVLMs under counterfactual changes to input images.\nSpecifically, we present LVLMs with identical open-ended text prompts while\nconditioning on images from different counterfactual sets, where each set\ncontains images which are largely identical in their depiction of a common\nsubject (e.g., a doctor), but vary only in terms of intersectional social\nattributes (e.g., race and gender). We comprehensively evaluate the text\nproduced by different models under this counterfactual generation setting at\nscale, producing over 57 million responses from popular LVLMs. Our\nmulti-dimensional analysis reveals that social attributes such as race, gender,\nand physical characteristics depicted in input images can significantly\ninfluence the generation of toxic content, competency-associated words, harmful\nstereotypes, and numerical ratings of depicted individuals. We additionally\nexplore the relationship between social bias in LVLMs and their corresponding\nLLMs, as well as inference-time strategies to mitigate bias.\n","authors":["Phillip Howard","Kathleen C. Fraser","Anahita Bhiwandiwalla","Svetlana Kiritchenko"],"pdf_url":"https://arxiv.org/pdf/2405.20152v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20141v1","updated":"2024-05-30T15:16:06Z","published":"2024-05-30T15:16:06Z","title":"OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation","summary":" The advent of Vision Language Models (VLMs) transformed image understanding\nfrom closed-set classifications to dynamic image-language interactions,\nenabling open-vocabulary segmentation. Despite this flexibility, VLMs often\nfall behind closed-set classifiers in accuracy due to their reliance on\nambiguous image captions and lack of domain-specific knowledge. We, therefore,\nintroduce a new task domain adaptation for open-vocabulary segmentation,\nenhancing VLMs with domain-specific priors while preserving their\nopen-vocabulary nature. Existing adaptation methods, when applied to\nsegmentation tasks, improve performance on training queries but can reduce VLM\nperformance on zero-shot text inputs. To address this shortcoming, we propose\nan approach that combines parameter-efficient prompt tuning with a\ntriplet-loss-based training strategy. This strategy is designed to enhance\nopen-vocabulary generalization while adapting to the visual domain. Our results\noutperform other parameter-efficient adaptation strategies in open-vocabulary\nsegment classification tasks across indoor and outdoor datasets. Notably, our\napproach is the only one that consistently surpasses the original VLM on\nzero-shot queries. Our adapted VLMs can be plug-and-play integrated into\nexisting open-vocabulary segmentation pipelines, improving OV-Seg by +6.0% mIoU\non ADE20K, and OpenMask3D by +4.1% AP on ScanNet++ Offices without any changes\nto the methods.\n","authors":["Gonca Yilmaz","Songyou Peng","Francis Engelmann","Marc Pollefeys","Hermann Blum"],"pdf_url":"https://arxiv.org/pdf/2405.20141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20136v1","updated":"2024-05-30T15:12:18Z","published":"2024-05-30T15:12:18Z","title":"A Multimodal Dangerous State Recognition and Early Warning System for\n Elderly with Intermittent Dementia","summary":" In response to the social issue of the increasing number of elderly\nvulnerable groups going missing due to the aggravating aging population in\nChina, our team has developed a wearable anti-loss device and intelligent early\nwarning system for elderly individuals with intermittent dementia using\nartificial intelligence and IoT technology. This system comprises an anti-loss\nsmart helmet, a cloud computing module, and an intelligent early warning\napplication on the caregiver's mobile device. The smart helmet integrates a\nminiature camera module, a GPS module, and a 5G communication module to collect\nfirst-person images and location information of the elderly. Data is\ntransmitted remotely via 5G, FTP, and TCP protocols. In the cloud computing\nmodule, our team has proposed for the first time a multimodal dangerous state\nrecognition network based on scene and location information to accurately\nassess the risk of elderly individuals going missing. Finally, the application\nsoftware interface designed for the caregiver's mobile device implements\nmulti-level early warnings. The system developed by our team requires no\noperation or response from the elderly, achieving fully automatic environmental\nperception, risk assessment, and proactive alarming. This overcomes the\nlimitations of traditional monitoring devices, which require active operation\nand response, thus avoiding the issue of the digital divide for the elderly. It\neffectively prevents accidental loss and potential dangers for elderly\nindividuals with dementia.\n","authors":["Liyun Deng","Lei Jin","Guangcheng Wang","Quan Shi","Han Wang"],"pdf_url":"https://arxiv.org/pdf/2405.20136v1.pdf","comment":"13 pages,9 figures"},{"id":"http://arxiv.org/abs/2401.17981v2","updated":"2024-05-30T15:09:49Z","published":"2024-01-31T16:38:32Z","title":"Enhancing Multimodal Large Language Models with Vision Detection Models:\n An Empirical Study","summary":" Despite the impressive capabilities of Multimodal Large Language Models\n(MLLMs) in integrating text and image modalities, challenges remain in\naccurately interpreting detailed visual elements. This paper presents an\nempirical study on enhancing MLLMs with state-of-the-art (SOTA) object\ndetection and Optical Character Recognition (OCR) models to improve\nfine-grained understanding and reduce hallucination in responses. We\ninvestigate the embedding-based infusion of textual detection information, the\nimpact of such infusion on MLLMs' original abilities, and the\ninterchangeability of detection models. We conduct systematic and extensive\nexperiments with representative models such as LLaVA-1.5, DINO, PaddleOCRv2,\nand Grounding DINO, revealing that our simple yet general approach not only\nrefines MLLMs' performance in fine-grained visual tasks but also maintains\ntheir original strengths. Notably, the enhanced LLaVA-1.5 outperforms its\noriginal 7B/13B models on all 10 benchmarks, achieving an improvement of up to\n12.5% on the normalized average score. We release our codes to facilitate\nfurther exploration into the fine-grained multimodal capabilities of MLLMs.\n","authors":["Qirui Jiao","Daoyuan Chen","Yilun Huang","Yaliang Li","Ying Shen"],"pdf_url":"https://arxiv.org/pdf/2401.17981v2.pdf","comment":"25 pages, 18 tables, 7 figures"},{"id":"http://arxiv.org/abs/2405.20126v1","updated":"2024-05-30T15:07:30Z","published":"2024-05-30T15:07:30Z","title":"Federated and Transfer Learning for Cancer Detection Based on Image\n Analysis","summary":" This review article discusses the roles of federated learning (FL) and\ntransfer learning (TL) in cancer detection based on image analysis. These two\nstrategies powered by machine learning have drawn a lot of attention due to\ntheir potential to increase the precision and effectiveness of cancer diagnosis\nin light of the growing importance of machine learning techniques in cancer\ndetection. FL enables the training of machine learning models on data\ndistributed across multiple sites without the need for centralized data\nsharing, while TL allows for the transfer of knowledge from one task to\nanother. A comprehensive assessment of the two methods, including their\nstrengths, and weaknesses is presented. Moving on, their applications in cancer\ndetection are discussed, including potential directions for the future.\nFinally, this article offers a thorough description of the functions of TL and\nFL in image-based cancer detection. The authors also make insightful\nsuggestions for additional study in this rapidly developing area.\n","authors":["Amine Bechar","Youssef Elmir","Yassine Himeur","Rafik Medjoudj","Abbes Amira"],"pdf_url":"https://arxiv.org/pdf/2405.20126v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.05168v2","updated":"2024-05-30T14:55:45Z","published":"2024-01-10T14:03:05Z","title":"CLIP-Guided Source-Free Object Detection in Aerial Images","summary":" Domain adaptation is crucial in aerial imagery, as the visual representation\nof these images can significantly vary based on factors such as geographic\nlocation, time, and weather conditions. Additionally, high-resolution aerial\nimages often require substantial storage space and may not be readily\naccessible to the public. To address these challenges, we propose a novel\nSource-Free Object Detection (SFOD) method. Specifically, our approach begins\nwith a self-training framework, which significantly enhances the performance of\nbaseline methods. To alleviate the noisy labels in self-training, we utilize\nContrastive Language-Image Pre-training (CLIP) to guide the generation of\npseudo-labels, termed CLIP-guided Aggregation (CGA). By leveraging CLIP's\nzero-shot classification capability, we aggregate its scores with the original\npredicted bounding boxes, enabling us to obtain refined scores for the\npseudo-labels. To validate the effectiveness of our method, we constructed two\nnew datasets from different domains based on the DIOR dataset, named DIOR-C and\nDIOR-Cloudy. Experimental results demonstrate that our method outperforms other\ncomparative algorithms. The code is available at\nhttps://github.com/Lans1ng/SFOD-RS.\n","authors":["Nanqing Liu","Xun Xu","Yongyi Su","Chengxin Liu","Peiliang Gong","Heng-Chao Li"],"pdf_url":"https://arxiv.org/pdf/2401.05168v2.pdf","comment":"Accepted by IGARSS2024"},{"id":"http://arxiv.org/abs/2405.20117v1","updated":"2024-05-30T14:54:26Z","published":"2024-05-30T14:54:26Z","title":"Infinite 3D Landmarks: Improving Continuous 2D Facial Landmark Detection","summary":" In this paper, we examine 3 important issues in the practical use of\nstate-of-the-art facial landmark detectors and show how a combination of\nspecific architectural modifications can directly improve their accuracy and\ntemporal stability. First, many facial landmark detectors require face\nnormalization as a preprocessing step, which is accomplished by a\nseparately-trained neural network that crops and resizes the face in the input\nimage. There is no guarantee that this pre-trained network performs the optimal\nface normalization for landmark detection. We instead analyze the use of a\nspatial transformer network that is trained alongside the landmark detector in\nan unsupervised manner, and jointly learn optimal face normalization and\nlandmark detection. Second, we show that modifying the output head of the\nlandmark predictor to infer landmarks in a canonical 3D space can further\nimprove accuracy. To convert the predicted 3D landmarks into screen-space, we\nadditionally predict the camera intrinsics and head pose from the input image.\nAs a side benefit, this allows to predict the 3D face shape from a given image\nonly using 2D landmarks as supervision, which is useful in determining landmark\nvisibility among other things. Finally, when training a landmark detector on\nmultiple datasets at the same time, annotation inconsistencies across datasets\nforces the network to produce a suboptimal average. We propose to add a\nsemantic correction network to address this issue. This additional lightweight\nneural network is trained alongside the landmark detector, without requiring\nany additional supervision. While the insights of this paper can be applied to\nmost common landmark detectors, we specifically target a recently-proposed\ncontinuous 2D landmark detector to demonstrate how each of our additions leads\nto meaningful improvements over the state-of-the-art on standard benchmarks.\n","authors":["Prashanth Chandran","Gaspard Zoss","Paulo Gotardo","Derek Bradley"],"pdf_url":"https://arxiv.org/pdf/2405.20117v1.pdf","comment":"12 pages, 13 figures"},{"id":"http://arxiv.org/abs/2405.20112v1","updated":"2024-05-30T14:49:54Z","published":"2024-05-30T14:49:54Z","title":"RIGID: A Training-free and Model-Agnostic Framework for Robust\n AI-Generated Image Detection","summary":" The rapid advances in generative AI models have empowered the creation of\nhighly realistic images with arbitrary content, raising concerns about\npotential misuse and harm, such as Deepfakes. Current research focuses on\ntraining detectors using large datasets of generated images. However, these\ntraining-based solutions are often computationally expensive and show limited\ngeneralization to unseen generated images. In this paper, we propose a\ntraining-free method to distinguish between real and AI-generated images. We\nfirst observe that real images are more robust to tiny noise perturbations than\nAI-generated images in the representation space of vision foundation models.\nBased on this observation, we propose RIGID, a training-free and model-agnostic\nmethod for robust AI-generated image detection. RIGID is a simple yet effective\napproach that identifies whether an image is AI-generated by comparing the\nrepresentation similarity between the original and the noise-perturbed\ncounterpart. Our evaluation on a diverse set of AI-generated images and\nbenchmarks shows that RIGID significantly outperforms existing trainingbased\nand training-free detectors. In particular, the average performance of RIGID\nexceeds the current best training-free method by more than 25%. Importantly,\nRIGID exhibits strong generalization across different image generation methods\nand robustness to image corruptions.\n","authors":["Zhiyuan He","Pin-Yu Chen","Tsung-Yi Ho"],"pdf_url":"https://arxiv.org/pdf/2405.20112v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20109v1","updated":"2024-05-30T14:45:02Z","published":"2024-05-30T14:45:02Z","title":"FMARS: Annotating Remote Sensing Images for Disaster Management using\n Foundation Models","summary":" Very-High Resolution (VHR) remote sensing imagery is increasingly accessible,\nbut often lacks annotations for effective machine learning applications. Recent\nfoundation models like GroundingDINO and Segment Anything (SAM) provide\nopportunities to automatically generate annotations. This study introduces\nFMARS (Foundation Model Annotations in Remote Sensing), a methodology\nleveraging VHR imagery and foundation models for fast and robust annotation. We\nfocus on disaster management and provide a large-scale dataset with labels\nobtained from pre-event imagery over 19 disaster events, derived from the Maxar\nOpen Data initiative. We train segmentation models on the generated labels,\nusing Unsupervised Domain Adaptation (UDA) techniques to increase\ntransferability to real-world scenarios. Our results demonstrate the\neffectiveness of leveraging foundation models to automatically annotate remote\nsensing data at scale, enabling robust downstream models for critical\napplications. Code and dataset are available at\n\\url{https://github.com/links-ads/igarss-fmars}.\n","authors":["Edoardo Arnaudo","Jacopo Lungo Vaschetti","Lorenzo Innocenti","Luca Barco","Davide Lisi","Vanina Fissore","Claudio Rossi"],"pdf_url":"https://arxiv.org/pdf/2405.20109v1.pdf","comment":"Accepted at IGARSS 2024, 5 pages"},{"id":"http://arxiv.org/abs/2405.20093v1","updated":"2024-05-30T14:31:46Z","published":"2024-05-30T14:31:46Z","title":"Rapid Wildfire Hotspot Detection Using Self-Supervised Learning on\n Temporal Remote Sensing Data","summary":" Rapid detection and well-timed intervention are essential to mitigate the\nimpacts of wildfires. Leveraging remote sensed data from satellite networks and\nadvanced AI models to automatically detect hotspots (i.e., thermal anomalies\ncaused by active fires) is an effective way to build wildfire monitoring\nsystems. In this work, we propose a novel dataset containing time series of\nremotely sensed data related to European fire events and a Self-Supervised\nLearning (SSL)-based model able to analyse multi-temporal data and identify\nhotspots in potentially near real time. We train and evaluate the performance\nof our model using our dataset and Thraws, a dataset of thermal anomalies\nincluding several fire events, obtaining an F1 score of 63.58.\n","authors":["Luca Barco","Angelica Urbanelli","Claudio Rossi"],"pdf_url":"https://arxiv.org/pdf/2405.20093v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20091v1","updated":"2024-05-30T14:27:40Z","published":"2024-05-30T14:27:40Z","title":"Visual Attention Analysis in Online Learning","summary":" In this paper, we present an approach in the Multimodal Learning Analytics\nfield. Within this approach, we have developed a tool to visualize and analyze\neye movement data collected during learning sessions in online courses. The\ntool is named VAAD (an acronym for Visual Attention Analysis Dashboard). These\neye movement data have been gathered using an eye-tracker and subsequently\nprocessed and visualized for interpretation. The purpose of the tool is to\nconduct a descriptive analysis of the data by facilitating its visualization,\nenabling the identification of differences and learning patterns among various\nlearner populations. Additionally, it integrates a predictive module capable of\nanticipating learner activities during a learning session. Consequently, VAAD\nholds the potential to offer valuable insights into online learning behaviors\nfrom both descriptive and predictive perspectives.\n","authors":["Navarro Miriam","Becerra Álvaro","Daza Roberto","Cobos Ruth","Morales Aythami","Fierrez Julian"],"pdf_url":"https://arxiv.org/pdf/2405.20091v1.pdf","comment":"Accepted in CEDI 2024 (VII Congreso Espa\\~nol de Inform\\'atica), A\n Coru\\~na, Spain"},{"id":"http://arxiv.org/abs/2405.20090v1","updated":"2024-05-30T14:27:20Z","published":"2024-05-30T14:27:20Z","title":"Typography Leads Semantic Diversifying: Amplifying Adversarial\n Transferability across Multimodal Large Language Models","summary":" Following the advent of the Artificial Intelligence (AI) era of large models,\nMultimodal Large Language Models (MLLMs) with the ability to understand\ncross-modal interactions between vision and text have attracted wide attention.\nAdversarial examples with human-imperceptible perturbation are shown to possess\na characteristic known as transferability, which means that a perturbation\ngenerated by one model could also mislead another different model. Augmenting\nthe diversity in input data is one of the most significant methods for\nenhancing adversarial transferability. This method has been certified as a way\nto significantly enlarge the threat impact under black-box conditions. Research\nworks also demonstrate that MLLMs can be exploited to generate adversarial\nexamples in the white-box scenario. However, the adversarial transferability of\nsuch perturbations is quite limited, failing to achieve effective black-box\nattacks across different models. In this paper, we propose the\nTypographic-based Semantic Transfer Attack (TSTA), which is inspired by: (1)\nMLLMs tend to process semantic-level information; (2) Typographic Attack could\neffectively distract the visual information captured by MLLMs. In the scenarios\nof Harmful Word Insertion and Important Information Protection, our TSTA\ndemonstrates superior performance.\n","authors":["Hao Cheng","Erjia Xiao","Jiahang Cao","Le Yang","Kaidi Xu","Jindong Gu","Renjing Xu"],"pdf_url":"https://arxiv.org/pdf/2405.20090v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.14870v2","updated":"2024-05-30T14:23:21Z","published":"2024-05-23T17:59:57Z","title":"An Empirical Study of Training State-of-the-Art LiDAR Segmentation\n Models","summary":" In the rapidly evolving field of autonomous driving, precise segmentation of\nLiDAR data is crucial for understanding complex 3D environments. Traditional\napproaches often rely on disparate, standalone codebases, hindering unified\nadvancements and fair benchmarking across models. To address these challenges,\nwe introduce MMDetection3D-lidarseg, a comprehensive toolbox designed for the\nefficient training and evaluation of state-of-the-art LiDAR segmentation\nmodels. We support a wide range of segmentation models and integrate advanced\ndata augmentation techniques to enhance robustness and generalization.\nAdditionally, the toolbox provides support for multiple leading sparse\nconvolution backends, optimizing computational efficiency and performance. By\nfostering a unified framework, MMDetection3D-lidarseg streamlines development\nand benchmarking, setting new standards for research and application. Our\nextensive benchmark experiments on widely-used datasets demonstrate the\neffectiveness of the toolbox. The codebase and trained models have been\npublicly available, promoting further research and innovation in the field of\nLiDAR segmentation for autonomous driving.\n","authors":["Jiahao Sun","Chunmei Qing","Xiang Xu","Lingdong Kong","Youquan Liu","Li Li","Chenming Zhu","Jingwei Zhang","Zeqi Xiao","Runnan Chen","Tai Wang","Wenwei Zhang","Kai Chen"],"pdf_url":"https://arxiv.org/pdf/2405.14870v2.pdf","comment":"Preprint; 17 pages, 4 figures, 7 tables; Code at\n https://github.com/open-mmlab/mmdetection3d"},{"id":"http://arxiv.org/abs/2405.20084v1","updated":"2024-05-30T14:14:39Z","published":"2024-05-30T14:14:39Z","title":"Estimating Human Poses Across Datasets: A Unified Skeleton and\n Multi-Teacher Distillation Approach","summary":" Human pose estimation is a key task in computer vision with various\napplications such as activity recognition and interactive systems. However, the\nlack of consistency in the annotated skeletons across different datasets poses\nchallenges in developing universally applicable models. To address this\nchallenge, we propose a novel approach integrating multi-teacher knowledge\ndistillation with a unified skeleton representation. Our networks are jointly\ntrained on the COCO and MPII datasets, containing 17 and 16 keypoints,\nrespectively. We demonstrate enhanced adaptability by predicting an extended\nset of 21 keypoints, 4 (COCO) and 5 (MPII) more than original annotations,\nimproving cross-dataset generalization. Our joint models achieved an average\naccuracy of 70.89 and 76.40, compared to 53.79 and 55.78 when trained on a\nsingle dataset and evaluated on both. Moreover, we also evaluate all 21\npredicted points by our two models by reporting an AP of 66.84 and 72.75 on the\nHalpe dataset. This highlights the potential of our technique to address one of\nthe most pressing challenges in pose estimation research and application - the\ninconsistency in skeletal annotations.\n","authors":["Muhammad Saif Ullah Khan","Dhavalkumar Limbachiya","Didier Stricker","Muhammad Zeshan Afzal"],"pdf_url":"https://arxiv.org/pdf/2405.20084v1.pdf","comment":"15 pages (with references)"},{"id":"http://arxiv.org/abs/2405.18751v2","updated":"2024-05-30T14:13:05Z","published":"2024-05-29T04:29:12Z","title":"On the Limits of Multi-modal Meta-Learning with Auxiliary Task\n Modulation Using Conditional Batch Normalization","summary":" Few-shot learning aims to learn representations that can tackle novel tasks\ngiven a small number of examples. Recent studies show that cross-modal learning\ncan improve representations for few-shot classification. More specifically,\nlanguage is a rich modality that can be used to guide visual learning. In this\nwork, we experiment with a multi-modal architecture for few-shot learning that\nconsists of three components: a classifier, an auxiliary network, and a bridge\nnetwork. While the classifier performs the main classification task, the\nauxiliary network learns to predict language representations from the same\ninput, and the bridge network transforms high-level features of the auxiliary\nnetwork into modulation parameters for layers of the few-shot classifier using\nconditional batch normalization. The bridge should encourage a form of\nlightweight semantic alignment between language and vision which could be\nuseful for the classifier. However, after evaluating the proposed approach on\ntwo popular few-shot classification benchmarks we find that a) the improvements\ndo not reproduce across benchmarks, and b) when they do, the improvements are\ndue to the additional compute and parameters introduced by the bridge network.\nWe contribute insights and recommendations for future work in multi-modal\nmeta-learning, especially when using language representations.\n","authors":["Jordi Armengol-Estapé","Vincent Michalski","Ramnath Kumar","Pierre-Luc St-Charles","Doina Precup","Samira Ebrahimi Kahou"],"pdf_url":"https://arxiv.org/pdf/2405.18751v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20081v1","updated":"2024-05-30T14:11:27Z","published":"2024-05-30T14:11:27Z","title":"NoiseBoost: Alleviating Hallucination with Noise Perturbation for\n Multimodal Large Language Models","summary":" Multimodal large language models (MLLMs) contribute a powerful mechanism to\nunderstanding visual information building on large language models. However,\nMLLMs are notorious for suffering from hallucinations, especially when\ngenerating lengthy, detailed descriptions for images. Our analysis reveals that\nhallucinations stem from the inherent summarization mechanism of large language\nmodels, leading to excessive dependence on linguistic tokens while neglecting\nvision information. In this paper, we propose NoiseBoost, a broadly applicable\nand simple method for alleviating hallucinations for MLLMs through the\nintegration of noise feature perturbations. Noise perturbation acts as a\nregularizer, facilitating a balanced distribution of attention weights among\nvisual and linguistic tokens. Despite its simplicity, NoiseBoost consistently\nenhances the performance of MLLMs across common training strategies, including\nsupervised fine-tuning and reinforcement learning. Further, NoiseBoost\npioneerly enables semi-supervised learning for MLLMs, unleashing the power of\nunlabeled data. Comprehensive experiments demonstrate that NoiseBoost improves\ndense caption accuracy by 8.1% with human evaluation and achieves comparable\nresults with 50% of the data by mining unlabeled data. Code and models are\navailable at https://kaiwu5.github.io/noiseboost.\n","authors":["Kai Wu","Boyuan Jiang","Zhengkai Jiang","Qingdong He","Donghao Luo","Shengzhi Wang","Qingwen Liu","Chengjie Wang"],"pdf_url":"https://arxiv.org/pdf/2405.20081v1.pdf","comment":"updating"},{"id":"http://arxiv.org/abs/2405.20072v1","updated":"2024-05-30T14:02:40Z","published":"2024-05-30T14:02:40Z","title":"Faces of the Mind: Unveiling Mental Health States Through Facial\n Expressions in 11,427 Adolescents","summary":" Mood disorders, including depression and anxiety, often manifest through\nfacial expressions. While previous research has explored the connection between\nfacial features and emotions, machine learning algorithms for estimating mood\ndisorder severity have been hindered by small datasets and limited real-world\napplication. To address this gap, we analyzed facial videos of 11,427\nparticipants, a dataset two orders of magnitude larger than previous studies.\nThis comprehensive collection includes standardized facial expression videos\nfrom reading tasks, along with a detailed psychological scale that measures\ndepression, anxiety, and stress. By examining the relationships among these\nemotional states and employing clustering analysis, we identified distinct\nsubgroups embodying different emotional profiles. We then trained tree-based\nclassifiers and deep learning models to estimate emotional states from facial\nfeatures. Results indicate that models previously effective on small datasets\nexperienced decreased performance when applied to our large dataset,\nhighlighting the importance of data scale and mitigating overfitting in\npractical settings. Notably, our study identified subtle shifts in pupil\ndynamics and gaze orientation as potential markers of mood disorders, providing\nvaluable information on the interaction between facial expressions and mental\nhealth. This research marks the first large-scale and comprehensive\ninvestigation of facial expressions in the context of mental health, laying the\ngroundwork for future data-driven advancements in this field.\n","authors":["Xiao Xu","Keyin Zhou","Yan Zhang","Yang Wang","Fei Wang","Xizhe Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20072v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20067v1","updated":"2024-05-30T13:56:58Z","published":"2024-05-30T13:56:58Z","title":"N-Dimensional Gaussians for Fitting of High Dimensional Functions","summary":" In the wake of many new ML-inspired approaches for reconstructing and\nrepresenting high-quality 3D content, recent hybrid and explicitly learned\nrepresentations exhibit promising performance and quality characteristics.\nHowever, their scaling to higher dimensions is challenging, e.g. when\naccounting for dynamic content with respect to additional parameters such as\nmaterial properties, illumination, or time. In this paper, we tackle these\nchallenges for an explicit representations based on Gaussian mixture models.\nWith our solutions, we arrive at efficient fitting of compact N-dimensional\nGaussian mixtures and enable efficient evaluation at render time: For fast\nfitting and evaluation, we introduce a high-dimensional culling scheme that\nefficiently bounds N-D Gaussians, inspired by Locality Sensitive Hashing. For\nadaptive refinement yet compact representation, we introduce a loss-adaptive\ndensity control scheme that incrementally guides the use of additional capacity\ntowards missing details. With these tools we can for the first time represent\ncomplex appearance that depends on many input dimensions beyond position or\nviewing angle within a compact, explicit representation optimized in minutes\nand rendered in milliseconds.\n","authors":["Stavros Diolatzis","Tobias Zirr","Alexandr Kuznetsov","Georgios Kopanas","Anton Kaplanyan"],"pdf_url":"https://arxiv.org/pdf/2405.20067v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20062v1","updated":"2024-05-30T13:50:39Z","published":"2024-05-30T13:50:39Z","title":"Can the accuracy bias by facial hairstyle be reduced through balancing\n the training data?","summary":" Appearance of a face can be greatly altered by growing a beard and mustache.\nThe facial hairstyles in a pair of images can cause marked changes to the\nimpostor distribution and the genuine distribution. Also, different\ndistributions of facial hairstyle across demographics could cause a false\nimpression of relative accuracy across demographics. We first show that, even\nthough larger training sets boost the recognition accuracy on all facial\nhairstyles, accuracy variations caused by facial hairstyles persist regardless\nof the size of the training set. Then, we analyze the impact of having\ndifferent fractions of the training data represent facial hairstyles. We\ncreated balanced training sets using a set of identities available in\nWebface42M that both have clean-shaven and facial hair images. We find that,\neven when a face recognition model is trained with a balanced clean-shaven /\nfacial hair training set, accuracy variation on the test data does not\ndiminish. Next, data augmentation is employed to further investigate the effect\nof facial hair distribution in training data by manipulating facial hair pixels\nwith the help of facial landmark points and a facial hair segmentation model.\nOur results show facial hair causes an accuracy gap between clean-shaven and\nfacial hair images, and this impact can be significantly different between\nAfrican-Americans and Caucasians.\n","authors":["Kagan Ozturk","Haiyu Wu","Kevin W. Bowyer"],"pdf_url":"https://arxiv.org/pdf/2405.20062v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20058v1","updated":"2024-05-30T13:46:56Z","published":"2024-05-30T13:46:56Z","title":"Enhancing Plant Disease Detection: A Novel CNN-Based Approach with\n Tensor Subspace Learning and HOWSVD-MD","summary":" Machine learning has revolutionized the field of agricultural science,\nparticularly in the early detection and management of plant diseases, which are\ncrucial for maintaining crop health and productivity. Leveraging advanced\nalgorithms and imaging technologies, researchers are now able to identify and\nclassify plant diseases with unprecedented accuracy and speed. Effective\nmanagement of tomato diseases is crucial for enhancing agricultural\nproductivity. The development and application of tomato disease classification\nmethods are central to this objective. This paper introduces a cutting-edge\ntechnique for the detection and classification of tomato leaf diseases,\nutilizing insights from the latest pre-trained Convolutional Neural Network\n(CNN) models. We propose a sophisticated approach within the domain of tensor\nsubspace learning, known as Higher-Order Whitened Singular Value Decomposition\n(HOWSVD), designed to boost the discriminatory power of the system. Our\napproach to Tensor Subspace Learning is methodically executed in two phases,\nbeginning with HOWSVD and culminating in Multilinear Discriminant Analysis\n(MDA). The efficacy of this innovative method was rigorously tested through\ncomprehensive experiments on two distinct datasets, namely PlantVillage and the\nTaiwan dataset. The findings reveal that HOWSVD-MDA outperforms existing\nmethods, underscoring its capability to markedly enhance the precision and\ndependability of diagnosing tomato leaf diseases. For instance, up to 98.36\\%\nand 89.39\\% accuracy scores have been achieved under PlantVillage and the\nTaiwan datasets, respectively.\n","authors":["Abdelmalik Ouamane","Ammar Chouchane","Yassine Himeur","Abderrazak Debilou","Abbes Amira","Shadi Atalla","Wathiq Mansoor","Hussain Al Ahmad"],"pdf_url":"https://arxiv.org/pdf/2405.20058v1.pdf","comment":"17 pages, 9 figures and 8 tables"},{"id":"http://arxiv.org/abs/2306.08970v2","updated":"2024-05-30T13:46:34Z","published":"2023-06-15T09:05:36Z","title":"An Efficient and Multi-private Key Secure Aggregation for Federated\n Learning","summary":" With the emergence of privacy leaks in federated learning, secure aggregation\nprotocols that mainly adopt either homomorphic encryption or threshold secret\nsharing have been widely developed for federated learning to protect the\nprivacy of the local training data of each client. However, these existing\nprotocols suffer from many shortcomings, such as the dependence on a trusted\nthird party, the vulnerability to clients being corrupted, low efficiency, the\ntrade-off between security and fault tolerance, etc. To solve these\ndisadvantages, we propose an efficient and multi-private key secure aggregation\nscheme for federated learning. Specifically, we skillfully modify the variant\nElGamal encryption technique to achieve homomorphic addition operation, which\nhas two important advantages: 1) The server and each client can freely select\npublic and private keys without introducing a trust third party and 2) Compared\nto the variant ElGamal encryption, the plaintext space is relatively large,\nwhich is more suitable for the deep model. Besides, for the high dimensional\ndeep model parameter, we introduce a super-increasing sequence to compress\nmulti-dimensional data into 1-D, which can greatly reduce encryption and\ndecryption times as well as communication for ciphertext transmission. Detailed\nsecurity analyses show that our proposed scheme achieves the semantic security\nof both individual local gradients and the aggregated result while achieving\noptimal robustness in tolerating both client collusion and dropped clients.\nExtensive simulations demonstrate that the accuracy of our scheme is almost the\nsame as the non-private approach, while the efficiency of our scheme is much\nbetter than the state-of-the-art homomorphic encryption-based secure\naggregation schemes. More importantly, the efficiency advantages of our scheme\nwill become increasingly prominent as the number of model parameters increases.\n","authors":["Xue Yang","Zifeng Liu","Xiaohu Tang","Rongxing Lu","Bo Liu"],"pdf_url":"https://arxiv.org/pdf/2306.08970v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17817v2","updated":"2024-05-30T13:40:23Z","published":"2024-05-28T04:29:10Z","title":"Benchmarking Skeleton-based Motion Encoder Models for Clinical\n Applications: Estimating Parkinson's Disease Severity in Walking Sequences","summary":" This study investigates the application of general human motion encoders\ntrained on large-scale human motion datasets for analyzing gait patterns in PD\npatients. Although these models have learned a wealth of human biomechanical\nknowledge, their effectiveness in analyzing pathological movements, such as\nparkinsonian gait, has yet to be fully validated. We propose a comparative\nframework and evaluate six pre-trained state-of-the-art human motion encoder\nmodels on their ability to predict the Movement Disorder Society - Unified\nParkinson's Disease Rating Scale (MDS-UPDRS-III) gait scores from motion\ncapture data. We compare these against a traditional gait feature-based\npredictive model in a recently released large public PD dataset, including PD\npatients on and off medication. The feature-based model currently shows higher\nweighted average accuracy, precision, recall, and F1-score. Motion encoder\nmodels with closely comparable results demonstrate promise for scalability and\nefficiency in clinical settings. This potential is underscored by the enhanced\nperformance of the encoder model upon fine-tuning on PD training set. Four of\nthe six human motion models examined provided prediction scores that were\nsignificantly different between on- and off-medication states. This finding\nreveals the sensitivity of motion encoder models to nuanced clinical changes.\nIt also underscores the necessity for continued customization of these models\nto better capture disease-specific features, thereby reducing the reliance on\nlabor-intensive feature engineering. Lastly, we establish a benchmark for the\nanalysis of skeleton-based motion encoder models in clinical settings. To the\nbest of our knowledge, this is the first study to provide a benchmark that\nenables state-of-the-art models to be tested and compete in a clinical context.\nCodes and benchmark leaderboard are available at code.\n","authors":["Vida Adeli","Soroush Mehraban","Irene Ballester","Yasamin Zarghami","Andrea Sabo","Andrea Iaboni","Babak Taati"],"pdf_url":"https://arxiv.org/pdf/2405.17817v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.04173v2","updated":"2024-05-30T13:38:18Z","published":"2024-03-07T03:07:59Z","title":"Image Coding for Machines with Edge Information Learning Using Segment\n Anything","summary":" Image Coding for Machines (ICM) is an image compression technique for image\nrecognition.\n This technique is essential due to the growing demand for image recognition\nAI.\n In this paper, we propose a method for ICM that focuses on encoding and\ndecoding only the edge information of object parts in an image, which we call\nSA-ICM.\n This is an Learned Image Compression (LIC) model trained using edge\ninformation created by Segment Anything.\n Our method can be used for image recognition models with various tasks.\n SA-ICM is also robust to changes in input data, making it effective for a\nvariety of use cases.\n Additionally, our method provides benefits from a privacy point of view, as\nit removes human facial information on the encoder's side, thus protecting\none's privacy.\n Furthermore, this LIC model training method can be used to train Neural\nRepresentations for Videos (NeRV), which is a video compression model.\n By training NeRV using edge information created by Segment Anything, it is\npossible to create a NeRV that is effective for image recognition (SA-NeRV).\n Experimental results confirm the advantages of SA-ICM, presenting the best\nperformance in image compression for image recognition.\n We also show that SA-NeRV is superior to ordinary NeRV in video compression\nfor machines.\n","authors":["Takahiro Shindo","Kein Yamada","Taiju Watanabe","Hiroshi Watanabe"],"pdf_url":"https://arxiv.org/pdf/2403.04173v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19149v2","updated":"2024-05-30T13:26:43Z","published":"2024-05-29T14:52:10Z","title":"CaLa: Complementary Association Learning for Augmenting Composed Image\n Retrieval","summary":" Composed Image Retrieval (CIR) involves searching for target images based on\nan image-text pair query. While current methods treat this as a query-target\nmatching problem, we argue that CIR triplets contain additional associations\nbeyond this primary relation. In our paper, we identify two new relations\nwithin triplets, treating each triplet as a graph node. Firstly, we introduce\nthe concept of text-bridged image alignment, where the query text serves as a\nbridge between the query image and the target image. We propose a hinge-based\ncross-attention mechanism to incorporate this relation into network learning.\nSecondly, we explore complementary text reasoning, considering CIR as a form of\ncross-modal retrieval where two images compose to reason about complementary\ntext. To integrate these perspectives effectively, we design a twin\nattention-based compositor. By combining these complementary associations with\nthe explicit query pair-target image relation, we establish a comprehensive set\nof constraints for CIR. Our framework, CaLa (Complementary Association Learning\nfor Augmenting Composed Image Retrieval), leverages these insights. We evaluate\nCaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating\nits superiority in composed image retrieval.\n","authors":["Xintong Jiang","Yaxiong Wang","Mengjian Li","Yujiao Wu","Bingwen Hu","Xueming Qian"],"pdf_url":"https://arxiv.org/pdf/2405.19149v2.pdf","comment":"To appear at SIGIR 2024. arXiv admin note: text overlap with\n arXiv:2309.02169"},{"id":"http://arxiv.org/abs/2405.20044v1","updated":"2024-05-30T13:25:25Z","published":"2024-05-30T13:25:25Z","title":"A Point-Neighborhood Learning Framework for Nasal Endoscope Image\n Segmentation","summary":" The lesion segmentation on endoscopic images is challenging due to its\ncomplex and ambiguous features. Fully-supervised deep learning segmentation\nmethods can receive good performance based on entirely pixel-level labeled\ndataset but greatly increase experts' labeling burden. Semi-supervised and\nweakly supervised methods can ease labeling burden, but heavily strengthen the\nlearning difficulty. To alleviate this difficulty, weakly semi-supervised\nsegmentation adopts a new annotation protocol of adding a large number of point\nannotation samples into a few pixel-level annotation samples. However, existing\nmethods only mine points' limited information while ignoring reliable prior\nsurrounding the point annotations. In this paper, we propose a weakly\nsemi-supervised method called Point-Neighborhood Learning (PNL) framework. To\nmine the prior of the pixels surrounding the annotated point, we transform a\nsingle-point annotation into a circular area named a point-neighborhood. We\npropose point-neighborhood supervision loss and pseudo-label scoring mechanism\nto enhance training supervision. Point-neighborhoods are also used to augment\nthe data diversity. Our method greatly improves performance without changing\nthe structure of segmentation network. Comprehensive experiments show the\nsuperiority of our method over the other existing methods, demonstrating its\neffectiveness in point-annotated medical images. The project code will be\navailable on: https://github.com/ParryJay/PNL.\n","authors":["Pengyu Jie","Wanquan Liu","Chenqiang Gao","Yihui Wen","Rui He","Pengcheng Li","Jintao Zhang","Deyu Meng"],"pdf_url":"https://arxiv.org/pdf/2405.20044v1.pdf","comment":"10 pages, 10 figures,"},{"id":"http://arxiv.org/abs/2405.20031v1","updated":"2024-05-30T13:16:17Z","published":"2024-05-30T13:16:17Z","title":"Structure Gaussian SLAM with Manhattan World Hypothesis","summary":" Gaussian SLAM systems have made significant advancements in improving the\nefficiency and fidelity of real-time reconstructions. However, these systems\noften encounter incomplete reconstructions in complex indoor environments,\ncharacterized by substantial holes due to unobserved geometry caused by\nobstacles or limited view angles. To address this challenge, we present\nManhattan Gaussian SLAM (MG-SLAM), an RGB-D system that leverages the Manhattan\nWorld hypothesis to enhance geometric accuracy and completeness. By seamlessly\nintegrating fused line segments derived from structured scenes, MG-SLAM ensures\nrobust tracking in textureless indoor areas. Moreover, The extracted lines and\nplanar surface assumption allow strategic interpolation of new Gaussians in\nregions of missing geometry, enabling efficient scene completion. Extensive\nexperiments conducted on both synthetic and real-world scenes demonstrate that\nthese advancements enable our method to achieve state-of-the-art performance,\nmarking a substantial improvement in the capabilities of Gaussian SLAM systems.\n","authors":["Shuhong Liu","Heng Zhou","Liuzhuozheng Li","Yun Liu","Tianchen Deng","Yiming Zhou","Mingrui Li"],"pdf_url":"https://arxiv.org/pdf/2405.20031v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20030v1","updated":"2024-05-30T13:15:18Z","published":"2024-05-30T13:15:18Z","title":"EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from\n Egocentric Videos","summary":" Predicting future human behavior from egocentric videos is a challenging but\ncritical task for human intention understanding. Existing methods for\nforecasting 2D hand positions rely on visual representations and mainly focus\non hand-object interactions. In this paper, we investigate the hand forecasting\ntask and tackle two significant issues that persist in the existing methods:\n(1) 2D hand positions in future frames are severely affected by ego-motions in\negocentric videos; (2) prediction based on visual information tends to overfit\nto background or scene textures, posing a challenge for generalization on novel\nscenes or human behaviors. To solve the aforementioned problems, we propose\nEMAG, an ego-motion-aware and generalizable 2D hand forecasting method. In\nresponse to the first problem, we propose a method that considers ego-motion,\nrepresented by a sequence of homography matrices of two consecutive frames. We\nfurther leverage modalities such as optical flow, trajectories of hands and\ninteracting objects, and ego-motions, thereby alleviating the second issue.\nExtensive experiments on two large-scale egocentric video datasets, Ego4D and\nEPIC-Kitchens 55, verify the effectiveness of the proposed method. In\nparticular, our model outperforms prior methods by $7.0$\\% on cross-dataset\nevaluations. Project page: https://masashi-hatano.github.io/EMAG/\n","authors":["Masashi Hatano","Ryo Hachiuma","Hideo Saito"],"pdf_url":"https://arxiv.org/pdf/2405.20030v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20025v1","updated":"2024-05-30T13:11:08Z","published":"2024-05-30T13:11:08Z","title":"From Forest to Zoo: Great Ape Behavior Recognition with ChimpBehave","summary":" This paper addresses the significant challenge of recognizing behaviors in\nnon-human primates, specifically focusing on chimpanzees. Automated behavior\nrecognition is crucial for both conservation efforts and the advancement of\nbehavioral research. However, it is significantly hindered by the\nlabor-intensive process of manual video annotation. Despite the availability of\nlarge-scale animal behavior datasets, the effective application of machine\nlearning models across varied environmental settings poses a critical\nchallenge, primarily due to the variability in data collection contexts and the\nspecificity of annotations.\n In this paper, we introduce ChimpBehave, a novel dataset featuring over 2\nhours of video (approximately 193,000 video frames) of zoo-housed chimpanzees,\nmeticulously annotated with bounding boxes and behavior labels for action\nrecognition. ChimpBehave uniquely aligns its behavior classes with existing\ndatasets, allowing for the study of domain adaptation and cross-dataset\ngeneralization methods between different visual settings. Furthermore, we\nbenchmark our dataset using a state-of-the-art CNN-based action recognition\nmodel, providing the first baseline results for both within and cross-dataset\nsettings. The dataset, models, and code can be accessed at:\nhttps://github.com/MitchFuchs/ChimpBehave\n","authors":["Michael Fuchs","Emilie Genty","Adrian Bangerter","Klaus Zuberbühler","Paul Cotofrei"],"pdf_url":"https://arxiv.org/pdf/2405.20025v1.pdf","comment":"CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling\n In conjunction with Computer Vision and Pattern Recognition 2024"},{"id":"http://arxiv.org/abs/2401.14535v2","updated":"2024-05-30T13:09:47Z","published":"2024-01-25T22:01:07Z","title":"CaRiNG: Learning Temporal Causal Representation under Non-Invertible\n Generation Process","summary":" Identifying the underlying time-delayed latent causal processes in sequential\ndata is vital for grasping temporal dynamics and making downstream reasoning.\nWhile some recent methods can robustly identify these latent causal variables,\nthey rely on strict assumptions about the invertible generation process from\nlatent variables to observed data. However, these assumptions are often hard to\nsatisfy in real-world applications containing information loss. For instance,\nthe visual perception process translates a 3D space into 2D images, or the\nphenomenon of persistence of vision incorporates historical data into current\nperceptions. To address this challenge, we establish an identifiability theory\nthat allows for the recovery of independent latent components even when they\ncome from a nonlinear and non-invertible mix. Using this theory as a\nfoundation, we propose a principled approach, CaRiNG, to learn the CAusal\nRepresentatIon of Non-invertible Generative temporal data with identifiability\nguarantees. Specifically, we utilize temporal context to recover lost latent\ninformation and apply the conditions in our theory to guide the training\nprocess. Through experiments conducted on synthetic datasets, we validate that\nour CaRiNG method reliably identifies the causal process, even when the\ngeneration process is non-invertible. Moreover, we demonstrate that our\napproach considerably improves temporal understanding and reasoning in\npractical applications.\n","authors":["Guangyi Chen","Yifan Shen","Zhenhao Chen","Xiangchen Song","Yuewen Sun","Weiran Yao","Xiao Liu","Kun Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.14535v2.pdf","comment":"To appear at ICML 2024, 24 pages"},{"id":"http://arxiv.org/abs/2402.07865v2","updated":"2024-05-30T13:08:48Z","published":"2024-02-12T18:21:14Z","title":"Prismatic VLMs: Investigating the Design Space of Visually-Conditioned\n Language Models","summary":" Visually-conditioned language models (VLMs) have seen growing adoption in\napplications such as visual dialogue, scene understanding, and robotic task\nplanning; adoption that has fueled a wealth of new models such as LLaVa,\nInstructBLIP, and PaLI-3. Despite the volume of new releases, key design\ndecisions around image preprocessing, architecture, and optimization are\nunder-explored, making it challenging to understand what factors account for\nmodel performance $-$ a challenge further complicated by the lack of objective,\nconsistent evaluations. To address these gaps, we first compile a suite of\nstandardized evaluations spanning visual question answering, object\nlocalization, and challenge sets that probe properties such as hallucination;\nevaluations that provide fine-grained insight VLM capabilities. Second, we\nrigorously investigate VLMs along key design axes, including pretrained visual\nrepresentations and training from base vs. instruct-tuned language models,\namongst others. We couple our analysis with three resource contributions: (1) a\nunified framework for evaluating VLMs, (2) optimized, flexible training code,\nand (3) checkpoints for all models, including a family of VLMs at the 7-13B\nscale that strictly outperform InstructBLIP and LLaVa v1.5, the\nstate-of-the-art in open VLMs.\n","authors":["Siddharth Karamcheti","Suraj Nair","Ashwin Balakrishna","Percy Liang","Thomas Kollar","Dorsa Sadigh"],"pdf_url":"https://arxiv.org/pdf/2402.07865v2.pdf","comment":"Published at ICML 2024. 22 pages, 11 figures. Training code and\n models: https://github.com/TRI-ML/prismatic-vlms. Evaluation code:\n https://github.com/TRI-ML/vlm-evaluation"},{"id":"http://arxiv.org/abs/2201.04435v3","updated":"2024-05-30T13:07:43Z","published":"2022-01-12T12:09:24Z","title":"Beyond the Visible: A Survey on Cross-spectral Face Recognition","summary":" Cross-spectral face recognition (CFR) refers to recognizing individuals using\nface images stemming from different spectral bands, such as infrared vs.\nvisible. While CFR is inherently more challenging than classical face\nrecognition due to significant variation in facial appearance caused by the\nmodality gap, it is useful in many scenarios including night-vision biometrics\nand detecting presentation attacks. Recent advances in convolutional neural\nnetworks (CNNs) have resulted in significant improvement in the performance of\nCFR systems. Given these developments, the contributions of this survey are\nthree-fold. First, we provide an overview of CFR, by formalizing the CFR\nproblem and presenting related applications. Secondly, we discuss the\nappropriate spectral bands for face recognition and discuss recent CFR methods,\nplacing emphasis on deep neural networks. In particular we describe techniques\nthat have been proposed to extract and compare heterogeneous features emerging\nfrom different spectral bands. We also discuss the datasets that have been used\nfor evaluating CFR methods. Finally, we discuss the challenges and future lines\nof research on this topic.\n","authors":["David Anghelone","Cunjian Chen","Arun Ross","Antitza Dantcheva"],"pdf_url":"https://arxiv.org/pdf/2201.04435v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17835v3","updated":"2024-05-30T12:55:14Z","published":"2024-05-28T05:14:57Z","title":"Deform3DGS: Flexible Deformation for Fast Surgical Scene Reconstruction\n with Gaussian Splatting","summary":" Tissue deformation poses a key challenge for accurate surgical scene\nreconstruction. Despite yielding high reconstruction quality, existing methods\nsuffer from slow rendering speeds and long training times, limiting their\nintraoperative applicability. Motivated by recent progress in 3D Gaussian\nSplatting, an emerging technology in real-time 3D rendering, this work presents\na novel fast reconstruction framework, termed Deform3DGS, for deformable\ntissues during endoscopic surgery. Specifically, we introduce 3D GS into\nsurgical scenes by integrating a point cloud initialization to improve\nreconstruction. Furthermore, we propose a novel flexible deformation modeling\nscheme (FDM) to learn tissue deformation dynamics at the level of individual\nGaussians. Our FDM can model the surface deformation with efficient\nrepresentations, allowing for real-time rendering performance. More\nimportantly, FDM significantly accelerates surgical scene reconstruction,\ndemonstrating considerable clinical values, particularly in intraoperative\nsettings where time efficiency is crucial. Experiments on DaVinci robotic\nsurgery videos indicate the efficacy of our approach, showcasing superior\nreconstruction fidelity PSNR: (37.90) and rendering speed (338.8 FPS) while\nsubstantially reducing training time to only 1 minute/scene. Our code is\navailable at https://github.com/jinlab-imvr/Deform3DGS.\n","authors":["Shuojue Yang","Qian Li","Daiyun Shen","Bingchen Gong","Qi Dou","Yueming Jin"],"pdf_url":"https://arxiv.org/pdf/2405.17835v3.pdf","comment":"Early accepted at MICCAI 2024, 10 pages, 2 figures"},{"id":"http://arxiv.org/abs/2405.20008v1","updated":"2024-05-30T12:45:34Z","published":"2024-05-30T12:45:34Z","title":"Sharing Key Semantics in Transformer Makes Efficient Image Restoration","summary":" Image Restoration (IR), a classic low-level vision task, has witnessed\nsignificant advancements through deep models that effectively model global\ninformation. Notably, the Vision Transformers (ViTs) emergence has further\npropelled these advancements. When computing, the self-attention mechanism, a\ncornerstone of ViTs, tends to encompass all global cues, even those from\nsemantically unrelated objects or regions. This inclusivity introduces\ncomputational inefficiencies, particularly noticeable with high input\nresolution, as it requires processing irrelevant information, thereby impeding\nefficiency. Additionally, for IR, it is commonly noted that small segments of a\ndegraded image, particularly those closely aligned semantically, provide\nparticularly relevant information to aid in the restoration process, as they\ncontribute essential contextual cues crucial for accurate reconstruction. To\naddress these challenges, we propose boosting IR's performance by sharing the\nkey semantics via Transformer for IR (i.e., SemanIR) in this paper.\nSpecifically, SemanIR initially constructs a sparse yet comprehensive\nkey-semantic dictionary within each transformer stage by establishing essential\nsemantic connections for every degraded patch. Subsequently, this dictionary is\nshared across all subsequent transformer blocks within the same stage. This\nstrategy optimizes attention calculation within each block by focusing\nexclusively on semantically related components stored in the key-semantic\ndictionary. As a result, attention calculation achieves linear computational\ncomplexity within each window. Extensive experiments across 6 IR tasks confirm\nthe proposed SemanIR's state-of-the-art performance, quantitatively and\nqualitatively showcasing advancements.\n","authors":["Bin Ren","Yawei Li","Jingyun Liang","Rakesh Ranjan","Mengyuan Liu","Rita Cucchiara","Luc Van Gool","Ming-Hsuan Yang","Nicu Sebe"],"pdf_url":"https://arxiv.org/pdf/2405.20008v1.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2405.19996v1","updated":"2024-05-30T12:32:35Z","published":"2024-05-30T12:32:35Z","title":"DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in\n the Wild","summary":" Image quality assessment (IQA) plays a critical role in selecting\nhigh-quality images and guiding compression and enhancement methods in a series\nof applications. The blind IQA, which assesses the quality of in-the-wild\nimages containing complex authentic distortions without reference images, poses\ngreater challenges. Existing methods are limited to modeling a uniform\ndistribution with local patches and are bothered by the gap between low and\nhigh-level visions (caused by widely adopted pre-trained classification\nnetworks). In this paper, we propose a novel IQA method called diffusion\npriors-based IQA (DP-IQA), which leverages the prior knowledge from the\npre-trained diffusion model with its excellent powers to bridge semantic gaps\nin the perception of the visual quality of images. Specifically, we use\npre-trained stable diffusion as the backbone, extract multi-level features from\nthe denoising U-Net during the upsampling process at a specified timestep, and\ndecode them to estimate the image quality score. The text and image adapters\nare adopted to mitigate the domain gap for downstream tasks and correct the\ninformation loss caused by the variational autoencoder bottleneck. Finally, we\ndistill the knowledge in the above model into a CNN-based student model,\nsignificantly reducing the parameter to enhance applicability, with the student\nmodel performing similarly or even better than the teacher model surprisingly.\nExperimental results demonstrate that our DP-IQA achieves state-of-the-art\nresults on various in-the-wild datasets with better generalization capability,\nwhich shows the superiority of our method in global modeling and utilizing the\nhierarchical feature clues of diffusion for evaluating image quality.\n","authors":["Honghao Fu","Yufei Wang","Wenhan Yang","Bihan Wen"],"pdf_url":"https://arxiv.org/pdf/2405.19996v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19990v1","updated":"2024-05-30T12:22:06Z","published":"2024-05-30T12:22:06Z","title":"DiffPhysBA: Diffusion-based Physical Backdoor Attack against Person\n Re-Identification in Real-World","summary":" Person Re-Identification (ReID) systems pose a significant security risk from\nbackdoor attacks, allowing adversaries to evade tracking or impersonate others.\nBeyond recognizing this issue, we investigate how backdoor attacks can be\ndeployed in real-world scenarios, where a ReID model is typically trained on\ndata collected in the digital domain and then deployed in a physical\nenvironment. This attack scenario requires an attack flow that embeds backdoor\ntriggers in the digital domain realistically enough to also activate the buried\nbackdoor in person ReID models in the physical domain. This paper realizes this\nattack flow by leveraging a diffusion model to generate realistic accessories\non pedestrian images (e.g., bags, hats, etc.) as backdoor triggers. However,\nthe noticeable domain gap between the triggers generated by the off-the-shelf\ndiffusion model and their physical counterparts results in a low attack success\nrate. Therefore, we introduce a novel diffusion-based physical backdoor attack\n(DiffPhysBA) method that adopts a training-free similarity-guided sampling\nprocess to enhance the resemblance between generated and physical triggers.\nConsequently, DiffPhysBA can generate realistic attributes as semantic-level\ntriggers in the digital domain and provides higher physical ASR compared to the\ndirect paste method by 25.6% on the real-world test set. Through evaluations on\nnewly proposed real-world and synthetic ReID test sets, DiffPhysBA demonstrates\nan impressive success rate exceeding 90% in both the digital and physical\ndomains. Notably, it excels in digital stealth metrics and can effectively\nevade state-of-the-art defense methods.\n","authors":["Wenli Sun","Xinyang Jiang","Dongsheng Li","Cairong Zhao"],"pdf_url":"https://arxiv.org/pdf/2405.19990v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.15240v2","updated":"2024-05-30T12:14:05Z","published":"2024-05-24T06:06:41Z","title":"Towards Real World Debiasing: A Fine-grained Analysis On Spurious\n Correlation","summary":" Spurious correlations in training data significantly hinder the\ngeneralization capability of machine learning models when faced with\ndistribution shifts in real-world scenarios. To tackle the problem, numerous\ndebias approaches have been proposed and benchmarked on datasets intentionally\ndesigned with severe biases. However, it remains to be asked: \\textit{1. Do\nexisting benchmarks really capture biases in the real world? 2. Can existing\ndebias methods handle biases in the real world?} To answer the questions, we\nrevisit biased distributions in existing benchmarks and real-world datasets,\nand propose a fine-grained framework for analyzing dataset bias by\ndisentangling it into the magnitude and prevalence of bias. We observe and\ntheoretically demonstrate that existing benchmarks poorly represent real-world\nbiases. We further introduce two novel biased distributions to bridge this gap,\nforming a nuanced evaluation framework for real-world debiasing. Building upon\nthese results, we evaluate existing debias methods with our evaluation\nframework. Results show that existing methods are incapable of handling\nreal-world biases. Through in-depth analysis, we propose a simple yet effective\napproach that can be easily applied to existing debias methods, named Debias in\nDestruction (DiD). Empirical results demonstrate the superiority of DiD,\nimproving the performance of existing methods on all types of biases within the\nproposed evaluation framework.\n","authors":["Zhibo Wang","Peng Kuang","Zhixuan Chu","Jingyi Wang","Kui Ren"],"pdf_url":"https://arxiv.org/pdf/2405.15240v2.pdf","comment":"9 pages of main paper, 10 pages of appendix"},{"id":"http://arxiv.org/abs/2308.14746v3","updated":"2024-05-30T11:52:33Z","published":"2023-08-28T17:55:33Z","title":"CoVR: Learning Composed Video Retrieval from Web Video Captions","summary":" Composed Image Retrieval (CoIR) has recently gained popularity as a task that\nconsiders both text and image queries together, to search for relevant images\nin a database. Most CoIR approaches require manually annotated datasets,\ncomprising image-text-image triplets, where the text describes a modification\nfrom the query image to the target image. However, manual curation of CoIR\ntriplets is expensive and prevents scalability. In this work, we instead\npropose a scalable automatic dataset creation methodology that generates\ntriplets given video-caption pairs, while also expanding the scope of the task\nto include composed video retrieval (CoVR). To this end, we mine paired videos\nwith a similar caption from a large database, and leverage a large language\nmodel to generate the corresponding modification text. Applying this\nmethodology to the extensive WebVid2M collection, we automatically construct\nour WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we\nintroduce a new benchmark for CoVR with a manually annotated evaluation set,\nalong with baseline results. Our experiments further demonstrate that training\na CoVR model on our dataset effectively transfers to CoIR, leading to improved\nstate-of-the-art performance in the zero-shot setup on both the CIRR and\nFashionIQ benchmarks. Our code, datasets, and models are publicly available at\nhttps://imagine.enpc.fr/~ventural/covr.\n","authors":["Lucas Ventura","Antoine Yang","Cordelia Schmid","Gül Varol"],"pdf_url":"https://arxiv.org/pdf/2308.14746v3.pdf","comment":"AAAI 2024, Updated the results on CIRR with the correct evaluation.\n Project page: Project page: https://imagine.enpc.fr/~ventural/covr/"},{"id":"http://arxiv.org/abs/2405.18416v2","updated":"2024-05-30T11:52:04Z","published":"2024-05-28T17:57:12Z","title":"3D StreetUnveiler with Semantic-Aware 2DGS","summary":" Unveiling an empty street from crowded observations captured by in-car\ncameras is crucial for autonomous driving. However, removing all temporarily\nstatic objects, such as stopped vehicles and standing pedestrians, presents a\nsignificant challenge. Unlike object-centric 3D inpainting, which relies on\nthorough observation in a small scene, street scene cases involve long\ntrajectories that differ from previous 3D inpainting tasks. The camera-centric\nmoving environment of captured videos further complicates the task due to the\nlimited degree and time duration of object observation. To address these\nobstacles, we introduce StreetUnveiler to reconstruct an empty street.\nStreetUnveiler learns a 3D representation of the empty street from crowded\nobservations. Our representation is based on the hard-label semantic 2D\nGaussian Splatting (2DGS) for its scalability and ability to identify Gaussians\nto be removed. We inpaint rendered image after removing unwanted Gaussians to\nprovide pseudo-labels and subsequently re-optimize the 2DGS. Given its temporal\ncontinuous movement, we divide the empty street scene into observed,\npartial-observed, and unobserved regions, which we propose to locate through a\nrendered alpha map. This decomposition helps us to minimize the regions that\nneed to be inpainted. To enhance the temporal consistency of the inpainting, we\nintroduce a novel time-reversal framework to inpaint frames in reverse order\nand use later frames as references for earlier frames to fully utilize the\nlong-trajectory observations. Our experiments conducted on the street scene\ndataset successfully reconstructed a 3D representation of the empty street. The\nmesh representation of the empty street can be extracted for further\napplications. The project page and more visualizations can be found at:\nhttps://streetunveiler.github.io\n","authors":["Jingwei Xu","Yikai Wang","Yiqun Zhao","Yanwei Fu","Shenghua Gao"],"pdf_url":"https://arxiv.org/pdf/2405.18416v2.pdf","comment":"Project page: https://streetunveiler.github.io"},{"id":"http://arxiv.org/abs/2402.03286v3","updated":"2024-05-30T11:42:15Z","published":"2024-02-05T18:42:34Z","title":"Training-Free Consistent Text-to-Image Generation","summary":" Text-to-image models offer a new level of creative flexibility by allowing\nusers to guide the image generation process through natural language. However,\nusing these models to consistently portray the same subject across diverse\nprompts remains challenging. Existing approaches fine-tune the model to teach\nit new words that describe specific user-provided subjects or add image\nconditioning to the model. These methods require lengthy per-subject\noptimization or large-scale pre-training. Moreover, they struggle to align\ngenerated images with text prompts and face difficulties in portraying multiple\nsubjects. Here, we present ConsiStory, a training-free approach that enables\nconsistent subject generation by sharing the internal activations of the\npretrained model. We introduce a subject-driven shared attention block and\ncorrespondence-based feature injection to promote subject consistency between\nimages. Additionally, we develop strategies to encourage layout diversity while\nmaintaining subject consistency. We compare ConsiStory to a range of baselines,\nand demonstrate state-of-the-art performance on subject consistency and text\nalignment, without requiring a single optimization step. Finally, ConsiStory\ncan naturally extend to multi-subject scenarios, and even enable training-free\npersonalization for common objects.\n","authors":["Yoad Tewel","Omri Kaduri","Rinon Gal","Yoni Kasten","Lior Wolf","Gal Chechik","Yuval Atzmon"],"pdf_url":"https://arxiv.org/pdf/2402.03286v3.pdf","comment":"Accepted to journal track of SIGGRAPH 2024 (TOG). Project page is at\n https://consistory-paper.github.io"},{"id":"http://arxiv.org/abs/2306.17574v2","updated":"2024-05-30T11:33:08Z","published":"2023-06-30T11:49:00Z","title":"SpATr: MoCap 3D Human Action Recognition based on Spiral Auto-encoder\n and Transformer Network","summary":" Recent technological advancements have significantly expanded the potential\nof human action recognition through harnessing the power of 3D data. This data\nprovides a richer understanding of actions, including depth information that\nenables more accurate analysis of spatial and temporal characteristics. In this\ncontext, We study the challenge of 3D human action recognition.Unlike prior\nmethods, that rely on sampling 2D depth images, skeleton points, or point\nclouds, often leading to substantial memory requirements and the ability to\nhandle only short sequences, we introduce a novel approach for 3D human action\nrecognition, denoted as SpATr (Spiral Auto-encoder and Transformer Network),\nspecifically designed for fixed-topology mesh sequences. The SpATr model\ndisentangles space and time in the mesh sequences. A lightweight auto-encoder,\nbased on spiral convolutions, is employed to extract spatial geometrical\nfeatures from each 3D mesh. These convolutions are lightweight and specifically\ndesigned for fix-topology mesh data. Subsequently, a temporal transformer,\nbased on self-attention, captures the temporal context within the feature\nsequence. The self-attention mechanism enables long-range dependencies\ncapturing and parallel processing, ensuring scalability for long sequences. The\nproposed method is evaluated on three prominent 3D human action datasets:\nBabel, MoVi, and BMLrub, from the Archive of Motion Capture As Surface Shapes\n(AMASS). Our results analysis demonstrates the competitive performance of our\nSpATr model in 3D human action recognition while maintaining efficient memory\nusage. The code and the training results will soon be made publicly available\nat https://github.com/h-bouzid/spatr.\n","authors":["Hamza Bouzid","Lahoucine Ballihi"],"pdf_url":"https://arxiv.org/pdf/2306.17574v2.pdf","comment":"Accepted in CVIU"},{"id":"http://arxiv.org/abs/2405.19957v1","updated":"2024-05-30T11:23:01Z","published":"2024-05-30T11:23:01Z","title":"PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting","summary":" As text-conditioned diffusion models (DMs) achieve breakthroughs in image,\nvideo, and 3D generation, the research community's focus has shifted to the\nmore challenging task of text-to-4D synthesis, which introduces a temporal\ndimension to generate dynamic 3D objects. In this context, we identify Score\nDistillation Sampling (SDS), a widely used technique for text-to-3D synthesis,\nas a significant hindrance to text-to-4D performance due to its Janus-faced and\ntexture-unrealistic problems coupled with high computational costs. In this\npaper, we propose \\textbf{P}ixel-\\textbf{L}evel \\textbf{A}lignments for\nText-to-\\textbf{4D} Gaussian Splatting (\\textbf{PLA4D}), a novel method that\nutilizes text-to-video frames as explicit pixel alignment targets to generate\nstatic 3D objects and inject motion into them. Specifically, we introduce Focal\nAlignment to calibrate camera poses for rendering and GS-Mesh Contrastive\nLearning to distill geometry priors from rendered image contrasts at the pixel\nlevel. Additionally, we develop Motion Alignment using a deformation network to\ndrive changes in Gaussians and implement Reference Refinement for smooth 4D\nobject surfaces. These techniques enable 4D Gaussian Splatting to align\ngeometry, texture, and motion with generated videos at the pixel level.\nCompared to previous methods, PLA4D produces synthesized outputs with better\ntexture details in less time and effectively mitigates the Janus-faced problem.\nPLA4D is fully implemented using open-source models, offering an accessible,\nuser-friendly, and promising direction for 4D digital content creation. Our\nproject page:\n\\href{https://github.com/MiaoQiaowei/PLA4D.github.io}{https://github.com/MiaoQiaowei/PLA4D.github.io}.\n","authors":["Qiaowei Miao","Yawei Luo","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2405.19957v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.06681v2","updated":"2024-05-30T11:16:49Z","published":"2024-03-11T12:56:36Z","title":"Trustworthy Partial Label Learning with Out-of-distribution Detection","summary":" Partial Label Learning (PLL) grapples with learning from ambiguously labelled\ndata, and it has been successfully applied in fields such as image recognition.\nNevertheless, traditional PLL methods rely on the closed-world assumption,\nwhich can be limiting in open-world scenarios and negatively impact model\nperformance and generalization. To tackle these challenges, our study\nintroduces a novel method called PLL-OOD, which is the first to incorporate\nOut-of-Distribution (OOD) detection into the PLL framework. PLL-OOD\nsignificantly enhances model adaptability and accuracy by merging\nself-supervised learning with partial label loss and pioneering the\nPartial-Energy (PE) score for OOD detection. This approach improves data\nfeature representation and effectively disambiguates candidate labels, using a\ndynamic label confidence matrix to refine predictions. The PE score, adjusted\nby label confidence, precisely identifies OOD instances, optimizing model\ntraining towards in-distribution data. This innovative method markedly boosts\nPLL model robustness and performance in open-world settings. To validate our\napproach, we conducted a comprehensive comparative experiment combining the\nexisting state-of-the-art PLL model with multiple OOD scores on the CIFAR-10\nand CIFAR-100 datasets with various OOD datasets. The results demonstrate that\nthe proposed PLL-OOD framework is highly effective and effectiveness\noutperforms existing models, showcasing its superiority and effectiveness.\n","authors":["Jintao Huang","Yiu-Ming Cheung"],"pdf_url":"https://arxiv.org/pdf/2403.06681v2.pdf","comment":"There are many errors in the Abstract, Introduction, Related Work,\n Proposed Method, Experiment and References of this paper, which need to be\n further corrected to avoid misleading. Therefore, it needs to be withdrawn"},{"id":"http://arxiv.org/abs/2405.19949v1","updated":"2024-05-30T11:11:54Z","published":"2024-05-30T11:11:54Z","title":"Hyper-Transformer for Amodal Completion","summary":" Amodal object completion is a complex task that involves predicting the\ninvisible parts of an object based on visible segments and background\ninformation. Learning shape priors is crucial for effective amodal completion,\nbut traditional methods often rely on two-stage processes or additional\ninformation, leading to inefficiencies and potential error accumulation. To\naddress these shortcomings, we introduce a novel framework named the\nHyper-Transformer Amodal Network (H-TAN). This framework utilizes a hyper\ntransformer equipped with a dynamic convolution head to directly learn shape\npriors and accurately predict amodal masks. Specifically, H-TAN uses a\ndual-branch structure to extract multi-scale features from both images and\nmasks. The multi-scale features from the image branch guide the hyper\ntransformer in learning shape priors and in generating the weights for dynamic\nconvolution tailored to each instance. The dynamic convolution head then uses\nthe features from the mask branch to predict precise amodal masks. We\nextensively evaluate our model on three benchmark datasets: KINS, COCOA-cls,\nand D2SA, where H-TAN demonstrated superior performance compared to existing\nmethods. Additional experiments validate the effectiveness and stability of the\nnovel hyper transformer in our framework.\n","authors":["Jianxiong Gao","Xuelin Qian","Longfei Liang","Junwei Han","Yanwei Fu"],"pdf_url":"https://arxiv.org/pdf/2405.19949v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19943v1","updated":"2024-05-30T11:03:27Z","published":"2024-05-30T11:03:27Z","title":"Multi-View People Detection in Large Scenes via Supervised View-Wise\n Contribution Weighting","summary":" Recent deep learning-based multi-view people detection (MVD) methods have\nshown promising results on existing datasets. However, current methods are\nmainly trained and evaluated on small, single scenes with a limited number of\nmulti-view frames and fixed camera views. As a result, these methods may not be\npractical for detecting people in larger, more complex scenes with severe\nocclusions and camera calibration errors. This paper focuses on improving\nmulti-view people detection by developing a supervised view-wise contribution\nweighting approach that better fuses multi-camera information under large\nscenes. Besides, a large synthetic dataset is adopted to enhance the model's\ngeneralization ability and enable more practical evaluation and comparison. The\nmodel's performance on new testing scenes is further improved with a simple\ndomain adaptation technique. Experimental results demonstrate the effectiveness\nof our approach in achieving promising cross-scene multi-view people detection\nperformance. See code here: https://vcc.tech/research/2024/MVD.\n","authors":["Qi Zhang","Yunfei Gong","Daijie Chen","Antoni B. Chan","Hui Huang"],"pdf_url":"https://arxiv.org/pdf/2405.19943v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2405.18762v2","updated":"2024-05-30T10:58:56Z","published":"2024-05-29T05:04:07Z","title":"Inpaint Biases: A Pathway to Accurate and Unbiased Image Generation","summary":" This paper examines the limitations of advanced text-to-image models in\naccurately rendering unconventional concepts which are scarcely represented or\nabsent in their training datasets. We identify how these limitations not only\nconfine the creative potential of these models but also pose risks of\nreinforcing stereotypes. To address these challenges, we introduce the Inpaint\nBiases framework, which employs user-defined masks and inpainting techniques to\nenhance the accuracy of image generation, particularly for novel or\ninaccurately rendered objects. Through experimental validation, we demonstrate\nhow this framework significantly improves the fidelity of generated images to\nthe user's intent, thereby expanding the models' creative capabilities and\nmitigating the risk of perpetuating biases. Our study contributes to the\nadvancement of text-to-image models as unbiased, versatile tools for creative\nexpression.\n","authors":["Jiyoon Myung","Jihyeon Park"],"pdf_url":"https://arxiv.org/pdf/2405.18762v2.pdf","comment":"Paper accepted in CVPRW 2024"},{"id":"http://arxiv.org/abs/2405.19931v1","updated":"2024-05-30T10:47:48Z","published":"2024-05-30T10:47:48Z","title":"Exploring Diffusion Models' Corruption Stage in Few-Shot Fine-tuning and\n Mitigating with Bayesian Neural Networks","summary":" Few-shot fine-tuning of Diffusion Models (DMs) is a key advancement,\nsignificantly reducing training costs and enabling personalized AI\napplications. However, we explore the training dynamics of DMs and observe an\nunanticipated phenomenon: during the training process, image fidelity initially\nimproves, then unexpectedly deteriorates with the emergence of noisy patterns,\nonly to recover later with severe overfitting. We term the stage with generated\nnoisy patterns as corruption stage. To understand this corruption stage, we\nbegin by theoretically modeling the one-shot fine-tuning scenario, and then\nextend this modeling to more general cases. Through this modeling, we identify\nthe primary cause of this corruption stage: a narrowed learning distribution\ninherent in the nature of few-shot fine-tuning. To tackle this, we apply\nBayesian Neural Networks (BNNs) on DMs with variational inference to implicitly\nbroaden the learned distribution, and present that the learning target of the\nBNNs can be naturally regarded as an expectation of the diffusion loss and a\nfurther regularization with the pretrained DMs. This approach is highly\ncompatible with current few-shot fine-tuning methods in DMs and does not\nintroduce any extra inference costs. Experimental results demonstrate that our\nmethod significantly mitigates corruption, and improves the fidelity, quality\nand diversity of the generated images in both object-driven and subject-driven\ngeneration tasks.\n","authors":["Xiaoyu Wu","Jiaru Zhang","Yang Hua","Bohan Lyu","Hao Wang","Tao Song","Haibing Guan"],"pdf_url":"https://arxiv.org/pdf/2405.19931v1.pdf","comment":"Preprint. Under review"},{"id":"http://arxiv.org/abs/2309.15703v3","updated":"2024-05-30T10:46:20Z","published":"2023-09-27T14:46:01Z","title":"Physics-Based Rigid Body Object Tracking and Friction Filtering From\n RGB-D Videos","summary":" Physics-based understanding of object interactions from sensory observations\nis an essential capability in augmented reality and robotics. It enables to\ncapture the properties of a scene for simulation and control. In this paper, we\npropose a novel approach for real-to-sim which tracks rigid objects in 3D from\nRGB-D images and infers physical properties of the objects. We use a\ndifferentiable physics simulation as state-transition model in an Extended\nKalman Filter which can model contact and friction for arbitrary mesh-based\nshapes and in this way estimate physically plausible trajectories. We\ndemonstrate that our approach can filter position, orientation, velocities, and\nconcurrently can estimate the coefficient of friction of the objects. We\nanalyze our approach on various sliding scenarios in synthetic image sequences\nof single objects and colliding objects. We also demonstrate and evaluate our\napproach on a real-world dataset. We make our novel benchmark datasets publicly\navailable to foster future research in this novel problem setting and\ncomparison with our method.\n","authors":["Rama Krishna Kandukuri","Michael Strecke","Joerg Stueckler"],"pdf_url":"https://arxiv.org/pdf/2309.15703v3.pdf","comment":"33 pages, 35 figures, accepted for publication at 3DV 2024, includes\n supplementary material of the conference submission"},{"id":"http://arxiv.org/abs/2405.19203v2","updated":"2024-05-30T10:38:09Z","published":"2024-05-29T15:43:49Z","title":"$E^{3}$Gen: Efficient, Expressive and Editable Avatars Generation","summary":" This paper aims to introduce 3D Gaussian for efficient, expressive, and\neditable digital avatar generation. This task faces two major challenges: (1)\nThe unstructured nature of 3D Gaussian makes it incompatible with current\ngeneration pipelines; (2) the expressive animation of 3D Gaussian in a\ngenerative setting that involves training with multiple subjects remains\nunexplored. In this paper, we propose a novel avatar generation method named\n$E^3$Gen, to effectively address these challenges. First, we propose a novel\ngenerative UV features plane representation that encodes unstructured 3D\nGaussian onto a structured 2D UV space defined by the SMPL-X parametric model.\nThis novel representation not only preserves the representation ability of the\noriginal 3D Gaussian but also introduces a shared structure among subjects to\nenable generative learning of the diffusion model. To tackle the second\nchallenge, we propose a part-aware deformation module to achieve robust and\naccurate full-body expressive pose control. Extensive experiments demonstrate\nthat our method achieves superior performance in avatar generation and enables\nexpressive full-body pose control and editing. Our project page is\nhttps://olivia23333.github.io/E3Gen.\n","authors":["Weitian Zhang","Yichao Yan","Yunhui Liu","Xingdong Sheng","Xiaokang Yang"],"pdf_url":"https://arxiv.org/pdf/2405.19203v2.pdf","comment":"Project Page: https://olivia23333.github.io/E3Gen"},{"id":"http://arxiv.org/abs/2405.19921v1","updated":"2024-05-30T10:33:14Z","published":"2024-05-30T10:33:14Z","title":"MCDS-VSS: Moving Camera Dynamic Scene Video Semantic Segmentation by\n Filtering with Self-Supervised Geometry and Motion","summary":" Autonomous systems, such as self-driving cars, rely on reliable semantic\nenvironment perception for decision making. Despite great advances in video\nsemantic segmentation, existing approaches ignore important inductive biases\nand lack structured and interpretable internal representations. In this work,\nwe propose MCDS-VSS, a structured filter model that learns in a self-supervised\nmanner to estimate scene geometry and ego-motion of the camera, while also\nestimating the motion of external objects. Our model leverages these\nrepresentations to improve the temporal consistency of semantic segmentation\nwithout sacrificing segmentation accuracy. MCDS-VSS follows a prediction-fusion\napproach in which scene geometry and camera motion are first used to compensate\nfor ego-motion, then residual flow is used to compensate motion of dynamic\nobjects, and finally the predicted scene features are fused with the current\nfeatures to obtain a temporally consistent scene segmentation. Our model parses\nautomotive scenes into multiple decoupled interpretable representations such as\nscene geometry, ego-motion, and object motion. Quantitative evaluation shows\nthat MCDS-VSS achieves superior temporal consistency on video sequences while\nretaining competitive segmentation performance.\n","authors":["Angel Villar-Corrales","Moritz Austermann","Sven Behnke"],"pdf_url":"https://arxiv.org/pdf/2405.19921v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19917v1","updated":"2024-05-30T10:30:07Z","published":"2024-05-30T10:30:07Z","title":"Multimodal Cross-Domain Few-Shot Learning for Egocentric Action\n Recognition","summary":" We address a novel cross-domain few-shot learning task (CD-FSL) with\nmultimodal input and unlabeled target data for egocentric action recognition.\nThis paper simultaneously tackles two critical challenges associated with\negocentric action recognition in CD-FSL settings: (1) the extreme domain gap in\negocentric videos (\\eg, daily life vs. industrial domain) and (2) the\ncomputational cost for real-world applications. We propose MM-CDFSL, a\ndomain-adaptive and computationally efficient approach designed to enhance\nadaptability to the target domain and improve inference speed. To address the\nfirst challenge, we propose the incorporation of multimodal distillation into\nthe student RGB model using teacher models. Each teacher model is trained\nindependently on source and target data for its respective modality. Leveraging\nonly unlabeled target data during multimodal distillation enhances the student\nmodel's adaptability to the target domain. We further introduce ensemble masked\ninference, a technique that reduces the number of input tokens through masking.\nIn this approach, ensemble prediction mitigates the performance degradation\ncaused by masking, effectively addressing the second issue. Our approach\noutperformed the state-of-the-art CD-FSL approaches with a substantial margin\non multiple egocentric datasets, improving by an average of 6.12/6.10 points\nfor 1-shot/5-shot settings while achieving $2.2$ times faster inference speed.\nProject page: https://masashi-hatano.github.io/MM-CDFSL/\n","authors":["Masashi Hatano","Ryo Hachiuma","Ryo Fuji","Hideo Saito"],"pdf_url":"https://arxiv.org/pdf/2405.19917v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19914v1","updated":"2024-05-30T10:25:50Z","published":"2024-05-30T10:25:50Z","title":"Towards RGB-NIR Cross-modality Image Registration and Beyond","summary":" This paper focuses on the area of RGB(visible)-NIR(near-infrared)\ncross-modality image registration, which is crucial for many downstream vision\ntasks to fully leverage the complementary information present in visible and\ninfrared images. In this field, researchers face two primary challenges - the\nabsence of a correctly-annotated benchmark with viewpoint variations for\nevaluating RGB-NIR cross-modality registration methods and the problem of\ninconsistent local features caused by the appearance discrepancy between\nRGB-NIR cross-modality images. To address these challenges, we first present\nthe RGB-NIR Image Registration (RGB-NIR-IRegis) benchmark, which, for the first\ntime, enables fair and comprehensive evaluations for the task of RGB-NIR\ncross-modality image registration. Evaluations of previous methods highlight\nthe significant challenges posed by our RGB-NIR-IRegis benchmark, especially on\nRGB-NIR image pairs with viewpoint variations. To analyze the causes of the\nunsatisfying performance, we then design several metrics to reveal the toxic\nimpact of inconsistent local features between visible and infrared images on\nthe model performance. This further motivates us to develop a baseline method\nnamed Semantic Guidance Transformer (SGFormer), which utilizes high-level\nsemantic guidance to mitigate the negative impact of local inconsistent\nfeatures. Despite the simplicity of our motivation, extensive experimental\nresults show the effectiveness of our method.\n","authors":["Huadong Li","Shichao Dong","Jin Wang","Rong Fu","Minhao Jing","Jiajun Liang","Haoqiang Fan","Renhe Ji"],"pdf_url":"https://arxiv.org/pdf/2405.19914v1.pdf","comment":"18 pages, 7 figures"},{"id":"http://arxiv.org/abs/2306.00530v3","updated":"2024-05-30T10:18:44Z","published":"2023-06-01T10:29:58Z","title":"CL-MRI: Self-Supervised Contrastive Learning to Improve the Accuracy of\n Undersampled MRI Reconstruction","summary":" In Magnetic Resonance Imaging (MRI), image acquisitions are often\nundersampled in the measurement domain to accelerate the scanning process, at\nthe expense of image quality. However, image quality is a crucial factor that\ninfluences the accuracy of clinical diagnosis; hence, high-quality image\nreconstruction from undersampled measurements has been a key area of research.\nRecently, deep learning (DL) methods have emerged as the state-of-the-art for\nMRI reconstruction, typically involving deep neural networks to transform\nundersampled MRI images into high-quality MRI images through data-driven\nprocesses. Nevertheless, there is clear and significant room for improvement in\nundersampled DL MRI reconstruction to meet the high standards required for\nclinical diagnosis, in terms of eliminating aliasing artifacts and reducing\nimage noise. In this paper, we introduce a self-supervised pretraining\nprocedure using contrastive learning to improve the accuracy of undersampled DL\nMRI reconstruction. We use contrastive learning to transform the MRI image\nrepresentations into a latent space that maximizes mutual information among\ndifferent undersampled representations and optimizes the information content at\nthe input of the downstream DL reconstruction models. Our experiments\ndemonstrate improved reconstruction accuracy across a range of acceleration\nfactors and datasets, both quantitatively and qualitatively. Furthermore, our\nextended experiments validate the proposed framework's robustness under\nadversarial conditions, such as measurement noise, different k-space sampling\npatterns, and pathological abnormalities, and also prove the transfer learning\ncapabilities on MRI datasets with completely different anatomy. Additionally,\nwe conducted experiments to visualize and analyze the properties of the\nproposed MRI contrastive learning latent space.\n","authors":["Mevan Ekanayake","Zhifeng Chen","Mehrtash Harandi","Gary Egan","Zhaolin Chen"],"pdf_url":"https://arxiv.org/pdf/2306.00530v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.00618v2","updated":"2024-05-30T10:16:04Z","published":"2024-03-31T09:10:32Z","title":"A Multi-Branched Radial Basis Network Approach to Predicting Complex\n Chaotic Behaviours","summary":" In this study, we propose a multi branched network approach to predict the\ndynamics of a physics attractor characterized by intricate and chaotic\nbehavior. We introduce a unique neural network architecture comprised of Radial\nBasis Function (RBF) layers combined with an attention mechanism designed to\neffectively capture nonlinear inter-dependencies inherent in the attractor's\ntemporal evolution. Our results demonstrate successful prediction of the\nattractor's trajectory across 100 predictions made using a real-world dataset\nof 36,700 time-series observations encompassing approximately 28 minutes of\nactivity. To further illustrate the performance of our proposed technique, we\nprovide comprehensive visualizations depicting the attractor's original and\npredicted behaviors alongside quantitative measures comparing observed versus\nestimated outcomes. Overall, this work showcases the potential of advanced\nmachine learning algorithms in elucidating hidden structures in complex\nphysical systems while offering practical applications in various domains\nrequiring accurate short-term forecasting capabilities.\n","authors":["Aarush Sinha"],"pdf_url":"https://arxiv.org/pdf/2404.00618v2.pdf","comment":"9 pages, 6 figures"},{"id":"http://arxiv.org/abs/2402.02407v2","updated":"2024-05-30T10:13:13Z","published":"2024-02-04T08:57:42Z","title":"Defining Neural Network Architecture through Polytope Structures of\n Dataset","summary":" Current theoretical and empirical research in neural networks suggests that\ncomplex datasets require large network architectures for thorough\nclassification, yet the precise nature of this relationship remains unclear.\nThis paper tackles this issue by defining upper and lower bounds for neural\nnetwork widths, which are informed by the polytope structure of the dataset in\nquestion. We also delve into the application of these principles to simplicial\ncomplexes and specific manifold shapes, explaining how the requirement for\nnetwork width varies in accordance with the geometric complexity of the\ndataset. Moreover, we develop an algorithm to investigate a converse situation\nwhere the polytope structure of a dataset can be inferred from its\ncorresponding trained neural networks. Through our algorithm, it is established\nthat popular datasets such as MNIST, Fashion-MNIST, and CIFAR10 can be\nefficiently encapsulated using no more than two polytopes with a small number\nof faces.\n","authors":["Sangmin Lee","Abbas Mammadov","Jong Chul Ye"],"pdf_url":"https://arxiv.org/pdf/2402.02407v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19899v1","updated":"2024-05-30T09:55:19Z","published":"2024-05-30T09:55:19Z","title":"Open-Set Domain Adaptation for Semantic Segmentation","summary":" Unsupervised domain adaptation (UDA) for semantic segmentation aims to\ntransfer the pixel-wise knowledge from the labeled source domain to the\nunlabeled target domain. However, current UDA methods typically assume a shared\nlabel space between source and target, limiting their applicability in\nreal-world scenarios where novel categories may emerge in the target domain. In\nthis paper, we introduce Open-Set Domain Adaptation for Semantic Segmentation\n(OSDA-SS) for the first time, where the target domain includes unknown classes.\nWe identify two major problems in the OSDA-SS scenario as follows: 1) the\nexisting UDA methods struggle to predict the exact boundary of the unknown\nclasses, and 2) they fail to accurately predict the shape of the unknown\nclasses. To address these issues, we propose Boundary and Unknown Shape-Aware\nopen-set domain adaptation, coined BUS. Our BUS can accurately discern the\nboundaries between known and unknown classes in a contrastive manner using a\nnovel dilation-erosion-based contrastive loss. In addition, we propose\nOpenReMix, a new domain mixing augmentation method that guides our model to\neffectively learn domain and size-invariant features for improving the shape\ndetection of the known and unknown classes. Through extensive experiments, we\ndemonstrate that our proposed BUS effectively detects unknown classes in the\nchallenging OSDA-SS scenario compared to the previous methods by a large\nmargin. The code is available at https://github.com/KHU-AGI/BUS.\n","authors":["Seun-An Choe","Ah-Hyung Shin","Keon-Hee Park","Jinwoo Choi","Gyeong-Moon Park"],"pdf_url":"https://arxiv.org/pdf/2405.19899v1.pdf","comment":"14 pages, 5 figures, 13 tables, CVPR 2024 Poster"},{"id":"http://arxiv.org/abs/2403.16539v2","updated":"2024-05-30T09:42:26Z","published":"2024-03-25T08:31:14Z","title":"Data-Efficient 3D Visual Grounding via Order-Aware Referring","summary":" 3D visual grounding aims to identify the target object within a 3D point\ncloud scene referred to by a natural language description. Previous works\nusually require significant data relating to point color and their descriptions\nto exploit the corresponding complicated verbo-visual relations. In our work,\nwe introduce Vigor, a novel Data-Efficient 3D Visual Grounding framework via\nOrder-aware Referring. Vigor leverages LLM to produce a desirable referential\norder from the input description for 3D visual grounding. With the proposed\nstacked object-referring blocks, the predicted anchor objects in the above\norder allow one to locate the target object progressively without supervision\non the identities of anchor objects or exact relations between anchor/target\nobjects. In addition, we present an order-aware warm-up training strategy,\nwhich augments referential orders for pre-training the visual grounding\nframework. This allows us to better capture the complex verbo-visual relations\nand benefit the desirable data-efficient learning scheme. Experimental results\non the NR3D and ScanRefer datasets demonstrate our superiority in low-resource\nscenarios. In particular, Vigor surpasses current state-of-the-art frameworks\nby 9.3% and 7.6% grounding accuracy under 1% data and 10% data settings on the\nNR3D dataset, respectively.\n","authors":["Tung-Yu Wu","Sheng-Yu Huang","Yu-Chiang Frank Wang"],"pdf_url":"https://arxiv.org/pdf/2403.16539v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19882v1","updated":"2024-05-30T09:41:10Z","published":"2024-05-30T09:41:10Z","title":"PixOOD: Pixel-Level Out-of-Distribution Detection","summary":" We propose a dense image prediction out-of-distribution detection algorithm,\ncalled PixOOD, which does not require training on samples of anomalous data and\nis not designed for a specific application which avoids traditional training\nbiases. In order to model the complex intra-class variability of the\nin-distribution data at the pixel level, we propose an online data condensation\nalgorithm which is more robust than standard K-means and is easily trainable\nthrough SGD. We evaluate PixOOD on a wide range of problems. It achieved\nstate-of-the-art results on four out of seven datasets, while being competitive\non the rest. The source code is available at https://github.com/vojirt/PixOOD.\n","authors":["Tomáš Vojíř","Jan Šochman","Jiří Matas"],"pdf_url":"https://arxiv.org/pdf/2405.19882v1.pdf","comment":"under review at ECCV 2024"},{"id":"http://arxiv.org/abs/2405.19876v1","updated":"2024-05-30T09:30:28Z","published":"2024-05-30T09:30:28Z","title":"IReNe: Instant Recoloring in Neural Radiance Fields","summary":" Advances in NERFs have allowed for 3D scene reconstructions and novel view\nsynthesis. Yet, efficiently editing these representations while retaining\nphotorealism is an emerging challenge. Recent methods face three primary\nlimitations: they're slow for interactive use, lack precision at object\nboundaries, and struggle to ensure multi-view consistency. We introduce IReNe\nto address these limitations, enabling swift, near real-time color editing in\nNeRF. Leveraging a pre-trained NeRF model and a single training image with\nuser-applied color edits, IReNe swiftly adjusts network parameters in seconds.\nThis adjustment allows the model to generate new scene views, accurately\nrepresenting the color changes from the training image while also controlling\nobject boundaries and view-specific effects. Object boundary control is\nachieved by integrating a trainable segmentation module into the model. The\nprocess gains efficiency by retraining only the weights of the last network\nlayer. We observed that neurons in this layer can be classified into those\nresponsible for view-dependent appearance and those contributing to diffuse\nappearance. We introduce an automated classification approach to identify these\nneuron types and exclusively fine-tune the weights of the diffuse neurons. This\nfurther accelerates training and ensures consistent color edits across\ndifferent views. A thorough validation on a new dataset, with edited object\ncolors, shows significant quantitative and qualitative advancements over\ncompetitors, accelerating speeds by 5x to 500x.\n","authors":["Alessio Mazzucchelli","Adrian Garcia-Garcia","Elena Garces","Fernando Rivas-Manzaneque","Francesc Moreno-Noguer","Adrian Penate-Sanchez"],"pdf_url":"https://arxiv.org/pdf/2405.19876v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.03917v3","updated":"2024-05-30T09:15:06Z","published":"2024-02-06T11:35:02Z","title":"Elastic Feature Consolidation for Cold Start Exemplar-Free Incremental\n Learning","summary":" Exemplar-Free Class Incremental Learning (EFCIL) aims to learn from a\nsequence of tasks without having access to previous task data. In this paper,\nwe consider the challenging Cold Start scenario in which insufficient data is\navailable in the first task to learn a high-quality backbone. This is\nespecially challenging for EFCIL since it requires high plasticity, which\nresults in feature drift which is difficult to compensate for in the\nexemplar-free setting. To address this problem, we propose a simple and\neffective approach that consolidates feature representations by regularizing\ndrift in directions highly relevant to previous tasks and employs prototypes to\nreduce task-recency bias. Our method, called Elastic Feature Consolidation\n(EFC), exploits a tractable second-order approximation of feature drift based\non an Empirical Feature Matrix (EFM). The EFM induces a pseudo-metric in\nfeature space which we use to regularize feature drift in important directions\nand to update Gaussian prototypes used in a novel asymmetric cross entropy loss\nwhich effectively balances prototype rehearsal with data from new tasks.\nExperimental results on CIFAR-100, Tiny-ImageNet, ImageNet-Subset and\nImageNet-1K demonstrate that Elastic Feature Consolidation is better able to\nlearn new tasks by maintaining model plasticity and significantly outperform\nthe state-of-the-art.\n","authors":["Simone Magistri","Tomaso Trinci","Albin Soutif-Cormerais","Joost van de Weijer","Andrew D. Bagdanov"],"pdf_url":"https://arxiv.org/pdf/2402.03917v3.pdf","comment":"Accepted at Twelfth International Conference on Learning\n Representations (ICLR 2024)"},{"id":"http://arxiv.org/abs/2405.19861v1","updated":"2024-05-30T09:10:33Z","published":"2024-05-30T09:10:33Z","title":"Hierarchical Object-Centric Learning with Capsule Networks","summary":" Capsule networks (CapsNets) were introduced to address convolutional neural\nnetworks limitations, learning object-centric representations that are more\nrobust, pose-aware, and interpretable. They organize neurons into groups called\ncapsules, where each capsule encodes the instantiation parameters of an object\nor one of its parts. Moreover, a routing algorithm connects capsules in\ndifferent layers, thereby capturing hierarchical part-whole relationships in\nthe data.\n This thesis investigates the intriguing aspects of CapsNets and focuses on\nthree key questions to unlock their full potential. First, we explore the\neffectiveness of the routing algorithm, particularly in small-sized networks.\nWe propose a novel method that anneals the number of routing iterations during\ntraining, enhancing performance in architectures with fewer parameters.\n Secondly, we investigate methods to extract more effective first-layer\ncapsules, also known as primary capsules. By exploiting pruned backbones, we\naim to improve computational efficiency by reducing the number of capsules\nwhile achieving high generalization. This approach reduces CapsNets memory\nrequirements and computational effort.\n Third, we explore part-relationship learning in CapsNets. Through extensive\nresearch, we demonstrate that capsules with low entropy can extract more\nconcise and discriminative part-whole relationships compared to traditional\ncapsule networks, even with reasonable network sizes.\n Lastly, we showcase how CapsNets can be utilized in real-world applications,\nincluding autonomous localization of unmanned aerial vehicles, quaternion-based\nrotations prediction in synthetic datasets, and lung nodule segmentation in\nbiomedical imaging.\n The findings presented in this thesis contribute to a deeper understanding of\nCapsNets and highlight their potential to address complex computer vision\nchallenges.\n","authors":["Riccardo Renzulli"],"pdf_url":"https://arxiv.org/pdf/2405.19861v1.pdf","comment":"Updated version of my PhD thesis (Nov 2023), with fixed typos. Will\n keep updated as new typos are discovered!"},{"id":"http://arxiv.org/abs/2405.19092v2","updated":"2024-05-30T09:06:07Z","published":"2024-05-29T13:54:12Z","title":"Benchmarking and Improving Detail Image Caption","summary":" Image captioning has long been regarded as a fundamental task in visual\nunderstanding. Recently, however, few large vision-language model (LVLM)\nresearch discusses model's image captioning performance because of the outdated\nshort-caption benchmarks and unreliable evaluation metrics. In this work, we\npropose to benchmark detail image caption task by curating high-quality\nevaluation datasets annotated by human experts, GPT-4V and Gemini-1.5-Pro. We\nalso design a more reliable caption evaluation metric called CAPTURE (CAPtion\nevaluation by exTracting and coUpling coRE information). CAPTURE extracts\nvisual elements, e.g., objects, attributes and relations from captions, and\nthen matches these elements through three stages, achieving the highest\nconsistency with expert judgements over other rule-based or model-based caption\nmetrics. The proposed benchmark and metric provide reliable evaluation for\nLVLM's detailed image captioning ability. Guided by this evaluation, we further\nexplore to unleash LVLM's detail caption capabilities by synthesizing\nhigh-quality data through a five-stage data construction pipeline. Our pipeline\nonly uses a given LVLM itself and other open-source tools, without any human or\nGPT-4V annotation in the loop. Experiments show that the proposed data\nconstruction strategy significantly improves model-generated detail caption\ndata quality for LVLMs with leading performance, and the data quality can be\nfurther improved in a self-looping paradigm. All code and dataset will be\npublicly available at https://github.com/foundation-multimodal-models/CAPTURE.\n","authors":["Hongyuan Dong","Jiawen Li","Bohong Wu","Jiacong Wang","Yuan Zhang","Haoyuan Guo"],"pdf_url":"https://arxiv.org/pdf/2405.19092v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19854v1","updated":"2024-05-30T09:03:23Z","published":"2024-05-30T09:03:23Z","title":"RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection","summary":" Open-vocabulary object detection (OVD) requires solid modeling of the\nregion-semantic relationship, which could be learned from massive region-text\npairs. However, such data is limited in practice due to significant annotation\ncosts. In this work, we propose RTGen to generate scalable open-vocabulary\nregion-text pairs and demonstrate its capability to boost the performance of\nopen-vocabulary object detection. RTGen includes both text-to-region and\nregion-to-text generation processes on scalable image-caption data. The\ntext-to-region generation is powered by image inpainting, directed by our\nproposed scene-aware inpainting guider for overall layout harmony. For\nregion-to-text generation, we perform multiple region-level image captioning\nwith various prompts and select the best matching text according to CLIP\nsimilarity. To facilitate detection training on region-text pairs, we also\nintroduce a localization-aware region-text contrastive loss that learns object\nproposals tailored with different localization qualities. Extensive experiments\ndemonstrate that our RTGen can serve as a scalable, semantically rich, and\neffective source for open-vocabulary object detection and continue to improve\nthe model performance when more data is utilized, delivering superior\nperformance compared to the existing state-of-the-art methods.\n","authors":["Fangyi Chen","Han Zhang","Zhantao Yang","Hao Chen","Kai Hu","Marios Savvides"],"pdf_url":"https://arxiv.org/pdf/2405.19854v1.pdf","comment":"Technical report"},{"id":"http://arxiv.org/abs/2405.19833v1","updated":"2024-05-30T08:44:12Z","published":"2024-05-30T08:44:12Z","title":"KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation","summary":" 2D keypoints are commonly used as an additional cue to refine estimated 3D\nhuman meshes. Current methods optimize the pose and shape parameters with a\nreprojection loss on the provided 2D keypoints. Such an approach, while simple\nand intuitive, has limited effectiveness because the optimal solution is hard\nto find in ambiguous parameter space and may sacrifice depth. Additionally,\ndivergent gradients from distal joints complicate and deviate the refinement of\nproximal joints in the kinematic chain. To address these, we introduce\nKinematic-Tree Rotation (KITRO), a novel mesh refinement strategy that\nexplicitly models depth and human kinematic-tree structure. KITRO treats\nrefinement from a bone-wise perspective. Unlike previous methods which perform\ngradient-based optimizations, our method calculates bone directions in closed\nform. By accounting for the 2D pose, bone length, and parent joint's depth, the\ncalculation results in two possible directions for each child joint. We then\nuse a decision tree to trace binary choices for all bones along the human\nskeleton's kinematic-tree to select the most probable hypothesis. Our\nexperiments across various datasets and baseline models demonstrate that KITRO\nsignificantly improves 3D joint estimation accuracy and achieves an ideal 2D\nfit simultaneously. Our code available at: https://github.com/MartaYang/KITRO.\n","authors":["Fengyuan Yang","Kerui Gu","Angela Yao"],"pdf_url":"https://arxiv.org/pdf/2405.19833v1.pdf","comment":"Accepted by CVPR24"},{"id":"http://arxiv.org/abs/2210.14562v2","updated":"2024-05-30T08:38:50Z","published":"2022-10-26T08:46:24Z","title":"FairCLIP: Social Bias Elimination based on Attribute Prototype Learning\n and Representation Neutralization","summary":" The Vision-Language Pre-training (VLP) models like CLIP have gained\npopularity in recent years. However, many works found that the social biases\nhidden in CLIP easily manifest in downstream tasks, especially in image\nretrieval, which can have harmful effects on human society. In this work, we\npropose FairCLIP to eliminate the social bias in CLIP-based image retrieval\nwithout damaging the retrieval performance achieving the compatibility between\nthe debiasing effect and the retrieval performance. FairCLIP is divided into\ntwo steps: Attribute Prototype Learning (APL) and Representation Neutralization\n(RN). In the first step, we extract the concepts needed for debiasing in CLIP.\nWe use the query with learnable word vector prefixes as the extraction\nstructure. In the second step, we first divide the attributes into target and\nbias attributes. By analysis, we find that both attributes have an impact on\nthe bias. Therefore, we try to eliminate the bias by using Re-Representation\nMatrix (RRM) to achieve the neutralization of the representation. We compare\nthe debiasing effect and retrieval performance with other methods, and\nexperiments demonstrate that FairCLIP can achieve the best compatibility.\nAlthough FairCLIP is used to eliminate bias in image retrieval, it achieves the\nneutralization of the representation which is common to all CLIP downstream\ntasks. This means that FairCLIP can be applied as a general debiasing method\nfor other fairness issues related to CLIP.\n","authors":["Junyang Wang","Yi Zhang","Jitao Sang"],"pdf_url":"https://arxiv.org/pdf/2210.14562v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19822v1","updated":"2024-05-30T08:31:01Z","published":"2024-05-30T08:31:01Z","title":"Improving Object Detector Training on Synthetic Data by Starting With a\n Strong Baseline Methodology","summary":" Collecting and annotating real-world data for the development of object\ndetection models is a time-consuming and expensive process. In the military\ndomain in particular, data collection can also be dangerous or infeasible.\nTraining models on synthetic data may provide a solution for cases where access\nto real-world training data is restricted. However, bridging the reality gap\nbetween synthetic and real data remains a challenge. Existing methods usually\nbuild on top of baseline Convolutional Neural Network (CNN) models that have\nbeen shown to perform well when trained on real data, but have limited ability\nto perform well when trained on synthetic data. For example, some architectures\nallow for fine-tuning with the expectation of large quantities of training data\nand are prone to overfitting on synthetic data. Related work usually ignores\nvarious best practices from object detection on real data, e.g. by training on\nsynthetic data from a single environment with relatively little variation. In\nthis paper we propose a methodology for improving the performance of a\npre-trained object detector when training on synthetic data. Our approach\nfocuses on extracting the salient information from synthetic data without\nforgetting useful features learned from pre-training on real images. Based on\nthe state of the art, we incorporate data augmentation methods and a\nTransformer backbone. Besides reaching relatively strong performance without\nany specialized synthetic data transfer methods, we show that our methods\nimprove the state of the art on synthetic data trained object detection for the\nRarePlanes and DGTA-VisDrone datasets, and reach near-perfect performance on an\nin-house vehicle detection dataset.\n","authors":["Frank A. Ruis","Alma M. Liezenga","Friso G. Heslinga","Luca Ballan","Thijs A. Eker","Richard J. M. den Hollander","Martin C. van Leeuwen","Judith Dijk","Wyke Huizinga"],"pdf_url":"https://arxiv.org/pdf/2405.19822v1.pdf","comment":"Submitted to and presented at SPIE Defense + Commercial Sensing 2024,\n 13 pages, 4 figures, 3 tables"},{"id":"http://arxiv.org/abs/2405.17825v2","updated":"2024-05-30T08:28:32Z","published":"2024-05-28T04:47:54Z","title":"Diffusion Model Patching via Mixture-of-Prompts","summary":" We present Diffusion Model Patching (DMP), a simple method to boost the\nperformance of pre-trained diffusion models that have already reached\nconvergence, with a negligible increase in parameters. DMP inserts a small,\nlearnable set of prompts into the model's input space while keeping the\noriginal model frozen. The effectiveness of DMP is not merely due to the\naddition of parameters but stems from its dynamic gating mechanism, which\nselects and combines a subset of learnable prompts at every step of the\ngenerative process (e.g., reverse denoising steps). This strategy, which we\nterm \"mixture-of-prompts\", enables the model to draw on the distinct expertise\nof each prompt, essentially \"patching\" the model's functionality at every step\nwith minimal yet specialized parameters. Uniquely, DMP enhances the model by\nfurther training on the same dataset on which it was originally trained, even\nin a scenario where significant improvements are typically not expected due to\nmodel convergence. Experiments show that DMP significantly enhances the\nconverged FID of DiT-L/2 on FFHQ 256x256 by 10.38%, achieved with only a 1.43%\nparameter increase and 50K additional training iterations.\n","authors":["Seokil Ham","Sangmin Woo","Jin-Young Kim","Hyojun Go","Byeongjun Park","Changick Kim"],"pdf_url":"https://arxiv.org/pdf/2405.17825v2.pdf","comment":"Project page: https://sangminwoo.github.io/DMP/"},{"id":"http://arxiv.org/abs/2405.19819v1","updated":"2024-05-30T08:26:47Z","published":"2024-05-30T08:26:47Z","title":"Gated Fields: Learning Scene Reconstruction from Gated Videos","summary":" Reconstructing outdoor 3D scenes from temporal observations is a challenge\nthat recent work on neural fields has offered a new avenue for. However,\nexisting methods that recover scene properties, such as geometry, appearance,\nor radiance, solely from RGB captures often fail when handling poorly-lit or\ntexture-deficient regions. Similarly, recovering scenes with scanning LiDAR\nsensors is also difficult due to their low angular sampling rate which makes\nrecovering expansive real-world scenes difficult. Tackling these gaps, we\nintroduce Gated Fields - a neural scene reconstruction method that utilizes\nactive gated video sequences. To this end, we propose a neural rendering\napproach that seamlessly incorporates time-gated capture and illumination. Our\nmethod exploits the intrinsic depth cues in the gated videos, achieving precise\nand dense geometry reconstruction irrespective of ambient illumination\nconditions. We validate the method across day and night scenarios and find that\nGated Fields compares favorably to RGB and LiDAR reconstruction methods. Our\ncode and datasets are available at https://light.princeton.edu/gatedfields/.\n","authors":["Andrea Ramazzina","Stefanie Walz","Pragyan Dahal","Mario Bijelic","Felix Heide"],"pdf_url":"https://arxiv.org/pdf/2405.19819v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19818v1","updated":"2024-05-30T08:25:21Z","published":"2024-05-30T08:25:21Z","title":"WebUOT-1M: Advancing Deep Underwater Object Tracking with A\n Million-Scale Benchmark","summary":" Underwater object tracking (UOT) is a foundational task for identifying and\ntracing submerged entities in underwater video sequences. However, current UOT\ndatasets suffer from limitations in scale, diversity of target categories and\nscenarios covered, hindering the training and evaluation of modern tracking\nalgorithms. To bridge this gap, we take the first step and introduce WebUOT-1M,\n\\ie, the largest public UOT benchmark to date, sourced from complex and\nrealistic underwater environments. It comprises 1.1 million frames across 1,500\nvideo clips filtered from 408 target categories, largely surpassing previous\nUOT datasets, \\eg, UVOT400. Through meticulous manual annotation and\nverification, we provide high-quality bounding boxes for underwater targets.\nAdditionally, WebUOT-1M includes language prompts for video sequences,\nexpanding its application areas, \\eg, underwater vision-language tracking. Most\nexisting trackers are tailored for open-air environments, leading to\nperformance degradation when applied to UOT due to domain gaps. Retraining and\nfine-tuning these trackers are challenging due to sample imbalances and limited\nreal-world underwater datasets. To tackle these challenges, we propose a novel\nomni-knowledge distillation framework based on WebUOT-1M, incorporating various\nstrategies to guide the learning of the student Transformer. To the best of our\nknowledge, this framework is the first to effectively transfer open-air domain\nknowledge to the UOT model through knowledge distillation, as demonstrated by\nresults on both existing UOT datasets and the newly proposed WebUOT-1M.\nFurthermore, we comprehensively evaluate WebUOT-1M using 30 deep trackers,\nshowcasing its value as a benchmark for UOT research by presenting new\nchallenges and opportunities for future studies. The complete dataset, codes\nand tracking results, will be made publicly available.\n","authors":["Chunhui Zhang","Li Liu","Guanjie Huang","Hao Wen","Xi Zhou","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2405.19818v1.pdf","comment":"GitHub project:\n https://github.com/983632847/Awesome-Multimodal-Object-Tracking"},{"id":"http://arxiv.org/abs/2405.19817v1","updated":"2024-05-30T08:24:00Z","published":"2024-05-30T08:24:00Z","title":"Performance Examination of Symbolic Aggregate Approximation in IoT\n Applications","summary":" Symbolic Aggregate approXimation (SAX) is a common dimensionality reduction\napproach for time-series data which has been employed in a variety of domains,\nincluding classification and anomaly detection in time-series data. Domains\nalso include shape recognition where the shape outline is converted into\ntime-series data forinstance epoch classification of archived arrowheads. In\nthis paper we propose a dimensionality reduction and shape recognition approach\nbased on the SAX algorithm, an application which requires responses on cost\nefficient, IoT-like, platforms. The challenge is largely dealing with the\ncomputational expense of the SAX algorithm in IoT-like applications, from\nsimple time-series dimension reduction through shape recognition. The approach\nis based on lowering the dimensional space while capturing and preserving the\nmost representative features of the shape. We present three scenarios of\nincreasing computational complexity backing up our statements with measurement\nof performance characteristics\n","authors":["Suzana Veljanovska","Hans Dermot Doran"],"pdf_url":"https://arxiv.org/pdf/2405.19817v1.pdf","comment":"Embedded World Conference, Nuremberg, 2024"},{"id":"http://arxiv.org/abs/2405.19794v1","updated":"2024-05-30T08:02:05Z","published":"2024-05-30T08:02:05Z","title":"Video Question Answering for People with Visual Impairments Using an\n Egocentric 360-Degree Camera","summary":" This paper addresses the daily challenges encountered by visually impaired\nindividuals, such as limited access to information, navigation difficulties,\nand barriers to social interaction. To alleviate these challenges, we introduce\na novel visual question answering dataset. Our dataset offers two significant\nadvancements over previous datasets: Firstly, it features videos captured using\na 360-degree egocentric wearable camera, enabling observation of the entire\nsurroundings, departing from the static image-centric nature of prior datasets.\nSecondly, unlike datasets centered on singular challenges, ours addresses\nmultiple real-life obstacles simultaneously through an innovative\nvisual-question answering framework. We validate our dataset using various\nstate-of-the-art VideoQA methods and diverse metrics. Results indicate that\nwhile progress has been made, satisfactory performance levels for AI-powered\nassistive services remain elusive for visually impaired individuals.\nAdditionally, our evaluation highlights the distinctive features of the\nproposed dataset, featuring ego-motion in videos captured via 360-degree\ncameras across varied scenarios.\n","authors":["Inpyo Song","Minjun Joo","Joonhyung Kwon","Jangwon Lee"],"pdf_url":"https://arxiv.org/pdf/2405.19794v1.pdf","comment":"CVPR2024 EgoVis Workshop"},{"id":"http://arxiv.org/abs/2403.16794v2","updated":"2024-05-30T07:53:21Z","published":"2024-03-25T14:13:09Z","title":"CurbNet: Curb Detection Framework Based on LiDAR Point Cloud\n Segmentation","summary":" Curb detection is a crucial function in intelligent driving, essential for\ndetermining drivable areas on the road. However, the complexity of road\nenvironments makes curb detection challenging. This paper introduces CurbNet, a\nnovel framework for curb detection utilizing point cloud segmentation. To\naddress the lack of comprehensive curb datasets with 3D annotations, we have\ndeveloped the 3D-Curb dataset based on SemanticKITTI, currently the largest and\nmost diverse collection of curb point clouds. Recognizing that the primary\ncharacteristic of curbs is height variation, our approach leverages spatially\nrich 3D point clouds for training. To tackle the challenges posed by the uneven\ndistribution of curb features on the xy-plane and their dependence on\nhigh-frequency features along the z-axis, we introduce the Multi-Scale and\nChannel Attention (MSCA) module, a customized solution designed to optimize\ndetection performance. Additionally, we propose an adaptive weighted loss\nfunction group specifically formulated to counteract the imbalance in the\ndistribution of curb point clouds relative to other categories. Extensive\nexperiments conducted on 2 major datasets demonstrate that our method surpasses\nexisting benchmarks set by leading curb detection and point cloud segmentation\nmodels. Through the post-processing refinement of the detection results, we\nhave significantly reduced noise in curb detection, thereby improving precision\nby 4.5 points. Similarly, our tolerance experiments also achieved\nstate-of-the-art results. Furthermore, real-world experiments and dataset\nanalyses mutually validate each other, reinforcing CurbNet's superior detection\ncapability and robust generalizability. The project website is available at:\nhttps://github.com/guoyangzhao/CurbNet/.\n","authors":["Guoyang Zhao","Fulong Ma","Weiqing Qi","Yuxuan Liu","Ming Liu"],"pdf_url":"https://arxiv.org/pdf/2403.16794v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19783v1","updated":"2024-05-30T07:48:32Z","published":"2024-05-30T07:48:32Z","title":"Instruction-Guided Visual Masking","summary":" Instruction following is crucial in contemporary LLM. However, when extended\nto multimodal setting, it often suffers from misalignment between specific\ntextual instruction and targeted local region of an image. To achieve more\naccurate and nuanced multimodal instruction following, we introduce\nInstruction-guided Visual Masking (IVM), a new versatile visual grounding model\nthat is compatible with diverse multimodal models, such as LMM and robot model.\nBy constructing visual masks for instruction-irrelevant regions, IVM-enhanced\nmultimodal models can effectively focus on task-relevant image regions to\nbetter align with complex instructions. Specifically, we design a visual\nmasking data generation pipeline and create an IVM-Mix-1M dataset with 1\nmillion image-instruction pairs. We further introduce a new learning technique,\nDiscriminator Weighted Supervised Learning (DWSL) for preferential IVM training\nthat prioritizes high-quality data samples. Experimental results on generic\nmultimodal tasks such as VQA and embodied robotic control demonstrate the\nversatility of IVM, which as a plug-and-play tool, significantly boosts the\nperformance of diverse multimodal models, yielding new state-of-the-art results\nacross challenging multimodal benchmarks. Code is available at\nhttps://github.com/2toinf/IVM.\n","authors":["Jinliang Zheng","Jianxiong Li","Sijie Cheng","Yinan Zheng","Jiaming Li","Jihao Liu","Yu Liu","Jingjing Liu","Xianyuan Zhan"],"pdf_url":"https://arxiv.org/pdf/2405.19783v1.pdf","comment":"preprint, 21 pages"},{"id":"http://arxiv.org/abs/2405.19775v1","updated":"2024-05-30T07:41:07Z","published":"2024-05-30T07:41:07Z","title":"Puff-Net: Efficient Style Transfer with Pure Content and Style Feature\n Fusion Network","summary":" Style transfer aims to render an image with the artistic features of a style\nimage, while maintaining the original structure. Various methods have been put\nforward for this task, but some challenges still exist. For instance, it is\ndifficult for CNN-based methods to handle global information and long-range\ndependencies between input images, for which transformer-based methods have\nbeen proposed. Although transformers can better model the relationship between\ncontent and style images, they require high-cost hardware and time-consuming\ninference. To address these issues, we design a novel transformer model that\nincludes only the encoder, thus significantly reducing the computational cost.\nIn addition, we also find that existing style transfer methods may lead to\nimages under-stylied or missing content. In order to achieve better\nstylization, we design a content feature extractor and a style feature\nextractor, based on which pure content and style images can be fed to the\ntransformer. Finally, we propose a novel network termed Puff-Net, i.e., pure\ncontent and style feature fusion network. Through qualitative and quantitative\nexperiments, we demonstrate the advantages of our model compared to\nstate-of-the-art ones in the literature.\n","authors":["Sizhe Zheng","Pan Gao","Peng Zhou","Jie Qin"],"pdf_url":"https://arxiv.org/pdf/2405.19775v1.pdf","comment":"11 pages, 11 figures, to be published in IEEE Conference on Computer\n Vision and Pattern Recognition (CVPR 2024)"},{"id":"http://arxiv.org/abs/2405.19773v1","updated":"2024-05-30T07:38:58Z","published":"2024-05-30T07:38:58Z","title":"VQA Training Sets are Self-play Environments for Generating Few-shot\n Pools","summary":" Large-language models and large-vision models are increasingly capable of\nsolving compositional reasoning tasks, as measured by breakthroughs in\nvisual-question answering benchmarks. However, state-of-the-art solutions often\ninvolve careful construction of large pre-training and fine-tuning datasets,\nwhich can be expensive. The use of external tools, whether other ML models,\nsearch engines, or APIs, can significantly improve performance by breaking down\nhigh-level reasoning questions into sub-questions that are answerable by\nindividual tools, but this approach has similar dataset construction costs to\nteach fine-tuned models how to use the available tools. We propose a technique\nin which existing training sets can be directly used for constructing\ncomputational environments with task metrics as rewards. This enables a model\nto autonomously teach itself to use itself or another model as a tool. By doing\nso, we augment training sets by integrating external signals. The proposed\nmethod starts with zero-shot prompts and iteratively refines them by selecting\nfew-shot examples that maximize the task metric on the training set. Our\nexperiments showcase how Gemini learns how to use itself, or another smaller\nand specialized model such as ScreenAI, to iteratively improve performance on\ntraining sets. Our approach successfully generalizes and improves upon zeroshot\nperformance on charts, infographics, and document visual question-answering\ndatasets\n","authors":["Tautvydas Misiunas","Hassan Mansoor","Jasper Uijlings","Oriana Riva","Victor Carbune"],"pdf_url":"https://arxiv.org/pdf/2405.19773v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19769v1","updated":"2024-05-30T07:34:05Z","published":"2024-05-30T07:34:05Z","title":"All-In-One Medical Image Restoration via Task-Adaptive Routing","summary":" Although single-task medical image restoration (MedIR) has witnessed\nremarkable success, the limited generalizability of these methods poses a\nsubstantial obstacle to wider application. In this paper, we focus on the task\nof all-in-one medical image restoration, aiming to address multiple distinct\nMedIR tasks with a single universal model. Nonetheless, due to significant\ndifferences between different MedIR tasks, training a universal model often\nencounters task interference issues, where different tasks with shared\nparameters may conflict with each other in the gradient update direction. This\ntask interference leads to deviation of the model update direction from the\noptimal path, thereby affecting the model's performance. To tackle this issue,\nwe propose a task-adaptive routing strategy, allowing conflicting tasks to\nselect different network paths in spatial and channel dimensions, thereby\nmitigating task interference. Experimental results demonstrate that our\nproposed \\textbf{A}ll-in-one \\textbf{M}edical \\textbf{I}mage\n\\textbf{R}estoration (\\textbf{AMIR}) network achieves state-of-the-art\nperformance in three MedIR tasks: MRI super-resolution, CT denoising, and PET\nsynthesis, both in single-task and all-in-one settings. The code and data will\nbe available at\n\\href{https://github.com/Yaziwel/All-In-One-Medical-Image-Restoration-via-Task-Adaptive-Routing.git}{https://github.com/Yaziwel/AMIR}.\n","authors":["Zhiwen Yang","Haowei Chen","Ziniu Qian","Yang Yi","Hui Zhang","Dan Zhao","Bingzheng Wei","Yan Xu"],"pdf_url":"https://arxiv.org/pdf/2405.19769v1.pdf","comment":"This article has been early accepted by MICCAI 2024"},{"id":"http://arxiv.org/abs/2405.19765v1","updated":"2024-05-30T07:25:23Z","published":"2024-05-30T07:25:23Z","title":"Towards Unified Multi-granularity Text Detection with Interactive\n Attention","summary":" Existing OCR engines or document image analysis systems typically rely on\ntraining separate models for text detection in varying scenarios and\ngranularities, leading to significant computational complexity and resource\ndemands. In this paper, we introduce \"Detect Any Text\" (DAT), an advanced\nparadigm that seamlessly unifies scene text detection, layout analysis, and\ndocument page detection into a cohesive, end-to-end model. This design enables\nDAT to efficiently manage text instances at different granularities, including\n*word*, *line*, *paragraph* and *page*. A pivotal innovation in DAT is the\nacross-granularity interactive attention module, which significantly enhances\nthe representation learning of text instances at varying granularities by\ncorrelating structural information across different text queries. As a result,\nit enables the model to achieve mutually beneficial detection performances\nacross multiple text granularities. Additionally, a prompt-based segmentation\nmodule refines detection outcomes for texts of arbitrary curvature and complex\nlayouts, thereby improving DAT's accuracy and expanding its real-world\napplicability. Experimental results demonstrate that DAT achieves\nstate-of-the-art performances across a variety of text-related benchmarks,\nincluding multi-oriented/arbitrarily-shaped scene text detection, document\nlayout analysis and page detection tasks.\n","authors":["Xingyu Wan","Chengquan Zhang","Pengyuan Lyu","Sen Fan","Zihan Ni","Kun Yao","Errui Ding","Jingdong Wang"],"pdf_url":"https://arxiv.org/pdf/2405.19765v1.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2405.19754v1","updated":"2024-05-30T07:02:50Z","published":"2024-05-30T07:02:50Z","title":"Mitigating annotation shift in cancer classification using single image\n generative models","summary":" Artificial Intelligence (AI) has emerged as a valuable tool for assisting\nradiologists in breast cancer detection and diagnosis. However, the success of\nAI applications in this domain is restricted by the quantity and quality of\navailable data, posing challenges due to limited and costly data annotation\nprocedures that often lead to annotation shifts. This study simulates, analyses\nand mitigates annotation shifts in cancer classification in the breast\nmammography domain. First, a high-accuracy cancer risk prediction model is\ndeveloped, which effectively distinguishes benign from malignant lesions. Next,\nmodel performance is used to quantify the impact of annotation shift. We\nuncover a substantial impact of annotation shift on multiclass classification\nperformance particularly for malignant lesions. We thus propose a training data\naugmentation approach based on single-image generative models for the affected\nclass, requiring as few as four in-domain annotations to considerably mitigate\nannotation shift, while also addressing dataset imbalance. Lastly, we further\nincrease performance by proposing and validating an ensemble architecture based\non multiple models trained under different data augmentation regimes. Our study\noffers key insights into annotation shift in deep learning breast cancer\nclassification and explores the potential of single-image generative models to\novercome domain shift challenges.\n","authors":["Marta Buetas Arcas","Richard Osuala","Karim Lekadir","Oliver Díaz"],"pdf_url":"https://arxiv.org/pdf/2405.19754v1.pdf","comment":"Preprint of paper accepted at SPIE IWBI 2024 Conference"},{"id":"http://arxiv.org/abs/2405.19751v1","updated":"2024-05-30T06:56:11Z","published":"2024-05-30T06:56:11Z","title":"HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization","summary":" Diffusion Transformers (DiTs) have recently gained substantial attention in\nboth industrial and academic fields for their superior visual generation\ncapabilities, outperforming traditional diffusion models that use U-Net.\nHowever,the enhanced performance of DiTs also comes with high parameter counts\nand implementation costs, seriously restricting their use on resource-limited\ndevices such as mobile phones. To address these challenges, we introduce the\nHybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training\nquantization method that utilizes 4-bit floating-point (FP) precision on both\nweights and activations for DiT inference. Compared to fixed-point quantization\n(e.g., INT8), FP quantization, complemented by our proposed clipping range\nselection mechanism, naturally aligns with the data distribution within DiT,\nresulting in a minimal quantization error. Furthermore, HQ-DiT also implements\na universal identity mathematical transform to mitigate the serious\nquantization error caused by the outliers. The experimental results demonstrate\nthat DiT can achieve extremely low-precision quantization (i.e., 4 bits) with\nnegligible impact on performance. Our approach marks the first instance where\nboth weights and activations in DiTs are quantized to just 4 bits, with only a\n0.12 increase in sFID on ImageNet.\n","authors":["Wenxuan Liu","Saiqian Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.19751v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07865v3","updated":"2024-05-30T06:55:15Z","published":"2023-12-13T03:04:22Z","title":"SimAC: A Simple Anti-Customization Method for Protecting Face Privacy\n against Text-to-Image Synthesis of Diffusion Models","summary":" Despite the success of diffusion-based customization methods on visual\ncontent creation, increasing concerns have been raised about such techniques\nfrom both privacy and political perspectives. To tackle this issue, several\nanti-customization methods have been proposed in very recent months,\npredominantly grounded in adversarial attacks. Unfortunately, most of these\nmethods adopt straightforward designs, such as end-to-end optimization with a\nfocus on adversarially maximizing the original training loss, thereby\nneglecting nuanced internal properties intrinsic to the diffusion model, and\neven leading to ineffective optimization in some diffusion time steps.In this\npaper, we strive to bridge this gap by undertaking a comprehensive exploration\nof these inherent properties, to boost the performance of current\nanti-customization approaches. Two aspects of properties are investigated: 1)\nWe examine the relationship between time step selection and the model's\nperception in the frequency domain of images and find that lower time steps can\ngive much more contributions to adversarial noises. This inspires us to propose\nan adaptive greedy search for optimal time steps that seamlessly integrates\nwith existing anti-customization methods. 2) We scrutinize the roles of\nfeatures at different layers during denoising and devise a sophisticated\nfeature-based optimization framework for anti-customization.Experiments on\nfacial benchmarks demonstrate that our approach significantly increases\nidentity disruption, thereby protecting user privacy and copyright. Our code is\navailable at: https://github.com/somuchtome/SimAC.\n","authors":["Feifei Wang","Zhentao Tan","Tianyi Wei","Yue Wu","Qidong Huang"],"pdf_url":"https://arxiv.org/pdf/2312.07865v3.pdf","comment":"Accepted by CVPR2024"},{"id":"http://arxiv.org/abs/2405.15549v2","updated":"2024-05-30T06:52:37Z","published":"2024-05-24T13:35:56Z","title":"SEP: Self-Enhanced Prompt Tuning for Visual-Language Model","summary":" Prompt tuning based on Context Optimization (CoOp) effectively adapts\nvisual-language models (VLMs) to downstream tasks by inferring additional\nlearnable prompt tokens. However, these tokens are less discriminative as they\nare independent of the pre-trained tokens and fail to capture input-specific\nknowledge, such as class-aware textual or instance-aware visual knowledge.\nLeveraging the discriminative and generalization capabilities inherent in\npre-trained tokens, we introduce a novel approach named Self-Enhanced Prompt\nTuning (SEP). The core principle of SEP involves adapting the learnable prompt\ntokens at each encoder layer from the corresponding self-pretrained tokens,\nthereby explicitly incorporating discriminative prior knowledge to enhance both\ntextual-level and visual-level embeddings. Furthermore, SEP's self-enhanced\ntokens not only boost discrimination but also mitigate domain shifts in unseen\ndomains, enhancing generalization. In practice, SEP selects several\nrepresentative tokens from all pre-trained tokens for each input data at every\nlayer of the text/visual encoders. Subsequently, a Token Fusion Module (TFM) is\nintroduced to generate a self-enhanced token by merging these representative\ntokens with the learnable tokens using a cross-attention mechanism. This\nself-enhanced token is then concatenated with all pre-trained tokens, serving\nas input for subsequent encoder layers to produce the relevant embeddings.\nComprehensive evaluations across various benchmarks and tasks confirm SEP's\nefficacy in prompt tuning. Code: \\href{Code}{https://github.com/htyao89/SEP}.\n","authors":["Hantao Yao","Rui Zhang","Lu Yu","Changsheng Xu"],"pdf_url":"https://arxiv.org/pdf/2405.15549v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19746v1","updated":"2024-05-30T06:49:59Z","published":"2024-05-30T06:49:59Z","title":"DenseSeg: Joint Learning for Semantic Segmentation and Landmark\n Detection Using Dense Image-to-Shape Representation","summary":" Purpose: Semantic segmentation and landmark detection are fundamental tasks\nof medical image processing, facilitating further analysis of anatomical\nobjects. Although deep learning-based pixel-wise classification has set a\nnew-state-of-the-art for segmentation, it falls short in landmark detection, a\nstrength of shape-based approaches.\n Methods: In this work, we propose a dense image-to-shape representation that\nenables the joint learning of landmarks and semantic segmentation by employing\na fully convolutional architecture. Our method intuitively allows the\nextraction of arbitrary landmarks due to its representation of anatomical\ncorrespondences. We benchmark our method against the state-of-the-art for\nsemantic segmentation (nnUNet), a shape-based approach employing geometric deep\nlearning and a CNN-based method for landmark detection.\n Results: We evaluate our method on two medical dataset: one common benchmark\nfeaturing the lungs, heart, and clavicle from thorax X-rays, and another with\n17 different bones in the paediatric wrist. While our method is on pair with\nthe landmark detection baseline in the thorax setting (error in mm of\n$2.6\\pm0.9$ vs $2.7\\pm0.9$), it substantially surpassed it in the more complex\nwrist setting ($1.1\\pm0.6$ vs $1.9\\pm0.5$).\n Conclusion: We demonstrate that dense geometric shape representation is\nbeneficial for challenging landmark detection tasks and outperforms previous\nstate-of-the-art using heatmap regression. While it does not require explicit\ntraining on the landmarks themselves, allowing for the addition of new\nlandmarks without necessitating retraining.}\n","authors":["Ron Keuth","Lasse Hansen","Maren Balks","Ronja Jäger","Anne-Nele Schröder","Ludger Tüshaus","Mattias Heinrich"],"pdf_url":"https://arxiv.org/pdf/2405.19746v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19745v1","updated":"2024-05-30T06:47:55Z","published":"2024-05-30T06:47:55Z","title":"GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion\n Extrapolation and Free View Synthesis","summary":" Forecasting future scenarios in dynamic environments is essential for\nintelligent decision-making and navigation, a challenge yet to be fully\nrealized in computer vision and robotics. Traditional approaches like video\nprediction and novel-view synthesis either lack the ability to forecast from\narbitrary viewpoints or to predict temporal dynamics. In this paper, we\nintroduce GaussianPrediction, a novel framework that empowers 3D Gaussian\nrepresentations with dynamic scene modeling and future scenario synthesis in\ndynamic environments. GaussianPrediction can forecast future states from any\nviewpoint, using video observations of dynamic scenes. To this end, we first\npropose a 3D Gaussian canonical space with deformation modeling to capture the\nappearance and geometry of dynamic scenes, and integrate the lifecycle property\ninto Gaussians for irreversible deformations. To make the prediction feasible\nand efficient, a concentric motion distillation approach is developed by\ndistilling the scene motion with key points. Finally, a Graph Convolutional\nNetwork is employed to predict the motions of key points, enabling the\nrendering of photorealistic images of future scenarios. Our framework shows\noutstanding performance on both synthetic and real-world datasets,\ndemonstrating its efficacy in predicting and rendering future environments.\n","authors":["Boming Zhao","Yuan Li","Ziyu Sun","Lin Zeng","Yujun Shen","Rui Ma","Yinda Zhang","Hujun Bao","Zhaopeng Cui"],"pdf_url":"https://arxiv.org/pdf/2405.19745v1.pdf","comment":"Accepted to SIGGRAPH 2024 Conference. Project Page:\n https://zju3dv.github.io/gaussian-prediction/"},{"id":"http://arxiv.org/abs/2405.19743v1","updated":"2024-05-30T06:43:55Z","published":"2024-05-30T06:43:55Z","title":"May the Dance be with You: Dance Generation Framework for Non-Humanoids","summary":" We hypothesize dance as a motion that forms a visual rhythm from music, where\nthe visual rhythm can be perceived from an optical flow. If an agent can\nrecognize the relationship between visual rhythm and music, it will be able to\ndance by generating a motion to create a visual rhythm that matches the music.\nBased on this, we propose a framework for any kind of non-humanoid agents to\nlearn how to dance from human videos. Our framework works in two processes: (1)\ntraining a reward model which perceives the relationship between optical flow\n(visual rhythm) and music from human dance videos, (2) training the\nnon-humanoid dancer based on that reward model, and reinforcement learning. Our\nreward model consists of two feature encoders for optical flow and music. They\nare trained based on contrastive learning which makes the higher similarity\nbetween concurrent optical flow and music features. With this reward model, the\nagent learns dancing by getting a higher reward when its action creates an\noptical flow whose feature has a higher similarity with the given music\nfeature. Experiment results show that generated dance motion can align with the\nmusic beat properly, and user study result indicates that our framework is more\npreferred by humans compared to the baselines. To the best of our knowledge,\nour work of non-humanoid agents which learn dance from human videos is\nunprecedented. An example video can be found at https://youtu.be/dOUPvo-O3QY.\n","authors":["Hyemin Ahn"],"pdf_url":"https://arxiv.org/pdf/2405.19743v1.pdf","comment":"13 pages, 6 Figures, Rejected at Neurips 2023"},{"id":"http://arxiv.org/abs/2405.09215v2","updated":"2024-05-30T06:33:03Z","published":"2024-05-15T09:47:59Z","title":"Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model","summary":" We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It\nis designed for efficient deployment on consumer GPU servers. Our work directly\nconfronts a pivotal industry issue by grappling with the prohibitive service\ncosts that hinder the broad adoption of large-scale multimodal systems. Through\nrigorous training, we have developed a 1B-scale language model from the ground\nup, employing the LLaVA paradigm for modal alignment. The result, which we call\nXmodel-VLM, is a lightweight yet powerful multimodal vision language model.\nExtensive testing across numerous classic multimodal benchmarks has revealed\nthat despite its smaller size and faster execution, Xmodel-VLM delivers\nperformance comparable to that of larger models. Our model checkpoints and code\nare publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.\n","authors":["Wanting Xu","Yang Liu","Langping He","Xucheng Huang","Ling Jiang"],"pdf_url":"https://arxiv.org/pdf/2405.09215v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19735v1","updated":"2024-05-30T06:31:03Z","published":"2024-05-30T06:31:03Z","title":"Twin Deformable Point Convolutions for Point Cloud Semantic Segmentation\n in Remote Sensing Scenes","summary":" Thanks to the application of deep learning technology in point cloud\nprocessing of the remote sensing field, point cloud segmentation has become a\nresearch hotspot in recent years, which can be applied to real-world 3D, smart\ncities, and other fields. Although existing solutions have made unprecedented\nprogress, they ignore the inherent characteristics of point clouds in remote\nsensing fields that are strictly arranged according to latitude, longitude, and\naltitude, which brings great convenience to the segmentation of point clouds in\nremote sensing fields. To consider this property cleverly, we propose novel\nconvolution operators, termed Twin Deformable point Convolutions (TDConvs),\nwhich aim to achieve adaptive feature learning by learning deformable sampling\npoints in the latitude-longitude plane and altitude direction, respectively.\nFirst, to model the characteristics of the latitude-longitude plane, we propose\na Cylinder-wise Deformable point Convolution (CyDConv) operator, which\ngenerates a two-dimensional cylinder map by constructing a cylinder-like grid\nin the latitude-longitude direction. Furthermore, to better integrate the\nfeatures of the latitude-longitude plane and the spatial geometric features, we\nperform a multi-scale fusion of the extracted latitude-longitude features and\nspatial geometric features, and realize it through the aggregation of adjacent\npoint features of different scales. In addition, a Sphere-wise Deformable point\nConvolution (SpDConv) operator is introduced to adaptively offset the sampling\npoints in three-dimensional space by constructing a sphere grid structure,\naiming at modeling the characteristics in the altitude direction. Experiments\non existing popular benchmarks conclude that our TDConvs achieve the best\nsegmentation performance, surpassing the existing state-of-the-art methods.\n","authors":["Yong-Qiang Mao","Hanbo Bi","Xuexue Li","Kaiqiang Chen","Zhirui Wang","Xian Sun","Kun Fu"],"pdf_url":"https://arxiv.org/pdf/2405.19735v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19732v1","updated":"2024-05-30T06:24:14Z","published":"2024-05-30T06:24:14Z","title":"Two Optimizers Are Better Than One: LLM Catalyst for Enhancing\n Gradient-Based Optimization","summary":" Learning a skill generally relies on both practical experience by doer and\ninsightful high-level guidance by instructor. Will this strategy also work well\nfor solving complex non-convex optimization problems? Here, a common\ngradient-based optimizer acts like a disciplined doer, making locally optimal\nupdate at each step. Recent methods utilize large language models (LLMs) to\noptimize solutions for concrete problems by inferring from natural language\ninstructions, akin to a high-level instructor. In this paper, we show that\nthese two optimizers are complementary to each other, suggesting a\ncollaborative optimization approach. The gradient-based optimizer and LLM-based\noptimizer are combined in an interleaved manner. We instruct LLMs using task\ndescriptions and timely optimization trajectories recorded during\ngradient-based optimization. Inferred results from LLMs are used as restarting\npoints for the next stage of gradient optimization. By leveraging both the\nlocally rigorous gradient-based optimizer and the high-level deductive\nLLM-based optimizer, our combined optimization method consistently yields\nimprovements over competitive baseline prompt tuning methods. Our results\ndemonstrate the synergistic effect of conventional gradient-based optimization\nand the inference ability of LLMs. The code is released at\nhttps://github.com/guozix/LLM-catalyst.\n","authors":["Zixian Guo","Ming Liu","Zhilong Ji","Jinfeng Bai","Yiwen Guo","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2405.19732v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19727v1","updated":"2024-05-30T06:19:01Z","published":"2024-05-30T06:19:01Z","title":"Automatic Dance Video Segmentation for Understanding Choreography","summary":" Segmenting dance video into short movements is a popular way to easily\nunderstand dance choreography. However, it is currently done manually and\nrequires a significant amount of effort by experts. That is, even if many dance\nvideos are available on social media (e.g., TikTok and YouTube), it remains\ndifficult for people, especially novices, to casually watch short video\nsegments to practice dance choreography. In this paper, we propose a method to\nautomatically segment a dance video into each movement. Given a dance video as\ninput, we first extract visual and audio features: the former is computed from\nthe keypoints of the dancer in the video, and the latter is computed from the\nMel spectrogram of the music in the video. Next, these features are passed to a\nTemporal Convolutional Network (TCN), and segmentation points are estimated by\npicking peaks of the network output. To build our training dataset, we annotate\nsegmentation points to dance videos in the AIST Dance Video Database, which is\na shared database containing original street dance videos with\ncopyright-cleared dance music. The evaluation study shows that the proposed\nmethod (i.e., combining the visual and audio features) can estimate\nsegmentation points with high accuracy. In addition, we developed an\napplication to help dancers practice choreography using the proposed method.\n","authors":["Koki Endo","Shuhei Tsuchida","Tsukasa Fukusato","Takeo Igarashi"],"pdf_url":"https://arxiv.org/pdf/2405.19727v1.pdf","comment":"9 pages, 11 figures"},{"id":"http://arxiv.org/abs/2405.19726v1","updated":"2024-05-30T06:16:33Z","published":"2024-05-30T06:16:33Z","title":"Streaming Video Diffusion: Online Video Editing with Diffusion Models","summary":" We present a novel task called online video editing, which is designed to\nedit \\textbf{streaming} frames while maintaining temporal consistency. Unlike\nexisting offline video editing assuming all frames are pre-established and\naccessible, online video editing is tailored to real-life applications such as\nlive streaming and online chat, requiring (1) fast continual step inference,\n(2) long-term temporal modeling, and (3) zero-shot video editing capability. To\nsolve these issues, we propose Streaming Video Diffusion (SVDiff), which\nincorporates the compact spatial-aware temporal recurrence into off-the-shelf\nStable Diffusion and is trained with the segment-level scheme on large-scale\nlong videos. This simple yet effective setup allows us to obtain a single model\nthat is capable of executing a broad range of videos and editing each streaming\nframe with temporal coherence. Our experiments indicate that our model can edit\nlong, high-quality videos with remarkable results, achieving a real-time\ninference speed of 15.2 FPS at a resolution of 512x512.\n","authors":["Feng Chen","Zhen Yang","Bohan Zhuang","Qi Wu"],"pdf_url":"https://arxiv.org/pdf/2405.19726v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19725v1","updated":"2024-05-30T06:15:08Z","published":"2024-05-30T06:15:08Z","title":"Quantum Visual Feature Encoding Revisited","summary":" Although quantum machine learning has been introduced for a while, its\napplications in computer vision are still limited. This paper, therefore,\nrevisits the quantum visual encoding strategies, the initial step in quantum\nmachine learning. Investigating the root cause, we uncover that the existing\nquantum encoding design fails to ensure information preservation of the visual\nfeatures after the encoding process, thus complicating the learning process of\nthe quantum machine learning models. In particular, the problem, termed\n\"Quantum Information Gap\" (QIG), leads to a gap of information between\nclassical and corresponding quantum features. We provide theoretical proof and\npractical demonstrations of that found and underscore the significance of QIG,\nas it directly impacts the performance of quantum machine learning algorithms.\nTo tackle this challenge, we introduce a simple but efficient new loss function\nnamed Quantum Information Preserving (QIP) to minimize this gap, resulting in\nenhanced performance of quantum machine learning algorithms. Extensive\nexperiments validate the effectiveness of our approach, showcasing superior\nperformance compared to current methodologies and consistently achieving\nstate-of-the-art results in quantum modeling.\n","authors":["Xuan-Bac Nguyen","Hoang-Quan Nguyen","Hugh Churchill","Samee U. Khan","Khoa Luu"],"pdf_url":"https://arxiv.org/pdf/2405.19725v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19723v1","updated":"2024-05-30T06:10:10Z","published":"2024-05-30T06:10:10Z","title":"Encoding and Controlling Global Semantics for Long-form Video Question\n Answering","summary":" Seeking answers effectively for long videos is essential to build video\nquestion answering (videoQA) systems. Previous methods adaptively select frames\nand regions from long videos to save computations. However, this fails to\nreason over the whole sequence of video, leading to sub-optimal performance. To\naddress this problem, we introduce a state space layer (SSL) into multi-modal\nTransformer to efficiently integrate global semantics of the video, which\nmitigates the video information loss caused by frame and region selection\nmodules. Our SSL includes a gating unit to enable controllability over the flow\nof global semantics into visual representations. To further enhance the\ncontrollability, we introduce a cross-modal compositional congruence (C^3)\nobjective to encourage global semantics aligned with the question. To\nrigorously evaluate long-form videoQA capacity, we construct two new benchmarks\nEgo-QA and MAD-QA featuring videos of considerably long length, i.e. 17.5\nminutes and 1.9 hours, respectively. Extensive experiments demonstrate the\nsuperiority of our framework on these new as well as existing datasets.\n","authors":["Thong Thanh Nguyen","Zhiyuan Hu","Xiaobao Wu","Cong-Duy T Nguyen","See-Kiong Ng","Anh Tuan Luu"],"pdf_url":"https://arxiv.org/pdf/2405.19723v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2405.19722v1","updated":"2024-05-30T06:07:57Z","published":"2024-05-30T06:07:57Z","title":"QClusformer: A Quantum Transformer-based Framework for Unsupervised\n Visual Clustering","summary":" Unsupervised vision clustering, a cornerstone in computer vision, has been\nstudied for decades, yielding significant outcomes across numerous vision\ntasks. However, these algorithms involve substantial computational demands when\nconfronted with vast amounts of unlabeled data. Conversely, Quantum computing\nholds promise in expediting unsupervised algorithms when handling large-scale\ndatabases. In this study, we introduce QClusformer, a pioneering\nTransformer-based framework leveraging Quantum machines to tackle unsupervised\nvision clustering challenges. Specifically, we design the Transformer\narchitecture, including the self-attention module and transformer blocks, from\na Quantum perspective to enable execution on Quantum hardware. In addition, we\npresent QClusformer, a variant based on the Transformer architecture, tailored\nfor unsupervised vision clustering tasks. By integrating these elements into an\nend-to-end framework, QClusformer consistently outperforms previous methods\nrunning on classical computers. Empirical evaluations across diverse\nbenchmarks, including MS-Celeb-1M and DeepFashion, underscore the superior\nperformance of QClusformer compared to state-of-the-art methods.\n","authors":["Xuan-Bac Nguyen","Hoang-Quan Nguyen","Samuel Yen-Chi Chen","Samee U. Khan","Hugh Churchill","Khoa Luu"],"pdf_url":"https://arxiv.org/pdf/2405.19722v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19718v1","updated":"2024-05-30T06:02:35Z","published":"2024-05-30T06:02:35Z","title":"LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising","summary":" Event camera has significant advantages in capturing dynamic scene\ninformation while being prone to noise interference, particularly in\nchallenging conditions like low threshold and low illumination. However, most\nexisting research focuses on gentle situations, hindering event camera\napplications in realistic complex scenarios. To tackle this limitation and\nadvance the field, we construct a new paired real-world event denoising dataset\n(LED), including 3K sequences with 18K seconds of high-resolution (1200*680)\nevent streams and showing three notable distinctions compared to others:\ndiverse noise levels and scenes, larger-scale with high-resolution, and\nhigh-quality GT. Specifically, it contains stepped parameters and varying\nillumination with diverse scenarios. Moreover, based on the property of noise\nevents inconsistency and signal events consistency, we propose a novel\neffective denoising framework(DED) using homogeneous dual events to generate\nthe GT with better separating noise from the raw. Furthermore, we design a\nbio-inspired baseline leveraging Leaky-Integrate-and-Fire (LIF) neurons with\ndynamic thresholds to realize accurate denoising. The experimental results\ndemonstrate that the remarkable performance of the proposed approach on\ndifferent datasets.The dataset and code are at https://github.com/Yee-Sing/led.\n","authors":["Yuxing Duan","Shihan Peng","Lin Zhu","Wei Zhang","Yi Chang","Sheng Zhong","Luxin Yan"],"pdf_url":"https://arxiv.org/pdf/2405.19718v1.pdf","comment":"Accepted by CVPR 2024"},{"id":"http://arxiv.org/abs/2405.19716v1","updated":"2024-05-30T05:53:49Z","published":"2024-05-30T05:53:49Z","title":"Enhancing Large Vision Language Models with Self-Training on Image\n Comprehension","summary":" Large vision language models (LVLMs) integrate large language models (LLMs)\nwith pre-trained vision encoders, thereby activating the perception capability\nof the model to understand image inputs for different queries and conduct\nsubsequent reasoning. Improving this capability requires high-quality\nvision-language data, which is costly and labor-intensive to acquire.\nSelf-training approaches have been effective in single-modal settings to\nalleviate the need for labeled data by leveraging model's own generation.\nHowever, effective self-training remains a challenge regarding the unique\nvisual perception and reasoning capability of LVLMs. To address this, we\nintroduce Self-Training on Image Comprehension (STIC), which emphasizes a\nself-training approach specifically for image comprehension. First, the model\nself-constructs a preference dataset for image descriptions using unlabeled\nimages. Preferred responses are generated through a step-by-step prompt, while\ndis-preferred responses are generated from either corrupted images or\nmisleading prompts. To further self-improve reasoning on the extracted visual\ninformation, we let the model reuse a small portion of existing\ninstruction-tuning data and append its self-generated image descriptions to the\nprompts. We validate the effectiveness of STIC across seven different\nbenchmarks, demonstrating substantial performance gains of 4.0% on average\nwhile using 70% less supervised fine-tuning data than the current method.\nFurther studies investigate various components of STIC and highlight its\npotential to leverage vast quantities of unlabeled images for self-training.\nCode and data are made publicly available.\n","authors":["Yihe Deng","Pan Lu","Fan Yin","Ziniu Hu","Sheng Shen","James Zou","Kai-Wei Chang","Wei Wang"],"pdf_url":"https://arxiv.org/pdf/2405.19716v1.pdf","comment":"19 pages, 14 figures, 6 tables"},{"id":"http://arxiv.org/abs/2310.17455v2","updated":"2024-05-30T05:53:23Z","published":"2023-10-26T15:01:54Z","title":"OTMatch: Improving Semi-Supervised Learning with Optimal Transport","summary":" Semi-supervised learning has made remarkable strides by effectively utilizing\na limited amount of labeled data while capitalizing on the abundant information\npresent in unlabeled data. However, current algorithms often prioritize\naligning image predictions with specific classes generated through\nself-training techniques, thereby neglecting the inherent relationships that\nexist within these classes. In this paper, we present a new approach called\nOTMatch, which leverages semantic relationships among classes by employing an\noptimal transport loss function to match distributions. We conduct experiments\non many standard vision and language datasets. The empirical results show\nimprovements in our method above baseline, this demonstrates the effectiveness\nand superiority of our approach in harnessing semantic relationships to enhance\nlearning performance in a semi-supervised setting.\n","authors":["Zhiquan Tan","Kaipeng Zheng","Weiran Huang"],"pdf_url":"https://arxiv.org/pdf/2310.17455v2.pdf","comment":"Accepted at ICML 2024"},{"id":"http://arxiv.org/abs/2405.19712v1","updated":"2024-05-30T05:43:09Z","published":"2024-05-30T05:43:09Z","title":"HINT: Learning Complete Human Neural Representations from Limited\n Viewpoints","summary":" No augmented application is possible without animated humanoid avatars. At\nthe same time, generating human replicas from real-world monocular hand-held or\nrobotic sensor setups is challenging due to the limited availability of views.\nPrevious work showed the feasibility of virtual avatars but required the\npresence of 360 degree views of the targeted subject. To address this issue, we\npropose HINT, a NeRF-based algorithm able to learn a detailed and complete\nhuman model from limited viewing angles. We achieve this by introducing a\nsymmetry prior, regularization constraints, and training cues from large human\ndatasets. In particular, we introduce a sagittal plane symmetry prior to the\nappearance of the human, directly supervise the density function of the human\nmodel using explicit 3D body modeling, and leverage a co-learned human\ndigitization network as additional supervision for the unseen angles. As a\nresult, our method can reconstruct complete humans even from a few viewing\nangles, increasing performance by more than 15% PSNR compared to previous\nstate-of-the-art algorithms.\n","authors":["Alessandro Sanvito","Andrea Ramazzina","Stefanie Walz","Mario Bijelic","Felix Heide"],"pdf_url":"https://arxiv.org/pdf/2405.19712v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19005v2","updated":"2024-05-30T05:42:46Z","published":"2024-05-29T11:42:02Z","title":"Auto-selected Knowledge Adapters for Lifelong Person Re-identification","summary":" Lifelong Person Re-Identification (LReID) extends traditional ReID by\nrequiring systems to continually learn from non-overlapping datasets across\ndifferent times and locations, adapting to new identities while preserving\nknowledge of previous ones. Existing approaches, either rehearsal-free or\nrehearsal-based, still suffer from the problem of catastrophic forgetting since\nthey try to cram diverse knowledge into one fixed model. To overcome this\nlimitation, we introduce a novel framework AdalReID, that adopts knowledge\nadapters and a parameter-free auto-selection mechanism for lifelong learning.\nConcretely, we incrementally build distinct adapters to learn domain-specific\nknowledge at each step, which can effectively learn and preserve knowledge\nacross different datasets. Meanwhile, the proposed auto-selection strategy\nadaptively calculates the knowledge similarity between the input set and the\nadapters. On the one hand, the appropriate adapters are selected for the inputs\nto process ReID, and on the other hand, the knowledge interaction and fusion\nbetween adapters are enhanced to improve the generalization ability of the\nmodel. Extensive experiments are conducted to demonstrate the superiority of\nour AdalReID, which significantly outperforms SOTAs by about 10$\\sim$20\\% mAP\non both seen and unseen domains.\n","authors":["Xuelin Qian","Ruiqi Wu","Gong Cheng","Junwei Han"],"pdf_url":"https://arxiv.org/pdf/2405.19005v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.09655v3","updated":"2024-05-30T05:42:03Z","published":"2023-11-16T08:17:02Z","title":"Multi-View Spectrogram Transformer for Respiratory Sound Classification","summary":" Deep neural networks have been applied to audio spectrograms for respiratory\nsound classification. Existing models often treat the spectrogram as a\nsynthetic image while overlooking its physical characteristics. In this paper,\na Multi-View Spectrogram Transformer (MVST) is proposed to embed different\nviews of time-frequency characteristics into the vision transformer.\nSpecifically, the proposed MVST splits the mel-spectrogram into different sized\npatches, representing the multi-view acoustic elements of a respiratory sound.\nThese patches and positional embeddings are then fed into transformer encoders\nto extract the attentional information among patches through a self-attention\nmechanism. Finally, a gated fusion scheme is designed to automatically weigh\nthe multi-view features to highlight the best one in a specific scenario.\nExperimental results on the ICBHI dataset demonstrate that the proposed MVST\nsignificantly outperforms state-of-the-art methods for classifying respiratory\nsounds.\n","authors":["Wentao He","Yuchen Yan","Jianfeng Ren","Ruibin Bai","Xudong Jiang"],"pdf_url":"https://arxiv.org/pdf/2311.09655v3.pdf","comment":"The paper was published at ICASSP 2024"},{"id":"http://arxiv.org/abs/2405.19708v1","updated":"2024-05-30T05:36:32Z","published":"2024-05-30T05:36:32Z","title":"Text Guided Image Editing with Automatic Concept Locating and Forgetting","summary":" With the advancement of image-to-image diffusion models guided by text,\nsignificant progress has been made in image editing. However, a persistent\nchallenge remains in seamlessly incorporating objects into images based on\ntextual instructions, without relying on extra user-provided guidance. Text and\nimages are inherently distinct modalities, bringing out difficulties in fully\ncapturing the semantic intent conveyed through language and accurately\ntranslating that into the desired visual modifications. Therefore, text-guided\nimage editing models often produce generations with residual object attributes\nthat do not fully align with human expectations. To address this challenge, the\nmodels should comprehend the image content effectively away from a disconnect\nbetween the provided textual editing prompts and the actual modifications made\nto the image. In our paper, we propose a novel method called Locate and Forget\n(LaF), which effectively locates potential target concepts in the image for\nmodification by comparing the syntactic trees of the target prompt and scene\ndescriptions in the input image, intending to forget their existence clues in\nthe generated image. Compared to the baselines, our method demonstrates its\nsuperiority in text-guided image editing tasks both qualitatively and\nquantitatively.\n","authors":["Jia Li","Lijie Hu","Zhixian He","Jingfeng Zhang","Tianhang Zheng","Di Wang"],"pdf_url":"https://arxiv.org/pdf/2405.19708v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19707v1","updated":"2024-05-30T05:36:12Z","published":"2024-05-30T05:36:12Z","title":"DeMamba: AI-Generated Video Detection on Million-Scale GenVideo\n Benchmark","summary":" Recently, video generation techniques have advanced rapidly. Given the\npopularity of video content on social media platforms, these models intensify\nconcerns about the spread of fake information. Therefore, there is a growing\ndemand for detectors capable of distinguishing between fake AI-generated videos\nand mitigating the potential harm caused by fake information. However, the lack\nof large-scale datasets from the most advanced video generators poses a barrier\nto the development of such detectors. To address this gap, we introduce the\nfirst AI-generated video detection dataset, GenVideo. It features the following\ncharacteristics: (1) a large volume of videos, including over one million\nAI-generated and real videos collected; (2) a rich diversity of generated\ncontent and methodologies, covering a broad spectrum of video categories and\ngeneration techniques. We conducted extensive studies of the dataset and\nproposed two evaluation methods tailored for real-world-like scenarios to\nassess the detectors' performance: the cross-generator video classification\ntask assesses the generalizability of trained detectors on generators; the\ndegraded video classification task evaluates the robustness of detectors to\nhandle videos that have degraded in quality during dissemination. Moreover, we\nintroduced a plug-and-play module, named Detail Mamba (DeMamba), designed to\nenhance the detectors by identifying AI-generated videos through the analysis\nof inconsistencies in temporal and spatial dimensions. Our extensive\nexperiments demonstrate DeMamba's superior generalizability and robustness on\nGenVideo compared to existing detectors. We believe that the GenVideo dataset\nand the DeMamba module will significantly advance the field of AI-generated\nvideo detection. Our code and dataset will be aviliable at\n\\url{https://github.com/chenhaoxing/DeMamba}.\n","authors":["Haoxing Chen","Yan Hong","Zizheng Huang","Zhuoer Xu","Zhangxuan Gu","Yaohui Li","Jun Lan","Huijia Zhu","Jianfu Zhang","Weiqiang Wang","Huaxiong Li"],"pdf_url":"https://arxiv.org/pdf/2405.19707v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19703v1","updated":"2024-05-30T05:27:46Z","published":"2024-05-30T05:27:46Z","title":"Towards a Better Evaluation of Out-of-Domain Generalization","summary":" The objective of Domain Generalization (DG) is to devise algorithms and\nmodels capable of achieving high performance on previously unseen test\ndistributions. In the pursuit of this objective, average measure has been\nemployed as the prevalent measure for evaluating models and comparing\nalgorithms in the existing DG studies. Despite its significance, a\ncomprehensive exploration of the average measure has been lacking and its\nsuitability in approximating the true domain generalization performance has\nbeen questionable. In this study, we carefully investigate the limitations\ninherent in the average measure and propose worst+gap measure as a robust\nalternative. We establish theoretical grounds of the proposed measure by\nderiving two theorems starting from two different assumptions. We conduct\nextensive experimental investigations to compare the proposed worst+gap measure\nwith the conventional average measure. Given the indispensable need to access\nthe true DG performance for studying measures, we modify five existing datasets\nto come up with SR-CMNIST, C-Cats&Dogs, L-CIFAR10, PACS-corrupted, and\nVLCS-corrupted datasets. The experiment results unveil an inferior performance\nof the average measure in approximating the true DG performance and confirm the\nrobustness of the theoretically supported worst+gap measure.\n","authors":["Duhun Hwang","Suhyun Kang","Moonjung Eo","Jimyeong Kim","Wonjong Rhee"],"pdf_url":"https://arxiv.org/pdf/2405.19703v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19695v1","updated":"2024-05-30T05:15:38Z","published":"2024-05-30T05:15:38Z","title":"Distribution Aligned Semantics Adaption for Lifelong Person\n Re-Identification","summary":" In real-world scenarios, person Re-IDentification (Re-ID) systems need to be\nadaptable to changes in space and time. Therefore, the adaptation of Re-ID\nmodels to new domains while preserving previously acquired knowledge is\ncrucial, known as Lifelong person Re-IDentification (LReID). Advanced LReID\nmethods rely on replaying exemplars from old domains and applying knowledge\ndistillation in logits with old models. However, due to privacy concerns,\nretaining previous data is inappropriate. Additionally, the fine-grained and\nopen-set characteristics of Re-ID limit the effectiveness of the distillation\nparadigm for accumulating knowledge. We argue that a Re-ID model trained on\ndiverse and challenging pedestrian images at a large scale can acquire robust\nand general human semantic knowledge. These semantics can be readily utilized\nas shared knowledge for lifelong applications. In this paper, we identify the\nchallenges and discrepancies associated with adapting a pre-trained model to\neach application domain, and introduce the Distribution Aligned Semantics\nAdaption (DASA) framework. It efficiently adjusts Batch Normalization (BN) to\nmitigate interference from data distribution discrepancy and freezes the\npre-trained convolutional layers to preserve shared knowledge. Additionally, we\npropose the lightweight Semantics Adaption (SA) module, which effectively\nadapts learned semantics to enhance pedestrian representations. Extensive\nexperiments demonstrate the remarkable superiority of our proposed framework\nover advanced LReID methods, and it exhibits significantly reduced storage\nconsumption. DASA presents a novel and cost-effective perspective on\neffectively adapting pre-trained models for LReID.\n","authors":["Qizao Wang","Xuelin Qian","Bin Li","Xiangyang Xue"],"pdf_url":"https://arxiv.org/pdf/2405.19695v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19689v1","updated":"2024-05-30T05:04:01Z","published":"2024-05-30T05:04:01Z","title":"Uncertainty-aware sign language video retrieval with probability\n distribution modeling","summary":" Sign language video retrieval plays a key role in facilitating information\naccess for the deaf community. Despite significant advances in video-text\nretrieval, the complexity and inherent uncertainty of sign language preclude\nthe direct application of these techniques. Previous methods achieve the\nmapping between sign language video and text through fine-grained modal\nalignment. However, due to the scarcity of fine-grained annotation, the\nuncertainty inherent in sign language video is underestimated, limiting the\nfurther development of sign language retrieval tasks. To address this\nchallenge, we propose a novel Uncertainty-aware Probability Distribution\nRetrieval (UPRet), that conceptualizes the mapping process of sign language\nvideo and text in terms of probability distributions, explores their potential\ninterrelationships, and enables flexible mappings. Experiments on three\nbenchmarks demonstrate the effectiveness of our method, which achieves\nstate-of-the-art results on How2Sign (59.1%), PHOENIX-2014T (72.0%), and\nCSL-Daily (78.4%).\n","authors":["Xuan Wu","Hongxiang Li","Yuanjiang Luo","Xuxin Cheng","Xianwei Zhuang","Meng Cao","Keren Fu"],"pdf_url":"https://arxiv.org/pdf/2405.19689v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19688v1","updated":"2024-05-30T04:57:55Z","published":"2024-05-30T04:57:55Z","title":"DNPM: A Neural Parametric Model for the Synthesis of Facial Geometric\n Details","summary":" Parametric 3D models have enabled a wide variety of computer vision and\ngraphics tasks, such as modeling human faces, bodies and hands. In 3D face\nmodeling, 3DMM is the most widely used parametric model, but can't generate\nfine geometric details solely from identity and expression inputs. To tackle\nthis limitation, we propose a neural parametric model named DNPM for the facial\ngeometric details, which utilizes deep neural network to extract latent codes\nfrom facial displacement maps encoding details and wrinkles. Built upon DNPM, a\nnovel 3DMM named Detailed3DMM is proposed, which augments traditional 3DMMs by\nincluding the synthesis of facial details only from the identity and expression\ninputs. Moreover, we show that DNPM and Detailed3DMM can facilitate two\ndownstream applications: speech-driven detailed 3D facial animation and 3D face\nreconstruction from a degraded image. Extensive experiments have shown the\nusefulness of DNPM and Detailed3DMM, and the progressiveness of two proposed\napplications.\n","authors":["Haitao Cao","Baoping Cheng","Qiran Pu","Haocheng Zhang","Bin Luo","Yixiang Zhuang","Juncong Lin","Liyan Chen","Xuan Cheng"],"pdf_url":"https://arxiv.org/pdf/2405.19688v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19687v1","updated":"2024-05-30T04:57:54Z","published":"2024-05-30T04:57:54Z","title":"Autonomous Driving with Spiking Neural Networks","summary":" Autonomous driving demands an integrated approach that encompasses\nperception, prediction, and planning, all while operating under strict energy\nconstraints to enhance scalability and environmental sustainability. We present\nSpiking Autonomous Driving (\\name{}), the first unified Spiking Neural Network\n(SNN) to address the energy challenges faced by autonomous driving systems\nthrough its event-driven and energy-efficient nature. SAD is trained end-to-end\nand consists of three main modules: perception, which processes inputs from\nmulti-view cameras to construct a spatiotemporal bird's eye view; prediction,\nwhich utilizes a novel dual-pathway with spiking neurons to forecast future\nstates; and planning, which generates safe trajectories considering predicted\noccupancy, traffic rules, and ride comfort. Evaluated on the nuScenes dataset,\nSAD achieves competitive performance in perception, prediction, and planning\ntasks, while drawing upon the energy efficiency of SNNs. This work highlights\nthe potential of neuromorphic computing to be applied to energy-efficient\nautonomous driving, a critical step toward sustainable and safety-critical\nautomotive technology. Our code is available at\n\\url{https://github.com/ridgerchu/SAD}.\n","authors":["Rui-Jie Zhu","Ziqing Wang","Leilani Gilpin","Jason K. Eshraghian"],"pdf_url":"https://arxiv.org/pdf/2405.19687v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19684v1","updated":"2024-05-30T04:46:40Z","published":"2024-05-30T04:46:40Z","title":"A Comprehensive Survey on Underwater Image Enhancement Based on Deep\n Learning","summary":" Underwater image enhancement (UIE) is a challenging research task in the\nfield of computer vision. Although hundreds of UIE algorithms have been\nproposed, a comprehensive and systematic review is still lacking. To promote\nfuture research, we summarize the UIE task from multiple perspectives. First,\nthe physical models, data construction processes, evaluation metrics, and loss\nfunctions are introduced. Second, according to the contributions brought by\ndifferent literatures, recent proposed algorithms are discussed and classified\nfrom six perspectives, namely network architecture, learning strategy, learning\nstage, assistance task, domain perspective and disentanglement fusion,\nrespectively. Third, considering the inconsistencies in experimental settings\nin different literatures, a comprehensive and fair comparison does not yet\nexist. To this end, we quantitatively and qualitatively evaluate\nstate-of-the-art algorithms on multiple benchmark datasets. Finally, issues\nworthy of further research in the UIE task are raised. A collection of useful\nmaterials is available at https://github.com/YuZhao1999/UIE.\n","authors":["Xiaofeng Cong","Yu Zhao","Jie Gui","Junming Hou","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2405.19684v1.pdf","comment":"A survey on the underwater image enhancement task"},{"id":"http://arxiv.org/abs/2405.16134v2","updated":"2024-05-30T04:45:11Z","published":"2024-05-25T08:57:30Z","title":"Breaking the False Sense of Security in Backdoor Defense through\n Re-Activation Attack","summary":" Deep neural networks face persistent challenges in defending against backdoor\nattacks, leading to an ongoing battle between attacks and defenses. While\nexisting backdoor defense strategies have shown promising performance on\nreducing attack success rates, can we confidently claim that the backdoor\nthreat has truly been eliminated from the model? To address it, we\nre-investigate the characteristics of the backdoored models after defense\n(denoted as defense models). Surprisingly, we find that the original backdoors\nstill exist in defense models derived from existing post-training defense\nstrategies, and the backdoor existence is measured by a novel metric called\nbackdoor existence coefficient. It implies that the backdoors just lie dormant\nrather than being eliminated. To further verify this finding, we empirically\nshow that these dormant backdoors can be easily re-activated during inference,\nby manipulating the original trigger with well-designed tiny perturbation using\nuniversal adversarial attack. More practically, we extend our backdoor\nreactivation to black-box scenario, where the defense model can only be queried\nby the adversary during inference, and develop two effective methods, i.e.,\nquery-based and transfer-based backdoor re-activation attacks. The\neffectiveness of the proposed methods are verified on both image classification\nand multimodal contrastive learning (i.e., CLIP) tasks. In conclusion, this\nwork uncovers a critical vulnerability that has never been explored in existing\ndefense strategies, emphasizing the urgency of designing more robust and\nadvanced backdoor defense mechanisms in the future.\n","authors":["Mingli Zhu","Siyuan Liang","Baoyuan Wu"],"pdf_url":"https://arxiv.org/pdf/2405.16134v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19682v1","updated":"2024-05-30T04:37:57Z","published":"2024-05-30T04:37:57Z","title":"Fully Test-Time Adaptation for Monocular 3D Object Detection","summary":" Monocular 3D object detection (Mono 3Det) aims to identify 3D objects from a\nsingle RGB image. However, existing methods often assume training and test data\nfollow the same distribution, which may not hold in real-world test scenarios.\nTo address the out-of-distribution (OOD) problems, we explore a new adaptation\nparadigm for Mono 3Det, termed Fully Test-time Adaptation. It aims to adapt a\nwell-trained model to unlabeled test data by handling potential data\ndistribution shifts at test time without access to training data and test\nlabels. However, applying this paradigm in Mono 3Det poses significant\nchallenges due to OOD test data causing a remarkable decline in object\ndetection scores. This decline conflicts with the pre-defined score thresholds\nof existing detection methods, leading to severe object omissions (i.e., rare\npositive detections and many false negatives). Consequently, the limited\npositive detection and plenty of noisy predictions cause test-time adaptation\nto fail in Mono 3Det. To handle this problem, we propose a novel Monocular\nTest-Time Adaptation (MonoTTA) method, based on two new strategies. 1)\nReliability-driven adaptation: we empirically find that high-score objects are\nstill reliable and the optimization of high-score objects can enhance\nconfidence across all detections. Thus, we devise a self-adaptive strategy to\nidentify reliable objects for model adaptation, which discovers potential\nobjects and alleviates omissions. 2) Noise-guard adaptation: since high-score\nobjects may be scarce, we develop a negative regularization term to exploit the\nnumerous low-score objects via negative learning, preventing overfitting to\nnoise and trivial solutions. Experimental results show that MonoTTA brings\nsignificant performance gains for Mono 3Det models in OOD test scenarios,\napproximately 190% gains by average on KITTI and 198% gains on nuScenes.\n","authors":["Hongbin Lin","Yifan Zhang","Shuaicheng Niu","Shuguang Cui","Zhen Li"],"pdf_url":"https://arxiv.org/pdf/2405.19682v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19678v1","updated":"2024-05-30T04:14:58Z","published":"2024-05-30T04:14:58Z","title":"View-Consistent Hierarchical 3D SegmentationUsing Ultrametric Feature\n Fields","summary":" Large-scale vision foundation models such as Segment Anything (SAM)\ndemonstrate impressive performance in zero-shot image segmentation at multiple\nlevels of granularity. However, these zero-shot predictions are rarely\n3D-consistent. As the camera viewpoint changes in a scene, so do the\nsegmentation predictions, as well as the characterizations of ``coarse\" or\n``fine\" granularity. In this work, we address the challenging task of lifting\nmulti-granular and view-inconsistent image segmentations into a hierarchical\nand 3D-consistent representation. We learn a novel feature field within a\nNeural Radiance Field (NeRF) representing a 3D scene, whose segmentation\nstructure can be revealed at different scales by simply using different\nthresholds on feature distance. Our key idea is to learn an ultrametric feature\nspace, which unlike a Euclidean space, exhibits transitivity in distance-based\ngrouping, naturally leading to a hierarchical clustering. Put together, our\nmethod takes view-inconsistent multi-granularity 2D segmentations as input and\nproduces a hierarchy of 3D-consistent segmentations as output. We evaluate our\nmethod and several baselines on synthetic datasets with multi-view images and\nmulti-granular segmentation, showcasing improved accuracy and\nviewpoint-consistency. We additionally provide qualitative examples of our\nmodel's 3D hierarchical segmentations in real world scenes.\\footnote{The code\nand dataset are available at:\n","authors":["Haodi He","Colton Stearns","Adam W. Harley","Leonidas J. Guibas"],"pdf_url":"https://arxiv.org/pdf/2405.19678v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19675v1","updated":"2024-05-30T04:04:36Z","published":"2024-05-30T04:04:36Z","title":"Knowledge-grounded Adaptation Strategy for Vision-language Models:\n Building Unique Case-set for Screening Mammograms for Residents Training","summary":" A visual-language model (VLM) pre-trained on natural images and text pairs\nposes a significant barrier when applied to medical contexts due to domain\nshift. Yet, adapting or fine-tuning these VLMs for medical use presents\nconsiderable hurdles, including domain misalignment, limited access to\nextensive datasets, and high-class imbalances. Hence, there is a pressing need\nfor strategies to effectively adapt these VLMs to the medical domain, as such\nadaptations would prove immensely valuable in healthcare applications. In this\nstudy, we propose a framework designed to adeptly tailor VLMs to the medical\ndomain, employing selective sampling and hard-negative mining techniques for\nenhanced performance in retrieval tasks. We validate the efficacy of our\nproposed approach by implementing it across two distinct VLMs: the in-domain\nVLM (MedCLIP) and out-of-domain VLMs (ALBEF). We assess the performance of\nthese models both in their original off-the-shelf state and after undergoing\nour proposed training strategies, using two extensive datasets containing\nmammograms and their corresponding reports. Our evaluation spans zero-shot,\nfew-shot, and supervised scenarios. Through our approach, we observe a notable\nenhancement in Recall@K performance for the image-text retrieval task.\n","authors":["Aisha Urooj Khan","John Garrett","Tyler Bradshaw","Lonie Salkowski","Jiwoong Jason Jeong","Amara Tariq","Imon Banerjee"],"pdf_url":"https://arxiv.org/pdf/2405.19675v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19672v1","updated":"2024-05-30T03:56:01Z","published":"2024-05-30T03:56:01Z","title":"CRIS: Collaborative Refinement Integrated with Segmentation for Polyp\n Segmentation","summary":" Accurate detection of colorectal cancer and early prevention heavily rely on\nprecise polyp identification during gastrointestinal colonoscopy. Due to\nlimited data, many current state-of-the-art deep learning methods for polyp\nsegmentation often rely on post-processing of masks to reduce noise and enhance\nresults. In this study, we propose an approach that integrates mask refinement\nand binary semantic segmentation, leveraging a novel collaborative training\nstrategy that surpasses current widely-used refinement strategies. We\ndemonstrate the superiority of our approach through comprehensive evaluation on\nestablished benchmark datasets and its successful application across various\nmedical image segmentation architectures.\n","authors":["Ankush Gajanan Arudkar","Bernard J. E. Evans"],"pdf_url":"https://arxiv.org/pdf/2405.19672v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.08816v2","updated":"2024-05-30T03:49:20Z","published":"2024-05-14T17:59:57Z","title":"The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition","summary":" In the realm of autonomous driving, robust perception under\nout-of-distribution conditions is paramount for the safe deployment of\nvehicles. Challenges such as adverse weather, sensor malfunctions, and\nenvironmental unpredictability can severely impact the performance of\nautonomous systems. The 2024 RoboDrive Challenge was crafted to propel the\ndevelopment of driving perception technologies that can withstand and adapt to\nthese real-world variabilities. Focusing on four pivotal tasks -- BEV\ndetection, map segmentation, semantic occupancy prediction, and multi-view\ndepth estimation -- the competition laid down a gauntlet to innovate and\nenhance system resilience against typical and atypical disturbances. This\nyear's challenge consisted of five distinct tracks and attracted 140 registered\nteams from 93 institutes across 11 countries, resulting in nearly one thousand\nsubmissions evaluated through our servers. The competition culminated in 15\ntop-performing solutions, which introduced a range of innovative approaches\nincluding advanced data augmentation, multi-sensor fusion, self-supervised\nlearning for error correction, and new algorithmic strategies to enhance sensor\nrobustness. These contributions significantly advanced the state of the art,\nparticularly in handling sensor inconsistencies and environmental variability.\nParticipants, through collaborative efforts, pushed the boundaries of current\ntechnologies, showcasing their potential in real-world scenarios. Extensive\nevaluations and analyses provided insights into the effectiveness of these\nsolutions, highlighting key trends and successful strategies for improving the\nresilience of driving perception systems. This challenge has set a new\nbenchmark in the field, providing a rich repository of techniques expected to\nguide future research in this field.\n","authors":["Lingdong Kong","Shaoyuan Xie","Hanjiang Hu","Yaru Niu","Wei Tsang Ooi","Benoit R. Cottereau","Lai Xing Ng","Yuexin Ma","Wenwei Zhang","Liang Pan","Kai Chen","Ziwei Liu","Weichao Qiu","Wei Zhang","Xu Cao","Hao Lu","Ying-Cong Chen","Caixin Kang","Xinning Zhou","Chengyang Ying","Wentao Shang","Xingxing Wei","Yinpeng Dong","Bo Yang","Shengyin Jiang","Zeliang Ma","Dengyi Ji","Haiwen Li","Xingliang Huang","Yu Tian","Genghua Kou","Fan Jia","Yingfei Liu","Tiancai Wang","Ying Li","Xiaoshuai Hao","Yifan Yang","Hui Zhang","Mengchuan Wei","Yi Zhou","Haimei Zhao","Jing Zhang","Jinke Li","Xiao He","Xiaoqiang Cheng","Bingyang Zhang","Lirong Zhao","Dianlei Ding","Fangsheng Liu","Yixiang Yan","Hongming Wang","Nanfei Ye","Lun Luo","Yubo Tian","Yiwei Zuo","Zhe Cao","Yi Ren","Yunfan Li","Wenjie Liu","Xun Wu","Yifan Mao","Ming Li","Jian Liu","Jiayang Liu","Zihan Qin","Cunxi Chu","Jialei Xu","Wenbo Zhao","Junjun Jiang","Xianming Liu","Ziyan Wang","Chiwei Li","Shilong Li","Chendong Yuan","Songyue Yang","Wentao Liu","Peng Chen","Bin Zhou","Yubo Wang","Chi Zhang","Jianhang Sun","Hai Chen","Xiao Yang","Lizhong Wang","Dongyi Fu","Yongchun Lin","Huitong Yang","Haoang Li","Yadan Luo","Xianjing Cheng","Yong Xu"],"pdf_url":"https://arxiv.org/pdf/2405.08816v2.pdf","comment":"ICRA 2024; 32 pages, 24 figures, 5 tables; Code at\n https://robodrive-24.github.io/"},{"id":"http://arxiv.org/abs/2405.19671v1","updated":"2024-05-30T03:46:59Z","published":"2024-05-30T03:46:59Z","title":"GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and\n Monocular Cues for Indoor Scene Reconstruction","summary":" Recently, 3D Gaussian Splatting(3DGS) has revolutionized neural rendering\nwith its high-quality rendering and real-time speed. However, when it comes to\nindoor scenes with a significant number of textureless areas, 3DGS yields\nincomplete and noisy reconstruction results due to the poor initialization of\nthe point cloud and under-constrained optimization. Inspired by the continuity\nof signed distance field (SDF), which naturally has advantages in modeling\nsurfaces, we present a unified optimizing framework integrating neural SDF with\n3DGS. This framework incorporates a learnable neural SDF field to guide the\ndensification and pruning of Gaussians, enabling Gaussians to accurately model\nscenes even with poor initialized point clouds. At the same time, the geometry\nrepresented by Gaussians improves the efficiency of the SDF field by piloting\nits point sampling. Additionally, we regularize the optimization with normal\nand edge priors to eliminate geometry ambiguity in textureless areas and\nimprove the details. Extensive experiments in ScanNet and ScanNet++ show that\nour method achieves state-of-the-art performance in both surface reconstruction\nand novel view synthesis.\n","authors":["Haodong Xiang","Xinghui Li","Xiansong Lai","Wanting Zhang","Zhichao Liao","Kai Cheng","Xueping Liu"],"pdf_url":"https://arxiv.org/pdf/2405.19671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19669v1","updated":"2024-05-30T03:38:44Z","published":"2024-05-30T03:38:44Z","title":"Texture-guided Coding for Deep Features","summary":" With the rapid development of machine vision technology in recent years, many\nresearchers have begun to focus on feature compression that is better suited\nfor machine vision tasks. The target of feature compression is deep features,\nwhich arise from convolution in the middle layer of a pre-trained convolutional\nneural network. However, due to the large volume of data and high level of\nabstraction of deep features, their application is primarily limited to\nmachine-centric scenarios, which poses significant constraints in situations\nrequiring human-computer interaction. This paper investigates features and\ntextures and proposes a texture-guided feature compression strategy based on\ntheir characteristics. Specifically, the strategy comprises feature layers and\ntexture layers. The feature layers serve the machine, including a feature\nselection module and a feature reconstruction network. With the assistance of\ntexture images, they selectively compress and transmit channels relevant to\nvisual tasks, reducing feature data while providing high-quality features for\nthe machine. The texture layers primarily serve humans and consist of an image\nreconstruction network. This image reconstruction network leverages features\nand texture images to reconstruct preview images for humans. Our method fully\nexploits the characteristics of texture and features. It eliminates feature\nredundancy, reconstructs high-quality preview images for humans, and supports\ndecision-making. The experimental results demonstrate excellent performance\nwhen employing our proposed method to compress the deep features.\n","authors":["Lei Xiong","Xin Luo","Zihao Wang","Chaofan He","Shuyuan Zhu","Bing Zeng"],"pdf_url":"https://arxiv.org/pdf/2405.19669v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19668v1","updated":"2024-05-30T03:38:31Z","published":"2024-05-30T03:38:31Z","title":"AutoBreach: Universal and Adaptive Jailbreaking with Efficient\n Wordplay-Guided Optimization","summary":" Despite the widespread application of large language models (LLMs) across\nvarious tasks, recent studies indicate that they are susceptible to jailbreak\nattacks, which can render their defense mechanisms ineffective. However,\nprevious jailbreak research has frequently been constrained by limited\nuniversality, suboptimal efficiency, and a reliance on manual crafting. In\nresponse, we rethink the approach to jailbreaking LLMs and formally define\nthree essential properties from the attacker' s perspective, which contributes\nto guiding the design of jailbreak methods. We further introduce AutoBreach, a\nnovel method for jailbreaking LLMs that requires only black-box access.\nInspired by the versatility of wordplay, AutoBreach employs a wordplay-guided\nmapping rule sampling strategy to generate a variety of universal mapping rules\nfor creating adversarial prompts. This generation process leverages LLMs'\nautomatic summarization and reasoning capabilities, thus alleviating the manual\nburden. To boost jailbreak success rates, we further suggest sentence\ncompression and chain-of-thought-based mapping rules to correct errors and\nwordplay misinterpretations in target LLMs. Additionally, we propose a\ntwo-stage mapping rule optimization strategy that initially optimizes mapping\nrules before querying target LLMs to enhance the efficiency of AutoBreach.\nAutoBreach can efficiently identify security vulnerabilities across various\nLLMs, including three proprietary models: Claude-3, GPT-3.5, GPT-4 Turbo, and\ntwo LLMs' web platforms: Bingchat, GPT-4 Web, achieving an average success rate\nof over 80% with fewer than 10 queries\n","authors":["Jiawei Chen","Xiao Yang","Zhengwei Fang","Yu Tian","Yinpeng Dong","Zhaoxia Yin","Hang Su"],"pdf_url":"https://arxiv.org/pdf/2405.19668v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2403.16578v4","updated":"2024-05-30T03:35:06Z","published":"2024-03-25T09:43:56Z","title":"SegICL: A Multimodal In-context Learning Framework for Enhanced\n Segmentation in Medical Imaging","summary":" In the field of medical image segmentation, tackling Out-of-Distribution\n(OOD) segmentation tasks in a cost-effective manner remains a significant\nchallenge. Universal segmentation models is a solution, which aim to generalize\nacross the diverse modality of medical images, yet their effectiveness often\ndiminishes when applied to OOD data modalities and tasks, requiring intricate\nfine-tuning of model for optimal performance. Few-shot learning segmentation\nmethods are typically designed for specific modalities of data and cannot be\ndirectly transferred for use with another modality. Therefore, we introduce\nSegICL, a novel approach leveraging In-Context Learning (ICL) for image\nsegmentation. Unlike existing methods, SegICL has the capability to employ\ntext-guided segmentation and conduct in-context learning with a small set of\nimage-mask pairs, eliminating the need for training the model from scratch or\nfine-tuning for OOD tasks (including OOD modality and dataset). Extensive\nexperimental demonstrates a positive correlation between the number of shots\nand segmentation performance on OOD tasks. The performance of segmentation when\nprovided thre-shots is approximately 1.5 times better than the performance in a\nzero-shot setting. This indicates that SegICL effectively address new\nsegmentation tasks based on contextual information. Additionally, SegICL also\nexhibits comparable performance to mainstream models on OOD and in-distribution\ntasks. Our code will be released after paper review.\n","authors":["Lingdong Shen","Fangxin Shang","Xiaoshuang Huang","Yehui Yang","Haifeng Huang","Shiming Xiang"],"pdf_url":"https://arxiv.org/pdf/2403.16578v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19988v1","updated":"2024-05-30T12:18:06Z","published":"2024-05-30T12:18:06Z","title":"Video-Language Critic: Transferable Reward Functions for\n Language-Conditioned Robotics","summary":" Natural language is often the easiest and most convenient modality for humans\nto specify tasks for robots. However, learning to ground language to behavior\ntypically requires impractical amounts of diverse, language-annotated\ndemonstrations collected on each target robot. In this work, we aim to separate\nthe problem of what to accomplish from how to accomplish it, as the former can\nbenefit from substantial amounts of external observation-only data, and only\nthe latter depends on a specific robot embodiment. To this end, we propose\nVideo-Language Critic, a reward model that can be trained on readily available\ncross-embodiment data using contrastive learning and a temporal ranking\nobjective, and use it to score behavior traces from a separate reinforcement\nlearning actor. When trained on Open X-Embodiment data, our reward model\nenables 2x more sample-efficient policy training on Meta-World tasks than a\nsparse reward only, despite a significant domain gap. Using in-domain data but\nin a challenging task generalization setting on Meta-World, we further\ndemonstrate more sample-efficient training than is possible with prior\nlanguage-conditioned reward models that are either trained with binary\nclassification, use static images, or do not leverage the temporal information\npresent in video data.\n","authors":["Minttu Alakuijala","Reginald McLean","Isaac Woungang","Nariman Farsad","Samuel Kaski","Pekka Marttinen","Kai Yuan"],"pdf_url":"https://arxiv.org/pdf/2405.19988v1.pdf","comment":"10 pages in the main text, 16 pages including references and\n supplementary materials. 4 figures and 3 tables in the main text, 1 table in\n supplementary materials"},{"id":"http://arxiv.org/abs/2405.19730v1","updated":"2024-05-30T06:21:34Z","published":"2024-05-30T06:21:34Z","title":"Research on Foundation Model for Spatial Data Intelligence: China's 2024\n White Paper on Strategic Development of Spatial Data Intelligence","summary":" This report focuses on spatial data intelligent large models, delving into\nthe principles, methods, and cutting-edge applications of these models. It\nprovides an in-depth discussion on the definition, development history, current\nstatus, and trends of spatial data intelligent large models, as well as the\nchallenges they face. The report systematically elucidates the key technologies\nof spatial data intelligent large models and their applications in urban\nenvironments, aerospace remote sensing, geography, transportation, and other\nscenarios. Additionally, it summarizes the latest application cases of spatial\ndata intelligent large models in themes such as urban development, multimodal\nsystems, remote sensing, smart transportation, and resource environments.\nFinally, the report concludes with an overview and outlook on the development\nprospects of spatial data intelligent large models.\n","authors":["Shaohua Wang","Xing Xie","Yong Li","Danhuai Guo","Zhi Cai","Yu Liu","Yang Yue","Xiao Pan","Feng Lu","Huayi Wu","Zhipeng Gui","Zhiming Ding","Bolong Zheng","Fuzheng Zhang","Tao Qin","Jingyuan Wang","Chuang Tao","Zhengchao Chen","Hao Lu","Jiayi Li","Hongyang Chen","Peng Yue","Wenhao Yu","Yao Yao","Leilei Sun","Yong Zhang","Longbiao Chen","Xiaoping Du","Xiang Li","Xueying Zhang","Kun Qin","Zhaoya Gong","Weihua Dong","Xiaofeng Meng"],"pdf_url":"https://arxiv.org/pdf/2405.19730v1.pdf","comment":"in Chinese language"},{"id":"http://arxiv.org/abs/2405.20525v1","updated":"2024-05-30T22:56:15Z","published":"2024-05-30T22:56:15Z","title":"Comparing Quantum Annealing and Spiking Neuromorphic Computing for\n Sampling Binary Sparse Coding QUBO Problems","summary":" We consider the problem of computing a sparse binary representation of an\nimage. To be precise, given an image and an overcomplete, non-orthonormal\nbasis, we aim to find a sparse binary vector indicating the minimal set of\nbasis vectors that when added together best reconstruct the given input. We\nformulate this problem with an $L_2$ loss on the reconstruction error, and an\n$L_0$ (or, equivalently, an $L_1$) loss on the binary vector enforcing\nsparsity. This yields a quadratic binary optimization problem (QUBO), whose\noptimal solution(s) in general is NP-hard to find. The method of unsupervised\nand unnormalized dictionary feature learning for a desired sparsity level to\nbest match the data is presented. Next, we solve the sparse representation QUBO\nby implementing it both on a D-Wave quantum annealer with Pegasus chip\nconnectivity via minor embedding, as well as on the Intel Loihi 2 spiking\nneuromorphic processor. On the quantum annealer, we sample from the sparse\nrepresentation QUBO using parallel quantum annealing combined with quantum\nevolution Monte Carlo, also known as iterated reverse annealing. On Loihi 2, we\nuse a stochastic winner take all network of neurons. The solutions are\nbenchmarked against simulated annealing, a classical heuristic, and the optimal\nsolutions are computed using CPLEX. Iterated reverse quantum annealing performs\nsimilarly to simulated annealing, although simulated annealing is always able\nto sample the optimal solution whereas quantum annealing was not always able\nto. The Loihi 2 solutions that are sampled are on average more sparse than the\nsolutions from any of the other methods. Loihi 2 outperforms a D-Wave quantum\nannealer standard linear-schedule anneal, while iterated reverse quantum\nannealing performs much better than both unmodified linear-schedule quantum\nannealing and iterated warm starting on Loihi 2.\n","authors":["Kyle Henke","Elijah Pelofske","Garrett Kenyon","Georg Hahn"],"pdf_url":"https://arxiv.org/pdf/2405.20525v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.17138v2","updated":"2024-05-30T22:22:54Z","published":"2023-11-28T18:59:06Z","title":"Shadows Don't Lie and Lines Can't Bend! Generative Models don't know\n Projective Geometry...for now","summary":" Generative models can produce impressively realistic images. This paper\ndemonstrates that generated images have geometric features different from those\nof real images. We build a set of collections of generated images, prequalified\nto fool simple, signal-based classifiers into believing they are real. We then\nshow that prequalified generated images can be identified reliably by\nclassifiers that only look at geometric properties. We use three such\nclassifiers. All three classifiers are denied access to image pixels, and look\nonly at derived geometric features. The first classifier looks at the\nperspective field of the image, the second looks at lines detected in the\nimage, and the third looks at relations between detected objects and shadows.\nOur procedure detects generated images more reliably than SOTA local signal\nbased detectors, for images from a number of distinct generators. Saliency maps\nsuggest that the classifiers can identify geometric problems reliably. We\nconclude that current generators cannot reliably reproduce geometric properties\nof real images.\n","authors":["Ayush Sarkar","Hanlin Mai","Amitabh Mahapatra","Svetlana Lazebnik","D. A. Forsyth","Anand Bhattad"],"pdf_url":"https://arxiv.org/pdf/2311.17138v2.pdf","comment":"Project Page: https://projective-geometry.github.io | First three\n authors contributed equally"},{"id":"http://arxiv.org/abs/2405.20513v1","updated":"2024-05-30T22:13:17Z","published":"2024-05-30T22:13:17Z","title":"Deep Modeling of Non-Gaussian Aleatoric Uncertainty","summary":" Deep learning offers promising new ways to accurately model aleatoric\nuncertainty in robotic estimation systems, particularly when the uncertainty\ndistributions do not conform to traditional assumptions of being fixed and\nGaussian. In this study, we formulate and evaluate three fundamental deep\nlearning approaches for conditional probability density modeling to quantify\nnon-Gaussian aleatoric uncertainty: parametric, discretized, and generative\nmodeling. We systematically compare the respective strengths and weaknesses of\nthese three methods on simulated non-Gaussian densities as well as on\nreal-world terrain-relative navigation data. Our results show that these deep\nlearning methods can accurately capture complex uncertainty patterns,\nhighlighting their potential for improving the reliability and robustness of\nestimation systems.\n","authors":["Aastha Acharya","Caleb Lee","Marissa D'Alonzo","Jared Shamwell","Nisar R. Ahmed","Rebecca Russell"],"pdf_url":"https://arxiv.org/pdf/2405.20513v1.pdf","comment":"8 pages, 7 figures"},{"id":"http://arxiv.org/abs/2405.20510v1","updated":"2024-05-30T21:59:29Z","published":"2024-05-30T21:59:29Z","title":"Physically Compatible 3D Object Modeling from a Single Image","summary":" We present a computational framework that transforms single images into 3D\nphysical objects. The visual geometry of a physical object in an image is\ndetermined by three orthogonal attributes: mechanical properties, external\nforces, and rest-shape geometry. Existing single-view 3D reconstruction methods\noften overlook this underlying composition, presuming rigidity or neglecting\nexternal forces. Consequently, the reconstructed objects fail to withstand\nreal-world physical forces, resulting in instability or undesirable deformation\n-- diverging from their intended designs as depicted in the image. Our\noptimization framework addresses this by embedding physical compatibility into\nthe reconstruction process. We explicitly decompose the three physical\nattributes and link them through static equilibrium, which serves as a hard\nconstraint, ensuring that the optimized physical shapes exhibit desired\nphysical behaviors. Evaluations on a dataset collected from Objaverse\ndemonstrate that our framework consistently enhances the physical realism of 3D\nmodels over existing methods. The utility of our framework extends to practical\napplications in dynamic simulations and 3D printing, where adherence to\nphysical compatibility is paramount.\n","authors":["Minghao Guo","Bohan Wang","Pingchuan Ma","Tianyuan Zhang","Crystal Elaine Owens","Chuang Gan","Joshua B. Tenenbaum","Kaiming He","Wojciech Matusik"],"pdf_url":"https://arxiv.org/pdf/2405.20510v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20501v1","updated":"2024-05-30T21:42:54Z","published":"2024-05-30T21:42:54Z","title":"ShelfHelp: Empowering Humans to Perform Vision-Independent Manipulation\n Tasks with a Socially Assistive Robotic Cane","summary":" The ability to shop independently, especially in grocery stores, is important\nfor maintaining a high quality of life. This can be particularly challenging\nfor people with visual impairments (PVI). Stores carry thousands of products,\nwith approximately 30,000 new products introduced each year in the US market\nalone, presenting a challenge even for modern computer vision solutions.\nThrough this work, we present a proof-of-concept socially assistive robotic\nsystem we call ShelfHelp, and propose novel technical solutions for enhancing\ninstrumented canes traditionally meant for navigation tasks with additional\ncapability within the domain of shopping. ShelfHelp includes a novel visual\nproduct locator algorithm designed for use in grocery stores and a novel\nplanner that autonomously issues verbal manipulation guidance commands to guide\nthe user during product retrieval. Through a human subjects study, we show the\nsystem's success in locating and providing effective manipulation guidance to\nretrieve desired products with novice users. We compare two autonomous verbal\nguidance modes achieving comparable performance to a human assistance baseline\nand present encouraging findings that validate our system's efficiency and\neffectiveness and through positive subjective metrics including competence,\nintelligence, and ease of use.\n","authors":["Shivendra Agrawal","Suresh Nayak","Ashutosh Naik","Bradley Hayes"],"pdf_url":"https://arxiv.org/pdf/2405.20501v1.pdf","comment":"8 pages, 14 figures and charts"},{"id":"http://arxiv.org/abs/2405.20494v1","updated":"2024-05-30T21:35:48Z","published":"2024-05-30T21:35:48Z","title":"Slight Corruption in Pre-training Data Makes Better Diffusion Models","summary":" Diffusion models (DMs) have shown remarkable capabilities in generating\nrealistic high-quality images, audios, and videos. They benefit significantly\nfrom extensive pre-training on large-scale datasets, including web-crawled data\nwith paired data and conditions, such as image-text and image-class pairs.\nDespite rigorous filtering, these pre-training datasets often inevitably\ncontain corrupted pairs where conditions do not accurately describe the data.\nThis paper presents the first comprehensive study on the impact of such\ncorruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K\nand CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical\nfindings reveal that various types of slight corruption in pre-training can\nsignificantly enhance the quality, diversity, and fidelity of the generated\nimages across different DMs, both during pre-training and downstream adaptation\nstages. Theoretically, we consider a Gaussian mixture model and prove that\nslight corruption in the condition leads to higher entropy and a reduced\n2-Wasserstein distance to the ground truth of the data distribution generated\nby the corruptly trained DMs. Inspired by our analysis, we propose a simple\nmethod to improve the training of DMs on practical datasets by adding condition\nembedding perturbations (CEP). CEP significantly improves the performance of\nvarious DMs in both pre-training and downstream tasks. We hope that our study\nprovides new insights into understanding the data and pre-training processes of\nDMs.\n","authors":["Hao Chen","Yujin Han","Diganta Misra","Xiang Li","Kai Hu","Difan Zou","Masashi Sugiyama","Jindong Wang","Bhiksha Raj"],"pdf_url":"https://arxiv.org/pdf/2405.20494v1.pdf","comment":"50 pages, 33 figures, 4 tables"},{"id":"http://arxiv.org/abs/2309.02691v3","updated":"2024-05-30T21:16:29Z","published":"2023-09-06T03:54:57Z","title":"A Joint Study of Phrase Grounding and Task Performance in Vision and\n Language Models","summary":" Key to tasks that require reasoning about natural language in visual contexts\nis grounding words and phrases to image regions. However, observing this\ngrounding in contemporary models is complex, even if it is generally expected\nto take place if the task is addressed in a way that is conductive to\ngeneralization. We propose a framework to jointly study task performance and\nphrase grounding, and propose three benchmarks to study the relation between\nthe two. Our results show that contemporary models demonstrate inconsistency\nbetween their ability to ground phrases and solve tasks. We show how this can\nbe addressed through brute-force training on ground phrasing annotations, and\nanalyze the dynamics it creates. Code and at available at\nhttps://github.com/lil-lab/phrase_grounding.\n","authors":["Noriyuki Kojima","Hadar Averbuch-Elor","Yoav Artzi"],"pdf_url":"https://arxiv.org/pdf/2309.02691v3.pdf","comment":"This was published in TMLR in 2024, on January 24th"},{"id":"http://arxiv.org/abs/2402.03299v4","updated":"2024-05-30T21:14:26Z","published":"2024-02-05T18:54:43Z","title":"GUARD: Role-playing to Generate Natural-language Jailbreakings to Test\n Guideline Adherence of Large Language Models","summary":" The discovery of \"jailbreaks\" to bypass safety filters of Large Language\nModels (LLMs) and harmful responses have encouraged the community to implement\nsafety measures. One major safety measure is to proactively test the LLMs with\njailbreaks prior to the release. Therefore, such testing will require a method\nthat can generate jailbreaks massively and efficiently. In this paper, we\nfollow a novel yet intuitive strategy to generate jailbreaks in the style of\nthe human generation. We propose a role-playing system that assigns four\ndifferent roles to the user LLMs to collaborate on new jailbreaks. Furthermore,\nwe collect existing jailbreaks and split them into different independent\ncharacteristics using clustering frequency and semantic patterns sentence by\nsentence. We organize these characteristics into a knowledge graph, making them\nmore accessible and easier to retrieve. Our system of different roles will\nleverage this knowledge graph to generate new jailbreaks, which have proved\neffective in inducing LLMs to generate unethical or guideline-violating\nresponses. In addition, we also pioneer a setting in our system that will\nautomatically follow the government-issued guidelines to generate jailbreaks to\ntest whether LLMs follow the guidelines accordingly. We refer to our system as\nGUARD (Guideline Upholding through Adaptive Role-play Diagnostics). We have\nempirically validated the effectiveness of GUARD on three cutting-edge\nopen-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a\nwidely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the\nrealm of vision language models (MiniGPT-v2 and Gemini Vision Pro), showcasing\nGUARD's versatility and contributing valuable insights for the development of\nsafer, more reliable LLM-based applications across diverse modalities.\n","authors":["Haibo Jin","Ruoxi Chen","Andy Zhou","Yang Zhang","Haohan Wang"],"pdf_url":"https://arxiv.org/pdf/2402.03299v4.pdf","comment":"28 papges"},{"id":"http://arxiv.org/abs/2402.01335v2","updated":"2024-05-30T21:04:36Z","published":"2024-02-02T11:40:27Z","title":"Simulator-Free Visual Domain Randomization via Video Games","summary":" Domain randomization is an effective computer vision technique for improving\ntransferability of vision models across visually distinct domains exhibiting\nsimilar content. Existing approaches, however, rely extensively on tweaking\ncomplex and specialized simulation engines that are difficult to construct,\nsubsequently affecting their feasibility and scalability. This paper introduces\nBehAVE, a video understanding framework that uniquely leverages the plethora of\nexisting commercial video games for domain randomization, without requiring\naccess to their simulation engines. Under BehAVE (1) the inherent rich visual\ndiversity of video games acts as the source of randomization and (2) player\nbehavior -- represented semantically via textual descriptions of actions --\nguides the *alignment* of videos with similar content. We test BehAVE on 25\ngames of the first-person shooter (FPS) genre across various video and text\nfoundation models and we report its robustness for domain randomization. BehAVE\nsuccessfully aligns player behavioral patterns and is able to zero-shot\ntransfer them to multiple unseen FPS games when trained on just one FPS game.\nIn a more challenging setting, BehAVE manages to improve the zero-shot\ntransferability of foundation models to unseen FPS games (up to 22%) even when\ntrained on a game of a different genre (Minecraft). Code and dataset can be\nfound at https://github.com/nrasajski/BehAVE.\n","authors":["Chintan Trivedi","Nemanja Rašajski","Konstantinos Makantasis","Antonios Liapis","Georgios N. Yannakakis"],"pdf_url":"https://arxiv.org/pdf/2402.01335v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.10636v5","updated":"2024-05-30T20:58:39Z","published":"2022-11-19T09:57:01Z","title":"EVEREST: Efficient Masked Video Autoencoder by Removing Redundant\n Spatiotemporal Tokens","summary":" Masked Video Autoencoder (MVA) approaches have demonstrated their potential\nby significantly outperforming previous video representation learning methods.\nHowever, they waste an excessive amount of computations and memory in\npredicting uninformative tokens/frames due to random masking strategies. (e.g.,\nover 16 nodes with 128 NVIDIA A100 GPUs). To resolve this issue, we exploit the\nunequal information density among the patches in videos and propose EVEREST, a\nsurprisingly efficient MVA approach for video representation learning that\nfinds tokens containing rich motion features and discards uninformative ones\nduring both pre-training and fine-tuning. We further present an\ninformation-intensive frame selection strategy that allows the model to focus\non informative and causal frames with minimal redundancy. Our method\nsignificantly reduces the computation and memory requirements of MVA, enabling\nthe pre-training and fine-tuning on a single machine with 8 GPUs while\nachieving comparable performance to computation- and memory-heavy baselines on\nmultiple benchmarks and the uncurated Ego4D dataset. We hope that our work\ncontributes to reducing the barrier to further research on video understanding.\n","authors":["Sunil Hwang","Jaehong Yoon","Youngwan Lee","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2211.10636v5.pdf","comment":"Accepted by ICML 2024"},{"id":"http://arxiv.org/abs/2405.20470v1","updated":"2024-05-30T20:41:12Z","published":"2024-05-30T20:41:12Z","title":"STHN: Deep Homography Estimation for UAV Thermal Geo-localization with\n Satellite Imagery","summary":" Accurate geo-localization of Unmanned Aerial Vehicles (UAVs) is crucial for a\nvariety of outdoor applications including search and rescue operations, power\nline inspections, and environmental monitoring. The vulnerability of Global\nNavigation Satellite Systems (GNSS) signals to interference and spoofing\nnecessitates the development of additional robust localization methods for\nautonomous navigation. Visual Geo-localization (VG), leveraging onboard cameras\nand reference satellite maps, offers a promising solution for absolute\nlocalization. Specifically, Thermal Geo-localization (TG), which relies on\nimage-based matching between thermal imagery with satellite databases, stands\nout by utilizing infrared cameras for effective night-time localization.\nHowever, the efficiency and effectiveness of current TG approaches, are\nhindered by dense sampling on satellite maps and geometric noises in thermal\nquery images. To overcome these challenges, in this paper, we introduce STHN, a\nnovel UAV thermal geo-localization approach that employs a coarse-to-fine deep\nhomography estimation method. This method attains reliable thermal\ngeo-localization within a 512-meter radius of the UAV's last known location\neven with a challenging 11% overlap between satellite and thermal images,\ndespite the presence of indistinct textures in thermal imagery and self-similar\npatterns in both spectra. Our research significantly enhances UAV thermal\ngeo-localization performance and robustness against the impacts of geometric\nnoises under low-visibility conditions in the wild. The code will be made\npublicly available.\n","authors":["Jiuhong Xiao","Ning Zhang","Daniel Tortei","Giuseppe Loianno"],"pdf_url":"https://arxiv.org/pdf/2405.20470v1.pdf","comment":"8 pages, 7 figures. This work has been submitted to the IEEE for\n possible publication. Copyright may be transferred without notice, after\n which this version may no longer be accessible"},{"id":"http://arxiv.org/abs/2405.20469v1","updated":"2024-05-30T20:37:34Z","published":"2024-05-30T20:37:34Z","title":"Is Synthetic Data all We Need? Benchmarking the Robustness of Models\n Trained with Synthetic Images","summary":" A long-standing challenge in developing machine learning approaches has been\nthe lack of high-quality labeled data. Recently, models trained with purely\nsynthetic data, here termed synthetic clones, generated using large-scale\npre-trained diffusion models have shown promising results in overcoming this\nannotation bottleneck. As these synthetic clone models progress, they are\nlikely to be deployed in challenging real-world settings, yet their suitability\nremains understudied. Our work addresses this gap by providing the first\nbenchmark for three classes of synthetic clone models, namely supervised,\nself-supervised, and multi-modal ones, across a range of robustness measures.\nWe show that existing synthetic self-supervised and multi-modal clones are\ncomparable to or outperform state-of-the-art real-image baselines for a range\nof robustness metrics - shape bias, background bias, calibration, etc. However,\nwe also find that synthetic clones are much more susceptible to adversarial and\nreal-world noise than models trained with real data. To address this, we find\nthat combining both real and synthetic data further increases the robustness,\nand that the choice of prompt used for generating synthetic images plays an\nimportant part in the robustness of synthetic clones.\n","authors":["Krishnakant Singh","Thanush Navaratnam","Jannik Holmer","Simone Schaub-Meyer","Stefan Roth"],"pdf_url":"https://arxiv.org/pdf/2405.20469v1.pdf","comment":"Accepted at CVPR 2024 Workshop: SyntaGen-Harnessing Generative Models\n for Synthetic Visual Datasets. Project page at\n https://synbenchmark.github.io/SynCloneBenchmark"},{"id":"http://arxiv.org/abs/2405.20465v1","updated":"2024-05-30T20:26:47Z","published":"2024-05-30T20:26:47Z","title":"ENTIRe-ID: An Extensive and Diverse Dataset for Person Re-Identification","summary":" The growing importance of person reidentification in computer vision has\nhighlighted the need for more extensive and diverse datasets. In response, we\nintroduce the ENTIRe-ID dataset, an extensive collection comprising over 4.45\nmillion images from 37 different cameras in varied environments. This dataset\nis uniquely designed to tackle the challenges of domain variability and model\ngeneralization, areas where existing datasets for person re-identification have\nfallen short. The ENTIRe-ID dataset stands out for its coverage of a wide array\nof real-world scenarios, encompassing various lighting conditions, angles of\nview, and diverse human activities. This design ensures a realistic and robust\ntraining platform for ReID models. The ENTIRe-ID dataset is publicly available\nat https://serdaryildiz.github.io/ENTIRe-ID\n","authors":["Serdar Yildiz","Ahmet Nezih Kasim"],"pdf_url":"https://arxiv.org/pdf/2405.20465v1.pdf","comment":"5 pages, 2024 18th International Conference on Automatic Face and\n Gesture Recognition (FG)"},{"id":"http://arxiv.org/abs/2405.20462v1","updated":"2024-05-30T20:19:42Z","published":"2024-05-30T20:19:42Z","title":"Multi-Label Guided Soft Contrastive Learning for Efficient Earth\n Observation Pretraining","summary":" Self-supervised pretraining on large-scale satellite data has raised great\ninterest in building Earth observation (EO) foundation models. However, many\nimportant resources beyond pure satellite imagery, such as land-cover-land-use\nproducts that provide free global semantic information, as well as vision\nfoundation models that hold strong knowledge of the natural world, tend to be\noverlooked. In this work, we show these free additional resources not only help\nresolve common contrastive learning bottlenecks, but also significantly boost\nthe efficiency and effectiveness of EO pretraining.\n Specifically, we first propose soft contrastive learning that optimizes\ncross-scene soft similarity based on land-cover-generated multi-label\nsupervision, naturally solving the issue of multiple positive samples and too\nstrict positive matching in complex scenes. Second, we explore cross-domain\ncontinual pretraining for both multispectral and SAR imagery, building\nefficient EO foundation models from strongest vision models such as DINOv2.\nIntegrating simple weight-initialization and Siamese masking strategies into\nour soft contrastive learning framework, we demonstrate impressive continual\npretraining performance even when the input channels and modalities are not\naligned.\n Without prohibitive training, we produce multispectral and SAR foundation\nmodels that achieve significantly better results in 9 out of 10 downstream\ntasks than most existing SOTA models. For example, our ResNet50/ViT-S achieve\n84.8/85.0 linear probing mAP scores on BigEarthNet-10\\% which are better than\nmost existing ViT-L models; under the same setting, our ViT-B sets a new record\nof 86.8 in multispectral, and 82.5 in SAR, the latter even better than many\nmultispectral models. Dataset and models are available at\nhttps://github.com/zhu-xlab/softcon.\n","authors":["Yi Wang","Conrad M Albrecht","Xiao Xiang Zhu"],"pdf_url":"https://arxiv.org/pdf/2405.20462v1.pdf","comment":"16 pages, 9 figures"},{"id":"http://arxiv.org/abs/2405.20459v1","updated":"2024-05-30T20:12:14Z","published":"2024-05-30T20:12:14Z","title":"On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines","summary":" Reliable usage of object detectors require them to be calibrated -- a crucial\nproblem that requires careful attention. Recent approaches towards this involve\n(1) designing new loss functions to obtain calibrated detectors by training\nthem from scratch, and (2) post-hoc Temperature Scaling (TS) that learns to\nscale the likelihood of a trained detector to output calibrated predictions.\nThese approaches are then evaluated based on a combination of Detection\nExpected Calibration Error (D-ECE) and Average Precision. In this work, via\nextensive analysis and insights, we highlight that these recent evaluation\nframeworks, evaluation metrics, and the use of TS have notable drawbacks\nleading to incorrect conclusions. As a step towards fixing these issues, we\npropose a principled evaluation framework to jointly measure calibration and\naccuracy of object detectors. We also tailor efficient and easy-to-use post-hoc\ncalibration approaches such as Platt Scaling and Isotonic Regression\nspecifically for object detection task. Contrary to the common notion, our\nexperiments show that once designed and evaluated properly, post-hoc\ncalibrators, which are extremely cheap to build and use, are much more powerful\nand effective than the recent train-time calibration methods. To illustrate,\nD-DETR with our post-hoc Isotonic Regression calibrator outperforms the recent\ntrain-time state-of-the-art calibration method Cal-DETR by more than 7 D-ECE on\nthe COCO dataset. Additionally, we propose improved versions of the recently\nproposed Localization-aware ECE and show the efficacy of our method on these\nmetrics as well. Code is available at:\nhttps://github.com/fiveai/detection_calibration.\n","authors":["Selim Kuzucu","Kemal Oksuz","Jonathan Sadeghi","Puneet K. Dokania"],"pdf_url":"https://arxiv.org/pdf/2405.20459v1.pdf","comment":"31 pages, 8 figures"},{"id":"http://arxiv.org/abs/2405.20443v1","updated":"2024-05-30T19:40:08Z","published":"2024-05-30T19:40:08Z","title":"P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image\n Segmentation","summary":" Diffusion models and multi-scale features are essential components in\nsemantic segmentation tasks that deal with remote-sensing images. They\ncontribute to improved segmentation boundaries and offer significant contextual\ninformation. U-net-like architectures are frequently employed in diffusion\nmodels for segmentation tasks. These architectural designs include dense skip\nconnections that may pose challenges for interpreting intermediate features.\nConsequently, they might not efficiently convey semantic information throughout\nvarious layers of the encoder-decoder architecture. To address these\nchallenges, we propose a new model for semantic segmentation known as the\ndiffusion model with parallel multi-scale branches. This model consists of\nParallel Multiscale Diffusion modules (P-MSDiff) and a Cross-Bridge Linear\nAttention mechanism (CBLA). P-MSDiff enhances the understanding of semantic\ninformation across multiple levels of granularity and detects repetitive\ndistribution data through the integration of recursive denoising branches. It\nfurther facilitates the amalgamation of data by connecting relevant branches to\nthe primary framework to enable concurrent denoising. Furthermore, within the\ninterconnected transformer architecture, the LA module has been substituted\nwith the CBLA module. This module integrates a semidefinite matrix linked to\nthe query into the dot product computation of keys and values. This integration\nenables the adaptation of queries within the LA framework. This adjustment\nenhances the structure for multi-head attention computation, leading to\nenhanced network performance and CBLA is a plug-and-play module. Our model\ndemonstrates superior performance based on the J1 metric on both the UAVid and\nVaihingen Building datasets, showing improvements of 1.60% and 1.40% over\nstrong baseline models, respectively.\n","authors":["Qi Zhang","Guohua Geng","Longquan Yan","Pengbo Zhou","Zhaodi Li","Kang Li","Qinglin Liu"],"pdf_url":"https://arxiv.org/pdf/2405.20443v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.01716v4","updated":"2024-05-30T19:21:41Z","published":"2023-04-04T11:25:44Z","title":"Decoupling Dynamic Monocular Videos for Dynamic View Synthesis","summary":" The challenge of dynamic view synthesis from dynamic monocular videos, i.e.,\nsynthesizing novel views for free viewpoints given a monocular video of a\ndynamic scene captured by a moving camera, mainly lies in accurately modeling\nthe \\textbf{dynamic objects} of a scene using limited 2D frames, each with a\nvarying timestamp and viewpoint. Existing methods usually require pre-processed\n2D optical flow and depth maps by off-the-shelf methods to supervise the\nnetwork, making them suffer from the inaccuracy of the pre-processed\nsupervision and the ambiguity when lifting the 2D information to 3D. In this\npaper, we tackle this challenge in an unsupervised fashion. Specifically, we\ndecouple the motion of the dynamic objects into object motion and camera\nmotion, respectively regularized by proposed unsupervised surface consistency\nand patch-based multi-view constraints. The former enforces the 3D geometric\nsurfaces of moving objects to be consistent over time, while the latter\nregularizes their appearances to be consistent across different viewpoints.\nSuch a fine-grained motion formulation can alleviate the learning difficulty\nfor the network, thus enabling it to produce not only novel views with higher\nquality but also more accurate scene flows and depth than existing methods\nrequiring extra supervision.\n","authors":["Meng You","Junhui Hou"],"pdf_url":"https://arxiv.org/pdf/2304.01716v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20431v1","updated":"2024-05-30T19:21:33Z","published":"2024-05-30T19:21:33Z","title":"Exploring the Practicality of Federated Learning: A Survey Towards the\n Communication Perspective","summary":" Federated Learning (FL) is a promising paradigm that offers significant\nadvancements in privacy-preserving, decentralized machine learning by enabling\ncollaborative training of models across distributed devices without\ncentralizing data. However, the practical deployment of FL systems faces a\nsignificant bottleneck: the communication overhead caused by frequently\nexchanging large model updates between numerous devices and a central server.\nThis communication inefficiency can hinder training speed, model performance,\nand the overall feasibility of real-world FL applications. In this survey, we\ninvestigate various strategies and advancements made in communication-efficient\nFL, highlighting their impact and potential to overcome the communication\nchallenges inherent in FL systems. Specifically, we define measures for\ncommunication efficiency, analyze sources of communication inefficiency in FL\nsystems, and provide a taxonomy and comprehensive review of state-of-the-art\ncommunication-efficient FL methods. Additionally, we discuss promising future\nresearch directions for enhancing the communication efficiency of FL systems.\nBy addressing the communication bottleneck, FL can be effectively applied and\nenable scalable and practical deployment across diverse applications that\nrequire privacy-preserving, decentralized machine learning, such as IoT,\nhealthcare, or finance.\n","authors":["Khiem Le","Nhan Luong-Ha","Manh Nguyen-Duc","Danh Le-Phuoc","Cuong Do","Kok-Seng Wong"],"pdf_url":"https://arxiv.org/pdf/2405.20431v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20420v1","updated":"2024-05-30T18:55:50Z","published":"2024-05-30T18:55:50Z","title":"Back to the Basics on Predicting Transfer Performance","summary":" In the evolving landscape of deep learning, selecting the best pre-trained\nmodels from a growing number of choices is a challenge. Transferability scorers\npropose alleviating this scenario, but their recent proliferation, ironically,\nposes the challenge of their own assessment. In this work, we propose both\nrobust benchmark guidelines for transferability scorers, and a well-founded\ntechnique to combine multiple scorers, which we show consistently improves\ntheir results. We extensively evaluate 13 scorers from literature across 11\ndatasets, comprising generalist, fine-grained, and medical imaging datasets. We\nshow that few scorers match the predictive performance of the simple raw metric\nof models on ImageNet, and that all predictors suffer on medical datasets. Our\nresults highlight the potential of combining different information sources for\nreliably predicting transferability across varied domains.\n","authors":["Levy Chaves","Eduardo Valle","Alceu Bissoto","Sandra Avila"],"pdf_url":"https://arxiv.org/pdf/2405.20420v1.pdf","comment":"15 pages, 3 figures, 2 tables"},{"id":"http://arxiv.org/abs/2405.20413v1","updated":"2024-05-30T18:38:36Z","published":"2024-05-30T18:38:36Z","title":"Jailbreaking Large Language Models Against Moderation Guardrails via\n Cipher Characters","summary":" Large Language Models (LLMs) are typically harmless but remain vulnerable to\ncarefully crafted prompts known as ``jailbreaks'', which can bypass protective\nmeasures and induce harmful behavior. Recent advancements in LLMs have\nincorporated moderation guardrails that can filter outputs, which trigger\nprocessing errors for certain malicious questions. Existing red-teaming\nbenchmarks often neglect to include questions that trigger moderation\nguardrails, making it difficult to evaluate jailbreak effectiveness. To address\nthis issue, we introduce JAMBench, a harmful behavior benchmark designed to\ntrigger and evaluate moderation guardrails. JAMBench involves 160 manually\ncrafted instructions covering four major risk categories at multiple severity\nlevels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against\nModeration), designed to attack moderation guardrails using jailbreak prefixes\nto bypass input-level filters and a fine-tuned shadow model functionally\nequivalent to the guardrail model to generate cipher characters to bypass\noutput-level filters. Our extensive experiments on four LLMs demonstrate that\nJAM achieves higher jailbreak success ($\\sim$ $\\times$ 19.88) and lower\nfiltered-out rates ($\\sim$ $\\times$ 1/6) than baselines.\n","authors":["Haibo Jin","Andy Zhou","Joe D. Menke","Haohan Wang"],"pdf_url":"https://arxiv.org/pdf/2405.20413v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2405.20392v1","updated":"2024-05-30T18:04:58Z","published":"2024-05-30T18:04:58Z","title":"Can No-Reference Quality-Assessment Methods Serve as Perceptual Losses\n for Super-Resolution?","summary":" Perceptual losses play an important role in constructing\ndeep-neural-network-based methods by increasing the naturalness and realism of\nprocessed images and videos. Use of perceptual losses is often limited to\nLPIPS, a fullreference method. Even though deep no-reference\nimage-qualityassessment methods are excellent at predicting human judgment,\nlittle research has examined their incorporation in loss functions. This paper\ninvestigates direct optimization of several video-superresolution models using\nno-reference image-quality-assessment methods as perceptual losses. Our\nexperimental results show that straightforward optimization of these methods\nproduce artifacts, but a special training procedure can mitigate them.\n","authors":["Egor Kashkarov","Egor Chistov","Ivan Molodetskikh","Dmitriy Vatolin"],"pdf_url":"https://arxiv.org/pdf/2405.20392v1.pdf","comment":"4 pages, 3 figures. The first two authors contributed equally to this\n work"},{"id":"http://arxiv.org/abs/2405.20380v1","updated":"2024-05-30T18:00:03Z","published":"2024-05-30T18:00:03Z","title":"Gradient Inversion of Federated Diffusion Models","summary":" Diffusion models are becoming defector generative models, which generate\nexceptionally high-resolution image data. Training effective diffusion models\nrequire massive real data, which is privately owned by distributed parties.\nEach data party can collaboratively train diffusion models in a federated\nlearning manner by sharing gradients instead of the raw data. In this paper, we\nstudy the privacy leakage risk of gradient inversion attacks. First, we design\na two-phase fusion optimization, GIDM, to leverage the well-trained generative\nmodel itself as prior knowledge to constrain the inversion search (latent)\nspace, followed by pixel-wise fine-tuning. GIDM is shown to be able to\nreconstruct images almost identical to the original ones. Considering a more\nprivacy-preserving training scenario, we then argue that locally initialized\nprivate training noise $\\epsilon$ and sampling step t may raise additional\nchallenges for the inversion attack. To solve this, we propose a\ntriple-optimization GIDM+ that coordinates the optimization of the unknown\ndata, $\\epsilon$ and $t$. Our extensive evaluation results demonstrate the\nvulnerability of sharing gradient for data protection of diffusion models, even\nhigh-resolution images can be reconstructed with high quality.\n","authors":["Jiyue Huang","Chi Hong","Lydia Y. Chen","Stefanie Roos"],"pdf_url":"https://arxiv.org/pdf/2405.20380v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20364v1","updated":"2024-05-30T17:59:51Z","published":"2024-05-30T17:59:51Z","title":"Learning 3D Robotics Perception using Inductive Priors","summary":" Recent advances in deep learning have led to a data-centric intelligence i.e.\nartificially intelligent models unlocking the potential to ingest a large\namount of data and be really good at performing digital tasks such as\ntext-to-image generation, machine-human conversation, and image recognition.\nThis thesis covers the topic of learning with structured inductive bias and\npriors to design approaches and algorithms unlocking the potential of\nprinciple-centric intelligence. Prior knowledge (priors for short), often\navailable in terms of past experience as well as assumptions of how the world\nworks, helps the autonomous agent generalize better and adapt their behavior\nbased on past experience. In this thesis, I demonstrate the use of prior\nknowledge in three different robotics perception problems. 1. object-centric 3D\nreconstruction, 2. vision and language for decision-making, and 3. 3D scene\nunderstanding. To solve these challenging problems, I propose various sources\nof prior knowledge including 1. geometry and appearance priors from synthetic\ndata, 2. modularity and semantic map priors and 3. semantic, structural, and\ncontextual priors. I study these priors for solving robotics 3D perception\ntasks and propose ways to efficiently encode them in deep learning models. Some\npriors are used to warm-start the network for transfer learning, others are\nused as hard constraints to restrict the action space of robotics agents. While\nclassical techniques are brittle and fail to generalize to unseen scenarios and\ndata-centric approaches require a large amount of labeled data, this thesis\naims to build intelligent agents which require very-less real-world data or\ndata acquired only from simulation to generalize to highly dynamic and\ncluttered environments in novel simulations (i.e. sim2sim) or real-world unseen\nenvironments (i.e. sim2real) for a holistic scene understanding of the 3D\nworld.\n","authors":["Muhammad Zubair Irshad"],"pdf_url":"https://arxiv.org/pdf/2405.20364v1.pdf","comment":"Georgia Tech Ph.D. Thesis, December 2023. For more details:\n https://zubairirshad.com/"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2403.19546v2","updated":"2024-05-30T16:20:04Z","published":"2024-03-28T16:27:26Z","title":"Croissant: A Metadata Format for ML-Ready Datasets","summary":" Data is a critical resource for Machine Learning (ML), yet working with data\nremains a key friction point. This paper introduces Croissant, a metadata\nformat for datasets that simplifies how data is used by ML tools and\nframeworks. Croissant makes datasets more discoverable, portable and\ninteroperable, thereby addressing significant challenges in ML data management\nand responsible AI. Croissant is already supported by several popular dataset\nrepositories, spanning hundreds of thousands of datasets, ready to be loaded\ninto the most popular ML frameworks.\n","authors":["Mubashara Akhtar","Omar Benjelloun","Costanza Conforti","Pieter Gijsbers","Joan Giner-Miguelez","Nitisha Jain","Michael Kuchnik","Quentin Lhoest","Pierre Marcenac","Manil Maskey","Peter Mattson","Luis Oala","Pierre Ruyssen","Rajat Shinde","Elena Simperl","Goeffry Thomas","Slava Tykhonov","Joaquin Vanschoren","Jos van der Velde","Steffen Vogler","Carole-Jean Wu"],"pdf_url":"https://arxiv.org/pdf/2403.19546v2.pdf","comment":"Published in Proceedings of ACM SIGMOD/PODS'24 Data Management for\n End-to-End Machine Learning (DEEM) Workshop\n https://dl.acm.org/doi/10.1145/3650203.3663326"},{"id":"http://arxiv.org/abs/2405.20204v1","updated":"2024-05-30T16:07:54Z","published":"2024-05-30T16:07:54Z","title":"Jina CLIP: Your CLIP Model Is Also Your Text Retriever","summary":" Contrastive Language-Image Pretraining (CLIP) is widely used to train models\nto align images and texts in a common embedding space by mapping them to\nfixed-sized vectors. These models are key to multimodal information retrieval\nand related tasks. However, CLIP models generally underperform in text-only\ntasks compared to specialized text models. This creates inefficiencies for\ninformation retrieval systems that keep separate embeddings and models for\ntext-only and multimodal tasks. We propose a novel, multi-task contrastive\ntraining method to address this issue, which we use to train the jina-clip-v1\nmodel to achieve the state-of-the-art performance on both text-image and\ntext-text retrieval tasks.\n","authors":["Andreas Koukounas","Georgios Mastrapas","Michael Günther","Bo Wang","Scott Martens","Isabelle Mohr","Saba Sturua","Mohammad Kalim Akram","Joan Fontanals Martínez","Saahil Ognawala","Susana Guzman","Maximilian Werk","Nan Wang","Han Xiao"],"pdf_url":"https://arxiv.org/pdf/2405.20204v1.pdf","comment":"4 pages, ICML2024 workshop submission"},{"id":"http://arxiv.org/abs/2405.19149v2","updated":"2024-05-30T13:26:43Z","published":"2024-05-29T14:52:10Z","title":"CaLa: Complementary Association Learning for Augmenting Composed Image\n Retrieval","summary":" Composed Image Retrieval (CIR) involves searching for target images based on\nan image-text pair query. While current methods treat this as a query-target\nmatching problem, we argue that CIR triplets contain additional associations\nbeyond this primary relation. In our paper, we identify two new relations\nwithin triplets, treating each triplet as a graph node. Firstly, we introduce\nthe concept of text-bridged image alignment, where the query text serves as a\nbridge between the query image and the target image. We propose a hinge-based\ncross-attention mechanism to incorporate this relation into network learning.\nSecondly, we explore complementary text reasoning, considering CIR as a form of\ncross-modal retrieval where two images compose to reason about complementary\ntext. To integrate these perspectives effectively, we design a twin\nattention-based compositor. By combining these complementary associations with\nthe explicit query pair-target image relation, we establish a comprehensive set\nof constraints for CIR. Our framework, CaLa (Complementary Association Learning\nfor Augmenting Composed Image Retrieval), leverages these insights. We evaluate\nCaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating\nits superiority in composed image retrieval.\n","authors":["Xintong Jiang","Yaxiong Wang","Mengjian Li","Yujiao Wu","Bingwen Hu","Xueming Qian"],"pdf_url":"https://arxiv.org/pdf/2405.19149v2.pdf","comment":"To appear at SIGIR 2024. arXiv admin note: text overlap with\n arXiv:2309.02169"},{"id":"http://arxiv.org/abs/2312.12728v3","updated":"2024-05-30T11:25:08Z","published":"2023-12-20T02:55:15Z","title":"Lookahead: An Inference Acceleration Framework for Large Language Model\n with Lossless Generation Accuracy","summary":" As Large Language Models (LLMs) have made significant advancements across\nvarious tasks, such as question answering, translation, text summarization, and\ndialogue systems, the need for accuracy in information becomes crucial,\nespecially for serious financial products serving billions of users like\nAlipay. However, for a real-world product serving millions of users, the\ninference speed of LLMs becomes a critical factor compared to a mere\nexperimental model.\n Hence, this paper presents a generic framework for accelerating the inference\nprocess, resulting in a substantial increase in speed and cost reduction for\nour LLM-based scenarios, with lossless generation accuracy. In the traditional\ninference process, each token is generated sequentially by the LLM, leading to\na time consumption proportional to the number of generated tokens. To enhance\nthis process, our framework, named \\textit{lookahead}, introduces a\n\\textit{multi-branch} strategy. Instead of generating a single token at a time,\nwe propose a Trie-based retrieval and verification mechanism to be able to\naccept several tokens at a forward step. Our strategy offers two distinct\nadvantages: (1) it guarantees absolute correctness of the output, avoiding any\napproximation algorithms, and (2) the worst-case performance of our approach is\nequivalent to the conventional process. We conduct extensive experiments to\ndemonstrate the significant improvements achieved by applying our inference\nacceleration framework. Our framework is widely deployed in Alipay since April\n2023, and obtain remarkable 2.66x to 6.26x speedup. Our code is available at\nhttps://github.com/alipay/PainlessInferenceAcceleration.\n","authors":["Yao Zhao","Zhitian Xie","Chen Liang","Chenyi Zhuang","Jinjie Gu"],"pdf_url":"https://arxiv.org/pdf/2312.12728v3.pdf","comment":"10 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.09497v2","updated":"2024-05-30T10:03:27Z","published":"2023-10-14T05:20:02Z","title":"A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking\n with Large Language Models","summary":" We propose a novel zero-shot document ranking approach based on Large\nLanguage Models (LLMs): the Setwise prompting approach. Our approach\ncomplements existing prompting approaches for LLM-based zero-shot ranking:\nPointwise, Pairwise, and Listwise. Through the first-of-its-kind comparative\nevaluation within a consistent experimental framework and considering factors\nlike model size, token consumption, latency, among others, we show that\nexisting approaches are inherently characterised by trade-offs between\neffectiveness and efficiency. We find that while Pointwise approaches score\nhigh on efficiency, they suffer from poor effectiveness. Conversely, Pairwise\napproaches demonstrate superior effectiveness but incur high computational\noverhead. Our Setwise approach, instead, reduces the number of LLM inferences\nand the amount of prompt token consumption during the ranking procedure,\ncompared to previous methods. This significantly improves the efficiency of\nLLM-based zero-shot ranking, while also retaining high zero-shot ranking\neffectiveness. We make our code and results publicly available at\n\\url{https://github.com/ielab/llm-rankers}.\n","authors":["Shengyao Zhuang","Honglei Zhuang","Bevan Koopman","Guido Zuccon"],"pdf_url":"https://arxiv.org/pdf/2310.09497v2.pdf","comment":"SIGIR2024 full paper"},{"id":"http://arxiv.org/abs/2208.09612v2","updated":"2024-05-30T07:46:20Z","published":"2022-08-20T06:03:35Z","title":"AntCritic: Argument Mining for Free-Form and Visually-Rich Financial\n Comments","summary":" Argument mining aims to detect all possible argumentative components and\nidentify their relationships automatically. As a thriving task in natural\nlanguage processing, there has been a large amount of corpus for academic study\nand application development in this field. However, the research in this area\nis still constrained by the inherent limitations of existing datasets.\nSpecifically, all the publicly available datasets are relatively small in\nscale, and few of them provide information from other modalities to facilitate\nthe learning process. Moreover, the statements and expressions in these corpora\nare usually in a compact form, which restricts the generalization ability of\nmodels. To this end, we collect a novel dataset AntCritic to serve as a helpful\ncomplement to this area, which consists of about 10k free-form and\nvisually-rich financial comments and supports both argument component detection\nand argument relation prediction tasks. Besides, to cope with the challenges\nbrought by scenario expansion, we thoroughly explore the fine-grained relation\nprediction and structure reconstruction scheme and discuss the encoding\nmechanism for visual styles and layouts. On this basis, we design two simple\nbut effective model architectures and conduct various experiments on this\ndataset to provide benchmark performances as a reference and verify the\npracticability of our proposed architecture. We release our data and code in\nthis link, and this dataset follows CC BY-NC-ND 4.0 license.\n","authors":["Huadai Liu","Wenqiang Xu","Xuan Lin","Jingjing Huo","Hong Chen","Zhou Zhao"],"pdf_url":"https://arxiv.org/pdf/2208.09612v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19749v1","updated":"2024-05-30T06:52:01Z","published":"2024-05-30T06:52:01Z","title":"Generating Query Recommendations via LLMs","summary":" Query recommendation systems are ubiquitous in modern search engines,\nassisting users in producing effective queries to meet their information needs.\nHowever, these systems require a large amount of data to produce good\nrecommendations, such as a large collection of documents to index and query\nlogs. In particular, query logs and user data are not available in cold start\nscenarios. Query logs are expensive to collect and maintain and require complex\nand time-consuming cascading pipelines for creating, combining, and ranking\nrecommendations. To address these issues, we frame the query recommendation\nproblem as a generative task, proposing a novel approach called Generative\nQuery Recommendation (GQR). GQR uses an LLM as its foundation and does not\nrequire to be trained or fine-tuned to tackle the query recommendation problem.\nWe design a prompt that enables the LLM to understand the specific\nrecommendation task, even using a single example. We then improved our system\nby proposing a version that exploits query logs called Retriever-Augmented GQR\n(RA-GQR). RA-GQr dynamically composes its prompt by retrieving similar queries\nfrom query logs. GQR approaches reuses a pre-existing neural architecture\nresulting in a simpler and more ready-to-market approach, even in a cold start\nscenario. Our proposed GQR obtains state-of-the-art performance in terms of\nNDCG@10 and clarity score against two commercial search engines and the\nprevious state-of-the-art approach on the Robust04 and ClueWeb09B collections,\nimproving on average the NDCG@10 performance up to ~4% on Robust04 and\nClueWeb09B w.r.t the previous best competitor. RA-GQR further improve the\nNDCG@10 obtaining an increase of ~11%, ~6\\% on Robust04 and ClueWeb09B w.r.t\nthe best competitor. Furthermore, our system obtained ~59% of user preferences\nin a blind user study, proving that our method produces the most engaging\nqueries.\n","authors":["Andrea Bacciu","Enrico Palumbo","Andreas Damianou","Nicola Tonellotto","Fabrizio Silvestri"],"pdf_url":"https://arxiv.org/pdf/2405.19749v1.pdf","comment":"Generating Query Recommendations via LLMs"},{"id":"http://arxiv.org/abs/2306.08121v2","updated":"2024-05-30T05:53:39Z","published":"2023-06-13T20:34:15Z","title":"Better Generalization with Semantic IDs: A Case Study in Ranking for\n Recommendations","summary":" Randomly-hashed item ids are used ubiquitously in recommendation models.\nHowever, the learned representations from random hashing prevents\ngeneralization across similar items, causing problems of learning unseen and\nlong-tail items, especially when item corpus is large, power-law distributed,\nand evolving dynamically. In this paper, we propose using content-derived\nfeatures as a replacement for random ids. We show that simply replacing ID\nfeatures with content-based embeddings can cause a drop in quality due to\nreduced memorization capability. To strike a good balance of memorization and\ngeneralization, we propose to use Semantic IDs -- a compact discrete item\nrepresentation learned from frozen content embeddings using RQ-VAE that\ncaptures the hierarchy of concepts in items -- as a replacement for random item\nids. Similar to content embeddings, the compactness of Semantic IDs poses a\nproblem of easy adaption in recommendation models. We propose novel methods for\nadapting Semantic IDs in industry-scale ranking models, through hashing\nsub-pieces of of the Semantic-ID sequences. In particular, we find that the\nSentencePiece model that is commonly used in LLM tokenization outperforms\nmanually crafted pieces such as N-grams. To the end, we evaluate our approaches\nin a real-world ranking model for YouTube recommendations. Our experiments\ndemonstrate that Semantic IDs can replace the direct use of video IDs by\nimproving the generalization ability on new and long-tail item slices without\nsacrificing overall model quality.\n","authors":["Anima Singh","Trung Vu","Nikhil Mehta","Raghunandan Keshavan","Maheswaran Sathiamoorthy","Yilin Zheng","Lichan Hong","Lukasz Heldt","Li Wei","Devansh Tandon","Ed H. Chi","Xinyang Yi"],"pdf_url":"https://arxiv.org/pdf/2306.08121v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19689v1","updated":"2024-05-30T05:04:01Z","published":"2024-05-30T05:04:01Z","title":"Uncertainty-aware sign language video retrieval with probability\n distribution modeling","summary":" Sign language video retrieval plays a key role in facilitating information\naccess for the deaf community. Despite significant advances in video-text\nretrieval, the complexity and inherent uncertainty of sign language preclude\nthe direct application of these techniques. Previous methods achieve the\nmapping between sign language video and text through fine-grained modal\nalignment. However, due to the scarcity of fine-grained annotation, the\nuncertainty inherent in sign language video is underestimated, limiting the\nfurther development of sign language retrieval tasks. To address this\nchallenge, we propose a novel Uncertainty-aware Probability Distribution\nRetrieval (UPRet), that conceptualizes the mapping process of sign language\nvideo and text in terms of probability distributions, explores their potential\ninterrelationships, and enables flexible mappings. Experiments on three\nbenchmarks demonstrate the effectiveness of our method, which achieves\nstate-of-the-art results on How2Sign (59.1%), PHOENIX-2014T (72.0%), and\nCSL-Daily (78.4%).\n","authors":["Xuan Wu","Hongxiang Li","Yuanjiang Luo","Xuxin Cheng","Xianwei Zhuang","Meng Cao","Keren Fu"],"pdf_url":"https://arxiv.org/pdf/2405.19689v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19612v1","updated":"2024-05-30T02:00:03Z","published":"2024-05-30T02:00:03Z","title":"Keyword-driven Retrieval-Augmented Large Language Models for Cold-start\n User Recommendations","summary":" Recent advancements in Large Language Models (LLMs) have shown significant\npotential in enhancing recommender systems. However, addressing the cold-start\nrecommendation problem, where users lack historical data, remains a\nconsiderable challenge. In this paper, we introduce KALM4Rec (Keyword-driven\nRetrieval-Augmented Large Language Models for Cold-start User Recommendations),\na novel framework specifically designed to tackle this problem by requiring\nonly a few input keywords from users in a practical scenario of cold-start user\nrestaurant recommendations. KALM4Rec operates in two main stages: candidates\nretrieval and LLM-based candidates re-ranking. In the first stage,\nkeyword-driven retrieval models are used to identify potential candidates,\naddressing LLMs' limitations in processing extensive tokens and reducing the\nrisk of generating misleading information. In the second stage, we employ LLMs\nwith various prompting strategies, including zero-shot and few-shot techniques,\nto re-rank these candidates by integrating multiple examples directly into the\nLLM prompts. Our evaluation, using a Yelp restaurant dataset with user reviews\nfrom three English-speaking cities, shows that our proposed framework\nsignificantly improves recommendation quality. Specifically, the integration of\nin-context instructions with LLMs for re-ranking markedly enhances the\nperformance of the cold-start user recommender system.\n","authors":["Hai-Dang Kieu","Minh Duc Nguyen","Thanh-Son Nguyen","Dung D. Le"],"pdf_url":"https://arxiv.org/pdf/2405.19612v1.pdf","comment":"10 pages, 10 figures, 4 tables"},{"id":"http://arxiv.org/abs/2405.20245v1","updated":"2024-05-30T16:54:42Z","published":"2024-05-30T16:54:42Z","title":"Retrieval Augmented Structured Generation: Business Document Information\n Extraction As Tool Use","summary":" Business Document Information Extraction (BDIE) is the problem of\ntransforming a blob of unstructured information (raw text, scanned documents,\netc.) into a structured format that downstream systems can parse and use. It\nhas two main tasks: Key-Information Extraction (KIE) and Line Items Recognition\n(LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem,\nwhere the tools are these downstream systems. We then present Retrieval\nAugmented Structured Generation (RASG), a novel general framework for BDIE that\nachieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE\nbenchmarks.\n The contributions of this paper are threefold: (1) We show, with ablation\nbenchmarks, that Large Language Models (LLMs) with RASG are already competitive\nwith or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on\nBDIE benchmarks. (2) We propose a new metric class for Line Items Recognition,\nGeneral Line Items Recognition Metric (GLIRM), that is more aligned with\npractical BDIE use cases compared to existing metrics, such as ANLS*, DocILE,\nand GriTS. (3) We provide a heuristic algorithm for backcalculating bounding\nboxes of predicted line items and tables without the need for vision encoders.\nFinally, we claim that, while LMMs might sometimes offer marginal performance\nbenefits, LLMs + RASG is oftentimes superior given real-world applications and\nconstraints of BDIE.\n","authors":["Franz Louis Cesista","Rui Aguiar","Jason Kim","Paolo Acilo"],"pdf_url":"https://arxiv.org/pdf/2405.20245v1.pdf","comment":"Accepted by IEEE 7th International Conference on Multimedia\n Information Processing and Retrieval (MIPR), 2024"},{"id":"http://arxiv.org/abs/2405.20468v1","updated":"2024-05-30T20:34:37Z","published":"2024-05-30T20:34:37Z","title":"Extending the Massive Text Embedding Benchmark to French","summary":" In recent years, numerous embedding models have been made available and\nwidely used for various NLP tasks. Choosing a model that performs well for\nseveral tasks in English has been largely simplified by the Massive Text\nEmbedding Benchmark (MTEB), but extensions to other languages remain\nchallenging. This is why we expand MTEB to propose the first massive benchmark\nof sentence embeddings for French. Not only we gather 22 existing datasets in\nan easy-to-use interface, but we also create three new French datasets for a\nglobal evaluation over 8 different tasks. We perform a large scale comparison\nwith 46 carefully selected embedding models, conduct comprehensive statistical\ntests, and analyze the correlation between model performance and many of their\ncharacteristics. We find out that even if no model is the best on all tasks,\nlarge multilingual models pre-trained on sentence similarity perform\nparticularly well. Our work comes with open-source code, new datasets and a\npublic leaderboard.\n","authors":["Mathieu Ciancone","Imene Kerboua","Marion Schaeffer","Wissam Siblini"],"pdf_url":"https://arxiv.org/pdf/2405.20468v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20389v1","updated":"2024-05-30T18:00:21Z","published":"2024-05-30T18:00:21Z","title":"Designing an Evaluation Framework for Large Language Models in Astronomy\n Research","summary":" Large Language Models (LLMs) are shifting how scientific research is done. It\nis imperative to understand how researchers interact with these models and how\nscientific sub-communities like astronomy might benefit from them. However,\nthere is currently no standard for evaluating the use of LLMs in astronomy.\nTherefore, we present the experimental design for an evaluation study on how\nastronomy researchers interact with LLMs. We deploy a Slack chatbot that can\nanswer queries from users via Retrieval-Augmented Generation (RAG); these\nresponses are grounded in astronomy papers from arXiv. We record and anonymize\nuser questions and chatbot answers, user upvotes and downvotes to LLM\nresponses, user feedback to the LLM, and retrieved documents and similarity\nscores with the query. Our data collection method will enable future dynamic\nevaluations of LLM tools for astronomy.\n","authors":["John F. Wu","Alina Hyk","Kiera McCormick","Christine Ye","Simone Astarita","Elina Baral","Jo Ciuca","Jesse Cranney","Anjalie Field","Kartheik Iyer","Philipp Koehn","Jenn Kotler","Sandor Kruk","Michelle Ntampaka","Charles O'Neill","Joshua E. G. Peek","Sanjib Sharma","Mikaeel Yunus"],"pdf_url":"https://arxiv.org/pdf/2405.20389v1.pdf","comment":"7 pages, 3 figures. Code available at\n https://github.com/jsalt2024-evaluating-llms-for-astronomy/astro-arxiv-bot"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2405.20343v1","updated":"2024-05-30T17:59:54Z","published":"2024-05-30T17:59:54Z","title":"Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single\n Image","summary":" In this work, we introduce Unique3D, a novel image-to-3D framework for\nefficiently generating high-quality 3D meshes from single-view images,\nfeaturing state-of-the-art generation fidelity and strong generalizability.\nPrevious methods based on Score Distillation Sampling (SDS) can produce\ndiversified 3D results by distilling 3D knowledge from large 2D diffusion\nmodels, but they usually suffer from long per-case optimization time with\ninconsistent issues. Recent works address the problem and generate better 3D\nresults either by finetuning a multi-view diffusion model or training a fast\nfeed-forward model. However, they still lack intricate textures and complex\ngeometries due to inconsistency and limited generated resolution. To\nsimultaneously achieve high fidelity, consistency, and efficiency in single\nimage-to-3D, we propose a novel framework Unique3D that includes a multi-view\ndiffusion model with a corresponding normal diffusion model to generate\nmulti-view images with their normal maps, a multi-level upscale process to\nprogressively improve the resolution of generated orthographic multi-views, as\nwell as an instant and consistent mesh reconstruction algorithm called ISOMER,\nwhich fully integrates the color and geometric priors into mesh results.\nExtensive experiments demonstrate that our Unique3D significantly outperforms\nother image-to-3D baselines in terms of geometric and textural details.\n","authors":["Kailu Wu","Fangfu Liu","Zhihan Cai","Runjie Yan","Hanyang Wang","Yating Hu","Yueqi Duan","Kaisheng Ma"],"pdf_url":"https://arxiv.org/pdf/2405.20343v1.pdf","comment":"Project page: https://wukailu.github.io/Unique3D"},{"id":"http://arxiv.org/abs/2405.20341v1","updated":"2024-05-30T17:59:51Z","published":"2024-05-30T17:59:51Z","title":"From Zero to Hero: Cold-Start Anomaly Detection","summary":" When first deploying an anomaly detection system, e.g., to detect\nout-of-scope queries in chatbots, there are no observed data, making\ndata-driven approaches ineffective. Zero-shot anomaly detection methods offer a\nsolution to such \"cold-start\" cases, but unfortunately they are often not\naccurate enough. This paper studies the realistic but underexplored cold-start\nsetting where an anomaly detection model is initialized using zero-shot\nguidance, but subsequently receives a small number of contaminated observations\n(namely, that may include anomalies). The goal is to make efficient use of both\nthe zero-shot guidance and the observations. We propose ColdFusion, a method\nthat effectively adapts the zero-shot anomaly detector to contaminated\nobservations. To support future development of this new setting, we propose an\nevaluation suite consisting of evaluation protocols and metrics.\n","authors":["Tal Reiss","George Kour","Naama Zwerdling","Ateret Anaby-Tavor","Yedid Hoshen"],"pdf_url":"https://arxiv.org/pdf/2405.20341v1.pdf","comment":"ACL 2024. Our code is available at\n https://github.com/talreiss/ColdFusion"},{"id":"http://arxiv.org/abs/2405.20331v1","updated":"2024-05-30T17:59:04Z","published":"2024-05-30T17:59:04Z","title":"CoSy: Evaluating Textual Explanations of Neurons","summary":" A crucial aspect of understanding the complex nature of Deep Neural Networks\n(DNNs) is the ability to explain learned concepts within their latent\nrepresentations. While various methods exist to connect neurons to textual\ndescriptions of human-understandable concepts, evaluating the quality of these\nexplanation methods presents a major challenge in the field due to a lack of\nunified, general-purpose quantitative evaluation. In this work, we introduce\nCoSy (Concept Synthesis) -- a novel, architecture-agnostic framework to\nevaluate the quality of textual explanations for latent neurons. Given textual\nexplanations, our proposed framework leverages a generative model conditioned\non textual input to create data points representing the textual explanation.\nThen, the neuron's response to these explanation data points is compared with\nthe response to control data points, providing a quality estimate of the given\nexplanation. We ensure the reliability of our proposed framework in a series of\nmeta-evaluation experiments and demonstrate practical value through insights\nfrom benchmarking various concept-based textual explanation methods for\nComputer Vision tasks, showing that tested explanation methods significantly\ndiffer in quality.\n","authors":["Laura Kopf","Philine Lou Bommer","Anna Hedström","Sebastian Lapuschkin","Marina M. -C. Höhne","Kirill Bykov"],"pdf_url":"https://arxiv.org/pdf/2405.20331v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2405.20324v1","updated":"2024-05-30T17:57:26Z","published":"2024-05-30T17:57:26Z","title":"Don't drop your samples! Coherence-aware training benefits Conditional\n diffusion","summary":" Conditional diffusion models are powerful generative models that can leverage\nvarious types of conditional information, such as class labels, segmentation\nmasks, or text captions. However, in many real-world scenarios, conditional\ninformation may be noisy or unreliable due to human annotation errors or weak\nalignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a\nnovel method that integrates coherence in conditional information into\ndiffusion models, allowing them to learn from noisy annotations without\ndiscarding data. We assume that each data point has an associated coherence\nscore that reflects the quality of the conditional information. We then\ncondition the diffusion model on both the conditional information and the\ncoherence score. In this way, the model learns to ignore or discount the\nconditioning when the coherence is low. We show that CAD is theoretically sound\nand empirically effective on various conditional generation tasks. Moreover, we\nshow that leveraging coherence generates realistic and diverse samples that\nrespect conditional information better than models trained on cleaned datasets\nwhere samples with low coherence have been discarded.\n","authors":["Nicolas Dufour","Victor Besnier","Vicky Kalogeiton","David Picard"],"pdf_url":"https://arxiv.org/pdf/2405.20324v1.pdf","comment":"Accepted at CVPR 2024 as a Highlight. Project page:\n https://nicolas-dufour.github.io/cad.html"},{"id":"http://arxiv.org/abs/2405.20321v1","updated":"2024-05-30T17:56:54Z","published":"2024-05-30T17:56:54Z","title":"Vision-based Manipulation from Single Human Video with Open-World Object\n Graphs","summary":" We present an object-centric approach to empower robots to learn vision-based\nmanipulation skills from human videos. We investigate the problem of imitating\nrobot manipulation from a single human video in the open-world setting, where a\nrobot must learn to manipulate novel objects from one video demonstration. We\nintroduce ORION, an algorithm that tackles the problem by extracting an\nobject-centric manipulation plan from a single RGB-D video and deriving a\npolicy that conditions on the extracted plan. Our method enables the robot to\nlearn from videos captured by daily mobile devices such as an iPad and\ngeneralize the policies to deployment environments with varying visual\nbackgrounds, camera angles, spatial layouts, and novel object instances. We\nsystematically evaluate our method on both short-horizon and long-horizon\ntasks, demonstrating the efficacy of ORION in learning from a single human\nvideo in the open world. Videos can be found in the project website\nhttps://ut-austin-rpl.github.io/ORION-release.\n","authors":["Yifeng Zhu","Arisrei Lim","Peter Stone","Yuke Zhu"],"pdf_url":"https://arxiv.org/pdf/2405.20321v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20320v1","updated":"2024-05-30T17:56:04Z","published":"2024-05-30T17:56:04Z","title":"Improving the Training of Rectified Flows","summary":" Diffusion models have shown great promise for image and video generation, but\nsampling from state-of-the-art models requires expensive numerical integration\nof a generative ODE. One approach for tackling this problem is rectified flows,\nwhich iteratively learn smooth ODE paths that are less susceptible to\ntruncation error. However, rectified flows still require a relatively large\nnumber of function evaluations (NFEs). In this work, we propose improved\ntechniques for training rectified flows, allowing them to compete with\nknowledge distillation methods even in the low NFE setting. Our main insight is\nthat under realistic settings, a single iteration of the Reflow algorithm for\ntraining rectified flows is sufficient to learn nearly straight trajectories;\nhence, the current practice of using multiple Reflow iterations is unnecessary.\nWe thus propose techniques to improve one-round training of rectified flows,\nincluding a U-shaped timestep distribution and LPIPS-Huber premetric. With\nthese techniques, we improve the FID of the previous 2-rectified flow by up to\n72% in the 1 NFE setting on CIFAR-10. On ImageNet 64$\\times$64, our improved\nrectified flow outperforms the state-of-the-art distillation methods such as\nconsistency distillation and progressive distillation in both one-step and\ntwo-step settings and rivals the performance of improved consistency training\n(iCT) in FID. Code is available at https://github.com/sangyun884/rfpp.\n","authors":["Sangyun Lee","Zinan Lin","Giulia Fanti"],"pdf_url":"https://arxiv.org/pdf/2405.20320v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20318v1","updated":"2024-05-30T17:55:28Z","published":"2024-05-30T17:55:28Z","title":"CausalQuest: Collecting Natural Causal Questions for AI Agents","summary":" Humans have an innate drive to seek out causality. Whether fuelled by\ncuriosity or specific goals, we constantly question why things happen, how they\nare interconnected, and many other related phenomena. To develop AI agents\ncapable of addressing this natural human quest for causality, we urgently need\na comprehensive dataset of natural causal questions. Unfortunately, existing\ndatasets either contain only artificially-crafted questions that do not reflect\nreal AI usage scenarios or have limited coverage of questions from specific\nsources. To address this gap, we present CausalQuest, a dataset of 13,500\nnaturally occurring questions sourced from social networks, search engines, and\nAI assistants. We formalize the definition of causal questions and establish a\ntaxonomy for finer-grained classification. Through a combined effort of human\nannotators and large language models (LLMs), we carefully label the dataset. We\nfind that 42% of the questions humans ask are indeed causal, with the majority\nseeking to understand the causes behind given effects. Using this dataset, we\ntrain efficient classifiers (up to 2.85B parameters) for the binary task of\nidentifying causal questions, achieving high performance with F1 scores of up\nto 0.877. We conclude with a rich set of future research directions that can\nbuild upon our data and models.\n","authors":["Roberto Ceraolo","Dmitrii Kharlapenko","Amélie Reymond","Rada Mihalcea","Mrinmaya Sachan","Bernhard Schölkopf","Zhijing Jin"],"pdf_url":"https://arxiv.org/pdf/2405.20318v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.09919v3","updated":"2024-05-30T17:55:19Z","published":"2024-03-14T23:40:56Z","title":"Recurrent Drafter for Fast Speculative Decoding in Large Language Models","summary":" In this paper, we introduce an improved approach of speculative decoding\naimed at enhancing the efficiency of serving large language models. Our method\ncapitalizes on the strengths of two established techniques: the classic\ntwo-model speculative decoding approach, and the more recent single-model\napproach, Medusa. Drawing inspiration from Medusa, our approach adopts a\nsingle-model strategy for speculative decoding. However, our method\ndistinguishes itself by employing a single, lightweight draft head with a\nrecurrent dependency design, akin in essence to the small, draft model uses in\nclassic speculative decoding, but without the complexities of the full\ntransformer architecture. And because of the recurrent dependency, we can use\nbeam search to swiftly filter out undesired candidates with the draft head. The\noutcome is a method that combines the simplicity of single-model design and\navoids the need to create a data-dependent tree attention structure only for\ninference in Medusa. We empirically demonstrate the effectiveness of the\nproposed method on several popular open source language models, along with a\ncomprehensive analysis of the trade-offs involved in adopting this approach.\n","authors":["Aonan Zhang","Chong Wang","Yi Wang","Xuanyu Zhang","Yunfei Cheng"],"pdf_url":"https://arxiv.org/pdf/2403.09919v3.pdf","comment":"11 pages, 6 figures"},{"id":"http://arxiv.org/abs/2402.05928v2","updated":"2024-05-30T17:54:02Z","published":"2024-02-08T18:57:42Z","title":"Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation\n for the Square Loss","summary":" In this work, we study statistical learning with dependent ($\\beta$-mixing)\ndata and square loss in a hypothesis class $\\mathscr{F}\\subset L_{\\Psi_p}$\nwhere $\\Psi_p$ is the norm $\\|f\\|_{\\Psi_p} \\triangleq \\sup_{m\\geq 1} m^{-1/p}\n\\|f\\|_{L^m} $ for some $p\\in [2,\\infty]$. Our inquiry is motivated by the\nsearch for a sharp noise interaction term, or variance proxy, in learning with\ndependent data. Absent any realizability assumption, typical non-asymptotic\nresults exhibit variance proxies that are deflated multiplicatively by the\nmixing time of the underlying covariates process. We show that whenever the\ntopologies of $L^2$ and $\\Psi_p$ are comparable on our hypothesis class\n$\\mathscr{F}$ -- that is, $\\mathscr{F}$ is a weakly sub-Gaussian class:\n$\\|f\\|_{\\Psi_p} \\lesssim \\|f\\|_{L^2}^\\eta$ for some $\\eta\\in (0,1]$ -- the\nempirical risk minimizer achieves a rate that only depends on the complexity of\nthe class and second order statistics in its leading term. Our result holds\nwhether the problem is realizable or not and we refer to this as a \\emph{near\nmixing-free rate}, since direct dependence on mixing is relegated to an\nadditive higher order term. We arrive at our result by combining the above\nnotion of a weakly sub-Gaussian class with mixed tail generic chaining. This\ncombination allows us to compute sharp, instance-optimal rates for a wide range\nof problems. Examples that satisfy our framework include sub-Gaussian linear\nregression, more general smoothly parameterized function classes, finite\nhypothesis classes, and bounded smoothness classes.\n","authors":["Ingvar Ziemann","Stephen Tu","George J. Pappas","Nikolai Matni"],"pdf_url":"https://arxiv.org/pdf/2402.05928v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20313v1","updated":"2024-05-30T17:53:50Z","published":"2024-05-30T17:53:50Z","title":"Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone\n Generation","summary":" Proteins are essential for almost all biological processes and derive their\ndiverse functions from complex 3D structures, which are in turn determined by\ntheir amino acid sequences. In this paper, we exploit the rich biological\ninductive bias of amino acid sequences and introduce FoldFlow-2, a novel\nsequence-conditioned SE(3)-equivariant flow matching model for protein\nstructure generation. FoldFlow-2 presents substantial new architectural\nfeatures over the previous FoldFlow family of models including a protein large\nlanguage model to encode sequence, a new multi-modal fusion trunk that combines\nstructure and sequence representations, and a geometric transformer based\ndecoder. To increase diversity and novelty of generated samples -- crucial for\nde-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an\norder of magnitude larger than PDB datasets of prior works, containing both\nknown proteins in PDB and high-quality synthetic structures achieved through\nfiltering. We further demonstrate the ability to align FoldFlow-2 to arbitrary\nrewards, e.g. increasing secondary structures diversity, by introducing a\nReinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2\noutperforms previous state-of-the-art protein structure-based generative\nmodels, improving over RFDiffusion in terms of unconditional generation across\nall metrics including designability, diversity, and novelty across all protein\nlengths, as well as exhibiting generalization on the task of equilibrium\nconformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2\nmakes progress on challenging conditional design tasks such as designing\nscaffolds for the VHH nanobody.\n","authors":["Guillaume Huguet","James Vuckovic","Kilian Fatras","Eric Thibodeau-Laufer","Pablo Lemos","Riashat Islam","Cheng-Hao Liu","Jarrid Rector-Brooks","Tara Akhound-Sadegh","Michael Bronstein","Alexander Tong","Avishek Joey Bose"],"pdf_url":"https://arxiv.org/pdf/2405.20313v1.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2405.20309v1","updated":"2024-05-30T17:52:36Z","published":"2024-05-30T17:52:36Z","title":"Large Language Models Can Self-Improve At Web Agent Tasks","summary":" Training models to act as agents that can effectively navigate and perform\nactions in a complex environment, such as a web browser, has typically been\nchallenging due to lack of training data. Large language models (LLMs) have\nrecently demonstrated some capability to navigate novel environments as agents\nin a zero-shot or few-shot fashion, purely guided by natural language\ninstructions as prompts. Recent research has also demonstrated LLMs have the\ncapability to exceed their base performance through self-improvement, i.e.\nfine-tuning on data generated by the model itself. In this work, we explore the\nextent to which LLMs can self-improve their performance as agents in\nlong-horizon tasks in a complex environment using the WebArena benchmark. In\nWebArena, an agent must autonomously navigate and perform actions on web pages\nto achieve a specified objective. We explore fine-tuning on three distinct\nsynthetic training data mixtures and achieve a 31\\% improvement in task\ncompletion rate over the base model on the WebArena benchmark through a\nself-improvement procedure. We additionally contribute novel evaluation metrics\nfor assessing the performance, robustness, capabilities, and quality of\ntrajectories of our fine-tuned agent models to a greater degree than simple,\naggregate-level benchmark scores currently used to measure self-improvement.\n","authors":["Ajay Patel","Markus Hofmarcher","Claudiu Leoveanu-Condrei","Marius-Constantin Dinu","Chris Callison-Burch","Sepp Hochreiter"],"pdf_url":"https://arxiv.org/pdf/2405.20309v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20304v1","updated":"2024-05-30T17:50:04Z","published":"2024-05-30T17:50:04Z","title":"Group Robust Preference Optimization in Reward-free RLHF","summary":" Adapting large language models (LLMs) for specific tasks usually involves\nfine-tuning through reinforcement learning with human feedback (RLHF) on\npreference data. While these data often come from diverse labelers' groups\n(e.g., different demographics, ethnicities, company teams, etc.), traditional\nRLHF approaches adopt a \"one-size-fits-all\" approach, i.e., they\nindiscriminately assume and optimize a single preference model, thus not being\nrobust to unique characteristics and needs of the various groups. To address\nthis limitation, we propose a novel Group Robust Preference Optimization (GRPO)\nmethod to align LLMs to individual groups' preferences robustly. Our approach\nbuilds upon reward-free direct preference optimization methods, but unlike\nprevious approaches, it seeks a robust policy which maximizes the worst-case\ngroup performance. To achieve this, GRPO adaptively and sequentially weights\nthe importance of different groups, prioritizing groups with worse cumulative\nloss. We theoretically study the feasibility of GRPO and analyze its\nconvergence for the log-linear policy class. By fine-tuning LLMs with GRPO\nusing diverse group-based global opinion data, we significantly improved\nperformance for the worst-performing groups, reduced loss imbalances across\ngroups, and improved probability accuracies compared to non-robust baselines.\n","authors":["Shyam Sundhar Ramesh","Yifan Hu","Iason Chaimalas","Viraj Mehta","Pier Giuseppe Sessa","Haitham Bou Ammar","Ilija Bogunovic"],"pdf_url":"https://arxiv.org/pdf/2405.20304v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2403.01643v2","updated":"2024-05-30T17:46:22Z","published":"2024-03-03T23:40:35Z","title":"You Need to Pay Better Attention: Rethinking the Mathematics of\n Attention Mechanism","summary":" Scaled Dot Product Attention (SDPA) is the backbone of many modern\ndeep-learning models. It is so versatile that it has been used in natural\nlanguage, vision, and multi-modal domains with very little change compared to\nits original formulation. This paper discusses why the current formulation is\ninefficient by delving into the mathematical details of the attention\nmechanism. We propose three improvements to mitigate these inefficiencies,\nthereby, introducing three enhanced attention mechanisms: Optimised, Efficient,\nand Super Attention. Optimised and Efficient Attention have one and two matrix\nmultiplications fewer per head, respectively, and 25% and 50% fewer parameters,\nrespectively, than standard SDPA, but perform similarly to standard SDPA in\nboth vision and natural language tasks. They can be used in all applications\nwhere SDPA is used while offering smaller model sizes and faster training and\ninference without noticeable loss in performance. Super Attention introduces a\nnew linear transformation on the values, transforming them from the left. It\noutperforms standard SPDA on vision and natural language tasks by up to 17%\nwhile having one fewer matrix multiplication per head and 25% fewer parameters\nthan standard SDPA. Consequently, it is also faster than standard SDPA. Super\nAttention is ideal in applications where the attention layer's context length\nis fixed, such as Vision Transformers. In addition to providing mathematical\nreasoning, we evaluate the presented attention mechanisms on several datasets\nincluding MNIST, CIFAR100, ImageNet, IMDB Movie Reviews, and Amazon Reviews\ndatasets, as well as combined Europarl and Anki English-Spanish datasets for\nneural machine translation.\n","authors":["Mehran Hosseini","Peyman Hosseini"],"pdf_url":"https://arxiv.org/pdf/2403.01643v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20289v1","updated":"2024-05-30T17:40:11Z","published":"2024-05-30T17:40:11Z","title":"DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music\n Generation","summary":" Controllable music generation methods are critical for human-centered\nAI-based music creation, but are currently limited by speed, quality, and\ncontrol design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in\nparticular, offers state-of-the-art results, but is over 10x slower than\nreal-time, limiting practical use. We propose Distilled Diffusion\nInference-Time T -Optimization (or DITTO-2), a new method to speed up\ninference-time optimization-based control and unlock faster-than-real-time\ngeneration for a wide-variety of applications such as music inpainting,\noutpainting, intensity, melody, and musical structure control. Our method works\nby (1) distilling a pre-trained diffusion model for fast sampling via an\nefficient, modified consistency or consistency trajectory distillation process\n(2) performing inference-time optimization using our distilled model with\none-step sampling as an efficient surrogate optimization task and (3) running a\nfinal multi-step sampling generation (decoding) using our estimated noise\nlatents for best-quality, fast, controllable generation. Through thorough\nevaluation, we find our method not only speeds up generation over 10-20x, but\nsimultaneously improves control adherence and generation quality all at once.\nFurthermore, we apply our approach to a new application of maximizing text\nadherence (CLAP score) and show we can convert an unconditional diffusion model\nwithout text inputs into a model that yields state-of-the-art text control.\nSound examples can be found at https://ditto-music.github.io/ditto2/.\n","authors":["Zachary Novack","Julian McAuley","Taylor Berg-Kirkpatrick","Nicholas Bryan"],"pdf_url":"https://arxiv.org/pdf/2405.20289v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20287v1","updated":"2024-05-30T17:39:15Z","published":"2024-05-30T17:39:15Z","title":"Flexible SE(2) graph neural networks with applications to PDE surrogates","summary":" This paper presents a novel approach for constructing graph neural networks\nequivariant to 2D rotations and translations and leveraging them as PDE\nsurrogates on non-gridded domains. We show that aligning the representations\nwith the principal axis allows us to sidestep many constraints while preserving\nSE(2) equivariance. By applying our model as a surrogate for fluid flow\nsimulations and conducting thorough benchmarks against non-equivariant models,\nwe demonstrate significant gains in terms of both data efficiency and accuracy.\n","authors":["Maria Bånkestad","Olof Mogren","Aleksis Pirinen"],"pdf_url":"https://arxiv.org/pdf/2405.20287v1.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2402.05140v2","updated":"2024-05-30T17:37:06Z","published":"2024-02-06T20:11:54Z","title":"Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains","summary":" Large Language Models (LLMs) have demonstrated remarkable proficiency in\nunderstanding and generating natural language. However, their capabilities wane\nin highly specialized domains underrepresented in the pretraining corpus, such\nas physical and biomedical sciences. This work explores how to repurpose\ngeneral LLMs into effective task solvers for specialized domains. We introduce\na novel, model-agnostic framework for learning custom input tags, which are\nparameterized as continuous vectors appended to the LLM's embedding layer, to\ncondition the LLM. We design two types of input tags: domain tags are used to\ndelimit specialized representations (e.g., chemical formulas) and provide\ndomain-relevant context; function tags are used to represent specific functions\n(e.g., predicting molecular properties) and compress function-solving\ninstructions. We develop a three-stage protocol to learn these tags using\nauxiliary data and domain knowledge. By explicitly disentangling task domains\nfrom task functions, our method enables zero-shot generalization to unseen\nproblems through diverse combinations of the input tags. It also boosts LLM's\nperformance in various specialized domains, such as predicting protein or\nchemical properties and modeling drug-target interactions, outperforming expert\nmodels tailored to these tasks.\n","authors":["Junhong Shen","Neil Tenenholtz","James Brian Hall","David Alvarez-Melis","Nicolo Fusi"],"pdf_url":"https://arxiv.org/pdf/2402.05140v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20278v1","updated":"2024-05-30T17:32:46Z","published":"2024-05-30T17:32:46Z","title":"Length independent generalization bounds for deep SSM architectures with\n stability constraints","summary":" Many state-of-the-art models trained on long-range sequences, for example S4,\nS5 or LRU, are made of sequential blocks combining State-Space Models (SSMs)\nwith neural networks. In this paper we provide a PAC bound that holds for these\nkind of architectures with stable SSM blocks and does not depend on the length\nof the input sequence. Imposing stability of the SSM blocks is a standard\npractice in the literature, and it is known to help performance. Our results\nprovide a theoretical justification for the use of stable SSM blocks as the\nproposed PAC bound decreases as the degree of stability of the SSM blocks\nincreases.\n","authors":["Dániel Rácz","Mihály Petreczky","Bálint Daróczy"],"pdf_url":"https://arxiv.org/pdf/2405.20278v1.pdf","comment":"25 pages, no figures, under submission"},{"id":"http://arxiv.org/abs/2210.16299v4","updated":"2024-05-30T17:31:41Z","published":"2022-10-28T17:52:18Z","title":"Nonuniqueness and Convergence to Equivalent Solutions in Observer-based\n Inverse Reinforcement Learning","summary":" A key challenge in solving the deterministic inverse reinforcement learning\n(IRL) problem online and in real-time is the existence of multiple solutions.\nNonuniqueness necessitates the study of the notion of equivalent solutions,\ni.e., solutions that result in a different cost functional but same feedback\nmatrix, and convergence to such solutions. While offline algorithms that result\nin convergence to equivalent solutions have been developed in the literature,\nonline, real-time techniques that address nonuniqueness are not available. In\nthis paper, a regularized history stack observer that converges to\napproximately equivalent solutions of the IRL problem is developed. Novel\ndata-richness conditions are developed to facilitate the analysis and\nsimulation results are provided to demonstrate the effectiveness of the\ndeveloped technique.\n","authors":["Jared Town","Zachary Morrison","Rushikesh Kamalapurkar"],"pdf_url":"https://arxiv.org/pdf/2210.16299v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16646v3","updated":"2024-05-30T17:30:42Z","published":"2024-05-26T17:52:58Z","title":"A Provably Effective Method for Pruning Experts in Fine-tuned Sparse\n Mixture-of-Experts","summary":" The sparsely gated mixture of experts (MoE) architecture sends different\ninputs to different subnetworks, i.e., experts, through trainable routers. MoE\nreduces the training computation significantly for large models, but its\ndeployment can be still memory or computation expensive for some downstream\ntasks. Model pruning is a popular approach to reduce inference computation, but\nits application in MoE architecture is largely unexplored. To the best of our\nknowledge, this paper provides the first provably efficient technique for\npruning experts in finetuned MoE models. We theoretically prove that\nprioritizing the pruning of the experts with a smaller change of the routers l2\nnorm from the pretrained model guarantees the preservation of test accuracy,\nwhile significantly reducing the model size and the computational requirements.\nAlthough our theoretical analysis is centered on binary classification tasks on\nsimplified MoE architecture, our expert pruning method is verified on large\nvision MoE models such as VMoE and E3MoE finetuned on benchmark datasets such\nas CIFAR10, CIFAR100, and ImageNet.\n","authors":["Mohammed Nowaz Rabbani Chowdhury","Meng Wang","Kaoutar El Maghraoui","Naigang Wang","Pin-Yu Chen","Christopher Carothers"],"pdf_url":"https://arxiv.org/pdf/2405.16646v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20274v1","updated":"2024-05-30T17:29:15Z","published":"2024-05-30T17:29:15Z","title":"ROAST: Review-level Opinion Aspect Sentiment Target Joint Detection","summary":" Aspect-Based Sentiment Analysis (ABSA) has experienced tremendous expansion\nand diversity due to various shared tasks spanning several languages and fields\nand organized via SemEval workshops and Germeval. Nonetheless, a few\nshortcomings still need to be addressed, such as the lack of low-resource\nlanguage evaluations and the emphasis on sentence-level analysis. To thoroughly\nassess ABSA techniques in the context of complete reviews, this research\npresents a novel task, Review-Level Opinion Aspect Sentiment Target (ROAST).\nROAST seeks to close the gap between sentence-level and text-level ABSA by\nidentifying every ABSA constituent at the review level. We extend the available\ndatasets to enable ROAST, addressing the drawbacks noted in previous research\nby incorporating low-resource languages, numerous languages, and a variety of\ntopics. Through this effort, ABSA research will be able to cover more ground\nand get a deeper comprehension of the task and its practical application in a\nvariety of languages and domains (https://github.com/RiTUAL-UH/ROAST-ABSA).\n","authors":["Siva Uday Sampreeth Chebolu","Franck Dernoncourt","Nedim Lipka","Thamar Solorio"],"pdf_url":"https://arxiv.org/pdf/2405.20274v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2309.13297"},{"id":"http://arxiv.org/abs/2405.20272v1","updated":"2024-05-30T17:27:44Z","published":"2024-05-30T17:27:44Z","title":"Reconstruction Attacks on Machine Unlearning: Simple Models are\n Vulnerable","summary":" Machine unlearning is motivated by desire for data autonomy: a person can\nrequest to have their data's influence removed from deployed models, and those\nmodels should be updated as if they were retrained without the person's data.\nWe show that, counter-intuitively, these updates expose individuals to\nhigh-accuracy reconstruction attacks which allow the attacker to recover their\ndata in its entirety, even when the original models are so simple that privacy\nrisk might not otherwise have been a concern. We show how to mount a\nnear-perfect attack on the deleted data point from linear regression models. We\nthen generalize our attack to other loss functions and architectures, and\nempirically demonstrate the effectiveness of our attacks across a wide range of\ndatasets (capturing both tabular and image data). Our work highlights that\nprivacy risk is significant even for extremely simple model classes when\nindividuals can request deletion of their data from the model.\n","authors":["Martin Bertran","Shuai Tang","Michael Kearns","Jamie Morgenstern","Aaron Roth","Zhiwei Steven Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20272v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20271v1","updated":"2024-05-30T17:26:02Z","published":"2024-05-30T17:26:02Z","title":"ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane\n Reflections","summary":" Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt\nfoundation models to downstream task requirements while retaining their\ngeneralization ability. However, the amount of additionally introduced\nparameters and compute for successful adaptation and hyperparameter searches\ncan explode quickly, especially when deployed at scale to serve numerous\nindividual requests. To ensure effective, parameter-efficient, and\nhyperparameter-robust adaptation, we propose the ETHER transformation family,\nwhich performs Efficient fineTuning via HypErplane Reflections. By design,\nETHER transformations require a minimal number of parameters, are less likely\nto deteriorate model performance, and exhibit robustness to hyperparameter and\nlearning rate choices. In particular, we introduce ETHER and its relaxation\nETHER+, which match or outperform existing PEFT methods with significantly\nfewer parameters ($\\sim$$10$-$100$ times lower than LoRA or OFT) across\nmultiple image synthesis and natural language tasks without exhaustive\nhyperparameter tuning. Finally, we investigate the recent emphasis on\nHyperspherical Energy retention for adaptation and raise questions on its\npractical utility. The code is available at https://github.com/mwbini/ether.\n","authors":["Massimo Bini","Karsten Roth","Zeynep Akata","Anna Khoreva"],"pdf_url":"https://arxiv.org/pdf/2405.20271v1.pdf","comment":"Accepted to ICML 2024. Code available at\n https://github.com/mwbini/ether"},{"id":"http://arxiv.org/abs/2402.07240v3","updated":"2024-05-30T17:23:03Z","published":"2024-02-11T16:36:48Z","title":"Oja's Algorithm for Sparse PCA","summary":" Oja's algorithm for streaming Principal Component Analysis (PCA) for $n$\ndatapoints in a $d$ dimensional space achieves the same sin-squared error\n$O(r_\\mathsf{eff}/n)$ as the offline algorithm in $O(d)$ space and $O(nd)$ time\nand a single pass through the datapoints. Here $r_\\mathsf{eff}$ is the\neffective rank (ratio of the trace and the principal eigenvalue of the\npopulation covariance matrix $\\Sigma$). Under this computational budget, we\nconsider the problem of sparse PCA, where the principal eigenvector of $\\Sigma$\nis $s$-sparse, and $r_\\mathsf{eff}$ can be large. In this setting, to our\nknowledge, \\textit{there are no known single-pass algorithms} that achieve the\nminimax error bound in $O(d)$ space and $O(nd)$ time without either requiring\nstrong initialization conditions or assuming further structure (e.g., spiked)\nof the covariance matrix. We show that a simple single-pass procedure that\nthresholds the output of Oja's algorithm (the Oja vector) can achieve the\nminimax error bound under some regularity conditions in $O(d)$ space and\n$O(nd)$ time as long as $r_\\mathsf{eff}=O(n/\\log n)$. We present a nontrivial\nand novel analysis of the entries of the unnormalized Oja vector, which\ninvolves the projection of a product of independent random matrices on a random\ninitial vector. This is completely different from previous analyses of Oja's\nalgorithm and matrix products, which have been done when the $r_\\mathsf{eff}$\nis bounded.\n","authors":["Syamantak Kumar","Purnamrita Sarkar"],"pdf_url":"https://arxiv.org/pdf/2402.07240v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19327v2","updated":"2024-05-30T17:17:21Z","published":"2024-05-29T17:57:16Z","title":"MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model\n Series","summary":" Large Language Models (LLMs) have made great strides in recent years to\nachieve unprecedented performance across different tasks. However, due to\ncommercial interest, the most competitive models like GPT, Gemini, and Claude\nhave been gated behind proprietary interfaces without disclosing the training\ndetails. Recently, many institutions have open-sourced several strong LLMs like\nLLaMA-3, comparable to existing closed-source LLMs. However, only the model's\nweights are provided with most details (e.g., intermediate checkpoints,\npre-training corpus, and training code, etc.) being undisclosed. To improve the\ntransparency of LLMs, the research community has formed to open-source truly\nopen LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training\ncorpus and training code) are being provided. These models have greatly\nadvanced the scientific study of these large models including their strengths,\nweaknesses, biases and risks. However, we observe that the existing truly open\nLLMs on reasoning, knowledge, and coding tasks are still inferior to existing\nstate-of-the-art LLMs with similar model sizes. To this end, we open-source\nMAP-Neo, a highly capable and transparent bilingual language model with 7B\nparameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the\nfirst fully open-sourced bilingual LLM with comparable performance compared to\nexisting state-of-the-art LLMs. Moreover, we open-source all details to\nreproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning\npipeline, checkpoints, and well-optimized training/evaluation framework are\nprovided. Finally, we hope our MAP-Neo will enhance and strengthen the open\nresearch community and inspire more innovations and creativities to facilitate\nthe further improvements of LLMs.\n","authors":["Ge Zhang","Scott Qu","Jiaheng Liu","Chenchen Zhang","Chenghua Lin","Chou Leuang Yu","Danny Pan","Esther Cheng","Jie Liu","Qunshu Lin","Raven Yuan","Tuney Zheng","Wei Pang","Xinrun Du","Yiming Liang","Yinghao Ma","Yizhi Li","Ziyang Ma","Bill Lin","Emmanouil Benetos","Huan Yang","Junting Zhou","Kaijing Ma","Minghao Liu","Morry Niu","Noah Wang","Quehry Que","Ruibo Liu","Sine Liu","Shawn Guo","Soren Gao","Wangchunshu Zhou","Xinyue Zhang","Yizhi Zhou","Yubo Wang","Yuelin Bai","Yuhan Zhang","Yuxiang Zhang","Zenith Wang","Zhenzhu Yang","Zijian Zhao","Jiajun Zhang","Wanli Ouyang","Wenhao Huang","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2405.19327v2.pdf","comment":"https://map-neo.github.io/"},{"id":"http://arxiv.org/abs/2405.17476v3","updated":"2024-05-30T17:15:09Z","published":"2024-05-24T04:56:39Z","title":"How to Leverage Diverse Demonstrations in Offline Imitation Learning","summary":" Offline Imitation Learning (IL) with imperfect demonstrations has garnered\nincreasing attention owing to the scarcity of expert data in many real-world\ndomains. A fundamental problem in this scenario is how to extract positive\nbehaviors from noisy data. In general, current approaches to the problem select\ndata building on state-action similarity to given expert demonstrations,\nneglecting precious information in (potentially abundant) $\\textit{diverse}$\nstate-actions that deviate from expert ones. In this paper, we introduce a\nsimple yet effective data selection method that identifies positive behaviors\nbased on their resultant states -- a more informative criterion enabling\nexplicit utilization of dynamics information and effective extraction of both\nexpert and beneficial diverse behaviors. Further, we devise a lightweight\nbehavior cloning algorithm capable of leveraging the expert and selected data\ncorrectly. In the experiments, we evaluate our method on a suite of complex and\nhigh-dimensional offline IL benchmarks, including continuous-control and\nvision-based tasks. The results demonstrate that our method achieves\nstate-of-the-art performance, outperforming existing methods on\n$\\textbf{20/21}$ benchmarks, typically by $\\textbf{2-5x}$, while maintaining a\ncomparable runtime to Behavior Cloning ($\\texttt{BC}$).\n","authors":["Sheng Yue","Jiani Liu","Xingyuan Hua","Ju Ren","Sen Lin","Junshan Zhang","Yaoxue Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.17476v3.pdf","comment":"International Conference on Machine Learning (ICML)"},{"id":"http://arxiv.org/abs/2402.01057v2","updated":"2024-05-30T17:14:43Z","published":"2024-02-01T23:06:19Z","title":"Expert Proximity as Surrogate Rewards for Single Demonstration Imitation\n Learning","summary":" In this paper, we focus on single-demonstration imitation learning (IL), a\npractical approach for real-world applications where acquiring multiple expert\ndemonstrations is costly or infeasible and the ground truth reward function is\nnot available. In contrast to typical IL settings with multiple demonstrations,\nsingle-demonstration IL involves an agent having access to only one expert\ntrajectory. We highlight the issue of sparse reward signals in this setting and\npropose to mitigate this issue through our proposed Transition\nDiscriminator-based IL (TDIL) method. TDIL is an IRL method designed to address\nreward sparsity by introducing a denser surrogate reward function that\nconsiders environmental dynamics. This surrogate reward function encourages the\nagent to navigate towards states that are proximal to expert states. In\npractice, TDIL trains a transition discriminator to differentiate between valid\nand non-valid transitions in a given environment to compute the surrogate\nrewards. The experiments demonstrate that TDIL outperforms existing IL\napproaches and achieves expert-level performance in the single-demonstration IL\nsetting across five widely adopted MuJoCo benchmarks as well as the \"Adroit\nDoor\" robotic environment.\n","authors":["Chia-Cheng Chiang","Li-Cheng Lan","Wei-Fang Sun","Chien Feng","Cho-Jui Hsieh","Chun-Yi Lee"],"pdf_url":"https://arxiv.org/pdf/2402.01057v2.pdf","comment":"Published at ICML 2024. Code: https://github.com/stanl1y/tdil"},{"id":"http://arxiv.org/abs/2310.13230v5","updated":"2024-05-30T17:13:04Z","published":"2023-10-20T02:40:05Z","title":"Absolute Policy Optimization","summary":" In recent years, trust region on-policy reinforcement learning has achieved\nimpressive results in addressing complex control tasks and gaming scenarios.\nHowever, contemporary state-of-the-art algorithms within this category\nprimarily emphasize improvement in expected performance, lacking the ability to\ncontrol over the worst-case performance outcomes. To address this limitation,\nwe introduce a novel objective function, optimizing which leads to guaranteed\nmonotonic improvement in the lower probability bound of performance with high\nconfidence. Building upon this groundbreaking theoretical advancement, we\nfurther introduce a practical solution called Absolute Policy Optimization\n(APO). Our experiments demonstrate the effectiveness of our approach across\nchallenging continuous control benchmark tasks and extend its applicability to\nmastering Atari games. Our findings reveal that APO as well as its efficient\nvariation Proximal Absolute Policy Optimization (PAPO) significantly\noutperforms state-of-the-art policy gradient algorithms, resulting in\nsubstantial improvements in worst-case performance, as well as expected\nperformance.\n","authors":["Weiye Zhao","Feihan Li","Yifan Sun","Rui Chen","Tianhao Wei","Changliu Liu"],"pdf_url":"https://arxiv.org/pdf/2310.13230v5.pdf","comment":"published in ICML 2024"},{"id":"http://arxiv.org/abs/2405.17477v3","updated":"2024-05-30T17:11:46Z","published":"2024-05-24T04:57:25Z","title":"OLLIE: Imitation Learning from Offline Pretraining to Online Finetuning","summary":" In this paper, we study offline-to-online Imitation Learning (IL) that\npretrains an imitation policy from static demonstration data, followed by fast\nfinetuning with minimal environmental interaction. We find the na\\\"ive\ncombination of existing offline IL and online IL methods tends to behave poorly\nin this context, because the initial discriminator (often used in online IL)\noperates randomly and discordantly against the policy initialization, leading\nto misguided policy optimization and $\\textit{unlearning}$ of pretraining\nknowledge. To overcome this challenge, we propose a principled\noffline-to-online IL method, named $\\texttt{OLLIE}$, that simultaneously learns\na near-expert policy initialization along with an $\\textit{aligned\ndiscriminator initialization}$, which can be seamlessly integrated into online\nIL, achieving smooth and fast finetuning. Empirically, $\\texttt{OLLIE}$\nconsistently and significantly outperforms the baseline methods in\n$\\textbf{20}$ challenging tasks, from continuous control to vision-based\ndomains, in terms of performance, demonstration efficiency, and convergence\nspeed. This work may serve as a foundation for further exploration of\npretraining and finetuning in the context of IL.\n","authors":["Sheng Yue","Xingyuan Hua","Ju Ren","Sen Lin","Junshan Zhang","Yaoxue Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.17477v3.pdf","comment":"International Conference on Machine Learning (ICML)"},{"id":"http://arxiv.org/abs/2310.12815v2","updated":"2024-05-30T17:09:56Z","published":"2023-10-19T15:12:09Z","title":"Formalizing and Benchmarking Prompt Injection Attacks and Defenses","summary":" A prompt injection attack aims to inject malicious instruction/data into the\ninput of an LLM-Integrated Application such that it produces results as an\nattacker desires. Existing works are limited to case studies. As a result, the\nliterature lacks a systematic understanding of prompt injection attacks and\ntheir defenses. We aim to bridge the gap in this work. In particular, we\npropose a framework to formalize prompt injection attacks. Existing attacks are\nspecial cases in our framework. Moreover, based on our framework, we design a\nnew attack by combining existing ones. Using our framework, we conduct a\nsystematic evaluation on 5 prompt injection attacks and 10 defenses with 10\nLLMs and 7 tasks. Our work provides a common benchmark for quantitatively\nevaluating future prompt injection attacks and defenses. To facilitate research\non this topic, we make our platform public at\nhttps://github.com/liu00222/Open-Prompt-Injection.\n","authors":["Yupei Liu","Yuqi Jia","Runpeng Geng","Jinyuan Jia","Neil Zhenqiang Gong"],"pdf_url":"https://arxiv.org/pdf/2310.12815v2.pdf","comment":"To appear in USENIX Security Symposium 2024"},{"id":"http://arxiv.org/abs/2405.20250v1","updated":"2024-05-30T17:02:18Z","published":"2024-05-30T17:02:18Z","title":"Entropy annealing for policy mirror descent in continuous time and space","summary":" Entropy regularization has been extensively used in policy optimization\nalgorithms to regularize the optimization landscape and accelerate convergence;\nhowever, it comes at the cost of introducing an additional regularization bias.\nThis work quantifies the impact of entropy regularization on the convergence of\npolicy gradient methods for stochastic exit time control problems. We analyze a\ncontinuous-time policy mirror descent dynamics, which updates the policy based\non the gradient of an entropy-regularized value function and adjusts the\nstrength of entropy regularization as the algorithm progresses. We prove that\nwith a fixed entropy level, the dynamics converges exponentially to the optimal\nsolution of the regularized problem. We further show that when the entropy\nlevel decays at suitable polynomial rates, the annealed flow converges to the\nsolution of the unregularized problem at a rate of $\\mathcal O(1/S)$ for\ndiscrete action spaces and, under suitable conditions, at a rate of $\\mathcal\nO(1/\\sqrt{S})$ for general action spaces, with $S$ being the gradient flow\ntime. This paper explains how entropy regularization improves policy\noptimization, even with the true gradient, from the perspective of convergence\nrate.\n","authors":["Deven Sethi","David Šiška","Yufei Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20250v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.09904v3","updated":"2024-05-30T17:00:35Z","published":"2023-02-20T11:02:55Z","title":"WW-FL: Secure and Private Large-Scale Federated Learning","summary":" Federated learning (FL) is an efficient approach for large-scale distributed\nmachine learning that promises data privacy by keeping training data on client\ndevices. However, recent research has uncovered vulnerabilities in FL,\nimpacting both security and privacy through poisoning attacks and the potential\ndisclosure of sensitive information in individual model updates as well as the\naggregated global model. This paper explores the inadequacies of existing FL\nprotection measures when applied independently, and the challenges of creating\neffective compositions.\n Addressing these issues, we propose WW-FL, an innovative framework that\ncombines secure multi-party computation (MPC) with hierarchical FL to guarantee\ndata and global model privacy. One notable feature of WW-FL is its capability\nto prevent malicious clients from directly poisoning model parameters,\nconfining them to less destructive data poisoning attacks. We furthermore\nprovide a PyTorch-based FL implementation integrated with Meta's CrypTen MPC\nframework to systematically measure the performance and robustness of WW-FL.\nOur extensive evaluation demonstrates that WW-FL is a promising solution for\nsecure and private large-scale federated learning.\n","authors":["Felix Marx","Thomas Schneider","Ajith Suresh","Tobias Wehrle","Christian Weinert","Hossein Yalame"],"pdf_url":"https://arxiv.org/pdf/2302.09904v3.pdf","comment":"WWFL combines private training and inference with secure aggregation\n and hierarchical FL to provide end-to-end protection and to facilitate\n large-scale global deployment"},{"id":"http://arxiv.org/abs/2403.07723v2","updated":"2024-05-30T16:58:52Z","published":"2024-03-12T15:01:17Z","title":"On the Last-Iterate Convergence of Shuffling Gradient Methods","summary":" Shuffling gradient methods are widely implemented in practice, particularly\nincluding three popular algorithms: Random Reshuffle (RR), Shuffle Once (SO),\nand Incremental Gradient (IG). Compared to the empirical success, the\ntheoretical guarantee of shuffling gradient methods was not well-understood for\na long time. Until recently, the convergence rates had just been established\nfor the average iterate for convex functions and the last iterate for strongly\nconvex problems (using squared distance as the metric). However, when using the\nfunction value gap as the convergence criterion, existing theories cannot\ninterpret the good performance of the last iterate in different settings (e.g.,\nconstrained optimization). To bridge this gap between practice and theory, we\nprove the first last-iterate convergence rates for shuffling gradient methods\nwith respect to the objective value even without strong convexity. Our new\nresults either (nearly) match the existing last-iterate lower bounds or are as\nfast as the previous best upper bounds for the average iterate.\n","authors":["Zijian Liu","Zhengyuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2403.07723v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2405.20247v1","updated":"2024-05-30T16:58:34Z","published":"2024-05-30T16:58:34Z","title":"KerasCV and KerasNLP: Vision and Language Power-Ups","summary":" We present the Keras domain packages KerasCV and KerasNLP, extensions of the\nKeras API for Computer Vision and Natural Language Processing workflows,\ncapable of running on either JAX, TensorFlow, or PyTorch. These domain packages\nare designed to enable fast experimentation, with a focus on ease-of-use and\nperformance. We adopt a modular, layered design: at the library's lowest level\nof abstraction, we provide building blocks for creating models and data\npreprocessing pipelines, and at the library's highest level of abstraction, we\nprovide pretrained ``task\" models for popular architectures such as Stable\nDiffusion, YOLOv8, GPT2, BERT, Mistral, CLIP, Gemma, T5, etc. Task models have\nbuilt-in preprocessing, pretrained weights, and can be fine-tuned on raw\ninputs. To enable efficient training, we support XLA compilation for all\nmodels, and run all preprocessing via a compiled graph of TensorFlow operations\nusing the tf.data API. The libraries are fully open-source (Apache 2.0 license)\nand available on GitHub.\n","authors":["Matthew Watson","Divyashree Shivakumar Sreepathihalli","Francois Chollet","Martin Gorner","Kiranbir Sodhia","Ramesh Sampath","Tirth Patel","Haifeng Jin","Neel Kovelamudi","Gabriel Rasskin","Samaneh Saadat","Luke Wood","Chen Qian","Jonathan Bischof","Ian Stenbit"],"pdf_url":"https://arxiv.org/pdf/2405.20247v1.pdf","comment":"Submitted to Journal of Machine Learning Open Source Software"},{"id":"http://arxiv.org/abs/2405.20245v1","updated":"2024-05-30T16:54:42Z","published":"2024-05-30T16:54:42Z","title":"Retrieval Augmented Structured Generation: Business Document Information\n Extraction As Tool Use","summary":" Business Document Information Extraction (BDIE) is the problem of\ntransforming a blob of unstructured information (raw text, scanned documents,\netc.) into a structured format that downstream systems can parse and use. It\nhas two main tasks: Key-Information Extraction (KIE) and Line Items Recognition\n(LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem,\nwhere the tools are these downstream systems. We then present Retrieval\nAugmented Structured Generation (RASG), a novel general framework for BDIE that\nachieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE\nbenchmarks.\n The contributions of this paper are threefold: (1) We show, with ablation\nbenchmarks, that Large Language Models (LLMs) with RASG are already competitive\nwith or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on\nBDIE benchmarks. (2) We propose a new metric class for Line Items Recognition,\nGeneral Line Items Recognition Metric (GLIRM), that is more aligned with\npractical BDIE use cases compared to existing metrics, such as ANLS*, DocILE,\nand GriTS. (3) We provide a heuristic algorithm for backcalculating bounding\nboxes of predicted line items and tables without the need for vision encoders.\nFinally, we claim that, while LMMs might sometimes offer marginal performance\nbenefits, LLMs + RASG is oftentimes superior given real-world applications and\nconstraints of BDIE.\n","authors":["Franz Louis Cesista","Rui Aguiar","Jason Kim","Paolo Acilo"],"pdf_url":"https://arxiv.org/pdf/2405.20245v1.pdf","comment":"Accepted by IEEE 7th International Conference on Multimedia\n Information Processing and Retrieval (MIPR), 2024"},{"id":"http://arxiv.org/abs/2405.20237v1","updated":"2024-05-30T16:40:28Z","published":"2024-05-30T16:40:28Z","title":"Training-efficient density quantum machine learning","summary":" Quantum machine learning requires powerful, flexible and efficiently\ntrainable models to be successful in solving challenging problems. In this\nwork, we present density quantum neural networks, a learning model\nincorporating randomisation over a set of trainable unitaries. These models\ngeneralise quantum neural networks using parameterised quantum circuits, and\nallow a trade-off between expressibility and efficient trainability,\nparticularly on quantum hardware. We demonstrate the flexibility of the\nformalism by applying it to two recently proposed model families. The first are\ncommuting-block quantum neural networks (QNNs) which are efficiently trainable\nbut may be limited in expressibility. The second are orthogonal (Hamming-weight\npreserving) quantum neural networks which provide well-defined and\ninterpretable transformations on data but are challenging to train at scale on\nquantum devices. Density commuting QNNs improve capacity with minimal gradient\ncomplexity overhead, and density orthogonal neural networks admit a\nquadratic-to-constant gradient query advantage with minimal to no performance\nloss. We conduct numerical experiments on synthetic translationally invariant\ndata and MNIST image data with hyperparameter optimisation to support our\nfindings. Finally, we discuss the connection to post-variational quantum neural\nnetworks, measurement-based quantum machine learning and the dropout mechanism.\n","authors":["Brian Coyle","El Amine Cherrat","Nishant Jain","Natansh Mathur","Snehal Raj","Skander Kazdaghli","Iordanis Kerenidis"],"pdf_url":"https://arxiv.org/pdf/2405.20237v1.pdf","comment":"17 pages main text, 9 pages appendices. 9 figures"},{"id":"http://arxiv.org/abs/2405.20236v1","updated":"2024-05-30T16:40:07Z","published":"2024-05-30T16:40:07Z","title":"Disentangling and Mitigating the Impact of Task Similarity for Continual\n Learning","summary":" Continual learning of partially similar tasks poses a challenge for\nartificial neural networks, as task similarity presents both an opportunity for\nknowledge transfer and a risk of interference and catastrophic forgetting.\nHowever, it remains unclear how task similarity in input features and readout\npatterns influences knowledge transfer and forgetting, as well as how they\ninteract with common algorithms for continual learning. Here, we develop a\nlinear teacher-student model with latent structure and show analytically that\nhigh input feature similarity coupled with low readout similarity is\ncatastrophic for both knowledge transfer and retention. Conversely, the\nopposite scenario is relatively benign. Our analysis further reveals that\ntask-dependent activity gating improves knowledge retention at the expense of\ntransfer, while task-dependent plasticity gating does not affect either\nretention or transfer performance at the over-parameterized limit. In contrast,\nweight regularization based on the Fisher information metric significantly\nimproves retention, regardless of task similarity, without compromising\ntransfer performance. Nevertheless, its diagonal approximation and\nregularization in the Euclidean space are much less robust against task\nsimilarity. We demonstrate consistent results in a permuted MNIST task with\nlatent variables. Overall, this work provides insights into when continual\nlearning is difficult and how to mitigate it.\n","authors":["Naoki Hiratani"],"pdf_url":"https://arxiv.org/pdf/2405.20236v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20233v1","updated":"2024-05-30T16:35:30Z","published":"2024-05-30T16:35:30Z","title":"Grokfast: Accelerated Grokking by Amplifying Slow Gradients","summary":" One puzzling artifact in machine learning dubbed grokking is where delayed\ngeneralization is achieved tenfolds of iterations after near perfect\noverfitting to the training data. Focusing on the long delay itself on behalf\nof machine learning practitioners, our goal is to accelerate generalization of\na model under grokking phenomenon. By regarding a series of gradients of a\nparameter over training iterations as a random signal over time, we can\nspectrally decompose the parameter trajectories under gradient descent into two\ncomponents: the fast-varying, overfitting-yielding component and the\nslow-varying, generalization-inducing component. This analysis allows us to\naccelerate the grokking phenomenon more than $\\times 50$ with only a few lines\nof code that amplifies the slow-varying components of gradients. The\nexperiments show that our algorithm applies to diverse tasks involving images,\nlanguages, and graphs, enabling practical availability of this peculiar\nartifact of sudden generalization. Our code is available at\n\\url{https://github.com/ironjr/grokfast}.\n","authors":["Jaerin Lee","Bong Gyun Kang","Kihoon Kim","Kyoung Mu Lee"],"pdf_url":"https://arxiv.org/pdf/2405.20233v1.pdf","comment":"15 pages, 12 figures. Project page:\n https://jaerinlee.com/research/grokfast"},{"id":"http://arxiv.org/abs/2405.20231v1","updated":"2024-05-30T16:32:31Z","published":"2024-05-30T16:32:31Z","title":"The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof","summary":" Many algorithms and observed phenomena in deep learning appear to be affected\nby parameter symmetries -- transformations of neural network parameters that do\nnot change the underlying neural network function. These include linear mode\nconnectivity, model merging, Bayesian neural network inference, metanetworks,\nand several other characteristics of optimization or loss-landscapes. However,\ntheoretical analysis of the relationship between parameter space symmetries and\nthese phenomena is difficult. In this work, we empirically investigate the\nimpact of neural parameter symmetries by introducing new neural network\narchitectures that have reduced parameter space symmetries. We develop two\nmethods, with some provable guarantees, of modifying standard neural networks\nto reduce parameter space symmetries. With these new methods, we conduct a\ncomprehensive experimental study consisting of multiple tasks aimed at\nassessing the effect of removing parameter symmetries. Our experiments reveal\nseveral interesting observations on the empirical impact of parameter\nsymmetries; for instance, we observe linear mode connectivity between our\nnetworks without alignment of weight spaces, and we find that our networks\nallow for faster and more effective Bayesian neural network training.\n","authors":["Derek Lim","Moe Putterman","Robin Walters","Haggai Maron","Stefanie Jegelka"],"pdf_url":"https://arxiv.org/pdf/2405.20231v1.pdf","comment":"27 pages. Preparing code for release"},{"id":"http://arxiv.org/abs/2402.14800v2","updated":"2024-05-30T16:24:16Z","published":"2024-02-22T18:56:07Z","title":"Not All Experts are Equal: Efficient Expert Pruning and Skipping for\n Mixture-of-Experts Large Language Models","summary":" A pivotal advancement in the progress of large language models (LLMs) is the\nemergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs,\nMoE LLMs can achieve higher performance with fewer parameters, but it is still\nhard to deploy them due to their immense parameter sizes. Different from\nprevious weight pruning methods that rely on specifically designed hardware,\nthis paper mainly aims to enhance the deployment efficiency of MoE LLMs by\nintroducing plug-and-play expert-level sparsification techniques. Specifically,\nwe propose, for the first time to our best knowledge, post-training approaches\nfor task-agnostic and task-specific expert pruning and skipping of MoE LLMs,\ntailored to improve deployment efficiency while maintaining model performance\nacross a wide range of tasks. Extensive experiments show that our proposed\nmethods can simultaneously reduce model sizes and increase the inference speed,\nwhile maintaining satisfactory performance. Data and code will be available at\nhttps://github.com/Lucky-Lance/Expert_Sparsity.\n","authors":["Xudong Lu","Qi Liu","Yuhui Xu","Aojun Zhou","Siyuan Huang","Bo Zhang","Junchi Yan","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2402.14800v2.pdf","comment":"Mixture-of-Experts Large Language Models, ACL2024"},{"id":"http://arxiv.org/abs/2403.19546v2","updated":"2024-05-30T16:20:04Z","published":"2024-03-28T16:27:26Z","title":"Croissant: A Metadata Format for ML-Ready Datasets","summary":" Data is a critical resource for Machine Learning (ML), yet working with data\nremains a key friction point. This paper introduces Croissant, a metadata\nformat for datasets that simplifies how data is used by ML tools and\nframeworks. Croissant makes datasets more discoverable, portable and\ninteroperable, thereby addressing significant challenges in ML data management\nand responsible AI. Croissant is already supported by several popular dataset\nrepositories, spanning hundreds of thousands of datasets, ready to be loaded\ninto the most popular ML frameworks.\n","authors":["Mubashara Akhtar","Omar Benjelloun","Costanza Conforti","Pieter Gijsbers","Joan Giner-Miguelez","Nitisha Jain","Michael Kuchnik","Quentin Lhoest","Pierre Marcenac","Manil Maskey","Peter Mattson","Luis Oala","Pierre Ruyssen","Rajat Shinde","Elena Simperl","Goeffry Thomas","Slava Tykhonov","Joaquin Vanschoren","Jos van der Velde","Steffen Vogler","Carole-Jean Wu"],"pdf_url":"https://arxiv.org/pdf/2403.19546v2.pdf","comment":"Published in Proceedings of ACM SIGMOD/PODS'24 Data Management for\n End-to-End Machine Learning (DEEM) Workshop\n https://dl.acm.org/doi/10.1145/3650203.3663326"},{"id":"http://arxiv.org/abs/2405.20216v1","updated":"2024-05-30T16:18:05Z","published":"2024-05-30T16:18:05Z","title":"Boost Your Own Human Image Generation Model via Direct Preference\n Optimization with AI Feedback","summary":" The generation of high-quality human images through text-to-image (T2I)\nmethods is a significant yet challenging task. Distinct from general image\ngeneration, human image synthesis must satisfy stringent criteria related to\nhuman pose, anatomy, and alignment with textual prompts, making it particularly\ndifficult to achieve realistic results. Recent advancements in T2I generation\nbased on diffusion models have shown promise, yet challenges remain in meeting\nhuman-specific preferences. In this paper, we introduce a novel approach\ntailored specifically for human image generation utilizing Direct Preference\nOptimization (DPO). Specifically, we introduce an efficient method for\nconstructing a specialized DPO dataset for training human image generation\nmodels without the need for costly human feedback. We also propose a modified\nloss function that enhances the DPO training process by minimizing artifacts\nand improving image fidelity. Our method demonstrates its versatility and\neffectiveness in generating human images, including personalized text-to-image\ngeneration. Through comprehensive evaluations, we show that our approach\nsignificantly advances the state of human image generation, achieving superior\nresults in terms of natural anatomies, poses, and text-image alignment.\n","authors":["Sanghyeon Na","Yonggyu Kim","Hyunjoon Lee"],"pdf_url":"https://arxiv.org/pdf/2405.20216v1.pdf","comment":"28 pages, 18 figures"},{"id":"http://arxiv.org/abs/2405.20213v1","updated":"2024-05-30T16:16:25Z","published":"2024-05-30T16:16:25Z","title":"PostDoc: Generating Poster from a Long Multimodal Document Using Deep\n Submodular Optimization","summary":" A poster from a long input document can be considered as a one-page\neasy-to-read multimodal (text and images) summary presented on a nice template\nwith good design elements. Automatic transformation of a long document into a\nposter is a very less studied but challenging task. It involves content\nsummarization of the input document followed by template generation and\nharmonization. In this work, we propose a novel deep submodular function which\ncan be trained on ground truth summaries to extract multimodal content from the\ndocument and explicitly ensures good coverage, diversity and alignment of text\nand images. Then, we use an LLM based paraphraser and propose to generate a\ntemplate with various design aspects conditioned on the input content. We show\nthe merits of our approach through extensive automated and human evaluations.\n","authors":["Vijay Jaisankar","Sambaran Bandyopadhyay","Kalp Vyas","Varre Chaitanya","Shwetha Somasundaram"],"pdf_url":"https://arxiv.org/pdf/2405.20213v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.14918v2","updated":"2024-05-30T16:04:44Z","published":"2024-05-23T17:13:52Z","title":"AnalogCoder: Analog Circuit Design via Training-Free Code Generation","summary":" Analog circuit design is a significant task in modern chip technology,\nfocusing on the selection of component types, connectivity, and parameters to\nensure proper circuit functionality. Despite advances made by Large Language\nModels (LLMs) in digital circuit design, the complexity and scarcity of data in\nanalog circuitry pose significant challenges. To mitigate these issues, we\nintroduce AnalogCoder, the first training-free LLM agent for designing analog\ncircuits through Python code generation. Firstly, AnalogCoder incorporates a\nfeedback-enhanced flow with tailored domain-specific prompts, enabling the\nautomated and self-correcting design of analog circuits with a high success\nrate. Secondly, it proposes a circuit tool library to archive successful\ndesigns as reusable modular sub-circuits, simplifying composite circuit\ncreation. Thirdly, extensive experiments on a benchmark designed to cover a\nwide range of analog circuit tasks show that AnalogCoder outperforms other\nLLM-based methods. It has successfully designed 20 circuits, 5 more than\nstandard GPT-4o. We believe AnalogCoder can significantly improve the\nlabor-intensive chip design process, enabling non-experts to design analog\ncircuits efficiently.\n","authors":["Yao Lai","Sungyoung Lee","Guojin Chen","Souradip Poddar","Mengkang Hu","David Z. Pan","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2405.14918v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20200v1","updated":"2024-05-30T16:04:35Z","published":"2024-05-30T16:04:35Z","title":"Unified Explanations in Machine Learning Models: A Perturbation Approach","summary":" A high-velocity paradigm shift towards Explainable Artificial Intelligence\n(XAI) has emerged in recent years. Highly complex Machine Learning (ML) models\nhave flourished in many tasks of intelligence, and the questions have started\nto shift away from traditional metrics of validity towards something deeper:\nWhat is this model telling me about my data, and how is it arriving at these\nconclusions? Inconsistencies between XAI and modeling techniques can have the\nundesirable effect of casting doubt upon the efficacy of these explainability\napproaches. To address these problems, we propose a systematic,\nperturbation-based analysis against a popular, model-agnostic method in XAI,\nSHapley Additive exPlanations (Shap). We devise algorithms to generate relative\nfeature importance in settings of dynamic inference amongst a suite of popular\nmachine learning and deep learning methods, and metrics that allow us to\nquantify how well explanations generated under the static case hold. We propose\na taxonomy for feature importance methodology, measure alignment, and observe\nquantifiable similarity amongst explanation models across several datasets.\n","authors":["Jacob Dineen","Don Kridel","Daniel Dolk","David Castillo"],"pdf_url":"https://arxiv.org/pdf/2405.20200v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20194v1","updated":"2024-05-30T15:58:22Z","published":"2024-05-30T15:58:22Z","title":"Occam Gradient Descent","summary":" Deep learning neural network models must be large enough to adapt to their\nproblem domain, while small enough to avoid overfitting training data during\ngradient descent. To balance these competing demands, overprovisioned deep\nlearning models such as transformers are trained for a single epoch on large\ndata sets, and hence inefficient with both computing resources and training\ndata. In response to these inefficiencies, we exploit learning theory to derive\nOccam Gradient Descent, an algorithm that interleaves adaptive reduction of\nmodel size to minimize generalization error, with gradient descent on model\nweights to minimize fitting error. In contrast, traditional gradient descent\ngreedily minimizes fitting error without regard to generalization error. Our\nalgorithm simultaneously descends the space of weights and topological size of\nany neural network without modification, and is effective in our experiments in\noutperforming traditional gradient descent with or without post-train pruning\nin accuracy, compute and model compression.\n","authors":["B. N. Kausik"],"pdf_url":"https://arxiv.org/pdf/2405.20194v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20180v1","updated":"2024-05-30T15:48:04Z","published":"2024-05-30T15:48:04Z","title":"Transformers and Slot Encoding for Sample Efficient Physical World\n Modelling","summary":" World modelling, i.e. building a representation of the rules that govern the\nworld so as to predict its evolution, is an essential ability for any agent\ninteracting with the physical world. Recent applications of the Transformer\narchitecture to the problem of world modelling from video input show notable\nimprovements in sample efficiency. However, existing approaches tend to work\nonly at the image level thus disregarding that the environment is composed of\nobjects interacting with each other. In this paper, we propose an architecture\ncombining Transformers for world modelling with the slot-attention paradigm, an\napproach for learning representations of objects appearing in a scene. We\ndescribe the resulting neural architecture and report experimental results\nshowing an improvement over the existing solutions in terms of sample\nefficiency and a reduction of the variation of the performance over the\ntraining examples. The code for our architecture and experiments is available\nat https://github.com/torchipeppo/transformers-and-slot-encoding-for-wm\n","authors":["Francesco Petri","Luigi Asprino","Aldo Gangemi"],"pdf_url":"https://arxiv.org/pdf/2405.20180v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20178v1","updated":"2024-05-30T15:47:48Z","published":"2024-05-30T15:47:48Z","title":"Non-intrusive data-driven model order reduction for circuits based on\n Hammerstein architectures","summary":" We demonstrate that data-driven system identification techniques can provide\na basis for effective, non-intrusive model order reduction (MOR) for common\ncircuits that are key building blocks in microelectronics. Our approach is\nmotivated by the practical operation of these circuits and utilizes a canonical\nHammerstein architecture. To demonstrate the approach we develop a parsimonious\nHammerstein model for a non-linear CMOS differential amplifier. We train this\nmodel on a combination of direct current (DC) and transient Spice (Xyce)\ncircuit simulation data using a novel sequential strategy to identify the\nstatic nonlinear and linear dynamical parts of the model. Simulation results\nshow that the Hammerstein model is an effective surrogate for the differential\namplifier circuit that accurately and efficiently reproduces its behavior over\na wide range of operating points and input frequencies.\n","authors":["Joshua Hanson","Biliana Paskaleva","Pavel Bochev"],"pdf_url":"https://arxiv.org/pdf/2405.20178v1.pdf","comment":"13 pages, 13 figures; submitted to IEEE Transactions on\n Computer-Aided Design of Integrated Circuits and Systems"},{"id":"http://arxiv.org/abs/2405.20174v1","updated":"2024-05-30T15:45:03Z","published":"2024-05-30T15:45:03Z","title":"Tropical Expressivity of Neural Networks","summary":" We propose an algebraic geometric framework to study the expressivity of\nlinear activation neural networks. A particular quantity that has been actively\nstudied in the field of deep learning is the number of linear regions, which\ngives an estimate of the information capacity of the architecture. To study and\nevaluate information capacity and expressivity, we work in the setting of\ntropical geometry -- a combinatorial and polyhedral variant of algebraic\ngeometry -- where there are known connections between tropical rational maps\nand feedforward neural networks. Our work builds on and expands this connection\nto capitalize on the rich theory of tropical geometry to characterize and study\nvarious architectural aspects of neural networks. Our contributions are\nthreefold: we provide a novel tropical geometric approach to selecting sampling\ndomains among linear regions; an algebraic result allowing for a guided\nrestriction of the sampling domain for network architectures with symmetries;\nand an open source library to analyze neural networks as tropical Puiseux\nrational maps. We provide a comprehensive set of proof-of-concept numerical\nexperiments demonstrating the breadth of neural network architectures to which\ntropical geometric theory can be applied to reveal insights on expressivity\ncharacteristics of a network. Our work provides the foundations for the\nadaptation of both theory and existing software from computational tropical\ngeometry and symbolic computation to deep learning.\n","authors":["Shiv Bhatia","Yueqi Cao","Paul Lezeau","Anthea Monod"],"pdf_url":"https://arxiv.org/pdf/2405.20174v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.15159v3","updated":"2024-05-30T15:44:51Z","published":"2024-02-23T07:43:26Z","title":"Machine Unlearning of Pre-trained Large Language Models","summary":" This study investigates the concept of the `right to be forgotten' within the\ncontext of large language models (LLMs). We explore machine unlearning as a\npivotal solution, with a focus on pre-trained models--a notably\nunder-researched area. Our research delineates a comprehensive framework for\nmachine unlearning in pre-trained LLMs, encompassing a critical analysis of\nseven diverse unlearning methods. Through rigorous evaluation using curated\ndatasets from arXiv, books, and GitHub, we establish a robust benchmark for\nunlearning performance, demonstrating that these methods are over $10^5$ times\nmore computationally efficient than retraining. Our results show that\nintegrating gradient ascent with gradient descent on in-distribution data\nimproves hyperparameter robustness. We also provide detailed guidelines for\nefficient hyperparameter tuning in the unlearning process. Our findings advance\nthe discourse on ethical AI practices, offering substantive insights into the\nmechanics of machine unlearning for pre-trained LLMs and underscoring the\npotential for responsible AI development.\n","authors":["Jin Yao","Eli Chien","Minxin Du","Xinyao Niu","Tianhao Wang","Zezhou Cheng","Xiang Yue"],"pdf_url":"https://arxiv.org/pdf/2402.15159v3.pdf","comment":"ACL 2024 main. Code and data at\n https://github.com/yaojin17/Unlearning_LLM"},{"id":"http://arxiv.org/abs/2405.20172v1","updated":"2024-05-30T15:44:27Z","published":"2024-05-30T15:44:27Z","title":"Iterative Feature Boosting for Explainable Speech Emotion Recognition","summary":" In speech emotion recognition (SER), using predefined features without\nconsidering their practical importance may lead to high dimensional datasets,\nincluding redundant and irrelevant information. Consequently, high-dimensional\nlearning often results in decreasing model accuracy while increasing\ncomputational complexity. Our work underlines the importance of carefully\nconsidering and analyzing features in order to build efficient SER systems. We\npresent a new supervised SER method based on an efficient feature engineering\napproach. We pay particular attention to the explainability of results to\nevaluate feature relevance and refine feature sets. This is performed\niteratively through feature evaluation loop, using Shapley values to boost\nfeature selection and improve overall framework performance. Our approach\nallows thus to balance the benefits between model performance and transparency.\nThe proposed method outperforms human-level performance (HLP) and\nstate-of-the-art machine learning methods in emotion recognition on the TESS\ndataset.\n","authors":["Alaa Nfissi","Wassim Bouachir","Nizar Bouguila","Brian Mishara"],"pdf_url":"https://arxiv.org/pdf/2405.20172v1.pdf","comment":"Published in: 2023 International Conference on Machine Learning and\n Applications (ICMLA)"},{"id":"http://arxiv.org/abs/2405.20165v1","updated":"2024-05-30T15:39:19Z","published":"2024-05-30T15:39:19Z","title":"Randomized Exploration for Reinforcement Learning with Multinomial\n Logistic Function Approximation","summary":" We study reinforcement learning with multinomial logistic (MNL) function\napproximation where the underlying transition probability kernel of the Markov\ndecision processes (MDPs) is parametrized by an unknown transition core with\nfeatures of state and action. For the finite horizon episodic setting with\ninhomogeneous state transitions, we propose provably efficient algorithms with\nrandomized exploration having frequentist regret guarantees. For our first\nalgorithm, $\\texttt{RRL-MNL}$, we adapt optimistic sampling to ensure the\noptimism of the estimated value function with sufficient frequency and\nestablish that $\\texttt{RRL-MNL}$ is both statistically and computationally\nefficient, achieving a $\\tilde{O}(\\kappa^{-1} d^{\\frac{3}{2}} H^{\\frac{3}{2}}\n\\sqrt{T})$ frequentist regret bound with constant-time computational cost per\nepisode. Here, $d$ is the dimension of the transition core, $H$ is the horizon\nlength, $T$ is the total number of steps, and $\\kappa$ is a problem-dependent\nconstant. Despite the simplicity and practicality of $\\texttt{RRL-MNL}$, its\nregret bound scales with $\\kappa^{-1}$, which is potentially large in the worst\ncase. To improve the dependence on $\\kappa^{-1}$, we propose\n$\\texttt{ORRL-MNL}$, which estimates the value function using local gradient\ninformation of the MNL transition model. We show that its frequentist regret\nbound is $\\tilde{O}(d^{\\frac{3}{2}} H^{\\frac{3}{2}} \\sqrt{T} + \\kappa^{-1} d^2\nH^2)$. To the best of our knowledge, these are the first randomized RL\nalgorithms for the MNL transition model that achieve both computational and\nstatistical efficiency. Numerical experiments demonstrate the superior\nperformance of the proposed algorithms.\n","authors":["Wooseong Cho","Taehyun Hwang","Joongkyu Lee","Min-hwan Oh"],"pdf_url":"https://arxiv.org/pdf/2405.20165v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.09983v2","updated":"2024-05-30T15:34:10Z","published":"2024-05-16T11:01:09Z","title":"Zero-Shot Hierarchical Classification on the Common Procurement\n Vocabulary Taxonomy","summary":" Classifying public tenders is a useful task for both companies that are\ninvited to participate and for inspecting fraudulent activities. To facilitate\nthe task for both participants and public administrations, the European Union\npresented a common taxonomy (Common Procurement Vocabulary, CPV) which is\nmandatory for tenders of certain importance; however, the contracts in which a\nCPV label is mandatory are the minority compared to all the Public\nAdministrations activities. Classifying over a real-world taxonomy introduces\nsome difficulties that can not be ignored. First of all, some fine-grained\nclasses have an insufficient (if any) number of observations in the training\nset, while other classes are far more frequent (even thousands of times) than\nthe average. To overcome those difficulties, we present a zero-shot approach,\nbased on a pre-trained language model that relies only on label description and\nrespects the label taxonomy. To train our proposed model, we used industrial\ndata, which comes from contrattipubblici.org, a service by SpazioDati s.r.l.\nthat collects public contracts stipulated in Italy in the last 25 years.\nResults show that the proposed model achieves better performance in classifying\nlow-frequent classes compared to three different baselines, and is also able to\npredict never-seen classes.\n","authors":["Federico Moiraghi","Matteo Palmonari","Davide Allavena","Federico Morando"],"pdf_url":"https://arxiv.org/pdf/2405.09983v2.pdf","comment":"Full-length version of the short paper accepted at COMPSAC 2024"},{"id":"http://arxiv.org/abs/2307.13885v5","updated":"2024-05-30T15:33:55Z","published":"2023-07-26T01:10:29Z","title":"Characterizing Data Point Vulnerability via Average-Case Robustness","summary":" Studying the robustness of machine learning models is important to ensure\nconsistent model behaviour across real-world settings. To this end, adversarial\nrobustness is a standard framework, which views robustness of predictions\nthrough a binary lens: either a worst-case adversarial misclassification exists\nin the local region around an input, or it does not. However, this binary\nperspective does not account for the degrees of vulnerability, as data points\nwith a larger number of misclassified examples in their neighborhoods are more\nvulnerable. In this work, we consider a complementary framework for robustness,\ncalled average-case robustness, which measures the fraction of points in a\nlocal region that provides consistent predictions. However, computing this\nquantity is hard, as standard Monte Carlo approaches are inefficient especially\nfor high-dimensional inputs. In this work, we propose the first analytical\nestimators for average-case robustness for multi-class classifiers. We show\nempirically that our estimators are accurate and efficient for standard deep\nlearning models and demonstrate their usefulness for identifying vulnerable\ndata points, as well as quantifying robustness bias of models. Overall, our\ntools provide a complementary view to robustness, improving our ability to\ncharacterize model behaviour.\n","authors":["Tessa Han","Suraj Srinivas","Himabindu Lakkaraju"],"pdf_url":"https://arxiv.org/pdf/2307.13885v5.pdf","comment":"UAI 2024"},{"id":"http://arxiv.org/abs/2405.02235v2","updated":"2024-05-30T15:18:24Z","published":"2024-05-03T16:45:15Z","title":"Learning Optimal Deterministic Policies with Stochastic Policy Gradients","summary":" Policy gradient (PG) methods are successful approaches to deal with\ncontinuous reinforcement learning (RL) problems. They learn stochastic\nparametric (hyper)policies by either exploring in the space of actions or in\nthe space of parameters. Stochastic controllers, however, are often undesirable\nfrom a practical perspective because of their lack of robustness, safety, and\ntraceability. In common practice, stochastic (hyper)policies are learned only\nto deploy their deterministic version. In this paper, we make a step towards\nthe theoretical understanding of this practice. After introducing a novel\nframework for modeling this scenario, we study the global convergence to the\nbest deterministic policy, under (weak) gradient domination assumptions. Then,\nwe illustrate how to tune the exploration level used for learning to optimize\nthe trade-off between the sample complexity and the performance of the deployed\ndeterministic policy. Finally, we quantitatively compare action-based and\nparameter-based exploration, giving a formal guise to intuitive results.\n","authors":["Alessandro Montenegro","Marco Mussi","Alberto Maria Metelli","Matteo Papini"],"pdf_url":"https://arxiv.org/pdf/2405.02235v2.pdf","comment":"Accepted to ICML 2024"},{"id":"http://arxiv.org/abs/2403.15112v3","updated":"2024-05-30T15:17:55Z","published":"2024-03-22T11:08:48Z","title":"Text clustering with LLM embeddings","summary":" Text clustering is an important approach for organising the growing amount of\ndigital content, helping to structure and find hidden patterns in uncategorised\ndata. However, the effectiveness of text clustering heavily relies on the\nchoice of textual embeddings and clustering algorithms. We argue that recent\nadvances in large language models (LLMs) can potentially improve this task. In\nthis research, we investigated how different textual embeddings -- particularly\nthose used in LLMs -- and clustering algorithms affect how text datasets are\nclustered. A series of experiments were conducted to assess how embeddings\ninfluence clustering results, the role played by dimensionality reduction\nthrough summarisation, and model size adjustment. Findings reveal that LLM\nembeddings excel at capturing subtleties in structured language, while BERT\nleads the lightweight options in performance. In addition, we observe that\nincreasing model dimensionality and employing summarization techniques do not\nconsistently lead to improvements in clustering efficiency, suggesting that\nthese strategies require careful analysis to use in real-life models. These\nresults highlight a complex balance between the need for refined text\nrepresentation and computational feasibility in text clustering applications.\nThis study extends traditional text clustering frameworks by incorporating\nembeddings from LLMs, providing a path for improved methodologies, while\ninforming new avenues for future research in various types of textual analysis.\n","authors":["Alina Petukhova","João P. Matos-Carvalho","Nuno Fachada"],"pdf_url":"https://arxiv.org/pdf/2403.15112v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20139v1","updated":"2024-05-30T15:14:24Z","published":"2024-05-30T15:14:24Z","title":"GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning","summary":" Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form\nof triplets (head, relation, tail), which collectively form a graph. Question\nAnswering over KGs (KGQA) is the task of answering natural questions grounding\nthe reasoning to the information provided by the KG. Large Language Models\n(LLMs) are the state-of-the-art models for QA tasks due to their remarkable\nability to understand natural language. On the other hand, Graph Neural\nNetworks (GNNs) have been widely used for KGQA as they can handle the complex\ngraph information stored in the KG. In this work, we introduce GNN-RAG, a novel\nmethod for combining language understanding abilities of LLMs with the\nreasoning abilities of GNNs in a retrieval-augmented generation (RAG) style.\nFirst, a GNN reasons over a dense KG subgraph to retrieve answer candidates for\na given question. Second, the shortest paths in the KG that connect question\nentities and answer candidates are extracted to represent KG reasoning paths.\nThe extracted paths are verbalized and given as input for LLM reasoning with\nRAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to\nextract useful graph information, while the LLM leverages its natural language\nprocessing ability for ultimate KGQA. Furthermore, we develop a retrieval\naugmentation (RA) technique to further boost KGQA performance with GNN-RAG.\nExperimental results show that GNN-RAG achieves state-of-the-art performance in\ntwo widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching\nGPT-4 performance with a 7B tuned LLM. In addition, GNN-RAG excels on multi-hop\nand multi-entity questions outperforming competing approaches by 8.9--15.5%\npoints at answer F1.\n","authors":["Costas Mavromatis","George Karypis"],"pdf_url":"https://arxiv.org/pdf/2405.20139v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.15403v4","updated":"2024-05-30T15:11:45Z","published":"2022-05-30T20:00:19Z","title":"Neural Optimal Transport with General Cost Functionals","summary":" We introduce a novel neural network-based algorithm to compute optimal\ntransport (OT) plans for general cost functionals. In contrast to common\nEuclidean costs, i.e., $\\ell^1$ or $\\ell^2$, such functionals provide more\nflexibility and allow using auxiliary information, such as class labels, to\nconstruct the required transport map. Existing methods for general costs are\ndiscrete and have limitations in practice, i.e. they do not provide an\nout-of-sample estimation. We address the challenge of designing a continuous OT\napproach for general costs that generalizes to new data points in\nhigh-dimensional spaces, such as images. Additionally, we provide the\ntheoretical error analysis for our recovered transport plans. As an\napplication, we construct a cost functional to map data distributions while\npreserving the class-wise structure.\n","authors":["Arip Asadulaev","Alexander Korotin","Vage Egiazarian","Petr Mokrov","Evgeny Burnaev"],"pdf_url":"https://arxiv.org/pdf/2205.15403v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20127v1","updated":"2024-05-30T15:07:30Z","published":"2024-05-30T15:07:30Z","title":"SPAM: Stochastic Proximal Point Method with Momentum Variance Reduction\n for Non-convex Cross-Device Federated Learning","summary":" Cross-device training is a crucial subfield of federated learning, where the\nnumber of clients can reach into the billions. Standard approaches and local\nmethods are prone to issues such as client drift and insensitivity to data\nsimilarities. We propose a novel algorithm (SPAM) for cross-device federated\nlearning with non-convex losses, which solves both issues. We provide sharp\nanalysis under second-order (Hessian) similarity, a condition satisfied by a\nvariety of machine learning problems in practice. Additionally, we extend our\nresults to the partial participation setting, where a cohort of selected\nclients communicate with the server at each communication round. Our method is\nthe first in its kind, that does not require the smoothness of the objective\nand provably benefits from clients having similar data.\n","authors":["Avetik Karagulyan","Egor Shulgin","Abdurakhmon Sadiev","Peter Richtárik"],"pdf_url":"https://arxiv.org/pdf/2405.20127v1.pdf","comment":"The main part of the paper is around 9 pages. It contains the\n proposed algorithms, the main theoretical results and the experimental\n setting. The proofs of the main results and other technicalities are deferred\n to the Appendix"},{"id":"http://arxiv.org/abs/2403.07262v2","updated":"2024-05-30T15:04:42Z","published":"2024-03-12T02:43:41Z","title":"A2PO: Towards Effective Offline Reinforcement Learning from an\n Advantage-aware Perspective","summary":" Offline reinforcement learning endeavors to leverage offline datasets to\ncraft effective agent policy without online interaction, which imposes proper\nconservative constraints with the support of behavior policies to tackle the\nout-of-distribution problem. However, existing works often suffer from the\nconstraint conflict issue when offline datasets are collected from multiple\nbehavior policies, i.e., different behavior policies may exhibit inconsistent\nactions with distinct returns across the state space. To remedy this issue,\nrecent advantage-weighted methods prioritize samples with high advantage values\nfor agent training while inevitably ignoring the diversity of behavior policy.\nIn this paper, we introduce a novel Advantage-Aware Policy Optimization (A2PO)\nmethod to explicitly construct advantage-aware policy constraints for offline\nlearning under mixed-quality datasets. Specifically, A2PO employs a conditional\nvariational auto-encoder to disentangle the action distributions of intertwined\nbehavior policies by modeling the advantage values of all training data as\nconditional variables. Then the agent can follow such disentangled action\ndistribution constraints to optimize the advantage-aware policy towards high\nadvantage values. Extensive experiments conducted on both the single-quality\nand mixed-quality datasets of the D4RL benchmark demonstrate that A2PO yields\nresults superior to the counterparts. Our code will be made publicly available.\n","authors":["Yunpeng Qing","Shunyu liu","Jingyuan Cong","Kaixuan Chen","Yihe Zhou","Mingli Song"],"pdf_url":"https://arxiv.org/pdf/2403.07262v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.12880v2","updated":"2024-05-30T15:04:27Z","published":"2022-05-25T15:58:34Z","title":"Trust-based Consensus in Multi-Agent Reinforcement Learning Systems","summary":" An often neglected issue in multi-agent reinforcement learning (MARL) is the\npotential presence of unreliable agents in the environment whose deviations\nfrom expected behavior can prevent a system from accomplishing its intended\ntasks. In particular, consensus is a fundamental underpinning problem of\ncooperative distributed multi-agent systems. Consensus requires different\nagents, situated in a decentralized communication network, to reach an\nagreement out of a set of initial proposals that they put forward.\nLearning-based agents should adopt a protocol that allows them to reach\nconsensus despite having one or more unreliable agents in the system. This\npaper investigates the problem of unreliable agents in MARL, considering\nconsensus as a case study. Echoing established results in the distributed\nsystems literature, our experiments show that even a moderate fraction of such\nagents can greatly impact the ability of reaching consensus in a networked\nenvironment. We propose Reinforcement Learning-based Trusted Consensus (RLTC),\na decentralized trust mechanism, in which agents can independently decide which\nneighbors to communicate with. We empirically demonstrate that our trust\nmechanism is able to handle unreliable agents effectively, as evidenced by\nhigher consensus success rates.\n","authors":["Ho Long Fung","Victor-Alexandru Darvariu","Stephen Hailes","Mirco Musolesi"],"pdf_url":"https://arxiv.org/pdf/2205.12880v2.pdf","comment":"Accepted for publication in proceedings of the first Reinforcement\n Learning Conference (RLC 2024)"},{"id":"http://arxiv.org/abs/2405.14852v2","updated":"2024-05-30T15:01:49Z","published":"2024-05-23T17:57:04Z","title":"PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM\n Compression","summary":" There has been significant interest in \"extreme\" compression of large\nlanguage models (LLMs), i.e., to 1-2 bits per parameter, which allows such\nmodels to be executed efficiently on resource-constrained devices. Existing\nwork focused on improved one-shot quantization techniques and weight\nrepresentations; yet, purely post-training approaches are reaching diminishing\nreturns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art\nquantization methods such as QuIP# and AQLM include fine-tuning (part of) the\ncompressed parameters over a limited amount of calibration data; however, such\nfine-tuning techniques over compressed weights often make exclusive use of\nstraight-through estimators (STE), whose performance is not well-understood in\nthis setting. In this work, we question the use of STE for extreme LLM\ncompression, showing that it can be sub-optimal, and perform a systematic study\nof quantization-aware fine-tuning strategies for LLMs. We propose PV-Tuning - a\nrepresentation-agnostic framework that generalizes and improves upon existing\nfine-tuning strategies, and provides convergence guarantees in restricted\ncases. On the practical side, when used for 1-2 bit vector quantization,\nPV-Tuning outperforms prior techniques for highly-performant models such as\nLlama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal\nquantization for Llama 2 family models at 2 bits per parameter.\n","authors":["Vladimir Malinovskii","Denis Mazur","Ivan Ilin","Denis Kuznedelev","Konstantin Burlachenko","Kai Yi","Dan Alistarh","Peter Richtarik"],"pdf_url":"https://arxiv.org/pdf/2405.14852v2.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2405.20124v1","updated":"2024-05-30T15:01:18Z","published":"2024-05-30T15:01:18Z","title":"A Geometric Unification of Distributionally Robust Covariance\n Estimators: Shrinking the Spectrum by Inflating the Ambiguity Set","summary":" The state-of-the-art methods for estimating high-dimensional covariance\nmatrices all shrink the eigenvalues of the sample covariance matrix towards a\ndata-insensitive shrinkage target. The underlying shrinkage transformation is\neither chosen heuristically - without compelling theoretical justification - or\noptimally in view of restrictive distributional assumptions. In this paper, we\npropose a principled approach to construct covariance estimators without\nimposing restrictive assumptions. That is, we study distributionally robust\ncovariance estimation problems that minimize the worst-case Frobenius error\nwith respect to all data distributions close to a nominal distribution, where\nthe proximity of distributions is measured via a divergence on the space of\ncovariance matrices. We identify mild conditions on this divergence under which\nthe resulting minimizers represent shrinkage estimators. We show that the\ncorresponding shrinkage transformations are intimately related to the\ngeometrical properties of the underlying divergence. We also prove that our\nrobust estimators are efficiently computable and asymptotically consistent and\nthat they enjoy finite-sample performance guarantees. We exemplify our general\nmethodology by synthesizing explicit estimators induced by the\nKullback-Leibler, Fisher-Rao, and Wasserstein divergences. Numerical\nexperiments based on synthetic and real data show that our robust estimators\nare competitive with state-of-the-art estimators.\n","authors":["Man-Chung Yue","Yves Rychener","Daniel Kuhn","Viet Anh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2405.20124v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20114v1","updated":"2024-05-30T14:51:57Z","published":"2024-05-30T14:51:57Z","title":"Near Optimal Decentralized Optimization with Compression and Momentum\n Tracking","summary":" Communication efficiency has garnered significant attention as it is\nconsidered the main bottleneck for large-scale decentralized Machine Learning\napplications in distributed and federated settings. In this regime, clients are\nrestricted to transmitting small amounts of quantized information to their\nneighbors over a communication graph. Numerous endeavors have been made to\naddress this challenging problem by developing algorithms with compressed\ncommunication for decentralized non-convex optimization problems. Despite\nconsiderable efforts, the current results suffer from various issues such as\nnon-scalability with the number of clients, requirements for large batches, or\nbounded gradient assumption. In this paper, we introduce MoTEF, a novel\napproach that integrates communication compression with Momentum Tracking and\nError Feedback. Our analysis demonstrates that MoTEF achieves most of the\ndesired properties, and significantly outperforms existing methods under\narbitrary data heterogeneity. We provide numerical experiments to validate our\ntheoretical findings and confirm the practical superiority of MoTEF.\n","authors":["Rustem Islamov","Yuan Gao","Sebastian U. Stich"],"pdf_url":"https://arxiv.org/pdf/2405.20114v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.01567v2","updated":"2024-05-30T14:49:45Z","published":"2024-02-02T17:00:17Z","title":"Understanding Adam Optimizer via Online Learning of Updates: Adam is\n FTRL in Disguise","summary":" Despite the success of the Adam optimizer in practice, the theoretical\nunderstanding of its algorithmic components still remains limited. In\nparticular, most existing analyses of Adam show the convergence rate that can\nbe simply achieved by non-adative algorithms like SGD. In this work, we provide\na different perspective based on online learning that underscores the\nimportance of Adam's algorithmic components. Inspired by Cutkosky et al.\n(2023), we consider the framework called online learning of updates/increments,\nwhere we choose the updates/increments of an optimizer based on an online\nlearner. With this framework, the design of a good optimizer is reduced to the\ndesign of a good online learner. Our main observation is that Adam corresponds\nto a principled online learning framework called Follow-the-Regularized-Leader\n(FTRL). Building on this observation, we study the benefits of its algorithmic\ncomponents from the online learning perspective.\n","authors":["Kwangjun Ahn","Zhiyu Zhang","Yunbum Kook","Yan Dai"],"pdf_url":"https://arxiv.org/pdf/2402.01567v2.pdf","comment":"Accepted at ICML 2024"},{"id":"http://arxiv.org/abs/2310.12942v5","updated":"2024-05-30T14:49:25Z","published":"2023-10-19T17:39:47Z","title":"On the Representational Capacity of Recurrent Neural Language Models","summary":" This work investigates the computational expressivity of language models\n(LMs) based on recurrent neural networks (RNNs). Siegelmann and Sontag (1992)\nfamously showed that RNNs with rational weights and hidden states and unbounded\ncomputation time are Turing complete. However, LMs define weightings over\nstrings in addition to just (unweighted) language membership and the analysis\nof the computational power of RNN LMs (RLMs) should reflect this. We extend the\nTuring completeness result to the probabilistic case, showing how a rationally\nweighted RLM with unbounded computation time can simulate any deterministic\nprobabilistic Turing machine (PTM) with rationally weighted transitions. Since,\nin practice, RLMs work in real-time, processing a symbol at every time step, we\ntreat the above result as an upper bound on the expressivity of RLMs. We also\nprovide a lower bound by showing that under the restriction to real-time\ncomputation, such models can simulate deterministic real-time rational PTMs.\n","authors":["Franz Nowak","Anej Svete","Li Du","Ryan Cotterell"],"pdf_url":"https://arxiv.org/pdf/2310.12942v5.pdf","comment":"Added requirement for non-negative probabilities to definitions 2.3\n and 3.1, fixed typos"},{"id":"http://arxiv.org/abs/2312.07252v2","updated":"2024-05-30T14:48:06Z","published":"2023-12-12T13:28:53Z","title":"Identifying Drivers of Predictive Aleatoric Uncertainty","summary":" Explainability and uncertainty quantification are two pillars of trustable\nartificial intelligence. However, the reasoning behind uncertainty estimates is\ngenerally left unexplained. Identifying the drivers of uncertainty complements\nexplanations of point predictions in recognizing model limitations and enhances\ntrust in decisions and their communication. So far, explanations of\nuncertainties have been rarely studied. The few exceptions rely on Bayesian\nneural networks or technically intricate approaches, such as auxiliary\ngenerative models, thereby hindering their broad adoption. We present a simple\napproach to explain predictive aleatoric uncertainties. We estimate uncertainty\nas predictive variance by adapting a neural network with a Gaussian output\ndistribution. Subsequently, we apply out-of-the-box explainers to the model's\nvariance output. This approach can explain uncertainty influences more reliably\nthan literature baselines, which we evaluate in a synthetic setting with a\nknown data-generating process. We further adapt multiple metrics from\nconventional XAI research to uncertainty explanations. We quantify our findings\nwith a nuanced benchmark analysis that includes real-world datasets. Finally,\nwe apply our approach to an age regression model and discover reasonable\nsources of uncertainty. Overall, we explain uncertainty estimates with little\nmodifications to the model architecture and demonstrate that our approach\ncompetes effectively with more intricate methods.\n","authors":["Pascal Iversen","Simon Witzke","Katharina Baum","Bernhard Y. Renard"],"pdf_url":"https://arxiv.org/pdf/2312.07252v2.pdf","comment":"Simon Witzke and Pascal Iversen contributed equally"},{"id":"http://arxiv.org/abs/2405.20094v1","updated":"2024-05-30T14:32:06Z","published":"2024-05-30T14:32:06Z","title":"Low-dimensional approximations of the conditional law of Volterra\n processes: a non-positive curvature approach","summary":" Predicting the conditional evolution of Volterra processes with stochastic\nvolatility is a crucial challenge in mathematical finance. While deep neural\nnetwork models offer promise in approximating the conditional law of such\nprocesses, their effectiveness is hindered by the curse of dimensionality\ncaused by the infinite dimensionality and non-smooth nature of these problems.\nTo address this, we propose a two-step solution. Firstly, we develop a stable\ndimension reduction technique, projecting the law of a reasonably broad class\nof Volterra process onto a low-dimensional statistical manifold of non-positive\nsectional curvature. Next, we introduce a sequentially deep learning model\ntailored to the manifold's geometry, which we show can approximate the\nprojected conditional law of the Volterra process. Our model leverages an\nauxiliary hypernetwork to dynamically update its internal parameters, allowing\nit to encode non-stationary dynamics of the Volterra process, and it can be\ninterpreted as a gating mechanism in a mixture of expert models where each\nexpert is specialized at a specific point in time. Our hypernetwork further\nallows us to achieve approximation rates that would seemingly only be possible\nwith very large networks.\n","authors":["Reza Arabpour","John Armstrong","Luca Galimberti","Anastasis Kratsios","Giulia Livieri"],"pdf_url":"https://arxiv.org/pdf/2405.20094v1.pdf","comment":"Main body: 25 Pages, Appendices 29 Pages, 14 Tables, 6 Figures"},{"id":"http://arxiv.org/abs/2405.20091v1","updated":"2024-05-30T14:27:40Z","published":"2024-05-30T14:27:40Z","title":"Visual Attention Analysis in Online Learning","summary":" In this paper, we present an approach in the Multimodal Learning Analytics\nfield. Within this approach, we have developed a tool to visualize and analyze\neye movement data collected during learning sessions in online courses. The\ntool is named VAAD (an acronym for Visual Attention Analysis Dashboard). These\neye movement data have been gathered using an eye-tracker and subsequently\nprocessed and visualized for interpretation. The purpose of the tool is to\nconduct a descriptive analysis of the data by facilitating its visualization,\nenabling the identification of differences and learning patterns among various\nlearner populations. Additionally, it integrates a predictive module capable of\nanticipating learner activities during a learning session. Consequently, VAAD\nholds the potential to offer valuable insights into online learning behaviors\nfrom both descriptive and predictive perspectives.\n","authors":["Navarro Miriam","Becerra Álvaro","Daza Roberto","Cobos Ruth","Morales Aythami","Fierrez Julian"],"pdf_url":"https://arxiv.org/pdf/2405.20091v1.pdf","comment":"Accepted in CEDI 2024 (VII Congreso Espa\\~nol de Inform\\'atica), A\n Coru\\~na, Spain"},{"id":"http://arxiv.org/abs/2405.20086v1","updated":"2024-05-30T14:16:32Z","published":"2024-05-30T14:16:32Z","title":"Analysis of a multi-target linear shrinkage covariance estimator","summary":" Multi-target linear shrinkage is an extension of the standard single-target\nlinear shrinkage for covariance estimation. We combine several constant\nmatrices - the targets - with the sample covariance matrix. We derive the\noracle and a \\textit{bona fide} multi-target linear shrinkage estimator with\nexact and empirical mean. In both settings, we proved its convergence towards\nthe oracle under Kolmogorov asymptotics. Finally, we show empirically that it\noutperforms other standard estimators in various situations.\n","authors":["Benoit Oriol"],"pdf_url":"https://arxiv.org/pdf/2405.20086v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20085v1","updated":"2024-05-30T14:16:19Z","published":"2024-05-30T14:16:19Z","title":"Soft Partitioning of Latent Space for Semantic Channel Equalization","summary":" Semantic channel equalization has emerged as a solution to address language\nmismatch in multi-user semantic communications. This approach aims to align the\nlatent spaces of an encoder and a decoder which were not jointly trained and it\nrelies on a partition of the semantic (latent) space into atoms based on the\nthe semantic meaning. In this work we explore the role of the semantic space\npartition in scenarios where the task structure involves a one-to-many mapping\nbetween the semantic space and the action space. In such scenarios,\npartitioning based on hard inference results results in loss of information\nwhich degrades the equalization performance. We propose a soft criterion to\nderive the atoms of the partition which leverages the soft decoder's output and\noffers a more comprehensive understanding of the semantic space's structure.\nThrough empirical validation, we demonstrate that soft partitioning yields a\nmore descriptive and regular partition of the space, consequently enhancing the\nperformance of the equalization algorithm.\n","authors":["Tomás Huttebraucker","Mohamed Sana","Emilio Calvanese Strinati"],"pdf_url":"https://arxiv.org/pdf/2405.20085v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15174v3","updated":"2024-05-30T14:15:22Z","published":"2023-05-24T14:06:02Z","title":"Simultaneous identification of models and parameters of scientific\n simulators","summary":" Many scientific models are composed of multiple discrete components, and\nscientists often make heuristic decisions about which components to include.\nBayesian inference provides a mathematical framework for systematically\nselecting model components, but defining prior distributions over model\ncomponents and developing associated inference schemes has been challenging. We\napproach this problem in a simulation-based inference framework: We define\nmodel priors over candidate components and, from model simulations, train\nneural networks to infer joint probability distributions over both model\ncomponents and associated parameters. Our method, simulation-based model\ninference (SBMI), represents distributions over model components as a\nconditional mixture of multivariate binary distributions in the Grassmann\nformalism. SBMI can be applied to any compositional stochastic simulator\nwithout requiring likelihood evaluations. We evaluate SBMI on a simple time\nseries model and on two scientific models from neuroscience, and show that it\ncan discover multiple data-consistent model configurations, and that it reveals\nnon-identifiable model components and parameters. SBMI provides a powerful tool\nfor data-driven scientific inquiry which will allow scientists to identify\nessential model components and make uncertainty-informed modelling decisions.\n","authors":["Cornelius Schröder","Jakob H. Macke"],"pdf_url":"https://arxiv.org/pdf/2305.15174v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.05680v2","updated":"2024-05-30T14:12:54Z","published":"2024-02-08T13:58:16Z","title":"Interpretable classifiers for tabular data via discretization and\n feature selection","summary":" We introduce a method for computing immediately human interpretable yet\naccurate classifiers from tabular data. The classifiers obtained are short\nBoolean formulas, computed via first discretizing the original data and then\nusing feature selection coupled with a very fast algorithm for producing the\nbest possible Boolean classifier for the setting. We demonstrate the approach\nvia 13 experiments, obtaining results with accuracies comparable to ones\nobtained via random forests, XGBoost, and existing results for the same\ndatasets in the literature. In most cases, the accuracy of our method is in\nfact similar to that of the reference methods, even though the main objective\nof our study is the immediate interpretability of our classifiers. We also\nprove a new result on the probability that the classifier we obtain from\nreal-life data corresponds to the ideally best classifier with respect to the\nbackground distribution the data comes from.\n","authors":["Reijo Jaakkola","Tomi Janhunen","Antti Kuusisto","Masood Feyzbakhsh Rankooh","Miikka Vilander"],"pdf_url":"https://arxiv.org/pdf/2402.05680v2.pdf","comment":"Changes in relation to version 1: more thorough and detailed\n experiments, general corrections and refinements"},{"id":"http://arxiv.org/abs/2405.20082v1","updated":"2024-05-30T14:11:29Z","published":"2024-05-30T14:11:29Z","title":"Segment, Shuffle, and Stitch: A Simple Mechanism for Improving\n Time-Series Representations","summary":" Existing approaches for learning representations of time-series keep the\ntemporal arrangement of the time-steps intact with the presumption that the\noriginal order is the most optimal for learning. However, non-adjacent sections\nof real-world time-series may have strong dependencies. Accordingly we raise\nthe question: Is there an alternative arrangement for time-series which could\nenable more effective representation learning? To address this, we propose a\nsimple plug-and-play mechanism called Segment, Shuffle, and Stitch (S3)\ndesigned to improve time-series representation learning of existing models. S3\nworks by creating non-overlapping segments from the original sequence and\nshuffling them in a learned manner that is the most optimal for the task at\nhand. It then re-attaches the shuffled segments back together and performs a\nlearned weighted sum with the original input to capture both the newly shuffled\nsequence along with the original sequence. S3 is modular and can be stacked to\ncreate various degrees of granularity, and can be added to many forms of neural\narchitectures including CNNs or Transformers with negligible computation\noverhead. Through extensive experiments on several datasets and\nstate-of-the-art baselines, we show that incorporating S3 results in\nsignificant improvements for the tasks of time-series classification and\nforecasting, improving performance on certain datasets by up to 68\\%. We also\nshow that S3 makes the learning more stable with a smoother training loss curve\nand loss landscape compared to the original baseline. The code is available at\nhttps://github.com/shivam-grover/S3-TimeSeries .\n","authors":["Shivam Grover","Amin Jalali","Ali Etemad"],"pdf_url":"https://arxiv.org/pdf/2405.20082v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20079v1","updated":"2024-05-30T14:09:43Z","published":"2024-05-30T14:09:43Z","title":"Student Answer Forecasting: Transformer-Driven Answer Choice Prediction\n for Language Learning","summary":" Intelligent Tutoring Systems (ITS) enhance personalized learning by\npredicting student answers to provide immediate and customized instruction.\nHowever, recent research has primarily focused on the correctness of the answer\nrather than the student's performance on specific answer choices, limiting\ninsights into students' thought processes and potential misconceptions. To\naddress this gap, we present MCQStudentBert, an answer forecasting model that\nleverages the capabilities of Large Language Models (LLMs) to integrate\ncontextual understanding of students' answering history along with the text of\nthe questions and answers. By predicting the specific answer choices students\nare likely to make, practitioners can easily extend the model to new answer\nchoices or remove answer choices for the same multiple-choice question (MCQ)\nwithout retraining the model. In particular, we compare MLP, LSTM, BERT, and\nMistral 7B architectures to generate embeddings from students' past\ninteractions, which are then incorporated into a finetuned BERT's\nanswer-forecasting mechanism. We apply our pipeline to a dataset of language\nlearning MCQ, gathered from an ITS with over 10,000 students to explore the\npredictive accuracy of MCQStudentBert, which incorporates student interaction\npatterns, in comparison to correct answer prediction and traditional\nmastery-learning feature-based approaches. This work opens the door to more\npersonalized content, modularization, and granular support.\n","authors":["Elena Grazia Gado","Tommaso Martorella","Luca Zunino","Paola Mejia-Domenzain","Vinitra Swamy","Jibril Frej","Tanja Käser"],"pdf_url":"https://arxiv.org/pdf/2405.20079v1.pdf","comment":"Accepted as a poster paper at EDM 2024: 17th International Conference\n on Educational Data Mining in Atlanta, USA"},{"id":"http://arxiv.org/abs/2402.03271v2","updated":"2024-05-30T14:03:35Z","published":"2024-02-05T18:28:44Z","title":"Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information\n Seeking in Large Language Models","summary":" In the face of uncertainty, the ability to *seek information* is of\nfundamental importance. In many practical applications, such as medical\ndiagnosis and troubleshooting, the information needed to solve the task is not\ninitially given and has to be actively sought by asking follow-up questions\n(for example, a doctor asking a patient for more details about their symptoms).\nIn this work, we introduce Uncertainty of Thoughts (UoT), an algorithm to\naugment large language models with the ability to actively seek information by\nasking effective questions. UoT combines 1) an *uncertainty-aware simulation\napproach* which enables the model to simulate possible future scenarios and how\nlikely they are to occur, 2) *uncertainty-based rewards* motivated by\ninformation gain which incentivizes the model to seek information, and 3) a\n*reward propagation scheme* to select the optimal question to ask in a way that\nmaximizes the expected reward. In experiments on medical diagnosis,\ntroubleshooting, and the `20 Questions` game, UoT achieves an average\nperformance improvement of 38.1% in the rate of successful task completion\nacross multiple LLMs compared with direct prompting and also improves\nefficiency (i.e., the number of questions needed to complete the task). Our\ncode has been released [here](https://github.com/zhiyuanhubj/UoT)\n","authors":["Zhiyuan Hu","Chumin Liu","Xidong Feng","Yilun Zhao","See-Kiong Ng","Anh Tuan Luu","Junxian He","Pang Wei Koh","Bryan Hooi"],"pdf_url":"https://arxiv.org/pdf/2402.03271v2.pdf","comment":"Update Results"},{"id":"http://arxiv.org/abs/2405.20071v1","updated":"2024-05-30T14:01:02Z","published":"2024-05-30T14:01:02Z","title":"A Staged Approach using Machine Learning and Uncertainty Quantification\n to Predict the Risk of Hip Fracture","summary":" Despite advancements in medical care, hip fractures impose a significant\nburden on individuals and healthcare systems. This paper focuses on the\nprediction of hip fracture risk in older and middle-aged adults, where falls\nand compromised bone quality are predominant factors. We propose a novel staged\nmodel that combines advanced imaging and clinical data to improve predictive\nperformance. By using CNNs to extract features from hip DXA images, along with\nclinical variables, shape measurements, and texture features, our method\nprovides a comprehensive framework for assessing fracture risk. A staged\nmachine learning-based model was developed using two ensemble models: Ensemble\n1 (clinical variables only) and Ensemble 2 (clinical variables and DXA imaging\nfeatures). This staged approach used uncertainty quantification from Ensemble 1\nto decide if DXA features are necessary for further prediction. Ensemble 2\nexhibited the highest performance, achieving an AUC of 0.9541, an accuracy of\n0.9195, a sensitivity of 0.8078, and a specificity of 0.9427. The staged model\nalso performed well, with an AUC of 0.8486, an accuracy of 0.8611, a\nsensitivity of 0.5578, and a specificity of 0.9249, outperforming Ensemble 1,\nwhich had an AUC of 0.5549, an accuracy of 0.7239, a sensitivity of 0.1956, and\na specificity of 0.8343. Furthermore, the staged model suggested that 54.49% of\npatients did not require DXA scanning. It effectively balanced accuracy and\nspecificity, offering a robust solution when DXA data acquisition is not always\nfeasible. Statistical tests confirmed significant differences between the\nmodels, highlighting the advantages of the advanced modeling strategies. Our\nstaged approach could identify individuals at risk with a high accuracy but\nreduce the unnecessary DXA scanning. It has great promise to guide\ninterventions to prevent hip fractures with reduced cost and radiation.\n","authors":["Anjum Shaik","Kristoffer Larsen","Nancy E. Lane","Chen Zhao","Kuan-Jui Su","Joyce H. Keyak","Qing Tian","Qiuying Sha","Hui Shen","Hong-Wen Deng","Weihua Zhou"],"pdf_url":"https://arxiv.org/pdf/2405.20071v1.pdf","comment":"29 pages, 5 figures, 6 tables"},{"id":"http://arxiv.org/abs/2306.08970v2","updated":"2024-05-30T13:46:34Z","published":"2023-06-15T09:05:36Z","title":"An Efficient and Multi-private Key Secure Aggregation for Federated\n Learning","summary":" With the emergence of privacy leaks in federated learning, secure aggregation\nprotocols that mainly adopt either homomorphic encryption or threshold secret\nsharing have been widely developed for federated learning to protect the\nprivacy of the local training data of each client. However, these existing\nprotocols suffer from many shortcomings, such as the dependence on a trusted\nthird party, the vulnerability to clients being corrupted, low efficiency, the\ntrade-off between security and fault tolerance, etc. To solve these\ndisadvantages, we propose an efficient and multi-private key secure aggregation\nscheme for federated learning. Specifically, we skillfully modify the variant\nElGamal encryption technique to achieve homomorphic addition operation, which\nhas two important advantages: 1) The server and each client can freely select\npublic and private keys without introducing a trust third party and 2) Compared\nto the variant ElGamal encryption, the plaintext space is relatively large,\nwhich is more suitable for the deep model. Besides, for the high dimensional\ndeep model parameter, we introduce a super-increasing sequence to compress\nmulti-dimensional data into 1-D, which can greatly reduce encryption and\ndecryption times as well as communication for ciphertext transmission. Detailed\nsecurity analyses show that our proposed scheme achieves the semantic security\nof both individual local gradients and the aggregated result while achieving\noptimal robustness in tolerating both client collusion and dropped clients.\nExtensive simulations demonstrate that the accuracy of our scheme is almost the\nsame as the non-private approach, while the efficiency of our scheme is much\nbetter than the state-of-the-art homomorphic encryption-based secure\naggregation schemes. More importantly, the efficiency advantages of our scheme\nwill become increasingly prominent as the number of model parameters increases.\n","authors":["Xue Yang","Zifeng Liu","Xiaohu Tang","Rongxing Lu","Bo Liu"],"pdf_url":"https://arxiv.org/pdf/2306.08970v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15769v2","updated":"2024-05-30T13:43:44Z","published":"2023-09-27T16:41:10Z","title":"Algebraic and Statistical Properties of the Ordinary Least Squares\n Interpolator","summary":" Deep learning research has uncovered the phenomenon of benign overfitting for\noverparameterized statistical models, which has drawn significant theoretical\ninterest in recent years. Given its simplicity and practicality, the ordinary\nleast squares (OLS) interpolator has become essential to gain foundational\ninsights into this phenomenon. While properties of OLS are well established in\nclassical, underparameterized settings, its behavior in high-dimensional,\noverparameterized regimes is less explored (unlike for ridge or lasso\nregression) though significant progress has been made of late. We contribute to\nthis growing literature by providing fundamental algebraic and statistical\nresults for the minimum $\\ell_2$-norm OLS interpolator. In particular, we\nprovide algebraic equivalents of (i) the leave-$k$-out residual formula, (ii)\nCochran's formula, and (iii) the Frisch-Waugh-Lovell theorem in the\noverparameterized regime. These results aid in understanding the OLS\ninterpolator's ability to generalize and have substantive implications for\ncausal inference. Under the Gauss-Markov model, we present statistical results\nsuch as an extension of the Gauss-Markov theorem and an analysis of variance\nestimation under homoskedastic errors for the overparameterized regime. To\nsubstantiate our theoretical contributions, we conduct simulations that further\nexplore the stochastic properties of the OLS interpolator.\n","authors":["Dennis Shen","Dogyoon Song","Peng Ding","Jasjeet S. Sekhon"],"pdf_url":"https://arxiv.org/pdf/2309.15769v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20053v1","updated":"2024-05-30T13:38:52Z","published":"2024-05-30T13:38:52Z","title":"Would I Lie To You? Inference Time Alignment of Language Models using\n Direct Preference Heads","summary":" Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context\nlearning capabilities; however, their behaviors are often difficult to control.\nBy utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible\nto fine-tune unsupervised LMs to follow instructions and produce outputs that\nreflect human preferences. Despite its benefits, RLHF has been shown to\npotentially harm a language model's reasoning capabilities and introduce\nartifacts such as hallucinations where the model may fabricate facts. To\naddress this issue we introduce Direct Preference Heads (DPH), a fine-tuning\nframework that enables LMs to learn human preference signals through an\nauxiliary reward head without directly affecting the output distribution of the\nlanguage modeling head. We perform a theoretical analysis of our objective\nfunction and find strong ties to Conservative Direct Preference Optimization\n(cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All\nevaluation suite and demonstrate that our method produces models which achieve\nhigher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct\nPreference Optimization (DPO) alone.\n","authors":["Avelina Asada Hadji-Kyriacou","Ognjen Arandjelovic"],"pdf_url":"https://arxiv.org/pdf/2405.20053v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20052v1","updated":"2024-05-30T13:38:28Z","published":"2024-05-30T13:38:28Z","title":"A Hardware-Efficient EMG Decoder with an Attractor-based Neural Network\n for Next-Generation Hand Prostheses","summary":" Advancements in neural engineering have enabled the development of Robotic\nProsthetic Hands (RPHs) aimed at restoring hand functionality. Current\ncommercial RPHs offer limited control through basic on/off commands. Recent\nprogresses in machine learning enable finger movement decoding with higher\ndegrees of freedom, yet the high computational complexity of such models limits\ntheir application in portable devices. Future RPH designs must balance\nportability, low power consumption, and high decoding accuracy to be practical\nfor individuals with disabilities. To this end, we introduce a novel\nattractor-based neural network to realize on-chip movement decoding for\nnext-generation portable RPHs. The proposed architecture comprises an encoder,\nan attention layer, an attractor network, and a refinement regressor. We tested\nour model on four healthy subjects and achieved a decoding accuracy of\n80.6\\pm3.3\\%. Our proposed model is over 120 and 50 times more compact compared\nto state-of-the-art LSTM and CNN models, respectively, with comparable (or\nsuperior) decoding accuracy. Therefore, it exhibits minimal hardware complexity\nand can be effectively integrated as a System-on-Chip.\n","authors":["Mohammad Kalbasi","MohammadAli Shaeri","Vincent Alexandre Mendez","Solaiman Shokur","Silvestro Micera","Mahsa Shoaran"],"pdf_url":"https://arxiv.org/pdf/2405.20052v1.pdf","comment":"\\c{opyright} 2024 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2405.20051v1","updated":"2024-05-30T13:37:53Z","published":"2024-05-30T13:37:53Z","title":"Threshold-Independent Fair Matching through Score Calibration","summary":" Entity Matching (EM) is a critical task in numerous fields, such as\nhealthcare, finance, and public administration, as it identifies records that\nrefer to the same entity within or across different databases. EM faces\nconsiderable challenges, particularly with false positives and negatives. These\nare typically addressed by generating matching scores and apply thresholds to\nbalance false positives and negatives in various contexts. However, adjusting\nthese thresholds can affect the fairness of the outcomes, a critical factor\nthat remains largely overlooked in current fair EM research. The existing body\nof research on fair EM tends to concentrate on static thresholds, neglecting\ntheir critical impact on fairness. To address this, we introduce a new approach\nin EM using recent metrics for evaluating biases in score based binary\nclassification, particularly through the lens of distributional parity. This\napproach enables the application of various bias metrics like equalized odds,\nequal opportunity, and demographic parity without depending on threshold\nsettings. Our experiments with leading matching methods reveal potential\nbiases, and by applying a calibration technique for EM scores using Wasserstein\nbarycenters, we not only mitigate these biases but also preserve accuracy\nacross real world datasets. This paper contributes to the field of fairness in\ndata cleaning, especially within EM, which is a central task in data cleaning,\nby promoting a method for generating matching scores that reduce biases across\ndifferent thresholds.\n","authors":["Mohammad Hossein Moslemi","Mostafa Milani"],"pdf_url":"https://arxiv.org/pdf/2405.20051v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20045v1","updated":"2024-05-30T13:27:17Z","published":"2024-05-30T13:27:17Z","title":"Iterative Learning Control of Fast, Nonlinear, Oscillatory Dynamics\n (Preprint)","summary":" The sudden onset of deleterious and oscillatory dynamics (often called\ninstabilities) is a known challenge in many fluid, plasma, and aerospace\nsystems. These dynamics are difficult to address because they are nonlinear,\nchaotic, and are often too fast for active control schemes. In this work, we\ndevelop an alternative active controls system using an iterative,\ntrajectory-optimization and parameter-tuning approach based on Iterative\nLearning Control (ILC), Time-Lagged Phase Portraits (TLPP) and Gaussian Process\nRegression (GPR). The novelty of this approach is that it can control a\nsystem's dynamics despite the controller being much slower than the dynamics.\nWe demonstrate this controller on the Lorenz system of equations where it\niteratively adjusts (tunes) the system's input parameters to successfully\nreproduce a desired oscillatory trajectory or state. Additionally, we\ninvestigate the system's dynamical sensitivity to its control parameters,\nidentify continuous and bounded regions of desired dynamical trajectories, and\ndemonstrate that the controller is robust to missing information and\nuncontrollable parameters as long as certain requirements are met. The\ncontroller presented in this work provides a framework for low-speed control\nfor a variety of fast, nonlinear systems that may aid in instability\nsuppression and mitigation.\n","authors":["John W. Brooks","Christine M. Greve"],"pdf_url":"https://arxiv.org/pdf/2405.20045v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20042v1","updated":"2024-05-30T13:23:02Z","published":"2024-05-30T13:23:02Z","title":"CycleFormer : TSP Solver Based on Language Modeling","summary":" We propose a new transformer model for the Traveling Salesman Problem (TSP)\ncalled CycleFormer. We identified distinctive characteristics that need to be\nconsidered when applying a conventional transformer model to TSP and aimed to\nfully incorporate these elements into the TSP-specific transformer. Unlike the\ntoken sets in typical language models, which are limited and static, the token\n(node) set in TSP is unlimited and dynamic. To exploit this fact to the\nfullest, we equated the encoder output with the decoder linear layer and\ndirectly connected the context vector of the encoder to the decoder encoding.\nAdditionally, we added a positional encoding to the encoder tokens that\nreflects the two-dimensional nature of TSP, and devised a circular positional\nencoding for the decoder tokens that considers the cyclic properties of a tour.\nBy incorporating these ideas, CycleFormer outperforms state-of-the-art (SOTA)\ntransformer models for TSP from TSP-50 to TSP-500. Notably, on TSP-500, the\noptimality gap was reduced by approximately 2.8 times, from 3.09% to 1.10%,\ncompared to the existing SOTA. The code will be made available at\nhttps://github.com/Giventicket/CycleFormer.\n","authors":["Jieun Yook","Junpyo Seo","Joon Huh","Han Joon Byun","Byung-ro Mooon"],"pdf_url":"https://arxiv.org/pdf/2405.20042v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20039v1","updated":"2024-05-30T13:19:49Z","published":"2024-05-30T13:19:49Z","title":"Task-Agnostic Machine Learning-Assisted Inference","summary":" Machine learning (ML) is playing an increasingly important role in scientific\nresearch. In conjunction with classical statistical approaches, ML-assisted\nanalytical strategies have shown great promise in accelerating research\nfindings. This has also opened up a whole new field of methodological research\nfocusing on integrative approaches that leverage both ML and statistics to\ntackle data science challenges. One type of study that has quickly gained\npopularity employs ML to predict unobserved outcomes in massive samples and\nthen uses the predicted outcomes in downstream statistical inference. However,\nexisting methods designed to ensure the validity of this type of\npost-prediction inference are limited to very basic tasks such as linear\nregression analysis. This is because any extension of these approaches to new,\nmore sophisticated statistical tasks requires task-specific algebraic\nderivations and software implementations, which ignores the massive library of\nexisting software tools already developed for complex inference tasks and\nseverely constrains the scope of post-prediction inference in real\napplications. To address this challenge, we propose a novel statistical\nframework for task-agnostic ML-assisted inference. It provides a\npost-prediction inference solution that can be easily plugged into almost any\nestablished data analysis routine. It delivers valid and efficient inference\nthat is robust to arbitrary choices of ML models, while allowing nearly all\nexisting analytical frameworks to be incorporated into the analysis of\nML-predicted outcomes. Through extensive experiments, we showcase the validity,\nversatility, and superiority of our method compared to existing approaches.\n","authors":["Jiacheng Miao","Qiongshi Lu"],"pdf_url":"https://arxiv.org/pdf/2405.20039v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20029v1","updated":"2024-05-30T13:13:48Z","published":"2024-05-30T13:13:48Z","title":"A Random Forest-based Prediction Model for Turning Points in\n Antagonistic event-group Competitions","summary":" At present, most of the prediction studies related to antagonistic\nevent-group competitions focus on the prediction of competition results, and\nless on the prediction of the competition process, which can not provide\nreal-time feedback of the athletes' state information in the actual\ncompetition, and thus can not analyze the changes of the competition situation.\nIn order to solve this problem, this paper proposes a prediction model based on\nRandom Forest for the turning point of the antagonistic event-group. Firstly,\nthe quantitative equation of competitive potential energy is proposed;\nSecondly, the quantitative value of competitive potential energy is obtained by\nusing the dynamic combination of weights method, and the turning point of the\ncompetition situation of the antagonistic event-group is marked according to\nthe quantitative time series graph; Finally, the random forest prediction model\nbased on the optimisation of the KM-SMOTE algorithm and the grid search method\nis established. The experimental analysis shows that: the quantitative equation\nof competitive potential energy can effectively reflect the dynamic situation\nof the competition; The model can effectively predict the turning point of the\ncompetition situation of the antagonistic event-group, and the recall rate of\nthe model in the test set is 86.13%; the model has certain significance for the\nfuture study of the competition situation of the antagonistic event-group.\n","authors":["Zishuo Zhu"],"pdf_url":"https://arxiv.org/pdf/2405.20029v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20028v1","updated":"2024-05-30T13:13:12Z","published":"2024-05-30T13:13:12Z","title":"A Simple and Adaptive Learning Rate for FTRL in Online Learning with\n Minimax Regret of $Θ(T^{2/3})$ and its Application to\n Best-of-Both-Worlds","summary":" Follow-the-Regularized-Leader (FTRL) is a powerful framework for various\nonline learning problems. By designing its regularizer and learning rate to be\nadaptive to past observations, FTRL is known to work adaptively to various\nproperties of an underlying environment. However, most existing adaptive\nlearning rates are for online learning problems with a minimax regret of\n$\\Theta(\\sqrt{T})$ for the number of rounds $T$, and there are only a few\nstudies on adaptive learning rates for problems with a minimax regret of\n$\\Theta(T^{2/3})$, which include several important problems dealing with\nindirect feedback. To address this limitation, we establish a new adaptive\nlearning rate framework for problems with a minimax regret of\n$\\Theta(T^{2/3})$. Our learning rate is designed by matching the stability,\npenalty, and bias terms that naturally appear in regret upper bounds for\nproblems with a minimax regret of $\\Theta(T^{2/3})$. As applications of this\nframework, we consider two major problems dealing with indirect feedback:\npartial monitoring and graph bandits. We show that FTRL with our learning rate\nand the Tsallis entropy regularizer improves existing Best-of-Both-Worlds\n(BOBW) regret upper bounds, which achieve simultaneous optimality in the\nstochastic and adversarial regimes. The resulting learning rate is surprisingly\nsimple compared to the existing learning rates for BOBW algorithms for problems\nwith a minimax regret of $\\Theta(T^{2/3})$.\n","authors":["Taira Tsuchiya","Shinji Ito"],"pdf_url":"https://arxiv.org/pdf/2405.20028v1.pdf","comment":"31 pages"},{"id":"http://arxiv.org/abs/2401.14535v2","updated":"2024-05-30T13:09:47Z","published":"2024-01-25T22:01:07Z","title":"CaRiNG: Learning Temporal Causal Representation under Non-Invertible\n Generation Process","summary":" Identifying the underlying time-delayed latent causal processes in sequential\ndata is vital for grasping temporal dynamics and making downstream reasoning.\nWhile some recent methods can robustly identify these latent causal variables,\nthey rely on strict assumptions about the invertible generation process from\nlatent variables to observed data. However, these assumptions are often hard to\nsatisfy in real-world applications containing information loss. For instance,\nthe visual perception process translates a 3D space into 2D images, or the\nphenomenon of persistence of vision incorporates historical data into current\nperceptions. To address this challenge, we establish an identifiability theory\nthat allows for the recovery of independent latent components even when they\ncome from a nonlinear and non-invertible mix. Using this theory as a\nfoundation, we propose a principled approach, CaRiNG, to learn the CAusal\nRepresentatIon of Non-invertible Generative temporal data with identifiability\nguarantees. Specifically, we utilize temporal context to recover lost latent\ninformation and apply the conditions in our theory to guide the training\nprocess. Through experiments conducted on synthetic datasets, we validate that\nour CaRiNG method reliably identifies the causal process, even when the\ngeneration process is non-invertible. Moreover, we demonstrate that our\napproach considerably improves temporal understanding and reasoning in\npractical applications.\n","authors":["Guangyi Chen","Yifan Shen","Zhenhao Chen","Xiangchen Song","Yuewen Sun","Weiran Yao","Xiao Liu","Kun Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.14535v2.pdf","comment":"To appear at ICML 2024, 24 pages"},{"id":"http://arxiv.org/abs/2402.07865v2","updated":"2024-05-30T13:08:48Z","published":"2024-02-12T18:21:14Z","title":"Prismatic VLMs: Investigating the Design Space of Visually-Conditioned\n Language Models","summary":" Visually-conditioned language models (VLMs) have seen growing adoption in\napplications such as visual dialogue, scene understanding, and robotic task\nplanning; adoption that has fueled a wealth of new models such as LLaVa,\nInstructBLIP, and PaLI-3. Despite the volume of new releases, key design\ndecisions around image preprocessing, architecture, and optimization are\nunder-explored, making it challenging to understand what factors account for\nmodel performance $-$ a challenge further complicated by the lack of objective,\nconsistent evaluations. To address these gaps, we first compile a suite of\nstandardized evaluations spanning visual question answering, object\nlocalization, and challenge sets that probe properties such as hallucination;\nevaluations that provide fine-grained insight VLM capabilities. Second, we\nrigorously investigate VLMs along key design axes, including pretrained visual\nrepresentations and training from base vs. instruct-tuned language models,\namongst others. We couple our analysis with three resource contributions: (1) a\nunified framework for evaluating VLMs, (2) optimized, flexible training code,\nand (3) checkpoints for all models, including a family of VLMs at the 7-13B\nscale that strictly outperform InstructBLIP and LLaVa v1.5, the\nstate-of-the-art in open VLMs.\n","authors":["Siddharth Karamcheti","Suraj Nair","Ashwin Balakrishna","Percy Liang","Thomas Kollar","Dorsa Sadigh"],"pdf_url":"https://arxiv.org/pdf/2402.07865v2.pdf","comment":"Published at ICML 2024. 22 pages, 11 figures. Training code and\n models: https://github.com/TRI-ML/prismatic-vlms. Evaluation code:\n https://github.com/TRI-ML/vlm-evaluation"},{"id":"http://arxiv.org/abs/2310.00841v3","updated":"2024-05-30T13:03:32Z","published":"2023-10-02T01:30:42Z","title":"Drug Discovery with Dynamic Goal-aware Fragments","summary":" Fragment-based drug discovery is an effective strategy for discovering drug\ncandidates in the vast chemical space, and has been widely employed in\nmolecular generative models. However, many existing fragment extraction methods\nin such models do not take the target chemical properties into account or rely\non heuristic rules. Additionally, the existing fragment-based generative models\ncannot update the fragment vocabulary with goal-aware fragments newly\ndiscovered during the generation. To this end, we propose a molecular\ngenerative framework for drug discovery, named Goal-aware fragment Extraction,\nAssembly, and Modification (GEAM). GEAM consists of three modules, each\nresponsible for goal-aware fragment extraction, fragment assembly, and fragment\nmodification. The fragment extraction module identifies important fragments\ncontributing to the desired target properties with the information bottleneck\nprinciple, thereby constructing an effective goal-aware fragment vocabulary.\nMoreover, GEAM can explore beyond the initial vocabulary with the fragment\nmodification module, and the exploration is further enhanced through the\ndynamic goal-aware vocabulary update. We experimentally demonstrate that GEAM\neffectively discovers drug candidates through the generative cycle of the three\nmodules in various drug discovery tasks. Our code is available at\nhttps://github.com/SeulLee05/GEAM.\n","authors":["Seul Lee","Seanie Lee","Kenji Kawaguchi","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2310.00841v3.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2405.20018v1","updated":"2024-05-30T12:57:35Z","published":"2024-05-30T12:57:35Z","title":"Safe Multi-agent Reinforcement Learning with Natural Language\n Constraints","summary":" The role of natural language constraints in Safe Multi-agent Reinforcement\nLearning (MARL) is crucial, yet often overlooked. While Safe MARL has vast\npotential, especially in fields like robotics and autonomous vehicles, its full\npotential is limited by the need to define constraints in pre-designed\nmathematical terms, which requires extensive domain expertise and reinforcement\nlearning knowledge, hindering its broader adoption. To address this limitation\nand make Safe MARL more accessible and adaptable, we propose a novel approach\nnamed Safe Multi-agent Reinforcement Learning with Natural Language constraints\n(SMALL). Our method leverages fine-tuned language models to interpret and\nprocess free-form textual constraints, converting them into semantic embeddings\nthat capture the essence of prohibited states and behaviours. These embeddings\nare then integrated into the multi-agent policy learning process, enabling\nagents to learn policies that minimize constraint violations while optimizing\nrewards. To evaluate the effectiveness of SMALL, we introduce the LaMaSafe, a\nmulti-task benchmark designed to assess the performance of multiple agents in\nadhering to natural language constraints. Empirical evaluations across various\nenvironments demonstrate that SMALL achieves comparable rewards and\nsignificantly fewer constraint violations, highlighting its effectiveness in\nunderstanding and enforcing natural language constraints.\n","authors":["Ziyan Wang","Meng Fang","Tristan Tomilin","Fei Fang","Yali Du"],"pdf_url":"https://arxiv.org/pdf/2405.20018v1.pdf","comment":"23 pages, 6 figures"},{"id":"http://arxiv.org/abs/2405.20014v1","updated":"2024-05-30T12:49:34Z","published":"2024-05-30T12:49:34Z","title":"subMFL: Compatiple subModel Generation for Federated Learning in Device\n Heterogenous Environment","summary":" Federated Learning (FL) is commonly used in systems with distributed and\nheterogeneous devices with access to varying amounts of data and diverse\ncomputing and storage capacities. FL training process enables such devices to\nupdate the weights of a shared model locally using their local data and then a\ntrusted central server combines all of those models to generate a global model.\nIn this way, a global model is generated while the data remains local to\ndevices to preserve privacy. However, training large models such as Deep Neural\nNetworks (DNNs) on resource-constrained devices can take a prohibitively long\ntime and consume a large amount of energy. In the current process, the\nlow-capacity devices are excluded from the training process, although they\nmight have access to unseen data. To overcome this challenge, we propose a\nmodel compression approach that enables heterogeneous devices with varying\ncomputing capacities to participate in the FL process. In our approach, the\nserver shares a dense model with all devices to train it: Afterwards, the\ntrained model is gradually compressed to obtain submodels with varying levels\nof sparsity to be used as suitable initial global models for\nresource-constrained devices that were not capable of train the first dense\nmodel. This results in an increased participation rate of resource-constrained\ndevices while the transferred weights from the previous round of training are\npreserved. Our validation experiments show that despite reaching about 50 per\ncent global sparsity, generated submodels maintain their accuracy while can be\nshared to increase participation by around 50 per cent.\n","authors":["Zeyneddin Oz","Ceylan Soygul Oz","Abdollah Malekjafarian","Nima Afraz","Fatemeh Golpayegani"],"pdf_url":"https://arxiv.org/pdf/2405.20014v1.pdf","comment":"12 pages, 7 figures, European Conference on Parallel Processing, pp.\n between 52 and 64, Springer, 2023"},{"id":"http://arxiv.org/abs/2405.20012v1","updated":"2024-05-30T12:48:44Z","published":"2024-05-30T12:48:44Z","title":"FlexiDrop: Theoretical Insights and Practical Advances in Random Dropout\n Method on GNNs","summary":" Graph Neural Networks (GNNs) are powerful tools for handling graph-type data.\nRecently, GNNs have been widely applied in various domains, but they also face\nsome issues, such as overfitting, over-smoothing and non-robustness. The\nexisting research indicates that random dropout methods are an effective way to\naddress these issues. However, random dropout methods in GNNs still face\nunresolved problems. Currently, the choice of dropout rate, often determined by\nheuristic or grid search methods, can increase the generalization error,\ncontradicting the principal aims of dropout. In this paper, we propose a novel\nrandom dropout method for GNNs called FlexiDrop. First, we conduct a\ntheoretical analysis of dropout in GNNs using rademacher complexity and\ndemonstrate that the generalization error of traditional random dropout methods\nis constrained by a function related to the dropout rate. Subsequently, we use\nthis function as a regularizer to unify the dropout rate and empirical loss\nwithin a single loss function, optimizing them simultaneously. Therefore, our\nmethod enables adaptive adjustment of the dropout rate and theoretically\nbalances the trade-off between model complexity and generalization ability.\nFurthermore, extensive experimental results on benchmark datasets show that\nFlexiDrop outperforms traditional random dropout methods in GNNs.\n","authors":["Zhiheng Zhou","Sihao Liu","Weichen Zhao"],"pdf_url":"https://arxiv.org/pdf/2405.20012v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20003v1","updated":"2024-05-30T12:42:05Z","published":"2024-05-30T12:42:05Z","title":"Kernel Language Entropy: Fine-grained Uncertainty Quantification for\n LLMs from Semantic Similarities","summary":" Uncertainty quantification in Large Language Models (LLMs) is crucial for\napplications where safety and reliability are important. In particular,\nuncertainty can be used to improve the trustworthiness of LLMs by detecting\nfactually incorrect model responses, commonly called hallucinations.\nCritically, one should seek to capture the model's semantic uncertainty, i.e.,\nthe uncertainty over the meanings of LLM outputs, rather than uncertainty over\nlexical or syntactic variations that do not affect answer correctness. To\naddress this problem, we propose Kernel Language Entropy (KLE), a novel method\nfor uncertainty estimation in white- and black-box LLMs. KLE defines positive\nsemidefinite unit trace kernels to encode the semantic similarities of LLM\noutputs and quantifies uncertainty using the von Neumann entropy. It considers\npairwise semantic dependencies between answers (or semantic clusters),\nproviding more fine-grained uncertainty estimates than previous methods based\non hard clustering of answers. We theoretically prove that KLE generalizes the\nprevious state-of-the-art method called semantic entropy and empirically\ndemonstrate that it improves uncertainty quantification performance across\nmultiple natural language generation datasets and LLM architectures.\n","authors":["Alexander Nikitin","Jannik Kossen","Yarin Gal","Pekka Marttinen"],"pdf_url":"https://arxiv.org/pdf/2405.20003v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16510v3","updated":"2024-05-30T12:40:06Z","published":"2024-05-26T10:33:17Z","title":"Meta-Task Planning for Language Agents","summary":" The rapid advancement of neural language models has sparked a new surge of\nintelligent agent research. Unlike traditional agents, large language\nmodel-based agents (LLM agents) have emerged as a promising paradigm for\nachieving artificial general intelligence (AGI) due to their superior reasoning\nand generalization capabilities. Effective planning is crucial for the success\nof LLM agents in real-world tasks, making it a highly pursued topic in the\ncommunity. Current planning methods typically translate tasks into executable\naction sequences. However, determining a feasible or optimal sequence for\ncomplex tasks at fine granularity, which often requires compositing long chains\nof heterogeneous actions, remains challenging. This paper introduces Meta-Task\nPlanning (MTP), a zero-shot methodology for collaborative LLM-based multi-agent\nsystems that simplifies complex task planning by decomposing it into a\nhierarchy of subordinate tasks, or meta-tasks. Each meta-task is then mapped\ninto executable actions. MTP was assessed on two rigorous benchmarks,\nTravelPlanner and API-Bank. Notably, MTP achieved an average $\\sim40\\%$ success\nrate on TravelPlanner, significantly higher than the state-of-the-art (SOTA)\nbaseline ($2.92\\%$), and outperforming $LLM_{api}$-4 with ReAct on API-Bank by\n$\\sim14\\%$, showing the immense potential of integrating LLM with multi-agent\nsystems.\n","authors":["Cong Zhang","Derrick Goh Xin Deik","Dexun Li","Hao Zhang","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2405.16510v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19995v1","updated":"2024-05-30T12:32:18Z","published":"2024-05-30T12:32:18Z","title":"Symmetries in Overparametrized Neural Networks: A Mean-Field View","summary":" We develop a Mean-Field (MF) view of the learning dynamics of\noverparametrized Artificial Neural Networks (NN) under data symmetric in law\nwrt the action of a general compact group $G$. We consider for this a class of\ngeneralized shallow NNs given by an ensemble of $N$ multi-layer units, jointly\ntrained using stochastic gradient descent (SGD) and possibly\nsymmetry-leveraging (SL) techniques, such as Data Augmentation (DA), Feature\nAveraging (FA) or Equivariant Architectures (EA). We introduce the notions of\nweakly and strongly invariant laws (WI and SI) on the parameter space of each\nsingle unit, corresponding, respectively, to $G$-invariant distributions, and\nto distributions supported on parameters fixed by the group action (which\nencode EA). This allows us to define symmetric models compatible with taking\n$N\\to\\infty$ and give an interpretation of the asymptotic dynamics of DA, FA\nand EA in terms of Wasserstein Gradient Flows describing their MF limits. When\nactivations respect the group action, we show that, for symmetric data, DA, FA\nand freely-trained models obey the exact same MF dynamic, which stays in the\nspace of WI laws and minimizes therein the population risk. We also give a\ncounterexample to the general attainability of an optimum over SI laws. Despite\nthis, quite remarkably, we show that the set of SI laws is also preserved by\nthe MF dynamics even when freely trained. This sharply contrasts the finite-$N$\nsetting, in which EAs are generally not preserved by unconstrained SGD. We\nillustrate the validity of our findings as $N$ gets larger in a teacher-student\nexperimental setting, training a student NN to learn from a WI, SI or arbitrary\nteacher model through various SL schemes. We last deduce a data-driven\nheuristic to discover the largest subspace of parameters supporting SI\ndistributions for a problem, that could be used for designing EA with minimal\ngeneralization error.\n","authors":["Javier Maass Martínez","Joaquin Fontbona"],"pdf_url":"https://arxiv.org/pdf/2405.19995v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19988v1","updated":"2024-05-30T12:18:06Z","published":"2024-05-30T12:18:06Z","title":"Video-Language Critic: Transferable Reward Functions for\n Language-Conditioned Robotics","summary":" Natural language is often the easiest and most convenient modality for humans\nto specify tasks for robots. However, learning to ground language to behavior\ntypically requires impractical amounts of diverse, language-annotated\ndemonstrations collected on each target robot. In this work, we aim to separate\nthe problem of what to accomplish from how to accomplish it, as the former can\nbenefit from substantial amounts of external observation-only data, and only\nthe latter depends on a specific robot embodiment. To this end, we propose\nVideo-Language Critic, a reward model that can be trained on readily available\ncross-embodiment data using contrastive learning and a temporal ranking\nobjective, and use it to score behavior traces from a separate reinforcement\nlearning actor. When trained on Open X-Embodiment data, our reward model\nenables 2x more sample-efficient policy training on Meta-World tasks than a\nsparse reward only, despite a significant domain gap. Using in-domain data but\nin a challenging task generalization setting on Meta-World, we further\ndemonstrate more sample-efficient training than is possible with prior\nlanguage-conditioned reward models that are either trained with binary\nclassification, use static images, or do not leverage the temporal information\npresent in video data.\n","authors":["Minttu Alakuijala","Reginald McLean","Isaac Woungang","Nariman Farsad","Samuel Kaski","Pekka Marttinen","Kai Yuan"],"pdf_url":"https://arxiv.org/pdf/2405.19988v1.pdf","comment":"10 pages in the main text, 16 pages including references and\n supplementary materials. 4 figures and 3 tables in the main text, 1 table in\n supplementary materials"},{"id":"http://arxiv.org/abs/2405.19985v1","updated":"2024-05-30T12:14:25Z","published":"2024-05-30T12:14:25Z","title":"Targeted Sequential Indirect Experiment Design","summary":" Scientific hypotheses typically concern specific aspects of complex,\nimperfectly understood or entirely unknown mechanisms, such as the effect of\ngene expression levels on phenotypes or how microbial communities influence\nenvironmental health. Such queries are inherently causal (rather than purely\nassociational), but in many settings, experiments can not be conducted directly\non the target variables of interest, but are indirect. Therefore, they perturb\nthe target variable, but do not remove potential confounding factors. If,\nadditionally, the resulting experimental measurements are multi-dimensional and\nthe studied mechanisms nonlinear, the query of interest is generally not\nidentified. We develop an adaptive strategy to design indirect experiments that\noptimally inform a targeted query about the ground truth mechanism in terms of\nsequentially narrowing the gap between an upper and lower bound on the query.\nWhile the general formulation consists of a bi-level optimization procedure, we\nderive an efficiently estimable analytical kernel-based estimator of the bounds\nfor the causal effect, a query of key interest, and demonstrate the efficacy of\nour approach in confounded, multivariate, nonlinear synthetic settings.\n","authors":["Elisabeth Ailer","Niclas Dern","Jason Hartford","Niki Kilbertus"],"pdf_url":"https://arxiv.org/pdf/2405.19985v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.15240v2","updated":"2024-05-30T12:14:05Z","published":"2024-05-24T06:06:41Z","title":"Towards Real World Debiasing: A Fine-grained Analysis On Spurious\n Correlation","summary":" Spurious correlations in training data significantly hinder the\ngeneralization capability of machine learning models when faced with\ndistribution shifts in real-world scenarios. To tackle the problem, numerous\ndebias approaches have been proposed and benchmarked on datasets intentionally\ndesigned with severe biases. However, it remains to be asked: \\textit{1. Do\nexisting benchmarks really capture biases in the real world? 2. Can existing\ndebias methods handle biases in the real world?} To answer the questions, we\nrevisit biased distributions in existing benchmarks and real-world datasets,\nand propose a fine-grained framework for analyzing dataset bias by\ndisentangling it into the magnitude and prevalence of bias. We observe and\ntheoretically demonstrate that existing benchmarks poorly represent real-world\nbiases. We further introduce two novel biased distributions to bridge this gap,\nforming a nuanced evaluation framework for real-world debiasing. Building upon\nthese results, we evaluate existing debias methods with our evaluation\nframework. Results show that existing methods are incapable of handling\nreal-world biases. Through in-depth analysis, we propose a simple yet effective\napproach that can be easily applied to existing debias methods, named Debias in\nDestruction (DiD). Empirical results demonstrate the superiority of DiD,\nimproving the performance of existing methods on all types of biases within the\nproposed evaluation framework.\n","authors":["Zhibo Wang","Peng Kuang","Zhixuan Chu","Jingyi Wang","Kui Ren"],"pdf_url":"https://arxiv.org/pdf/2405.15240v2.pdf","comment":"9 pages of main paper, 10 pages of appendix"},{"id":"http://arxiv.org/abs/2405.04923v2","updated":"2024-05-30T12:04:17Z","published":"2024-05-08T09:45:54Z","title":"DataSP: A Differential All-to-All Shortest Path Algorithm for Learning\n Costs and Predicting Paths with Context","summary":" Learning latent costs of transitions on graphs from trajectories\ndemonstrations under various contextual features is challenging but useful for\npath planning. Yet, existing methods either oversimplify cost assumptions or\nscale poorly with the number of observed trajectories. This paper introduces\nDataSP, a differentiable all-to-all shortest path algorithm to facilitate\nlearning latent costs from trajectories. It allows to learn from a large number\nof trajectories in each learning step without additional computation. Complex\nlatent cost functions from contextual features can be represented in the\nalgorithm through a neural network approximation. We further propose a method\nto sample paths from DataSP in order to reconstruct/mimic observed paths'\ndistributions. We prove that the inferred distribution follows the maximum\nentropy principle. We show that DataSP outperforms state-of-the-art\ndifferentiable combinatorial solver and classical machine learning approaches\nin predicting paths on graphs.\n","authors":["Alan A. Lahoud","Erik Schaffernicht","Johannes A. Stork"],"pdf_url":"https://arxiv.org/pdf/2405.04923v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19978v1","updated":"2024-05-30T12:01:12Z","published":"2024-05-30T12:01:12Z","title":"Domain Adaptation with Cauchy-Schwarz Divergence","summary":" Domain adaptation aims to use training data from one or multiple source\ndomains to learn a hypothesis that can be generalized to a different, but\nrelated, target domain. As such, having a reliable measure for evaluating the\ndiscrepancy of both marginal and conditional distributions is crucial. We\nintroduce Cauchy-Schwarz (CS) divergence to the problem of unsupervised domain\nadaptation (UDA). The CS divergence offers a theoretically tighter\ngeneralization error bound than the popular Kullback-Leibler divergence. This\nholds for the general case of supervised learning, including multi-class\nclassification and regression. Furthermore, we illustrate that the CS\ndivergence enables a simple estimator on the discrepancy of both marginal and\nconditional distributions between source and target domains in the\nrepresentation space, without requiring any distributional assumptions. We\nprovide multiple examples to illustrate how the CS divergence can be\nconveniently used in both distance metric- or adversarial training-based UDA\nframeworks, resulting in compelling performance.\n","authors":["Wenzhe Yin","Shujian Yu","Yicong Lin","Jie Liu","Jan-Jakob Sonke","Efstratios Gavves"],"pdf_url":"https://arxiv.org/pdf/2405.19978v1.pdf","comment":"Accepted by UAI-24"},{"id":"http://arxiv.org/abs/2405.19977v1","updated":"2024-05-30T11:59:58Z","published":"2024-05-30T11:59:58Z","title":"Consistent Submodular Maximization","summary":" Maximizing monotone submodular functions under cardinality constraints is a\nclassic optimization task with several applications in data mining and machine\nlearning. In this paper we study this problem in a dynamic environment with\nconsistency constraints: elements arrive in a streaming fashion and the goal is\nmaintaining a constant approximation to the optimal solution while having a\nstable solution (i.e., the number of changes between two consecutive solutions\nis bounded). We provide algorithms in this setting with different trade-offs\nbetween consistency and approximation quality. We also complement our\ntheoretical results with an experimental analysis showing the effectiveness of\nour algorithms in real-world instances.\n","authors":["Paul Dütting","Federico Fusco","Silvio Lattanzi","Ashkan Norouzi-Fard","Morteza Zadimoghaddam"],"pdf_url":"https://arxiv.org/pdf/2405.19977v1.pdf","comment":"To appear at ICML 24"},{"id":"http://arxiv.org/abs/2403.05300v2","updated":"2024-05-30T11:55:49Z","published":"2024-03-08T13:29:46Z","title":"Unity by Diversity: Improved Representation Learning in Multimodal VAEs","summary":" Variational Autoencoders for multimodal data hold promise for many tasks in\ndata analysis, such as representation learning, conditional generation, and\nimputation. Current architectures either share the encoder output, decoder\ninput, or both across modalities to learn a shared representation. Such\narchitectures impose hard constraints on the model. In this work, we show that\na better latent representation can be obtained by replacing these hard\nconstraints with a soft constraint. We propose a new mixture-of-experts prior,\nsoftly guiding each modality's latent representation towards a shared aggregate\nposterior. This approach results in a superior latent representation and allows\neach encoding to preserve information better from its uncompressed original\nfeatures. In extensive experiments on multiple benchmark datasets and two\nchallenging real-world datasets, we show improved learned latent\nrepresentations and imputation of missing data modalities compared to existing\nmethods.\n","authors":["Thomas M. Sutter","Yang Meng","Andrea Agostini","Daphné Chopard","Norbert Fortin","Julia E. Vogt","Bahbak Shahbaba","Stephan Mandt"],"pdf_url":"https://arxiv.org/pdf/2403.05300v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19971v1","updated":"2024-05-30T11:55:21Z","published":"2024-05-30T11:55:21Z","title":"GasTrace: Detecting Sandwich Attack Malicious Accounts in Ethereum","summary":" The openness and transparency of Ethereum transaction data make it easy to be\nexploited by any entities, executing malicious attacks. The sandwich attack\nmanipulates the Automated Market Maker (AMM) mechanism, profiting from\nmanipulating the market price through front or after-running transactions. To\nidentify and prevent sandwich attacks, we propose a cascade classification\nframework GasTrace. GasTrace analyzes various transaction features to detect\nmalicious accounts, notably through the analysis and modeling of Gas features.\nIn the initial classification, we utilize the Support Vector Machine (SVM) with\nthe Radial Basis Function (RBF) kernel to generate the predicted probabilities\nof accounts, further constructing a detailed transaction network. Subsequently,\nthe behavior features are captured by the Graph Attention Network (GAT)\ntechnique in the second classification. Through cascade classification,\nGasTrace can analyze and classify the sandwich attacks. Our experimental\nresults demonstrate that GasTrace achieves a remarkable detection and\ngeneration capability, performing an accuracy of 96.73\\% and an F1 score of\n95.71\\% for identifying sandwich attack accounts.\n","authors":["Zekai Liu","Xiaoqi Li","Hongli Peng","Wenkai Li"],"pdf_url":"https://arxiv.org/pdf/2405.19971v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19967v1","updated":"2024-05-30T11:46:42Z","published":"2024-05-30T11:46:42Z","title":"Improved Out-of-Scope Intent Classification with Dual Encoding and\n Threshold-based Re-Classification","summary":" Detecting out-of-scope user utterances is essential for task-oriented\ndialogues and intent classification. Current methodologies face difficulties\nwith the unpredictable distribution of outliers and often rely on assumptions\nabout data distributions. We present the Dual Encoder for Threshold-Based\nRe-Classification (DETER) to address these challenges. This end-to-end\nframework efficiently detects out-of-scope intents without requiring\nassumptions on data distributions or additional post-processing steps. The core\nof DETER utilizes dual text encoders, the Universal Sentence Encoder (USE) and\nthe Transformer-based Denoising AutoEncoder (TSDAE), to generate user utterance\nembeddings, which are classified through a branched neural architecture.\nFurther, DETER generates synthetic outliers using self-supervision and\nincorporates out-of-scope phrases from open-domain datasets. This approach\nensures a comprehensive training set for out-of-scope detection. Additionally,\na threshold-based re-classification mechanism refines the model's initial\npredictions. Evaluations on the CLINC-150, Stackoverflow, and Banking77\ndatasets demonstrate DETER's efficacy. Our model outperforms previous\nbenchmarks, increasing up to 13% and 5% in F1 score for known and unknown\nintents on CLINC-150 and Stackoverflow, and 16% for known and 24% % for unknown\nintents on Banking77. The source code has been released at\nhttps://github.com/Hossam-Mohammed-tech/Intent\\_Classification\\_OOS.\n","authors":["Hossam M. Zawbaa","Wael Rashwan","Sourav Dutta","Haytham Assem"],"pdf_url":"https://arxiv.org/pdf/2405.19967v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.17106v3","updated":"2024-05-30T11:44:40Z","published":"2024-02-27T00:59:32Z","title":"Achievable Fairness on Your Data With Utility Guarantees","summary":" In machine learning fairness, training models that minimize disparity across\ndifferent sensitive groups often leads to diminished accuracy, a phenomenon\nknown as the fairness-accuracy trade-off. The severity of this trade-off\ninherently depends on dataset characteristics such as dataset imbalances or\nbiases and therefore, using a uniform fairness requirement across diverse\ndatasets remains questionable. To address this, we present a computationally\nefficient approach to approximate the fairness-accuracy trade-off curve\ntailored to individual datasets, backed by rigorous statistical guarantees. By\nutilizing the You-Only-Train-Once (YOTO) framework, our approach mitigates the\ncomputational burden of having to train multiple models when approximating the\ntrade-off curve. Crucially, we introduce a novel methodology for quantifying\nuncertainty in our estimates, thereby providing practitioners with a robust\nframework for auditing model fairness while avoiding false conclusions due to\nestimation errors. Our experiments spanning tabular (e.g., Adult), image\n(CelebA), and language (Jigsaw) datasets underscore that our approach not only\nreliably quantifies the optimum achievable trade-offs across various data\nmodalities but also helps detect suboptimality in SOTA fairness methods.\n","authors":["Muhammad Faaiz Taufiq","Jean-Francois Ton","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2402.17106v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.03286v3","updated":"2024-05-30T11:42:15Z","published":"2024-02-05T18:42:34Z","title":"Training-Free Consistent Text-to-Image Generation","summary":" Text-to-image models offer a new level of creative flexibility by allowing\nusers to guide the image generation process through natural language. However,\nusing these models to consistently portray the same subject across diverse\nprompts remains challenging. Existing approaches fine-tune the model to teach\nit new words that describe specific user-provided subjects or add image\nconditioning to the model. These methods require lengthy per-subject\noptimization or large-scale pre-training. Moreover, they struggle to align\ngenerated images with text prompts and face difficulties in portraying multiple\nsubjects. Here, we present ConsiStory, a training-free approach that enables\nconsistent subject generation by sharing the internal activations of the\npretrained model. We introduce a subject-driven shared attention block and\ncorrespondence-based feature injection to promote subject consistency between\nimages. Additionally, we develop strategies to encourage layout diversity while\nmaintaining subject consistency. We compare ConsiStory to a range of baselines,\nand demonstrate state-of-the-art performance on subject consistency and text\nalignment, without requiring a single optimization step. Finally, ConsiStory\ncan naturally extend to multi-subject scenarios, and even enable training-free\npersonalization for common objects.\n","authors":["Yoad Tewel","Omri Kaduri","Rinon Gal","Yoni Kasten","Lior Wolf","Gal Chechik","Yuval Atzmon"],"pdf_url":"https://arxiv.org/pdf/2402.03286v3.pdf","comment":"Accepted to journal track of SIGGRAPH 2024 (TOG). Project page is at\n https://consistory-paper.github.io"},{"id":"http://arxiv.org/abs/2405.19961v1","updated":"2024-05-30T11:32:42Z","published":"2024-05-30T11:32:42Z","title":"Collective Variable Free Transition Path Sampling with Generative Flow\n Network","summary":" Understanding transition paths between meta-stable states in molecular\nsystems is fundamental for material design and drug discovery. However,\nsampling these paths via molecular dynamics simulations is computationally\nprohibitive due to the high-energy barriers between the meta-stable states.\nRecent machine learning approaches are often restricted to simple systems or\nrely on collective variables (CVs) extracted from expensive domain knowledge.\nIn this work, we propose to leverage generative flow networks (GFlowNets) to\nsample transition paths without relying on CVs. We reformulate the problem as\namortized energy-based sampling over molecular trajectories and train a bias\npotential by minimizing the squared log-ratio between the target distribution\nand the generator, derived from the flow matching objective of GFlowNets. Our\nevaluation on three proteins (Alanine Dipeptide, Polyproline, and Chignolin)\ndemonstrates that our approach, called TPS-GFN, generates more realistic and\ndiverse transition paths than the previous CV-free machine learning approach.\n","authors":["Kiyoung Seong","Seonghyun Park","Seonghwan Kim","Woo Youn Kim","Sungsoo Ahn"],"pdf_url":"https://arxiv.org/pdf/2405.19961v1.pdf","comment":"9 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.12728v3","updated":"2024-05-30T11:25:08Z","published":"2023-12-20T02:55:15Z","title":"Lookahead: An Inference Acceleration Framework for Large Language Model\n with Lossless Generation Accuracy","summary":" As Large Language Models (LLMs) have made significant advancements across\nvarious tasks, such as question answering, translation, text summarization, and\ndialogue systems, the need for accuracy in information becomes crucial,\nespecially for serious financial products serving billions of users like\nAlipay. However, for a real-world product serving millions of users, the\ninference speed of LLMs becomes a critical factor compared to a mere\nexperimental model.\n Hence, this paper presents a generic framework for accelerating the inference\nprocess, resulting in a substantial increase in speed and cost reduction for\nour LLM-based scenarios, with lossless generation accuracy. In the traditional\ninference process, each token is generated sequentially by the LLM, leading to\na time consumption proportional to the number of generated tokens. To enhance\nthis process, our framework, named \\textit{lookahead}, introduces a\n\\textit{multi-branch} strategy. Instead of generating a single token at a time,\nwe propose a Trie-based retrieval and verification mechanism to be able to\naccept several tokens at a forward step. Our strategy offers two distinct\nadvantages: (1) it guarantees absolute correctness of the output, avoiding any\napproximation algorithms, and (2) the worst-case performance of our approach is\nequivalent to the conventional process. We conduct extensive experiments to\ndemonstrate the significant improvements achieved by applying our inference\nacceleration framework. Our framework is widely deployed in Alipay since April\n2023, and obtain remarkable 2.66x to 6.26x speedup. Our code is available at\nhttps://github.com/alipay/PainlessInferenceAcceleration.\n","authors":["Yao Zhao","Zhitian Xie","Chen Liang","Chenyi Zhuang","Jinjie Gu"],"pdf_url":"https://arxiv.org/pdf/2312.12728v3.pdf","comment":"10 pages, 6 figures"},{"id":"http://arxiv.org/abs/2405.16056v2","updated":"2024-05-30T11:20:22Z","published":"2024-05-25T04:51:41Z","title":"FedSheafHN: Personalized Federated Learning on Graph-structured Data","summary":" Personalized subgraph Federated Learning (FL) is a task that customizes Graph\nNeural Networks (GNNs) to individual client needs, accommodating diverse data\ndistributions. However, applying hypernetworks in FL, while aiming to\nfacilitate model personalization, often encounters challenges due to inadequate\nrepresentation of client-specific characteristics. To overcome these\nlimitations, we propose a model called FedSheafHN, using enhanced collaboration\ngraph embedding and efficient personalized model parameter generation.\nSpecifically, our model embeds each client's local subgraph into a\nserver-constructed collaboration graph. We utilize sheaf diffusion in the\ncollaboration graph to learn client representations. Our model improves the\nintegration and interpretation of complex client characteristics. Furthermore,\nour model ensures the generation of personalized models through advanced\nhypernetworks optimized for parallel operations across clients. Empirical\nevaluations demonstrate that FedSheafHN outperforms existing methods in most\nscenarios, in terms of client model performance on various graph-structured\ndatasets. It also has fast model convergence and effective new clients\ngeneralization.\n","authors":["Wenfei Liang","Yanan Zhao","Rui She","Yiming Li","Wee Peng Tay"],"pdf_url":"https://arxiv.org/pdf/2405.16056v2.pdf","comment":"This paper was submitted to ICML 2024 in Feb 2024. You can find a\n record\n here:https://github.com/CarrieWFF/ICML-2024-submission-recording/blob/main/Screenshot%20of%20FedSheafHN%20submission%20to%20ICML%202024.png"},{"id":"http://arxiv.org/abs/2405.19954v1","updated":"2024-05-30T11:18:52Z","published":"2024-05-30T11:18:52Z","title":"GenKubeSec: LLM-Based Kubernetes Misconfiguration Detection,\n Localization, Reasoning, and Remediation","summary":" A key challenge associated with Kubernetes configuration files (KCFs) is that\nthey are often highly complex and error-prone, leading to security\nvulnerabilities and operational setbacks. Rule-based (RB) tools for KCF\nmisconfiguration detection rely on static rule sets, making them inherently\nlimited and unable to detect newly-discovered misconfigurations. RB tools also\nsuffer from misdetection, since mistakes are likely when coding the detection\nrules. Recent methods for detecting and remediating KCF misconfigurations are\nlimited in terms of their scalability and detection coverage, or due to the\nfact that they have high expertise requirements and do not offer automated\nremediation along with misconfiguration detection. Novel approaches that employ\nLLMs in their pipeline rely on API-based, general-purpose, and mainly\ncommercial models. Thus, they pose security challenges, have inconsistent\nclassification performance, and can be costly. In this paper, we propose\nGenKubeSec, a comprehensive and adaptive, LLM-based method, which, in addition\nto detecting a wide variety of KCF misconfigurations, also identifies the exact\nlocation of the misconfigurations and provides detailed reasoning about them,\nalong with suggested remediation. When empirically compared with three\nindustry-standard RB tools, GenKubeSec achieved equivalent precision (0.990)\nand superior recall (0.999). When a random sample of KCFs was examined by a\nKubernetes security expert, GenKubeSec's explanations as to misconfiguration\nlocalization, reasoning and remediation were 100% correct, informative and\nuseful. To facilitate further advancements in this domain, we share the unique\ndataset we collected, a unified misconfiguration index we developed for label\nstandardization, our experimentation code, and GenKubeSec itself as an\nopen-source tool.\n","authors":["Ehud Malul","Yair Meidan","Dudu Mimran","Yuval Elovici","Asaf Shabtai"],"pdf_url":"https://arxiv.org/pdf/2405.19954v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19950v1","updated":"2024-05-30T11:14:01Z","published":"2024-05-30T11:14:01Z","title":"MM-Lego: Modular Biomedical Multimodal Models with Minimal Fine-Tuning","summary":" Learning holistic computational representations in physical, chemical or\nbiological systems requires the ability to process information from different\ndistributions and modalities within the same model. Thus, the demand for\nmultimodal machine learning models has sharply risen for modalities that go\nbeyond vision and language, such as sequences, graphs, time series, or tabular\ndata. While there are many available multimodal fusion and alignment\napproaches, most of them require end-to-end training, scale quadratically with\nthe number of modalities, cannot handle cases of high modality imbalance in the\ntraining set, or are highly topology-specific, making them too restrictive for\nmany biomedical learning tasks. This paper presents Multimodal Lego (MM-Lego),\na modular and general-purpose fusion and model merging framework to turn any\nset of encoders into a competitive multimodal model with no or minimal\nfine-tuning. We achieve this by introducing a wrapper for unimodal encoders\nthat enforces lightweight dimensionality assumptions between modalities and\nharmonises their representations by learning features in the frequency domain\nto enable model merging with little signal interference. We show that MM-Lego\n1) can be used as a model merging method which achieves competitive performance\nwith end-to-end fusion models without any fine-tuning, 2) can operate on any\nunimodal encoder, and 3) is a model fusion method that, with minimal\nfine-tuning, achieves state-of-the-art results on six benchmarked multimodal\nbiomedical tasks.\n","authors":["Konstantin Hemker","Nikola Simidjievski","Mateja Jamnik"],"pdf_url":"https://arxiv.org/pdf/2405.19950v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17823v2","updated":"2024-05-30T11:12:25Z","published":"2024-05-28T04:47:12Z","title":"Spectral Truncation Kernels: Noncommutativity in $C^*$-algebraic Kernel\n Machines","summary":" In this paper, we propose a new class of positive definite kernels based on\nthe spectral truncation, which has been discussed in the fields of\nnoncommutative geometry and $C^*$-algebra. We focus on kernels whose inputs and\noutputs are functions and generalize existing kernels, such as polynomial,\nproduct, and separable kernels, by introducing a truncation parameter $n$ that\ndescribes the noncommutativity of the products appearing in the kernels. When\n$n$ goes to infinity, the proposed kernels tend to the existing commutative\nkernels. If $n$ is finite, they exhibit different behavior, and the\nnoncommutativity induces interactions along the data function domain. We show\nthat the truncation parameter $n$ is a governing factor leading to performance\nenhancement: by setting an appropriate $n$, we can balance the representation\npower and the complexity of the representation space. The flexibility of the\nproposed class of kernels allows us to go beyond previous commutative kernels.\n","authors":["Yuka Hashimoto","Ayoub Hafid","Masahiro Ikeda","Hachem Kadri"],"pdf_url":"https://arxiv.org/pdf/2405.17823v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.08595v2","updated":"2024-05-30T11:03:46Z","published":"2023-06-14T15:55:19Z","title":"TensorKrowch: Smooth integration of tensor networks in machine learning","summary":" Tensor networks are factorizations of high-dimensional tensors into networks\nof smaller tensors. They have applications in physics and mathematics, and\nrecently have been proposed as promising machine learning architectures. To\nease the integration of tensor networks in machine learning pipelines, we\nintroduce TensorKrowch, an open source Python library built on top of PyTorch.\nProviding a user-friendly interface, TensorKrowch allows users to construct any\ntensor network, train it, and integrate it as a layer in more intricate deep\nlearning models. In this paper, we describe the main functionality and basic\nusage of TensorKrowch, and provide technical details on its building blocks and\nthe optimizations performed to achieve efficient operation.\n","authors":["José Ramón Pareja Monturiol","David Pérez-García","Alejandro Pozas-Kerstjens"],"pdf_url":"https://arxiv.org/pdf/2306.08595v2.pdf","comment":"20 pages, 2 figures. The TensorKrowch GitHub repository is in\n https://github.com/joserapa98/tensorkrowch and the TensorKrowch documentation\n is in https://joserapa98.github.io/tensorkrowch. V2: Accepted version"},{"id":"http://arxiv.org/abs/2402.02425v2","updated":"2024-05-30T10:53:51Z","published":"2024-02-04T09:45:35Z","title":"DeepLag: Discovering Deep Lagrangian Dynamics for Intuitive Fluid\n Prediction","summary":" Accurately predicting the future fluid is vital to extensive areas such as\nmeteorology, oceanology, and aerodynamics. However, since the fluid is usually\nobserved from an Eulerian perspective, its moving and intricate dynamics are\nseriously obscured and confounded in static grids, bringing thorny challenges\nto the prediction. This paper introduces a new Lagrangian-Eulerian combined\nparadigm to tackle the tanglesome fluid dynamics. Instead of solely predicting\nthe future based on Eulerian observations, we propose DeepLag to discover\nhidden Lagrangian dynamics within the fluid by tracking the movements of\nadaptively sampled key particles. DeepLag utilizes the proposed where the\nLagrangian movement of the tracked particles is inferred from Eulerian\nobservations, and their accumulated Lagrangian dynamics information is\nincorporated into global Eulerian evolving features to guide future prediction\nrespectively. Tracking key particles not only provides a transparent and\ninterpretable clue for fluid dynamics but also makes our model free from\nmodeling complex correlations among massive grids for better efficiency.\nExperimentally, DeepLag excels in three challenging fluid prediction tasks\ncovering 2D and 3D, simulated and real-world fluids.\n","authors":["Qilong Ma","Haixu Wu","Lanxiang Xing","Shangchen Miao","Mingsheng Long"],"pdf_url":"https://arxiv.org/pdf/2402.02425v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19933v1","updated":"2024-05-30T10:49:22Z","published":"2024-05-30T10:49:22Z","title":"Learning Latent Graph Structures and their Uncertainty","summary":" Within a prediction task, Graph Neural Networks (GNNs) use relational\ninformation as an inductive bias to enhance the model's accuracy. As\ntask-relevant relations might be unknown, graph structure learning approaches\nhave been proposed to learn them while solving the downstream prediction task.\nIn this paper, we demonstrate that minimization of a point-prediction loss\nfunction, e.g., the mean absolute error, does not guarantee proper learning of\nthe latent relational information and its associated uncertainty. Conversely,\nwe prove that a suitable loss function on the stochastic model outputs\nsimultaneously grants (i) the unknown adjacency matrix latent distribution and\n(ii) optimal performance on the prediction task. Finally, we propose a\nsampling-based method that solves this joint learning task. Empirical results\nvalidate our theoretical claims and demonstrate the effectiveness of the\nproposed approach.\n","authors":["Alessandro Manenti","Daniele Zambon","Cesare Alippi"],"pdf_url":"https://arxiv.org/pdf/2405.19933v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19931v1","updated":"2024-05-30T10:47:48Z","published":"2024-05-30T10:47:48Z","title":"Exploring Diffusion Models' Corruption Stage in Few-Shot Fine-tuning and\n Mitigating with Bayesian Neural Networks","summary":" Few-shot fine-tuning of Diffusion Models (DMs) is a key advancement,\nsignificantly reducing training costs and enabling personalized AI\napplications. However, we explore the training dynamics of DMs and observe an\nunanticipated phenomenon: during the training process, image fidelity initially\nimproves, then unexpectedly deteriorates with the emergence of noisy patterns,\nonly to recover later with severe overfitting. We term the stage with generated\nnoisy patterns as corruption stage. To understand this corruption stage, we\nbegin by theoretically modeling the one-shot fine-tuning scenario, and then\nextend this modeling to more general cases. Through this modeling, we identify\nthe primary cause of this corruption stage: a narrowed learning distribution\ninherent in the nature of few-shot fine-tuning. To tackle this, we apply\nBayesian Neural Networks (BNNs) on DMs with variational inference to implicitly\nbroaden the learned distribution, and present that the learning target of the\nBNNs can be naturally regarded as an expectation of the diffusion loss and a\nfurther regularization with the pretrained DMs. This approach is highly\ncompatible with current few-shot fine-tuning methods in DMs and does not\nintroduce any extra inference costs. Experimental results demonstrate that our\nmethod significantly mitigates corruption, and improves the fidelity, quality\nand diversity of the generated images in both object-driven and subject-driven\ngeneration tasks.\n","authors":["Xiaoyu Wu","Jiaru Zhang","Yang Hua","Bohan Lyu","Hao Wang","Tao Song","Haibing Guan"],"pdf_url":"https://arxiv.org/pdf/2405.19931v1.pdf","comment":"Preprint. Under review"},{"id":"http://arxiv.org/abs/2405.19928v1","updated":"2024-05-30T10:44:45Z","published":"2024-05-30T10:44:45Z","title":"BAN: Detecting Backdoors Activated by Adversarial Neuron Noise","summary":" Backdoor attacks on deep learning represent a recent threat that has gained\nsignificant attention in the research community. Backdoor defenses are mainly\nbased on backdoor inversion, which has been shown to be generic,\nmodel-agnostic, and applicable to practical threat scenarios. State-of-the-art\nbackdoor inversion recovers a mask in the feature space to locate prominent\nbackdoor features, where benign and backdoor features can be disentangled.\nHowever, it suffers from high computational overhead, and we also find that it\noverly relies on prominent backdoor features that are highly distinguishable\nfrom benign features. To tackle these shortcomings, this paper improves\nbackdoor feature inversion for backdoor detection by incorporating extra neuron\nactivation information. In particular, we adversarially increase the loss of\nbackdoored models with respect to weights to activate the backdoor effect,\nbased on which we can easily differentiate backdoored and clean models.\nExperimental results demonstrate our defense, BAN, is 1.37$\\times$ (on\nCIFAR-10) and 5.11$\\times$ (on ImageNet200) more efficient with 9.99% higher\ndetect success rate than the state-of-the-art defense BTI-DBF. Our code and\ntrained models are publicly\navailable.\\url{https://anonymous.4open.science/r/ban-4B32}\n","authors":["Xiaoyun Xu","Zhuoran Liu","Stefanos Koffas","Shujian Yu","Stjepan Picek"],"pdf_url":"https://arxiv.org/pdf/2405.19928v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19919v1","updated":"2024-05-30T10:30:44Z","published":"2024-05-30T10:30:44Z","title":"Unraveling the Impact of Heterophilic Structures on Graph\n Positive-Unlabeled Learning","summary":" While Positive-Unlabeled (PU) learning is vital in many real-world scenarios,\nits application to graph data still remains under-explored. We unveil that a\ncritical challenge for PU learning on graph lies on the edge heterophily, which\ndirectly violates the irreducibility assumption for Class-Prior Estimation\n(class prior is essential for building PU learning algorithms) and degenerates\nthe latent label inference on unlabeled nodes during classifier training. In\nresponse to this challenge, we introduce a new method, named Graph PU Learning\nwith Label Propagation Loss (GPL). Specifically, GPL considers learning from PU\nnodes along with an intermediate heterophily reduction, which helps mitigate\nthe negative impact of the heterophilic structure. We formulate this procedure\nas a bilevel optimization that reduces heterophily in the inner loop and\nefficiently learns a classifier in the outer loop. Extensive experiments across\na variety of datasets have shown that GPL significantly outperforms baseline\nmethods, confirming its effectiveness and superiority.\n","authors":["Yuhao Wu","Jiangchao Yao","Bo Han","Lina Yao","Tongliang Liu"],"pdf_url":"https://arxiv.org/pdf/2405.19919v1.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2405.19912v1","updated":"2024-05-30T10:23:16Z","published":"2024-05-30T10:23:16Z","title":"Robust Kernel Hypothesis Testing under Data Corruption","summary":" We propose two general methods for constructing robust permutation tests\nunder data corruption. The proposed tests effectively control the\nnon-asymptotic type I error under data corruption, and we prove their\nconsistency in power under minimal conditions. This contributes to the\npractical deployment of hypothesis tests for real-world applications with\npotential adversarial attacks. One of our methods inherently ensures\ndifferential privacy, further broadening its applicability to private data\nanalysis. For the two-sample and independence settings, we show that our kernel\nrobust tests are minimax optimal, in the sense that they are guaranteed to be\nnon-asymptotically powerful against alternatives uniformly separated from the\nnull in the kernel MMD and HSIC metrics at some optimal rate (tight with\nmatching lower bound). Finally, we provide publicly available implementations\nand empirically illustrate the practicality of our proposed tests.\n","authors":["Antonin Schrab","Ilmun Kim"],"pdf_url":"https://arxiv.org/pdf/2405.19912v1.pdf","comment":"26 pages, 2 figures, 2 algorithms"},{"id":"http://arxiv.org/abs/2405.19909v1","updated":"2024-05-30T10:20:55Z","published":"2024-05-30T10:20:55Z","title":"Adaptive Advantage-Guided Policy Regularization for Offline\n Reinforcement Learning","summary":" In offline reinforcement learning, the challenge of out-of-distribution (OOD)\nis pronounced. To address this, existing methods often constrain the learned\npolicy through policy regularization. However, these methods often suffer from\nthe issue of unnecessary conservativeness, hampering policy improvement. This\noccurs due to the indiscriminate use of all actions from the behavior policy\nthat generates the offline dataset as constraints. The problem becomes\nparticularly noticeable when the quality of the dataset is suboptimal. Thus, we\npropose Adaptive Advantage-guided Policy Regularization (A2PR), obtaining\nhigh-advantage actions from an augmented behavior policy combined with VAE to\nguide the learned policy. A2PR can select high-advantage actions that differ\nfrom those present in the dataset, while still effectively maintaining\nconservatism from OOD actions. This is achieved by harnessing the VAE capacity\nto generate samples matching the distribution of the data points. We\ntheoretically prove that the improvement of the behavior policy is guaranteed.\nBesides, it effectively mitigates value overestimation with a bounded\nperformance gap. Empirically, we conduct a series of experiments on the D4RL\nbenchmark, where A2PR demonstrates state-of-the-art performance. Furthermore,\nexperimental results on additional suboptimal mixed datasets reveal that A2PR\nexhibits superior performance. Code is available at\nhttps://github.com/ltlhuuu/A2PR.\n","authors":["Tenglong Liu","Yang Li","Yixing Lan","Hao Gao","Wei Pan","Xin Xu"],"pdf_url":"https://arxiv.org/pdf/2405.19909v1.pdf","comment":"ICML 2024, 19 pages"},{"id":"http://arxiv.org/abs/2404.00618v2","updated":"2024-05-30T10:16:04Z","published":"2024-03-31T09:10:32Z","title":"A Multi-Branched Radial Basis Network Approach to Predicting Complex\n Chaotic Behaviours","summary":" In this study, we propose a multi branched network approach to predict the\ndynamics of a physics attractor characterized by intricate and chaotic\nbehavior. We introduce a unique neural network architecture comprised of Radial\nBasis Function (RBF) layers combined with an attention mechanism designed to\neffectively capture nonlinear inter-dependencies inherent in the attractor's\ntemporal evolution. Our results demonstrate successful prediction of the\nattractor's trajectory across 100 predictions made using a real-world dataset\nof 36,700 time-series observations encompassing approximately 28 minutes of\nactivity. To further illustrate the performance of our proposed technique, we\nprovide comprehensive visualizations depicting the attractor's original and\npredicted behaviors alongside quantitative measures comparing observed versus\nestimated outcomes. Overall, this work showcases the potential of advanced\nmachine learning algorithms in elucidating hidden structures in complex\nphysical systems while offering practical applications in various domains\nrequiring accurate short-term forecasting capabilities.\n","authors":["Aarush Sinha"],"pdf_url":"https://arxiv.org/pdf/2404.00618v2.pdf","comment":"9 pages, 6 figures"},{"id":"http://arxiv.org/abs/2402.02407v2","updated":"2024-05-30T10:13:13Z","published":"2024-02-04T08:57:42Z","title":"Defining Neural Network Architecture through Polytope Structures of\n Dataset","summary":" Current theoretical and empirical research in neural networks suggests that\ncomplex datasets require large network architectures for thorough\nclassification, yet the precise nature of this relationship remains unclear.\nThis paper tackles this issue by defining upper and lower bounds for neural\nnetwork widths, which are informed by the polytope structure of the dataset in\nquestion. We also delve into the application of these principles to simplicial\ncomplexes and specific manifold shapes, explaining how the requirement for\nnetwork width varies in accordance with the geometric complexity of the\ndataset. Moreover, we develop an algorithm to investigate a converse situation\nwhere the polytope structure of a dataset can be inferred from its\ncorresponding trained neural networks. Through our algorithm, it is established\nthat popular datasets such as MNIST, Fashion-MNIST, and CIFAR10 can be\nefficiently encapsulated using no more than two polytopes with a small number\nof faces.\n","authors":["Sangmin Lee","Abbas Mammadov","Jong Chul Ye"],"pdf_url":"https://arxiv.org/pdf/2402.02407v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19902v1","updated":"2024-05-30T10:06:06Z","published":"2024-05-30T10:06:06Z","title":"Learning Discriminative Dynamics with Label Corruption for Noisy Label\n Detection","summary":" Label noise, commonly found in real-world datasets, has a detrimental impact\non a model's generalization. To effectively detect incorrectly labeled\ninstances, previous works have mostly relied on distinguishable training\nsignals, such as training loss, as indicators to differentiate between clean\nand noisy labels. However, they have limitations in that the training signals\nincompletely reveal the model's behavior and are not effectively generalized to\nvarious noise types, resulting in limited detection accuracy. In this paper, we\npropose DynaCor framework that distinguishes incorrectly labeled instances from\ncorrectly labeled ones based on the dynamics of the training signals. To cope\nwith the absence of supervision for clean and noisy labels, DynaCor first\nintroduces a label corruption strategy that augments the original dataset with\nintentionally corrupted labels, enabling indirect simulation of the model's\nbehavior on noisy labels. Then, DynaCor learns to identify clean and noisy\ninstances by inducing two clearly distinguishable clusters from the latent\nrepresentations of training dynamics. Our comprehensive experiments show that\nDynaCor outperforms the state-of-the-art competitors and shows strong\nrobustness to various noise types and noise rates.\n","authors":["Suyeon Kim","Dongha Lee","SeongKu Kang","Sukang Chae","Sanghwan Jang","Hwanjo Yu"],"pdf_url":"https://arxiv.org/pdf/2405.19902v1.pdf","comment":"Accepted to CVPR 2024"},{"id":"http://arxiv.org/abs/2405.19901v1","updated":"2024-05-30T10:02:53Z","published":"2024-05-30T10:02:53Z","title":"Urban Air Pollution Forecasting: a Machine Learning Approach leveraging\n Satellite Observations and Meteorological Forecasts","summary":" Air pollution poses a significant threat to public health and well-being,\nparticularly in urban areas. This study introduces a series of machine-learning\nmodels that integrate data from the Sentinel-5P satellite, meteorological\nconditions, and topological characteristics to forecast future levels of five\nmajor pollutants. The investigation delineates the process of data collection,\ndetailing the combination of diverse data sources utilized in the study.\nThrough experiments conducted in the Milan metropolitan area, the models\ndemonstrate their efficacy in predicting pollutant levels for the forthcoming\nday, achieving a percentage error of around 30%. The proposed models are\nadvantageous as they are independent of monitoring stations, facilitating their\nuse in areas without existing infrastructure. Additionally, we have released\nthe collected dataset to the public, aiming to stimulate further research in\nthis field. This research contributes to advancing our understanding of urban\nair quality dynamics and emphasizes the importance of amalgamating satellite,\nmeteorological, and topographical data to develop robust pollution forecasting\nmodels.\n","authors":["Giacomo Blanco","Luca Barco","Lorenzo Innocenti","Claudio Rossi"],"pdf_url":"https://arxiv.org/pdf/2405.19901v1.pdf","comment":"5 pages, 2 figures, submitted to IEEE MetroLivEnv 2024"},{"id":"http://arxiv.org/abs/2305.13067v2","updated":"2024-05-30T10:00:14Z","published":"2023-05-22T14:37:05Z","title":"Distilling Robustness into Natural Language Inference Models with\n Domain-Targeted Augmentation","summary":" Knowledge distillation optimises a smaller student model to behave similarly\nto a larger teacher model, retaining some of the performance benefits. While\nthis method can improve results on in-distribution examples, it does not\nnecessarily generalise to out-of-distribution (OOD) settings. We investigate\ntwo complementary methods for improving the robustness of the resulting student\nmodels on OOD domains. The first approach augments the distillation with\ngenerated unlabelled examples that match the target distribution. The second\nmethod upsamples data points among the training set that are similar to the\ntarget distribution. When applied on the task of natural language inference\n(NLI), our experiments on MNLI show that distillation with these modifications\noutperforms previous robustness solutions. We also find that these methods\nimprove performance on OOD domains even beyond the target domain.\n","authors":["Joe Stacey","Marek Rei"],"pdf_url":"https://arxiv.org/pdf/2305.13067v2.pdf","comment":"Accepted at ACL Findings 2024"},{"id":"http://arxiv.org/abs/2205.12961v2","updated":"2024-05-30T09:53:16Z","published":"2022-05-25T14:02:49Z","title":"Position: Tensor Networks are a Valuable Asset for Green AI","summary":" For the first time, this position paper introduces a fundamental link between\ntensor networks (TNs) and Green AI, highlighting their synergistic potential to\nenhance both the inclusivity and sustainability of AI research. We argue that\nTNs are valuable for Green AI due to their strong mathematical backbone and\ninherent logarithmic compression potential. We undertake a comprehensive review\nof the ongoing discussions on Green AI, emphasizing the importance of\nsustainability and inclusivity in AI research to demonstrate the significance\nof establishing the link between Green AI and TNs. To support our position, we\nfirst provide a comprehensive overview of efficiency metrics proposed in Green\nAI literature and then evaluate examples of TNs in the fields of kernel\nmachines and deep learning using the proposed efficiency metrics. This position\npaper aims to incentivize meaningful, constructive discussions by bridging\nfundamental principles of Green AI and TNs. We advocate for researchers to\nseriously evaluate the integration of TNs into their research projects, and in\nalignment with the link established in this paper, we support prior calls\nencouraging researchers to treat Green AI principles as a research priority.\n","authors":["Eva Memmel","Clara Menzen","Jetze Schuurmans","Frederiek Wesel","Kim Batselier"],"pdf_url":"https://arxiv.org/pdf/2205.12961v2.pdf","comment":"This paper has been accepted for presentation at the International\n Conference on Machine Learning (ICML) 2024 and will appear in the conference\n proceedings"},{"id":"http://arxiv.org/abs/2405.19893v1","updated":"2024-05-30T09:50:38Z","published":"2024-05-30T09:50:38Z","title":"Similarity is Not All You Need: Endowing Retrieval Augmented Generation\n with Multi Layered Thoughts","summary":" In recent years, large language models (LLMs) have made remarkable\nachievements in various domains. However, the untimeliness and cost of\nknowledge updates coupled with hallucination issues of LLMs have curtailed\ntheir applications in knowledge intensive tasks, where retrieval augmented\ngeneration (RAG) can be of help. Nevertheless, existing retrieval augmented\nmodels typically use similarity as a bridge between queries and documents and\nfollow a retrieve then read procedure. In this work, we argue that similarity\nis not always the panacea and totally relying on similarity would sometimes\ndegrade the performance of retrieval augmented generation. To this end, we\npropose MetRag, a Multi layEred Thoughts enhanced Retrieval Augmented\nGeneration framework. To begin with, beyond existing similarity oriented\nthought, we embrace a small scale utility model that draws supervision from an\nLLM for utility oriented thought and further come up with a smarter model by\ncomprehensively combining the similarity and utility oriented thoughts.\nFurthermore, given the fact that the retrieved document set tends to be huge\nand using them in isolation makes it difficult to capture the commonalities and\ncharacteristics among them, we propose to make an LLM as a task adaptive\nsummarizer to endow retrieval augmented generation with compactness-oriented\nthought. Finally, with multi layered thoughts from the precedent stages, an LLM\nis called for knowledge augmented generation. Extensive experiments on\nknowledge-intensive tasks have demonstrated the superiority of MetRag.\n","authors":["Chunjing Gan","Dan Yang","Binbin Hu","Hanxiao Zhang","Siyuan Li","Ziqi Liu","Yue Shen","Lin Ju","Zhiqiang Zhang","Jinjie Gu","Lei Liang","Jun Zhou"],"pdf_url":"https://arxiv.org/pdf/2405.19893v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2402.02446v3","updated":"2024-05-30T09:49:47Z","published":"2024-02-04T10:59:52Z","title":"LQER: Low-Rank Quantization Error Reconstruction for LLMs","summary":" Post-training quantization of Large Language Models (LLMs) is challenging. In\nthis work, we introduce Low-rank Quantization Error Reduction (LQER), which\ncombines quantization and low-rank approximation to recover the model\ncapability. LQER leverages an activation-induced scale matrix to drive the\nsingular value distribution of quantization error towards a desirable\ndistribution, which enables nearly-lossless W4A8 quantization on various LLMs\nand downstream tasks without the need for knowledge distillation, grid search,\nor gradient-base iterative optimization. Unlike existing methods, the\ncomputation pattern of LQER eliminates the need for specialized Scatter and\nGather processes to collect high-precision weights from irregular memory\nlocations. Our W4A8 LLMs achieve near-lossless performance on six popular\ndownstream tasks, while using 1.36$\\times$ fewer hardware resources than the\nleading state-of-the-art method. We open-source our framework at\nhttps://github.com/ChengZhang-98/lqer\n","authors":["Cheng Zhang","Jianyi Cheng","George A. Constantinides","Yiren Zhao"],"pdf_url":"https://arxiv.org/pdf/2402.02446v3.pdf","comment":"Accepted at ICML2024"},{"id":"http://arxiv.org/abs/2405.19889v1","updated":"2024-05-30T09:46:59Z","published":"2024-05-30T09:46:59Z","title":"Deep Joint Semantic Coding and Beamforming for Near-Space Airship-Borne\n Massive MIMO Network","summary":" Near-space airship-borne communication network is recognized to be an\nindispensable component of the future integrated ground-air-space network\nthanks to airships' advantage of long-term residency at stratospheric\naltitudes, but it urgently needs reliable and efficient Airship-to-X link. To\nimprove the transmission efficiency and capacity, this paper proposes to\nintegrate semantic communication with massive multiple-input multiple-output\n(MIMO) technology. Specifically, we propose a deep joint semantic coding and\nbeamforming (JSCBF) scheme for airship-based massive MIMO image transmission\nnetwork in space, in which semantics from both source and channel are fused to\njointly design the semantic coding and physical layer beamforming. First, we\ndesign two semantic extraction networks to extract semantics from image source\nand channel state information, respectively. Then, we propose a semantic fusion\nnetwork that can fuse these semantics into complex-valued semantic features for\nsubsequent physical-layer transmission. To efficiently transmit the fused\nsemantic features at the physical layer, we then propose the hybrid data and\nmodel-driven semantic-aware beamforming networks. At the receiver, a semantic\ndecoding network is designed to reconstruct the transmitted images. Finally, we\nperform end-to-end deep learning to jointly train all the modules, using the\nimage reconstruction quality at the receivers as a metric. The proposed deep\nJSCBF scheme fully combines the efficient source compressibility and robust\nerror correction capability of semantic communication with the high spectral\nefficiency of massive MIMO, achieving a significant performance improvement\nover existing approaches.\n","authors":["Minghui Wu","Zhen Gao","Zhaocheng Wang","Dusit Niyato","George K. Karagiannidis","Sheng Chen"],"pdf_url":"https://arxiv.org/pdf/2405.19889v1.pdf","comment":"Major Revision by IEEE JSAC"},{"id":"http://arxiv.org/abs/2405.19888v1","updated":"2024-05-30T09:46:36Z","published":"2024-05-30T09:46:36Z","title":"Parrot: Efficient Serving of LLM-based Applications with Semantic\n Variable","summary":" The rise of large language models (LLMs) has enabled LLM-based applications\n(a.k.a. AI agents or co-pilots), a new software paradigm that combines the\nstrength of LLM and conventional software. Diverse LLM applications from\ndifferent tenants could design complex workflows using multiple LLM requests to\naccomplish one task. However, they have to use the over-simplified\nrequest-level API provided by today's public LLM services, losing essential\napplication-level information. Public LLM services have to blindly optimize\nindividual LLM requests, leading to sub-optimal end-to-end performance of LLM\napplications.\n This paper introduces Parrot, an LLM service system that focuses on the\nend-to-end experience of LLM-based applications. Parrot proposes Semantic\nVariable, a unified abstraction to expose application-level knowledge to public\nLLM services. A Semantic Variable annotates an input/output variable in the\nprompt of a request, and creates the data pipeline when connecting multiple LLM\nrequests, providing a natural way to program LLM applications. Exposing\nSemantic Variables to the public LLM service allows it to perform conventional\ndata flow analysis to uncover the correlation across multiple LLM requests.\nThis correlation opens a brand-new optimization space for the end-to-end\nperformance of LLM-based applications. Extensive evaluations demonstrate that\nParrot can achieve up to an order-of-magnitude improvement for popular and\npractical use cases of LLM applications.\n","authors":["Chaofan Lin","Zhenhua Han","Chengruidong Zhang","Yuqing Yang","Fan Yang","Chen Chen","Lili Qiu"],"pdf_url":"https://arxiv.org/pdf/2405.19888v1.pdf","comment":"To appear on USENIX OSDI 2024"},{"id":"http://arxiv.org/abs/2405.19886v1","updated":"2024-05-30T09:45:18Z","published":"2024-05-30T09:45:18Z","title":"Federated Learning with Multi-resolution Model Broadcast","summary":" In federated learning, a server must periodically broadcast a model to the\nagents. We propose to use multi-resolution coding and modulation (also known as\nnon-uniform modulation) for this purpose. In the simplest instance, broadcast\ntransmission is used, whereby all agents are targeted with one and the same\ntransmission (typically without any particular favored beam direction), which\nis coded using multi-resolution coding/modulation. This enables high-SNR\nagents, with high path gains to the server, to receive a more accurate model\nthan the low-SNR agents do, without consuming more downlink resources. As one\nimplementation, we use transmission with a non-uniform 8-PSK constellation,\nwhere a high-SNR receiver (agent) can separate all 8 constellation points\n(hence receive 3 bits) whereas a low-SNR receiver can only separate 4 points\n(hence receive 2 bits). By encoding the least significant information in the\nthird bit, the high-SNR receivers can obtain the model with higher accuracy,\nwhile the low-SNR receiver can still obtain the model although with reduced\naccuracy, thereby facilitating at least some basic participation of the low-SNR\nreceiver. We show the effectiveness of our proposed scheme via experimentation\nusing federated learning with the MNIST data-set.\n","authors":["Henrik Rydén","Reza Moosavi","Erik G. Larsson"],"pdf_url":"https://arxiv.org/pdf/2405.19886v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19885v1","updated":"2024-05-30T09:43:59Z","published":"2024-05-30T09:43:59Z","title":"Fourier Controller Networks for Real-Time Decision-Making in Embodied\n Learning","summary":" Reinforcement learning is able to obtain generalized low-level robot policies\non diverse robotics datasets in embodied learning scenarios, and Transformer\nhas been widely used to model time-varying features. However, it still suffers\nfrom the issues of low data efficiency and high inference latency. In this\npaper, we propose to investigate the task from a new perspective of the\nfrequency domain. We first observe that the energy density in the frequency\ndomain of a robot's trajectory is mainly concentrated in the low-frequency\npart. Then, we present the Fourier Controller Network (FCNet), a new network\nthat utilizes the Short-Time Fourier Transform (STFT) to extract and encode\ntime-varying features through frequency domain interpolation. We further\nachieve parallel training and efficient recurrent inference by using FFT and\nSliding DFT methods in the model architecture for real-time decision-making.\nComprehensive analyses in both simulated (e.g., D4RL) and real-world\nenvironments (e.g., robot locomotion) demonstrate FCNet's substantial\nefficiency and effectiveness over existing methods such as Transformer, e.g.,\nFCNet outperforms Transformer on multi-environmental robotics datasets of all\ntypes of sizes (from 1.9M to 120M). The project page and code can be found\nhttps://thkkk.github.io/fcnet.\n","authors":["Hengkai Tan","Songming Liu","Kai Ma","Chengyang Ying","Xingxing Zhang","Hang Su","Jun Zhu"],"pdf_url":"https://arxiv.org/pdf/2405.19885v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19883v1","updated":"2024-05-30T09:42:54Z","published":"2024-05-30T09:42:54Z","title":"From Words to Actions: Unveiling the Theoretical Underpinnings of\n LLM-Driven Autonomous Systems","summary":" In this work, from a theoretical lens, we aim to understand why large\nlanguage model (LLM) empowered agents are able to solve decision-making\nproblems in the physical world. To this end, consider a hierarchical\nreinforcement learning (RL) model where the LLM Planner and the Actor perform\nhigh-level task planning and low-level execution, respectively. Under this\nmodel, the LLM Planner navigates a partially observable Markov decision process\n(POMDP) by iteratively generating language-based subgoals via prompting. Under\nproper assumptions on the pretraining data, we prove that the pretrained LLM\nPlanner effectively performs Bayesian aggregated imitation learning (BAIL)\nthrough in-context learning. Additionally, we highlight the necessity for\nexploration beyond the subgoals derived from BAIL by proving that naively\nexecuting the subgoals returned by LLM leads to a linear regret. As a remedy,\nwe introduce an $\\epsilon$-greedy exploration strategy to BAIL, which is proven\nto incur sublinear regret when the pretraining error is small. Finally, we\nextend our theoretical framework to include scenarios where the LLM Planner\nserves as a world model for inferring the transition model of the environment\nand to multi-agent settings, enabling coordination among multiple Actors.\n","authors":["Jianliang He","Siyu Chen","Fengzhuo Zhang","Zhuoran Yang"],"pdf_url":"https://arxiv.org/pdf/2405.19883v1.pdf","comment":"Accepted by ICML 2024"},{"id":"http://arxiv.org/abs/2405.06263v2","updated":"2024-05-30T09:40:02Z","published":"2024-05-10T06:28:42Z","title":"Learning Latent Dynamic Robust Representations for World Models","summary":" Visual Model-Based Reinforcement Learning (MBRL) promises to encapsulate\nagent's knowledge about the underlying dynamics of the environment, enabling\nlearning a world model as a useful planner. However, top MBRL agents such as\nDreamer often struggle with visual pixel-based inputs in the presence of\nexogenous or irrelevant noise in the observation space, due to failure to\ncapture task-specific features while filtering out irrelevant spatio-temporal\ndetails. To tackle this problem, we apply a spatio-temporal masking strategy, a\nbisimulation principle, combined with latent reconstruction, to capture\nendogenous task-specific aspects of the environment for world models,\neffectively eliminating non-essential information. Joint training of\nrepresentations, dynamics, and policy often leads to instabilities. To further\naddress this issue, we develop a Hybrid Recurrent State-Space Model (HRSSM)\nstructure, enhancing state representation robustness for effective policy\nlearning. Our empirical evaluation demonstrates significant performance\nimprovements over existing methods in a range of visually complex control tasks\nsuch as Maniskill \\cite{gu2023maniskill2} with exogenous distractors from the\nMatterport environment. Our code is avaliable at\nhttps://github.com/bit1029public/HRSSM.\n","authors":["Ruixiang Sun","Hongyu Zang","Xin Li","Riashat Islam"],"pdf_url":"https://arxiv.org/pdf/2405.06263v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08732v2","updated":"2024-05-30T09:37:30Z","published":"2023-10-12T21:39:16Z","title":"Provably Robust Cost-Sensitive Learning via Randomized Smoothing","summary":" We study the problem of robust learning against adversarial perturbations\nunder cost-sensitive scenarios, where the potential harm of different types of\nmisclassifications is encoded in a cost matrix. Existing approaches are either\nempirical and cannot certify robustness or suffer from inherent scalability\nissues. In this work, we investigate whether randomized smoothing, a scalable\nframework for robustness certification, can be leveraged to certify and train\nfor cost-sensitive robustness. Built upon the notion of cost-sensitive\ncertified radius, we first illustrate how to adapt the standard certification\nalgorithm of randomized smoothing to produce tight robustness certificates for\nany binary cost matrix, and then develop a robust training method to promote\ncertified cost-sensitive robustness while maintaining the model's overall\naccuracy. Through extensive experiments on image benchmarks, we demonstrate the\nsuperiority of our proposed certification algorithm and training method under\nvarious cost-sensitive scenarios. Our implementation is available as open\nsource code at: https://github.com/TrustMLRG/CS-RS.\n","authors":["Yuan Xin","Michael Backes","Xiao Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08732v2.pdf","comment":"19 pages, 9 tables, 5 figures"},{"id":"http://arxiv.org/abs/2405.19878v1","updated":"2024-05-30T09:34:31Z","published":"2024-05-30T09:34:31Z","title":"Learning from Random Demonstrations: Offline Reinforcement Learning with\n Importance-Sampled Diffusion Models","summary":" Generative models such as diffusion have been employed as world models in\noffline reinforcement learning to generate synthetic data for more effective\nlearning. Existing work either generates diffusion models one-time prior to\ntraining or requires additional interaction data to update it. In this paper,\nwe propose a novel approach for offline reinforcement learning with closed-loop\npolicy evaluation and world-model adaptation. It iteratively leverages a guided\ndiffusion world model to directly evaluate the offline target policy with\nactions drawn from it, and then performs an importance-sampled world model\nupdate to adaptively align the world model with the updated policy. We analyzed\nthe performance of the proposed method and provided an upper bound on the\nreturn gap between our method and the real environment under an optimal policy.\nThe result sheds light on various factors affecting learning performance.\nEvaluations in the D4RL environment show significant improvement over\nstate-of-the-art baselines, especially when only random or medium-expertise\ndemonstrations are available -- thus requiring improved alignment between the\nworld model and offline policy evaluation.\n","authors":["Zeyu Fang","Tian Lan"],"pdf_url":"https://arxiv.org/pdf/2405.19878v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19874v1","updated":"2024-05-30T09:28:56Z","published":"2024-05-30T09:28:56Z","title":"Is In-Context Learning Sufficient for Instruction Following in LLMs?","summary":" In-context learning (ICL) allows LLMs to learn from examples without changing\ntheir weights, which is a particularly promising capability for long-context\nLLMs that can potentially learn from many examples. Recently, Lin et al. (2024)\nproposed URIAL, a method using only three in-context examples to align base\nLLMs, achieving non-trivial instruction following performance. In this work, we\nshow that, while effective, ICL alignment with URIAL still underperforms\ncompared to instruction fine-tuning on established benchmarks such as MT-Bench\nand AlpacaEval 2.0 (LC), especially with more capable base LMs. Unlike for\ntasks such as classification, translation, or summarization, adding more ICL\ndemonstrations for long-context LLMs does not systematically improve\ninstruction following performance. To address this limitation, we derive a\ngreedy selection approach for ICL examples that noticeably improves\nperformance, yet without bridging the gap to instruction fine-tuning. Finally,\nwe provide a series of ablation studies to better understand the reasons behind\nthe remaining gap, and we show how some aspects of ICL depart from the existing\nknowledge and are specific to the instruction tuning setting. Overall, our work\nadvances the understanding of ICL as an alignment technique. We provide our\ncode at https://github.com/tml-epfl/icl-alignment.\n","authors":["Hao Zhao","Maksym Andriushchenko","Francesco Croce","Nicolas Flammarion"],"pdf_url":"https://arxiv.org/pdf/2405.19874v1.pdf","comment":"Preprint. Code at https://github.com/tml-epfl/icl-alignment"},{"id":"http://arxiv.org/abs/2405.19870v1","updated":"2024-05-30T09:23:48Z","published":"2024-05-30T09:23:48Z","title":"On Vessel Location Forecasting and the Effect of Federated Learning","summary":" The wide spread of Automatic Identification System (AIS) has motivated\nseveral maritime analytics operations. Vessel Location Forecasting (VLF) is one\nof the most critical operations for maritime awareness. However, accurate VLF\nis a challenging problem due to the complexity and dynamic nature of maritime\ntraffic conditions. Furthermore, as privacy concerns and restrictions have\ngrown, training data has become increasingly fragmented, resulting in dispersed\ndatabases of several isolated data silos among different organizations, which\nin turn decreases the quality of learning models. In this paper, we propose an\nefficient VLF solution based on LSTM neural networks, in two variants, namely\nNautilus and FedNautilus for the centralized and the federated learning\napproach, respectively. We also demonstrate the superiority of the centralized\napproach with respect to current state of the art and discuss the advantages\nand disadvantages of the federated against the centralized approach.\n","authors":["Andreas Tritsarolis","Nikos Pelekis","Konstantina Bereta","Dimitris Zissis","Yannis Theodoridis"],"pdf_url":"https://arxiv.org/pdf/2405.19870v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.03917v3","updated":"2024-05-30T09:15:06Z","published":"2024-02-06T11:35:02Z","title":"Elastic Feature Consolidation for Cold Start Exemplar-Free Incremental\n Learning","summary":" Exemplar-Free Class Incremental Learning (EFCIL) aims to learn from a\nsequence of tasks without having access to previous task data. In this paper,\nwe consider the challenging Cold Start scenario in which insufficient data is\navailable in the first task to learn a high-quality backbone. This is\nespecially challenging for EFCIL since it requires high plasticity, which\nresults in feature drift which is difficult to compensate for in the\nexemplar-free setting. To address this problem, we propose a simple and\neffective approach that consolidates feature representations by regularizing\ndrift in directions highly relevant to previous tasks and employs prototypes to\nreduce task-recency bias. Our method, called Elastic Feature Consolidation\n(EFC), exploits a tractable second-order approximation of feature drift based\non an Empirical Feature Matrix (EFM). The EFM induces a pseudo-metric in\nfeature space which we use to regularize feature drift in important directions\nand to update Gaussian prototypes used in a novel asymmetric cross entropy loss\nwhich effectively balances prototype rehearsal with data from new tasks.\nExperimental results on CIFAR-100, Tiny-ImageNet, ImageNet-Subset and\nImageNet-1K demonstrate that Elastic Feature Consolidation is better able to\nlearn new tasks by maintaining model plasticity and significantly outperform\nthe state-of-the-art.\n","authors":["Simone Magistri","Tomaso Trinci","Albin Soutif-Cormerais","Joost van de Weijer","Andrew D. Bagdanov"],"pdf_url":"https://arxiv.org/pdf/2402.03917v3.pdf","comment":"Accepted at Twelfth International Conference on Learning\n Representations (ICLR 2024)"},{"id":"http://arxiv.org/abs/2405.19864v1","updated":"2024-05-30T09:14:01Z","published":"2024-05-30T09:14:01Z","title":"Out-of-distribution Reject Option Method for Dataset Shift Problem in\n Early Disease Onset Prediction","summary":" Machine learning is increasingly used to predict lifestyle-related disease\nonset using health and medical data. However, the prediction effectiveness is\nhindered by dataset shift, which involves discrepancies in data distribution\nbetween the training and testing datasets, misclassifying out-of-distribution\n(OOD) data. To diminish dataset shift effects, this paper proposes the\nout-of-distribution reject option for prediction (ODROP), which integrates OOD\ndetection models to preclude OOD data from the prediction phase. We\ninvestigated the efficacy of five OOD detection methods (variational\nautoencoder, neural network ensemble std, neural network ensemble epistemic,\nneural network energy, and neural network gaussian mixture based energy\nmeasurement) across two datasets, the Hirosaki and Wakayama health checkup\ndata, in the context of three disease onset prediction tasks: diabetes,\ndyslipidemia, and hypertension. To evaluate the ODROP method, we trained\ndisease onset prediction models and OOD detection models on Hirosaki data and\nused AUROC-rejection curve plots from Wakayama data. The variational\nautoencoder method showed superior stability and magnitude of improvement in\nArea Under the Receiver Operating Curve (AUROC) in five cases: AUROC in the\nWakayama data was improved from 0.80 to 0.90 at a 31.1% rejection rate for\ndiabetes onset and from 0.70 to 0.76 at a 34% rejection rate for dyslipidemia.\nWe categorized dataset shifts into two types using SHAP clustering - those that\nconsiderably affect predictions and those that do not. We expect that this\nclassification will help standardize measuring instruments. This study is the\nfirst to apply OOD detection to actual health and medical data, demonstrating\nits potential to substantially improve the accuracy and reliability of disease\nprediction models amidst dataset shift.\n","authors":["Taisei Tosaki","Eiichiro Uchino","Ryosuke Kojima","Yohei Mineharu","Mikio Arita","Nobuyuki Miyai","Yoshinori Tamada","Tatsuya Mikami","Koichi Murashita","Shigeyuki Nakaji","Yasushi Okuno"],"pdf_url":"https://arxiv.org/pdf/2405.19864v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.05337v2","updated":"2024-05-30T09:11:51Z","published":"2023-09-11T09:34:44Z","title":"Stochastic Gradient Descent-like relaxation is equivalent to Metropolis\n dynamics in discrete optimization and inference problems","summary":" Is Stochastic Gradient Descent (SGD) substantially different from Metropolis\nMonte Carlo dynamics? This is a fundamental question at the time of\nunderstanding the most used training algorithm in the field of Machine\nLearning, but it received no answer until now. Here we show that in discrete\noptimization and inference problems, the dynamics of an SGD-like algorithm\nresemble very closely that of Metropolis Monte Carlo with a properly chosen\ntemperature, which depends on the mini-batch size. This quantitative matching\nholds both at equilibrium and in the out-of-equilibrium regime, despite the two\nalgorithms having fundamental differences (e.g.\\ SGD does not satisfy detailed\nbalance). Such equivalence allows us to use results about performances and\nlimits of Monte Carlo algorithms to optimize the mini-batch size in the\nSGD-like algorithm and make it efficient at recovering the signal in hard\ninference problems.\n","authors":["Maria Chiara Angelini","Angelo Giorgio Cavaliere","Raffaele Marino","Federico Ricci-Tersenghi"],"pdf_url":"https://arxiv.org/pdf/2309.05337v2.pdf","comment":"19 pages, 9 figures"},{"id":"http://arxiv.org/abs/2309.16733v2","updated":"2024-05-30T09:02:53Z","published":"2023-09-27T19:22:19Z","title":"Resilience of Deep Learning applications: a systematic literature review\n of analysis and hardening techniques","summary":" Machine Learning (ML) is currently being exploited in numerous applications\nbeing one of the most effective Artificial Intelligence (AI) technologies, used\nin diverse fields, such as vision, autonomous systems, and alike. The trend\nmotivated a significant amount of contributions to the analysis and design of\nML applications against faults affecting the underlying hardware. The authors\ninvestigate the existing body of knowledge on Deep Learning (among ML\ntechniques) resilience against hardware faults systematically through a\nthoughtful review in which the strengths and weaknesses of this literature\nstream are presented clearly and then future avenues of research are set out.\nThe review is based on 220 scientific articles published between January 2019\nand March 2024. The authors adopt a classifying framework to interpret and\nhighlight research similarities and peculiarities, based on several parameters,\nstarting from the main scope of the work, the adopted fault and error models,\nto their reproducibility. This framework allows for a comparison of the\ndifferent solutions and the identification of possible synergies. Furthermore,\nsuggestions concerning the future direction of research are proposed in the\nform of open challenges to be addressed.\n","authors":["Cristiana Bolchini","Luca Cassano","Antonio Miele"],"pdf_url":"https://arxiv.org/pdf/2309.16733v2.pdf","comment":"Submitted to Elsevier Computer Science Review on May 9, 2024"},{"id":"http://arxiv.org/abs/2402.08871v2","updated":"2024-05-30T08:52:56Z","published":"2024-02-14T00:35:10Z","title":"Position: Topological Deep Learning is the New Frontier for Relational\n Learning","summary":" Topological deep learning (TDL) is a rapidly evolving field that uses\ntopological features to understand and design deep learning models. This paper\nposits that TDL is the new frontier for relational learning. TDL may complement\ngraph representation learning and geometric deep learning by incorporating\ntopological concepts, and can thus provide a natural choice for various machine\nlearning settings. To this end, this paper discusses open problems in TDL,\nranging from practical benefits to theoretical foundations. For each problem,\nit outlines potential solutions and future research opportunities. At the same\ntime, this paper serves as an invitation to the scientific community to\nactively participate in TDL research to unlock the potential of this emerging\nfield.\n","authors":["Theodore Papamarkou","Tolga Birdal","Michael Bronstein","Gunnar Carlsson","Justin Curry","Yue Gao","Mustafa Hajij","Roland Kwitt","Pietro Liò","Paolo Di Lorenzo","Vasileios Maroulas","Nina Miolane","Farzana Nasrin","Karthikeyan Natesan Ramamurthy","Bastian Rieck","Simone Scardapane","Michael T. Schaub","Petar Veličković","Bei Wang","Yusu Wang","Guo-Wei Wei","Ghada Zamzmi"],"pdf_url":"https://arxiv.org/pdf/2402.08871v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19836v1","updated":"2024-05-30T08:45:45Z","published":"2024-05-30T08:45:45Z","title":"The Merit of River Network Topology for Neural Flood Forecasting","summary":" Climate change exacerbates riverine floods, which occur with higher frequency\nand intensity than ever. The much-needed forecasting systems typically rely on\naccurate river discharge predictions. To this end, the SOTA data-driven\napproaches treat forecasting at spatially distributed gauge stations as\nisolated problems, even within the same river network. However, incorporating\nthe known topology of the river network into the prediction model has the\npotential to leverage the adjacency relationship between gauges. Thus, we model\nriver discharge for a network of gauging stations with GNNs and compare the\nforecasting performance achieved by different adjacency definitions. Our\nresults show that the model fails to benefit from the river network topology\ninformation, both on the entire network and small subgraphs. The learned edge\nweights correlate with neither of the static definitions and exhibit no regular\npattern. Furthermore, the GNNs struggle to predict sudden, narrow discharge\nspikes. Our work hints at a more general underlying phenomenon of neural\nprediction not always benefitting from graphical structure and may inspire a\nsystematic study of the conditions under which this happens.\n","authors":["Nikolas Kirschstein","Yixuan Sun"],"pdf_url":"https://arxiv.org/pdf/2405.19836v1.pdf","comment":"https://openreview.net/forum?id=QE6iC9s6vU"},{"id":"http://arxiv.org/abs/2403.11904v3","updated":"2024-05-30T08:37:45Z","published":"2024-03-18T16:04:55Z","title":"CICLe: Conformal In-Context Learning for Largescale Multi-Class Food\n Risk Classification","summary":" Contaminated or adulterated food poses a substantial risk to human health.\nGiven sets of labeled web texts for training, Machine Learning and Natural\nLanguage Processing can be applied to automatically detect such risks. We\npublish a dataset of 7,546 short texts describing public food recall\nannouncements. Each text is manually labeled, on two granularity levels (coarse\nand fine), for food products and hazards that the recall corresponds to. We\ndescribe the dataset and benchmark naive, traditional, and Transformer models.\nBased on our analysis, Logistic Regression based on a tf-idf representation\noutperforms RoBERTa and XLM-R on classes with low support. Finally, we discuss\ndifferent prompting strategies and present an LLM-in-the-loop framework, based\non Conformal Prediction, which boosts the performance of the base classifier\nwhile reducing energy consumption compared to normal prompting.\n","authors":["Korbinian Randl","John Pavlopoulos","Aron Henriksson","Tony Lindgren"],"pdf_url":"https://arxiv.org/pdf/2403.11904v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19823v1","updated":"2024-05-30T08:31:18Z","published":"2024-05-30T08:31:18Z","title":"Joint Selective State Space Model and Detrending for Robust Time Series\n Anomaly Detection","summary":" Deep learning-based sequence models are extensively employed in Time Series\nAnomaly Detection (TSAD) tasks due to their effective sequential modeling\ncapabilities. However, the ability of TSAD is limited by two key challenges:\n(i) the ability to model long-range dependency and (ii) the generalization\nissue in the presence of non-stationary data. To tackle these challenges, an\nanomaly detector that leverages the selective state space model known for its\nproficiency in capturing long-term dependencies across various domains is\nproposed. Additionally, a multi-stage detrending mechanism is introduced to\nmitigate the prominent trend component in non-stationary data to address the\ngeneralization issue. Extensive experiments conducted on realworld public\ndatasets demonstrate that the proposed methods surpass all 12 compared baseline\nmethods.\n","authors":["Junqi Chen","Xu Tan","Sylwan Rahardja","Jiawei Yang","Susanto Rahardja"],"pdf_url":"https://arxiv.org/pdf/2405.19823v1.pdf","comment":"Submitted to IEEE Signal Processing Letters"},{"id":"http://arxiv.org/abs/2402.17257v3","updated":"2024-05-30T08:24:54Z","published":"2024-02-27T07:03:25Z","title":"RIME: Robust Preference-based Reinforcement Learning with Noisy\n Preferences","summary":" Preference-based Reinforcement Learning (PbRL) circumvents the need for\nreward engineering by harnessing human preferences as the reward signal.\nHowever, current PbRL methods excessively depend on high-quality feedback from\ndomain experts, which results in a lack of robustness. In this paper, we\npresent RIME, a robust PbRL algorithm for effective reward learning from noisy\npreferences. Our method utilizes a sample selection-based discriminator to\ndynamically filter out noise and ensure robust training. To counteract the\ncumulative error stemming from incorrect selection, we suggest a warm start for\nthe reward model, which additionally bridges the performance gap during the\ntransition from pre-training to online training in PbRL. Our experiments on\nrobotic manipulation and locomotion tasks demonstrate that RIME significantly\nenhances the robustness of the state-of-the-art PbRL method. Code is available\nat https://github.com/CJReinforce/RIME_ICML2024.\n","authors":["Jie Cheng","Gang Xiong","Xingyuan Dai","Qinghai Miao","Yisheng Lv","Fei-Yue Wang"],"pdf_url":"https://arxiv.org/pdf/2402.17257v3.pdf","comment":"Accepted by ICML2024"},{"id":"http://arxiv.org/abs/2405.19811v1","updated":"2024-05-30T08:20:34Z","published":"2024-05-30T08:20:34Z","title":"Approximate Global Convergence of Independent Learning in Multi-Agent\n Systems","summary":" Independent learning (IL), despite being a popular approach in practice to\nachieve scalability in large-scale multi-agent systems, usually lacks global\nconvergence guarantees. In this paper, we study two representative algorithms,\nindependent $Q$-learning and independent natural actor-critic, within\nvalue-based and policy-based frameworks, and provide the first finite-sample\nanalysis for approximate global convergence. The results imply a sample\ncomplexity of $\\tilde{\\mathcal{O}}(\\epsilon^{-2})$ up to an error term that\ncaptures the dependence among agents and characterizes the fundamental limit of\nIL in achieving global convergence. To establish the result, we develop a novel\napproach for analyzing IL by constructing a separable Markov decision process\n(MDP) for convergence analysis and then bounding the gap due to model\ndifference between the separable MDP and the original one. Moreover, we conduct\nnumerical experiments using a synthetic MDP and an electric vehicle charging\nexample to verify our theoretical findings and to demonstrate the practical\napplicability of IL.\n","authors":["Ruiyang Jin","Zaiwei Chen","Yiheng Lin","Jie Song","Adam Wierman"],"pdf_url":"https://arxiv.org/pdf/2405.19811v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.13047v2","updated":"2024-05-30T08:19:34Z","published":"2023-08-24T19:27:59Z","title":"Federated Causal Inference from Observational Data","summary":" Decentralized data sources are prevalent in real-world applications, posing a\nformidable challenge for causal inference. These sources cannot be consolidated\ninto a single entity owing to privacy constraints. The presence of dissimilar\ndata distributions and missing values within them can potentially introduce\nbias to the causal estimands. In this article, we propose a framework to\nestimate causal effects from decentralized data sources. The proposed framework\navoid exchanging raw data among the sources, thus contributing towards\nprivacy-preserving causal learning. Three instances of the proposed framework\nare introduced to estimate causal effects across a wide range of diverse\nscenarios within a federated setting. (1) FedCI: a Bayesian framework based on\nGaussian processes for estimating causal effects from federated observational\ndata sources. It estimates the posterior distributions of the causal effects to\ncompute the higher-order statistics that capture the uncertainty. (2)\nCausalRFF: an adaptive transfer algorithm that learns the similarities among\nthe data sources by utilizing Random Fourier Features to disentangle the loss\nfunction into multiple components, each of which is associated with a data\nsource. It estimates the similarities among the sources through transfer\ncoefficients, and hence requiring no prior information about the similarity\nmeasures. (3) CausalFI: a new approach for federated causal inference from\nincomplete data, enabling the estimation of causal effects from multiple\ndecentralized and incomplete data sources. It accounts for the missing data\nunder the missing at random assumption, while also estimating higher-order\nstatistics of the causal estimands. The proposed federated framework and its\ninstances are an important step towards a privacy-preserving causal learning\nmodel.\n","authors":["Thanh Vinh Vo","Young lee","Tze-Yun Leong"],"pdf_url":"https://arxiv.org/pdf/2308.13047v2.pdf","comment":"Preprint. arXiv admin note: substantial text overlap with\n arXiv:2301.00346"},{"id":"http://arxiv.org/abs/2405.19807v1","updated":"2024-05-30T08:17:00Z","published":"2024-05-30T08:17:00Z","title":"MetaCURL: Non-stationary Concave Utility Reinforcement Learning","summary":" We explore online learning in episodic loop-free Markov decision processes on\nnon-stationary environments (changing losses and probability transitions). Our\nfocus is on the Concave Utility Reinforcement Learning problem (CURL), an\nextension of classical RL for handling convex performance criteria in\nstate-action distributions induced by agent policies. While various machine\nlearning problems can be written as CURL, its non-linearity invalidates\ntraditional Bellman equations. Despite recent solutions to classical CURL, none\naddress non-stationary MDPs. This paper introduces MetaCURL, the first CURL\nalgorithm for non-stationary MDPs. It employs a meta-algorithm running multiple\nblack-box algorithms instances over different intervals, aggregating outputs\nvia a sleeping expert framework. The key hurdle is partial information due to\nMDP uncertainty. Under partial information on the probability transitions\n(uncertainty and non-stationarity coming only from external noise, independent\nof agent state-action pairs), we achieve optimal dynamic regret without prior\nknowledge of MDP changes. Unlike approaches for RL, MetaCURL handles full\nadversarial losses, not just stochastic ones. We believe our approach for\nmanaging non-stationarity with experts can be of interest to the RL community.\n","authors":["Bianca Marin Moreno","Margaux Brégère","Pierre Gaillard","Nadia Oudjane"],"pdf_url":"https://arxiv.org/pdf/2405.19807v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20291v1","updated":"2024-05-30T17:41:32Z","published":"2024-05-30T17:41:32Z","title":"Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning\n Weight Changes and Backdoor Activeness","summary":" The security threat of backdoor attacks is a central concern for deep neural\nnetworks (DNNs). Recently, without poisoned data, unlearning models with clean\ndata and then learning a pruning mask have contributed to backdoor defense.\nAdditionally, vanilla fine-tuning with those clean data can help recover the\nlost clean accuracy. However, the behavior of clean unlearning is still\nunder-explored, and vanilla fine-tuning unintentionally induces back the\nbackdoor effect. In this work, we first investigate model unlearning from the\nperspective of weight changes and gradient norms, and find two interesting\nobservations in the backdoored model: 1) the weight changes between poison and\nclean unlearning are positively correlated, making it possible for us to\nidentify the backdoored-related neurons without using poisoned data; 2) the\nneurons of the backdoored model are more active (i.e., larger changes in\ngradient norm) than those in the clean model, suggesting the need to suppress\nthe gradient norm during fine-tuning. Then, we propose an effective two-stage\ndefense method. In the first stage, an efficient Neuron Weight Change\n(NWC)-based Backdoor Reinitialization is proposed based on observation 1). In\nthe second stage, based on observation 2), we design an Activeness-Aware\nFine-Tuning to replace the vanilla fine-tuning. Extensive experiments,\ninvolving eight backdoor attacks on three benchmark datasets, demonstrate the\nsuperior performance of our proposed method compared to recent state-of-the-art\nbackdoor defense approaches.\n","authors":["Weilin Lin","Li Liu","Shaokui Wei","Jianze Li","Hui Xiong"],"pdf_url":"https://arxiv.org/pdf/2405.20291v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2405.20078v1","updated":"2024-05-30T14:08:09Z","published":"2024-05-30T14:08:09Z","title":"NeRF View Synthesis: Subjective Quality Assessment and Objective Metrics\n Evaluation","summary":" Neural radiance fields (NeRF) are a groundbreaking computer vision technology\nthat enables the generation of high-quality, immersive visual content from\nmultiple viewpoints. This capability holds significant advantages for\napplications such as virtual/augmented reality, 3D modelling and content\ncreation for the film and entertainment industry. However, the evaluation of\nNeRF methods poses several challenges, including a lack of comprehensive\ndatasets, reliable assessment methodologies, and objective quality metrics.\nThis paper addresses the problem of NeRF quality assessment thoroughly, by\nconducting a rigorous subjective quality assessment test that considers several\nscene classes and recently proposed NeRF view synthesis methods. Additionally,\nthe performance of a wide range of state-of-the-art conventional and\nlearning-based full-reference 2D image and video quality assessment metrics is\nevaluated against the subjective scores of the subjective study. The\nexperimental results are analyzed in depth, providing a comparative evaluation\nof several NeRF methods and objective quality metrics, across different classes\nof visual scenes, including real and synthetic content for front-face and\n360-degree camera trajectories.\n","authors":["Pedro Martin","Antonio Rodrigues","Joao Ascenso","Maria Paula Queluz"],"pdf_url":"https://arxiv.org/pdf/2405.20078v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20032v1","updated":"2024-05-30T13:16:48Z","published":"2024-05-30T13:16:48Z","title":"Promptus: Can Prompts Streaming Replace Video Streaming with Stable\n Diffusion","summary":" With the exponential growth of video traffic, traditional video streaming\nsystems are approaching their limits in compression efficiency and\ncommunication capacity. To further reduce bitrate while maintaining quality, we\npropose Promptus, a disruptive novel system that streaming prompts instead of\nvideo content with Stable Diffusion, which converts video frames into a series\nof \"prompts\" for delivery. To ensure pixel alignment, a gradient descent-based\nprompt fitting framework is proposed. To achieve adaptive bitrate for prompts,\na low-rank decomposition-based bitrate control algorithm is introduced. For\ninter-frame compression of prompts, a temporal smoothing-based prompt\ninterpolation algorithm is proposed. Evaluations across various video domains\nand real network traces demonstrate Promptus can enhance the perceptual quality\nby 0.111 and 0.092 (in LPIPS) compared to VAE and H.265, respectively, and\ndecreases the ratio of severely distorted frames by 89.3% and 91.7%. Moreover,\nPromptus achieves real-time video generation from prompts at over 150 FPS. To\nthe best of our knowledge, Promptus is the first attempt to replace video\ncodecs with prompt inversion and the first to use prompt streaming instead of\nvideo streaming. Our work opens up a new paradigm for efficient video\ncommunication beyond the Shannon limit.\n","authors":["Jiangkai Wu","Liming Liu","Yunpeng Tan","Junlin Hao","Xinggong Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20032v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19889v1","updated":"2024-05-30T09:46:59Z","published":"2024-05-30T09:46:59Z","title":"Deep Joint Semantic Coding and Beamforming for Near-Space Airship-Borne\n Massive MIMO Network","summary":" Near-space airship-borne communication network is recognized to be an\nindispensable component of the future integrated ground-air-space network\nthanks to airships' advantage of long-term residency at stratospheric\naltitudes, but it urgently needs reliable and efficient Airship-to-X link. To\nimprove the transmission efficiency and capacity, this paper proposes to\nintegrate semantic communication with massive multiple-input multiple-output\n(MIMO) technology. Specifically, we propose a deep joint semantic coding and\nbeamforming (JSCBF) scheme for airship-based massive MIMO image transmission\nnetwork in space, in which semantics from both source and channel are fused to\njointly design the semantic coding and physical layer beamforming. First, we\ndesign two semantic extraction networks to extract semantics from image source\nand channel state information, respectively. Then, we propose a semantic fusion\nnetwork that can fuse these semantics into complex-valued semantic features for\nsubsequent physical-layer transmission. To efficiently transmit the fused\nsemantic features at the physical layer, we then propose the hybrid data and\nmodel-driven semantic-aware beamforming networks. At the receiver, a semantic\ndecoding network is designed to reconstruct the transmitted images. Finally, we\nperform end-to-end deep learning to jointly train all the modules, using the\nimage reconstruction quality at the receivers as a metric. The proposed deep\nJSCBF scheme fully combines the efficient source compressibility and robust\nerror correction capability of semantic communication with the high spectral\nefficiency of massive MIMO, achieving a significant performance improvement\nover existing approaches.\n","authors":["Minghui Wu","Zhen Gao","Zhaocheng Wang","Dusit Niyato","George K. Karagiannidis","Sheng Chen"],"pdf_url":"https://arxiv.org/pdf/2405.19889v1.pdf","comment":"Major Revision by IEEE JSAC"},{"id":"http://arxiv.org/abs/2405.19802v1","updated":"2024-05-30T08:12:08Z","published":"2024-05-30T08:12:08Z","title":"Exploring the Robustness of Decision-Level Through Adversarial Attacks\n on LLM-Based Embodied Models","summary":" Embodied intelligence empowers agents with a profound sense of perception,\nenabling them to respond in a manner closely aligned with real-world\nsituations. Large Language Models (LLMs) delve into language instructions with\ndepth, serving a crucial role in generating plans for intricate tasks. Thus,\nLLM-based embodied models further enhance the agent's capacity to comprehend\nand process information. However, this amalgamation also ushers in new\nchallenges in the pursuit of heightened intelligence. Specifically, attackers\ncan manipulate LLMs to produce irrelevant or even malicious outputs by altering\ntheir prompts. Confronted with this challenge, we observe a notable absence of\nmulti-modal datasets essential for comprehensively evaluating the robustness of\nLLM-based embodied models. Consequently, we construct the Embodied Intelligent\nRobot Attack Dataset (EIRAD), tailored specifically for robustness evaluation.\nAdditionally, two attack strategies are devised, including untargeted attacks\nand targeted attacks, to effectively simulate a range of diverse attack\nscenarios. At the same time, during the attack process, to more accurately\nascertain whether our method is successful in attacking the LLM-based embodied\nmodel, we devise a new attack success evaluation method utilizing the BLIP2\nmodel. Recognizing the time and cost-intensive nature of the GCG algorithm in\nattacks, we devise a scheme for prompt suffix initialization based on various\ntarget tasks, thus expediting the convergence process. Experimental results\ndemonstrate that our method exhibits a superior attack success rate when\ntargeting LLM-based embodied models, indicating a lower level of decision-level\nrobustness in these models.\n","authors":["Shuyuan Liu","Jiawei Chen","Shouwei Ruan","Hang Su","Zhaoxia Yin"],"pdf_url":"https://arxiv.org/pdf/2405.19802v1.pdf","comment":null}]},"2024-05-31T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2405.21075v1","updated":"2024-05-31T17:59:47Z","published":"2024-05-31T17:59:47Z","title":"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of\n Multi-modal LLMs in Video Analysis","summary":" In the quest for artificial general intelligence, Multi-modal Large Language\nModels (MLLMs) have emerged as a focal point in recent advancements. However,\nthe predominant focus remains on developing their capabilities in static image\nunderstanding. The potential of MLLMs in processing sequential visual data is\nstill insufficiently explored, highlighting the absence of a comprehensive,\nhigh-quality assessment of their performance. In this paper, we introduce\nVideo-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of\nMLLMs in Video analysis. Our work distinguishes from existing benchmarks\nthrough four key features: 1) Diversity in video types, spanning 6 primary\nvisual domains with 30 subfields to ensure broad scenario generalizability; 2)\nDuration in temporal dimension, encompassing both short-, medium-, and\nlong-term videos, ranging from 11 seconds to 1 hour, for robust contextual\ndynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides\nvideo frames, including subtitles and audios, to unveil the all-round\ncapabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual\nlabeling by expert annotators to facilitate precise and reliable model\nassessment. 900 videos with a total of 256 hours are manually selected and\nannotated by repeatedly viewing all the video content, resulting in 2,700\nquestion-answer pairs. With Video-MME, we extensively evaluate various\nstate-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as\nopen-source image models like InternVL-Chat-V1.5 and video models like\nLLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the\nbest-performing commercial model, significantly outperforming the open-source\nmodels. Our dataset along with these findings underscores the need for further\nimprovements in handling longer sequences and multi-modal data. Project Page:\nhttps://video-mme.github.io\n","authors":["Chaoyou Fu","Yuhan Dai","Yondong Luo","Lei Li","Shuhuai Ren","Renrui Zhang","Zihan Wang","Chenyu Zhou","Yunhang Shen","Mengdan Zhang","Peixian Chen","Yanwei Li","Shaohui Lin","Sirui Zhao","Ke Li","Tong Xu","Xiawu Zheng","Enhong Chen","Rongrong Ji","Xing Sun"],"pdf_url":"https://arxiv.org/pdf/2405.21075v1.pdf","comment":"Project Page: https://video-mme.github.io"},{"id":"http://arxiv.org/abs/2405.21070v1","updated":"2024-05-31T17:57:24Z","published":"2024-05-31T17:57:24Z","title":"Generalization Beyond Data Imbalance: A Controlled Study on CLIP for\n Transferable Insights","summary":" Severe data imbalance naturally exists among web-scale vision-language\ndatasets. Despite this, we find CLIP pre-trained thereupon exhibits notable\nrobustness to the data imbalance compared to supervised learning, and\ndemonstrates significant effectiveness in learning generalizable\nrepresentations. With an aim to investigate the reasons behind this finding, we\nconduct controlled experiments to study various underlying factors, and reveal\nthat CLIP's pretext task forms a dynamic classification problem wherein only a\nsubset of classes is present in training. This isolates the bias from dominant\nclasses and implicitly balances the learning signal. Furthermore, the\nrobustness and discriminability of CLIP improve with more descriptive language\nsupervision, larger data scale, and broader open-world concepts, which are\ninaccessible to supervised learning. Our study not only uncovers the mechanisms\nbehind CLIP's generalizability beyond data imbalance but also provides\ntransferable insights for the research community. The findings are validated in\nboth supervised and self-supervised learning, enabling models trained on\nimbalanced data to achieve CLIP-level performance on diverse recognition tasks.\nCode will be available at: https://github.com/CVMI-Lab/clip-beyond-tail.\n","authors":["Xin Wen","Bingchen Zhao","Yilun Chen","Jiangmiao Pang","Xiaojuan Qi"],"pdf_url":"https://arxiv.org/pdf/2405.21070v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21068v1","updated":"2024-05-31T17:56:33Z","published":"2024-05-31T17:56:33Z","title":"Code Pretraining Improves Entity Tracking Abilities of Language Models","summary":" Recent work has provided indirect evidence that pretraining language models\non code improves the ability of models to track state changes of discourse\nentities expressed in natural language. In this work, we systematically test\nthis claim by comparing pairs of language models on their entity tracking\nperformance. Critically, the pairs consist of base models and models trained on\ntop of these base models with additional code data. We extend this analysis to\nadditionally examine the effect of math training, another highly structured\ndata type, and alignment tuning, an important step for enhancing the usability\nof models. We find clear evidence that models additionally trained on large\namounts of code outperform the base models. On the other hand, we find no\nconsistent benefit of additional math training or alignment tuning across\nvarious model families.\n","authors":["Najoung Kim","Sebastian Schuster","Shubham Toshniwal"],"pdf_url":"https://arxiv.org/pdf/2405.21068v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.15938v3","updated":"2024-05-31T17:49:03Z","published":"2024-02-24T23:54:41Z","title":"Generalization or Memorization: Data Contamination and Trustworthy\n Evaluation for Large Language Models","summary":" Recent statements about the impressive capabilities of large language models\n(LLMs) are usually supported by evaluating on open-access benchmarks.\nConsidering the vast size and wide-ranging sources of LLMs' training data, it\ncould explicitly or implicitly include test data, leading to LLMs being more\nsusceptible to data contamination. However, due to the opacity of training\ndata, the black-box access of models, and the rapid growth of synthetic\ntraining data, detecting and mitigating data contamination for LLMs faces\nsignificant challenges. In this paper, we propose CDD, which stands for\nContamination Detection via output Distribution for LLMs. CDD necessitates only\nthe sampled texts to detect data contamination, by identifying the peakedness\nof LLM's output distribution. To mitigate the impact of data contamination in\nevaluation, we also present TED: Trustworthy Evaluation via output\nDistribution, based on the correction of LLM's output distribution. To\nfacilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval,\nfor data contamination detection and contamination mitigation evaluation tasks.\nExtensive experimental results show that CDD achieves the average relative\nimprovements of 21.8\\%-30.2\\% over other contamination detection approaches in\nterms of Accuracy, F1 Score, and AUC metrics, and can effectively detect\nimplicit contamination. TED substantially mitigates performance improvements up\nto 66.9\\% attributed to data contamination across various contamination setups.\nIn real-world applications, we reveal that ChatGPT exhibits a high potential to\nsuffer from data contamination on HumanEval benchmark.\n","authors":["Yihong Dong","Xue Jiang","Huanyu Liu","Zhi Jin","Bin Gu","Mengfei Yang","Ge Li"],"pdf_url":"https://arxiv.org/pdf/2402.15938v3.pdf","comment":"Accepted to ACL"},{"id":"http://arxiv.org/abs/2405.08295v2","updated":"2024-05-31T17:47:40Z","published":"2024-05-14T03:33:31Z","title":"SpeechVerse: A Large-scale Generalizable Audio Language Model","summary":" Large language models (LLMs) have shown incredible proficiency in performing\ntasks that require semantic understanding of natural language instructions.\nRecently, many works have further expanded this capability to perceive\nmultimodal audio and text inputs, but their capabilities are often limited to\nspecific fine-tuned tasks such as automatic speech recognition and translation.\nWe therefore develop SpeechVerse, a robust multi-task training and curriculum\nlearning framework that combines pre-trained speech and text foundation models\nvia a small set of learnable parameters, while keeping the pre-trained models\nfrozen during training. The models are instruction finetuned using continuous\nlatent representations extracted from the speech foundation model to achieve\noptimal zero-shot performance on a diverse range of speech processing tasks\nusing natural language instructions. We perform extensive benchmarking that\nincludes comparing our model performance against traditional baselines across\nseveral datasets and tasks. Furthermore, we evaluate the model's capability for\ngeneralized instruction following by testing on out-of-domain datasets, novel\nprompts, and unseen tasks. Our empirical experiments reveal that our multi-task\nSpeechVerse model is even superior to conventional task-specific baselines on 9\nout of the 11 tasks.\n","authors":["Nilaksh Das","Saket Dingliwal","Srikanth Ronanki","Rohit Paturi","Zhaocheng Huang","Prashant Mathur","Jie Yuan","Dhanush Bekal","Xing Niu","Sai Muralidhar Jayanthi","Xilai Li","Karel Mundnich","Monica Sunkara","Sundararajan Srinivasan","Kyu J Han","Katrin Kirchhoff"],"pdf_url":"https://arxiv.org/pdf/2405.08295v2.pdf","comment":"Single Column, 13 page"},{"id":"http://arxiv.org/abs/2405.21047v1","updated":"2024-05-31T17:39:15Z","published":"2024-05-31T17:39:15Z","title":"Grammar-Aligned Decoding","summary":" Large Language Models (LLMs) struggle with reliably generating highly\nstructured outputs, such as program code, mathematical formulas, or well-formed\nmarkup. Constrained decoding approaches mitigate this problem by greedily\nrestricting what tokens an LLM can output at each step to guarantee that the\noutput matches a given constraint. Specifically, in grammar-constrained\ndecoding (GCD), the LLM's output must follow a given grammar. In this paper we\ndemonstrate that GCD techniques (and in general constrained decoding\ntechniques) can distort the LLM's distribution, leading to outputs that are\ngrammatical but appear with likelihoods that are not proportional to the ones\ngiven by the LLM, and so ultimately are low-quality. We call the problem of\naligning sampling with a grammar constraint, grammar-aligned decoding (GAD),\nand propose adaptive sampling with approximate expected futures (ASAp), a\ndecoding algorithm that guarantees the output to be grammatical while provably\nproducing outputs that match the conditional probability of the LLM's\ndistribution conditioned on the given grammar constraint. Our algorithm uses\nprior sample outputs to soundly overapproximate the future grammaticality of\ndifferent output prefixes. Our evaluation on code generation and structured NLP\ntasks shows how ASAp often produces outputs with higher likelihood (according\nto the LLM's distribution) than existing GCD techniques, while still enforcing\nthe desired grammatical constraints.\n","authors":["Kanghee Park","Jiayu Wang","Taylor Berg-Kirkpatrick","Nadia Polikarpova","Loris D'Antoni"],"pdf_url":"https://arxiv.org/pdf/2405.21047v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21046v1","updated":"2024-05-31T17:39:06Z","published":"2024-05-31T17:39:06Z","title":"Exploratory Preference Optimization: Harnessing Implicit\n Q*-Approximation for Sample-Efficient RLHF","summary":" Reinforcement learning from human feedback (RLHF) has emerged as a central\ntool for language model alignment. We consider online exploration in RLHF,\nwhich exploits interactive access to human or AI feedback by deliberately\nencouraging the model to produce diverse, maximally informative responses. By\nallowing RLHF to confidently stray from the pre-trained model, online\nexploration offers the possibility of novel, potentially super-human\ncapabilities, but its full potential as a paradigm for language model training\nhas yet to be realized, owing to computational and statistical bottlenecks in\ndirectly adapting existing reinforcement learning techniques. We propose a new\nalgorithm for online exploration in RLHF, Exploratory Preference Optimization\n(XPO), which is simple and practical -- a one-line change to (online) Direct\nPreference Optimization (DPO; Rafailov et al., 2023) -- yet enjoys the\nstrongest known provable guarantees and promising empirical performance. XPO\naugments the DPO objective with a novel and principled exploration bonus,\nempowering the algorithm to explore outside the support of the initial model\nand human feedback data. In theory, we show that XPO is provably\nsample-efficient and converges to a near-optimal language model policy under\nnatural exploration conditions, irrespective of whether the initial model has\ngood coverage. Our analysis, which builds on the observation that DPO\nimplicitly performs a form of $Q^{\\star}$-approximation (or, Bellman error\nminimization), combines previously disparate techniques from language modeling\nand theoretical reinforcement learning in a serendipitous fashion through the\nperspective of KL-regularized Markov decision processes. Empirically, we find\nthat XPO is more sample-efficient than non-exploratory DPO variants in a\npreliminary evaluation.\n","authors":["Tengyang Xie","Dylan J. Foster","Akshay Krishnamurthy","Corby Rosset","Ahmed Awadallah","Alexander Rakhlin"],"pdf_url":"https://arxiv.org/pdf/2405.21046v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.18239v2","updated":"2024-05-31T17:38:51Z","published":"2024-04-28T16:31:32Z","title":"SOUL: Unlocking the Power of Second-Order Optimization for LLM\n Unlearning","summary":" Large Language Models (LLMs) have highlighted the necessity of effective\nunlearning mechanisms to comply with data regulations and ethical AI practices.\nLLM unlearning aims at removing undesired data influences and associated model\ncapabilities without compromising utility out of the scope of unlearning. While\ninterest in studying LLM unlearning is growing,the impact of the optimizer\nchoice for LLM unlearning remains under-explored. In this work, we shed light\non the significance of optimizer selection in LLM unlearning for the first\ntime, establishing a clear connection between {second-order optimization} and\ninfluence unlearning (a classical approach using influence functions to update\nthe model for data influence removal). This insight propels us to develop a\nsecond-order unlearning framework, termed SOUL, built upon the second-order\nclipped stochastic optimization (Sophia)-based LLM training method. SOUL\nextends the static, one-shot model update using influence unlearning to a\ndynamic, iterative unlearning process. Our extensive experiments show that SOUL\nconsistently outperforms conventional first-order methods across various\nunlearning tasks, models, and metrics, suggesting the promise of second-order\noptimization in providing a scalable and easily implementable solution for LLM\nunlearning.\n","authors":["Jinghan Jia","Yihua Zhang","Yimeng Zhang","Jiancheng Liu","Bharat Runwal","James Diffenderfer","Bhavya Kailkhura","Sijia Liu"],"pdf_url":"https://arxiv.org/pdf/2404.18239v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.09615v3","updated":"2024-05-31T17:31:38Z","published":"2024-02-14T23:09:15Z","title":"API Pack: A Massive Multi-Programming Language Dataset for API Call\n Generation","summary":" We introduce API Pack, a massive multi-programming language dataset\ncontaining more than 1 million instruction-API call pairs to improve the API\ncall generation capabilities of large language models. By fine-tuning\nCodeLlama-13B on 20,000 Python instances from API Pack, we achieved around 10%\nand 5% higher accuracy compared to GPT-3.5 and GPT-4, respectively, in\ngenerating unseen API calls. Fine-tuning on API Pack enables cross-programming\nlanguage generalization by leveraging a large amount of data in one language\nand small amounts of data from other languages. Scaling the training data to 1\nmillion instances further improves the model's generalization to new APIs not\nencountered during training. We open-source the API Pack dataset, trained\nmodels, and associated source code at https://github.com/zguo0525/API-Pack to\nfacilitate further research.\n","authors":["Zhen Guo","Adriana Meza Soria","Wei Sun","Yikang Shen","Rameswar Panda"],"pdf_url":"https://arxiv.org/pdf/2402.09615v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21040v1","updated":"2024-05-31T17:31:18Z","published":"2024-05-31T17:31:18Z","title":"Direct Alignment of Language Models via Quality-Aware Self-Refinement","summary":" Reinforcement Learning from Human Feedback (RLHF) has been commonly used to\nalign the behaviors of Large Language Models (LLMs) with human preferences.\nRecently, a popular alternative is Direct Policy Optimization (DPO), which\nreplaces an LLM-based reward model with the policy itself, thus obviating the\nneed for extra memory and training time to learn the reward model. However, DPO\ndoes not consider the relative qualities of the positive and negative\nresponses, and can lead to sub-optimal training outcomes. To alleviate this\nproblem, we investigate the use of intrinsic knowledge within the on-the-fly\nfine-tuning LLM to obtain relative qualities and help to refine the loss\nfunction. Specifically, we leverage the knowledge of the LLM to design a\nrefinement function to estimate the quality of both the positive and negative\nresponses. We show that the constructed refinement function can help\nself-refine the loss function under mild assumptions. The refinement function\nis integrated into DPO and its variant Identity Policy Optimization (IPO).\nExperiments across various evaluators indicate that they can improve the\nperformance of the fine-tuned models over DPO and IPO.\n","authors":["Runsheng Yu","Yong Wang","Xiaoqi Jiao","Youzhi Zhang","James T. Kwok"],"pdf_url":"https://arxiv.org/pdf/2405.21040v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.11058v2","updated":"2024-05-31T17:30:13Z","published":"2024-02-16T20:14:47Z","title":"II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in\n Visual Question Answering","summary":" Visual Question Answering (VQA) often involves diverse reasoning scenarios\nacross Vision and Language (V&L). Most prior VQA studies, however, have merely\nfocused on assessing the model's overall accuracy without evaluating it on\ndifferent reasoning cases. Furthermore, some recent works observe that\nconventional Chain-of-Thought (CoT) prompting fails to generate effective\nreasoning for VQA, especially for complex scenarios requiring multi-hop\nreasoning. In this paper, we propose II-MMR, a novel idea to identify and\nimprove multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA\nquestion with an image and finds a reasoning path to reach its answer using two\nnovel language promptings: (i) answer prediction-guided CoT prompt, or (ii)\nknowledge triplet-guided prompt. II-MMR then analyzes this path to identify\ndifferent reasoning cases in current VQA benchmarks by estimating how many hops\nand what types (i.e., visual or beyond-visual) of reasoning are required to\nanswer the question. On popular benchmarks including GQA and A-OKVQA, II-MMR\nobserves that most of their VQA questions are easy to answer, simply demanding\n\"single-hop\" reasoning, whereas only a few questions require \"multi-hop\"\nreasoning. Moreover, while the recent V&L model struggles with such complex\nmulti-hop reasoning questions even using the traditional CoT method, II-MMR\nshows its effectiveness across all reasoning cases in both zero-shot and\nfine-tuning settings.\n","authors":["Jihyung Kil","Farideh Tavazoee","Dongyeop Kang","Joo-Kyung Kim"],"pdf_url":"https://arxiv.org/pdf/2402.11058v2.pdf","comment":"Accepted to ACL 2024 Findings"},{"id":"http://arxiv.org/abs/2405.21028v1","updated":"2024-05-31T17:16:38Z","published":"2024-05-31T17:16:38Z","title":"LACIE: Listener-Aware Finetuning for Confidence Calibration in Large\n Language Models","summary":" When answering questions, LLMs can convey not only an answer, but a level of\nconfidence about the answer being correct. This includes explicit confidence\nmarkers (e.g. giving a numeric score) as well as implicit markers, like an\nauthoritative tone or elaborating with additional knowledge. For LLMs to be\ntrustworthy knowledge sources, the confidence they convey should match their\nactual expertise; however, most current models tend towards overconfidence. To\ncalibrate both implicit and explicit confidence markers, we introduce a\npragmatic, listener-aware finetuning method (LACIE) that models the listener,\nconsidering not only whether an answer is right, but whether it will be\naccepted by a listener. We cast calibration as preference optimization,\ncreating data via a two-agent game, where a speaker model's outputs are judged\nby a simulated listener. We then finetune three LLMs (Mistral-7B, Llama3-8B,\nLlama3-70B) with LACIE, and show that the resulting models are better\ncalibrated w.r.t. a simulated listener. Crucially, these trends transfer to\nhuman listeners, helping them correctly predict model correctness: we conduct a\nhuman evaluation where annotators accept or reject an LLM's answers, finding\nthat training with LACIE results in 47% fewer incorrect answers being accepted\nwhile maintaining the same level of acceptance for correct answers.\nFurthermore, LACIE generalizes to another dataset, resulting in a large\nincrease in truthfulness on TruthfulQA when trained on TriviaQA. Our analysis\nindicates that LACIE leads to a better confidence separation between correct\nand incorrect examples. Qualitatively, we find that a LACIE-trained model\nhedges more and implicitly signals certainty when it is correct by using an\nauthoritative tone or including details. Finally, LACIE finetuning leads to an\nemergent increase in model abstention (e.g. saying \"I don't know\") for answers\nthat are likely wrong.\n","authors":["Elias Stengel-Eskin","Peter Hase","Mohit Bansal"],"pdf_url":"https://arxiv.org/pdf/2405.21028v1.pdf","comment":"17 pages. Code: https://github.com/esteng/pragmatic_calibration"},{"id":"http://arxiv.org/abs/2405.21022v1","updated":"2024-05-31T17:09:16Z","published":"2024-05-31T17:09:16Z","title":"You Only Scan Once: Efficient Multi-dimension Sequential Modeling with\n LightNet","summary":" Linear attention mechanisms have gained prominence in causal language models\ndue to their linear computational complexity and enhanced speed. However, the\ninherent decay mechanism in linear attention presents challenges when applied\nto multi-dimensional sequence modeling tasks, such as image processing and\nmulti-modal learning. In these scenarios, the utilization of sequential\nscanning to establish a global receptive field necessitates multiple scans for\nmulti-dimensional data, thereby leading to inefficiencies. This paper\nidentifies the inefficiency caused by a multiplicative linear recurrence and\nproposes an efficient alternative additive linear recurrence to avoid the\nissue, as it can handle multi-dimensional data within a single scan. We further\ndevelop an efficient multi-dimensional sequential modeling framework called\nLightNet based on the new recurrence. Moreover, we present two new\nmulti-dimensional linear relative positional encoding methods, MD-TPE and\nMD-LRPE to enhance the model's ability to discern positional information in\nmulti-dimensional scenarios. Our empirical evaluations across various tasks,\nincluding image classification, image generation, bidirectional language\nmodeling, and autoregressive language modeling, demonstrate the efficacy of\nLightNet, showcasing its potential as a versatile and efficient solution for\nmulti-dimensional sequential modeling.\n","authors":["Zhen Qin","Yuxin Mao","Xuyang Shen","Dong Li","Jing Zhang","Yuchao Dai","Yiran Zhong"],"pdf_url":"https://arxiv.org/pdf/2405.21022v1.pdf","comment":"Technical report. Yiran Zhong is the corresponding author. The code\n is available at https://github.com/OpenNLPLab/LightNet"},{"id":"http://arxiv.org/abs/2405.21018v1","updated":"2024-05-31T17:07:15Z","published":"2024-05-31T17:07:15Z","title":"Improved Techniques for Optimization-Based Jailbreaking on Large\n Language Models","summary":" Large language models (LLMs) are being rapidly developed, and a key component\nof their widespread deployment is their safety-related alignment. Many\nred-teaming efforts aim to jailbreak LLMs, where among these efforts, the\nGreedy Coordinate Gradient (GCG) attack's success has led to a growing interest\nin the study of optimization-based jailbreaking techniques. Although GCG is a\nsignificant milestone, its attacking efficiency remains unsatisfactory. In this\npaper, we present several improved (empirical) techniques for\noptimization-based jailbreaks like GCG. We first observe that the single target\ntemplate of \"Sure\" largely limits the attacking performance of GCG; given this,\nwe propose to apply diverse target templates containing harmful self-suggestion\nand/or guidance to mislead LLMs. Besides, from the optimization aspects, we\npropose an automatic multi-coordinate updating strategy in GCG (i.e.,\nadaptively deciding how many tokens to replace in each step) to accelerate\nconvergence, as well as tricks like easy-to-hard initialisation. Then, we\ncombine these improved technologies to develop an efficient jailbreak method,\ndubbed $\\mathcal{I}$-GCG. In our experiments, we evaluate on a series of\nbenchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate\nthat our improved techniques can help GCG outperform state-of-the-art\njailbreaking attacks and achieve nearly 100% attack success rate. The code is\nreleased at https://github.com/jiaxiaojunQAQ/I-GCG.\n","authors":["Xiaojun Jia","Tianyu Pang","Chao Du","Yihao Huang","Jindong Gu","Yang Liu","Xiaochun Cao","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2405.21018v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.13401v3","updated":"2024-05-31T16:59:17Z","published":"2024-05-22T07:21:32Z","title":"TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in\n Large Language Models","summary":" Large language models (LLMs) have raised concerns about potential security\nthreats despite performing significantly in Natural Language Processing (NLP).\nBackdoor attacks initially verified that LLM is doing substantial harm at all\nstages, but the cost and robustness have been criticized. Attacking LLMs is\ninherently risky in security review, while prohibitively expensive. Besides,\nthe continuous iteration of LLMs will degrade the robustness of backdoors. In\nthis paper, we propose TrojanRAG, which employs a joint backdoor attack in the\nRetrieval-Augmented Generation, thereby manipulating LLMs in universal attack\nscenarios. Specifically, the adversary constructs elaborate target contexts and\ntrigger sets. Multiple pairs of backdoor shortcuts are orthogonally optimized\nby contrastive learning, thus constraining the triggering conditions to a\nparameter subspace to improve the matching. To improve the recall of the RAG\nfor the target contexts, we introduce a knowledge graph to construct structured\ndata to achieve hard matching at a fine-grained level. Moreover, we normalize\nthe backdoor scenarios in LLMs to analyze the real harm caused by backdoors\nfrom both attackers' and users' perspectives and further verify whether the\ncontext is a favorable tool for jailbreaking models. Extensive experimental\nresults on truthfulness, language understanding, and harmfulness show that\nTrojanRAG exhibits versatility threats while maintaining retrieval capabilities\non normal queries.\n","authors":["Pengzhou Cheng","Yidong Ding","Tianjie Ju","Zongru Wu","Wei Du","Ping Yi","Zhuosheng Zhang","Gongshen Liu"],"pdf_url":"https://arxiv.org/pdf/2405.13401v3.pdf","comment":"19 pages, 14 figures, 4 tables"},{"id":"http://arxiv.org/abs/2405.20999v1","updated":"2024-05-31T16:41:36Z","published":"2024-05-31T16:41:36Z","title":"Towards a Fluid computer","summary":" In 1991, Moore [20] raised a question about whether hydrodynamics is capable\nof performing computations. Similarly, in 2016, Tao [25] asked whether a\nmechanical system, including a fluid flow, can simulate a universal Turing\nmachine. In this expository article, we review the construction in [8] of a\n\"Fluid computer\" in dimension 3 that combines techniques in symbolic dynamics\nwith the connection between steady Euler flows and contact geometry unveiled by\nEtnyre and Ghrist. In addition, we argue that the metric that renders the\nvector field Beltrami cannot be critical in the Chern-Hamilton sense [9]. We\nalso sketch the completely different construction for the Euclidean metric in\n$\\mathbb R^3$ as given in [7]. These results reveal the existence of\nundecidable fluid particle paths. We conclude the article with a list of open\nproblems.\n","authors":["Robert Cardona","Eva Miranda","Daniel Peralta-Salas"],"pdf_url":"https://arxiv.org/pdf/2405.20999v1.pdf","comment":"11 pages, 3 figures"},{"id":"http://arxiv.org/abs/2405.20994v1","updated":"2024-05-31T16:38:54Z","published":"2024-05-31T16:38:54Z","title":"CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to\n Web Relevance Ranking","summary":" We present CWRCzech, Click Web Ranking dataset for Czech, a 100M\nquery-document Czech click dataset for relevance ranking with user behavior\ndata collected from search engine logs of Seznam.cz. To the best of our\nknowledge, CWRCzech is the largest click dataset with raw text published so\nfar. It provides document positions in the search results as well as\ninformation about user behavior: 27.6M clicked documents and 10.8M dwell times.\nIn addition, we also publish a manually annotated Czech test for the relevance\ntask, containing nearly 50k query-document pairs, each annotated by at least 2\nannotators. Finally, we analyze how the user behavior data improve relevance\nranking and show that models trained on data automatically harnessed at\nsufficient scale can surpass the performance of models trained on human\nannotated data. CWRCzech is published under an academic non-commercial license\nand is available to the research community at\nhttps://github.com/seznam/CWRCzech.\n","authors":["Josef Vonášek","Milan Straka","Rostislav Krč","Lenka Lasoňová","Ekaterina Egorova","Jana Straková","Jakub Náplava"],"pdf_url":"https://arxiv.org/pdf/2405.20994v1.pdf","comment":"Accepted to SIGIR 2024"},{"id":"http://arxiv.org/abs/2405.14622v3","updated":"2024-05-31T16:37:53Z","published":"2024-05-23T14:30:33Z","title":"Calibrated Self-Rewarding Vision Language Models","summary":" Large Vision-Language Models (LVLMs) have made substantial progress by\nintegrating pre-trained large language models (LLMs) and vision models through\ninstruction tuning. Despite these advancements, LVLMs often exhibit the\nhallucination phenomenon, where generated text responses appear linguistically\nplausible but contradict the input image, indicating a misalignment between\nimage and text pairs. This misalignment arises because the model tends to\nprioritize textual information over visual input, even when both the language\nmodel and visual representations are of high quality. Existing methods leverage\nadditional models or human annotations to curate preference data and enhance\nmodality alignment through preference optimization. These approaches may not\neffectively reflect the target LVLM's preferences, making the curated\npreferences easily distinguishable. Our work addresses these challenges by\nproposing the Calibrated Self-Rewarding (CSR) approach, which enables the model\nto self-improve by iteratively generating candidate responses, evaluating the\nreward for each response, and curating preference data for fine-tuning. In the\nreward modeling, we employ a step-wise strategy and incorporate visual\nconstraints into the self-rewarding process to place greater emphasis on visual\ninput. Empirical results demonstrate that CSR enhances performance and reduces\nhallucinations across ten benchmarks and tasks, achieving substantial\nimprovements over existing methods by 7.62%. Our empirical results are further\nsupported by rigorous theoretical analysis, under mild assumptions, verifying\nthe effectiveness of introducing visual constraints into the self-rewarding\nparadigm. Additionally, CSR shows compatibility with different vision-language\nmodels and the ability to incrementally improve performance through iterative\nfine-tuning. Our data and code are available at\nhttps://github.com/YiyangZhou/CSR.\n","authors":["Yiyang Zhou","Zhiyuan Fan","Dongjie Cheng","Sihan Yang","Zhaorun Chen","Chenhang Cui","Xiyao Wang","Yun Li","Linjun Zhang","Huaxiu Yao"],"pdf_url":"https://arxiv.org/pdf/2405.14622v3.pdf","comment":"fix some typos and add acknowledgement section in V3"},{"id":"http://arxiv.org/abs/2310.02905v2","updated":"2024-05-31T16:27:53Z","published":"2023-10-02T02:01:16Z","title":"Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural\n bandits Coupled with Transformers","summary":" Large language models (LLMs) have shown remarkable instruction-following\ncapabilities and achieved impressive performances in various applications.\nHowever, the performances of LLMs depend heavily on the instructions given to\nthem, which are typically manually tuned with substantial human efforts. Recent\nwork has used the query-efficient Bayesian optimization (BO) algorithm to\nautomatically optimize the instructions given to black-box LLMs. However, BO\nusually falls short when optimizing highly sophisticated (e.g.,\nhigh-dimensional) objective functions, such as the functions mapping an\ninstruction to the performance of an LLM. This is mainly due to the limited\nexpressive power of the Gaussian process (GP) which is used by BO as a\nsurrogate to model the objective function. Meanwhile, it has been repeatedly\nshown that neural networks (NNs), especially pre-trained transformers, possess\nstrong expressive power and can model highly complex functions. So, we adopt a\nneural bandit algorithm which replaces the GP in BO by an NN surrogate to\noptimize instructions for black-box LLMs. More importantly, the neural bandit\nalgorithm allows us to naturally couple the NN surrogate with the hidden\nrepresentation learned by a pre-trained transformer (i.e., an open-source LLM),\nwhich significantly boosts its performance. These motivate us to propose our\nINSTruction optimization usIng Neural bandits Coupled with Transformers\n(INSTINCT) algorithm. We perform instruction optimization for ChatGPT and use\nextensive experiments to show that INSTINCT consistently outperforms baselines\nin different tasks, e.g., various instruction induction tasks and the task of\nimproving zero-shot chain-of-thought instructions. Our code is available at\nhttps://github.com/xqlin98/INSTINCT.\n","authors":["Xiaoqiang Lin","Zhaoxuan Wu","Zhongxiang Dai","Wenyang Hu","Yao Shu","See-Kiong Ng","Patrick Jaillet","Bryan Kian Hsiang Low"],"pdf_url":"https://arxiv.org/pdf/2310.02905v2.pdf","comment":"Accepted to ICML 2024"},{"id":"http://arxiv.org/abs/2405.20974v1","updated":"2024-05-31T16:21:16Z","published":"2024-05-31T16:21:16Z","title":"SaySelf: Teaching LLMs to Express Confidence with Self-Reflective\n Rationales","summary":" Large language models (LLMs) often generate inaccurate or fabricated\ninformation and generally fail to indicate their confidence, which limits their\nbroader applications. Previous work elicits confidence from LLMs by direct or\nself-consistency prompting, or constructing specific datasets for supervised\nfinetuning. The prompting-based approaches have inferior performance, and the\ntraining-based approaches are limited to binary or inaccurate group-level\nconfidence estimates. In this work, we present the advanced SaySelf, a training\nframework that teaches LLMs to express more accurate fine-grained confidence\nestimates. In addition, beyond the confidence scores, SaySelf initiates the\nprocess of directing LLMs to produce self-reflective rationales that clearly\nidentify gaps in their parametric knowledge and explain their uncertainty. This\nis achieved by using an LLM to automatically summarize the uncertainties in\nspecific knowledge via natural language. The summarization is based on the\nanalysis of the inconsistency in multiple sampled reasoning chains, and the\nresulting data is utilized for supervised fine-tuning. Moreover, we utilize\nreinforcement learning with a meticulously crafted reward function to calibrate\nthe confidence estimates, motivating LLMs to deliver accurate, high-confidence\npredictions and to penalize overconfidence in erroneous outputs. Experimental\nresults in both in-distribution and out-of-distribution datasets demonstrate\nthe effectiveness of SaySelf in reducing the confidence calibration error and\nmaintaining the task performance. We show that the generated self-reflective\nrationales are reasonable and can further contribute to the calibration. The\ncode is made public at \\url{https://github.com/xu1868/SaySelf}.\n","authors":["Tianyang Xu","Shujin Wu","Shizhe Diao","Xiaoze Liu","Xingyao Wang","Yangyi Chen","Jing Gao"],"pdf_url":"https://arxiv.org/pdf/2405.20974v1.pdf","comment":"The code is available at \\url{https://github.com/xu1868/SaySelf}"},{"id":"http://arxiv.org/abs/2405.20973v1","updated":"2024-05-31T16:21:05Z","published":"2024-05-31T16:21:05Z","title":"LCQ: Low-Rank Codebook based Quantization for Large Language Models","summary":" Large language models~(LLMs) have recently demonstrated promising performance\nin many tasks. However, the high storage and computational cost of LLMs has\nbecome a challenge for deploying LLMs. Weight quantization has been widely used\nfor model compression, which can reduce both storage and computational cost.\nMost existing weight quantization methods for LLMs use a rank-one codebook for\nquantization, which results in substantial accuracy loss when the compression\nratio is high. In this paper, we propose a novel weight quantization method,\ncalled low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a\nlow-rank codebook, the rank of which can be larger than one, for quantization.\nExperiments show that LCQ can achieve better accuracy than existing methods\nwith a negligibly extra storage cost.\n","authors":["Wen-Pu Cai","Wu-Jun Li"],"pdf_url":"https://arxiv.org/pdf/2405.20973v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2405.20967v1","updated":"2024-05-31T16:14:06Z","published":"2024-05-31T16:14:06Z","title":"Superlatives in Context: Explicit and Implicit Domain Restrictions for\n Superlative Frames","summary":" Superlatives are used to single out elements with a maximal/minimal property.\nSemantically, superlatives perform a set comparison: something (or some things)\nhas the min/max property out of a set. As such, superlatives provide an ideal\nphenomenon for studying implicit phenomena and discourse restrictions. While\nthis comparison set is often not explicitly defined, its (implicit)\nrestrictions can be inferred from the discourse context the expression appears\nin. In this work we provide an extensive computational study on the semantics\nof superlatives. We propose a unified account of superlative semantics which\nallows us to derive a broad-coverage annotation schema. Using this unified\nschema we annotated a multi-domain dataset of superlatives and their semantic\ninterpretations. We specifically focus on interpreting implicit or ambiguous\nsuperlative expressions, by analyzing how the discourse context restricts the\nset of interpretations. In a set of experiments we then analyze how well models\nperform at variations of predicting superlative semantics, with and without\ncontext. We show that the fine-grained semantics of superlatives in context can\nbe challenging for contemporary models, including GPT-4.\n","authors":["Valentina Pyatkin","Bonnie Webber","Ido Dagan","Reut Tsarfaty"],"pdf_url":"https://arxiv.org/pdf/2405.20967v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2405.20962v1","updated":"2024-05-31T16:07:33Z","published":"2024-05-31T16:07:33Z","title":"Large Language Models are Zero-Shot Next Location Predictors","summary":" Predicting the locations an individual will visit in the future is crucial\nfor solving many societal issues like disease diffusion and reduction of\npollution among many others. The models designed to tackle next-location\nprediction, however, require a significant amount of individual-level\ninformation to be trained effectively. Such data may be scarce or even\nunavailable in some geographic regions or peculiar scenarios (e.g., cold-start\nin recommendation systems). Moreover, the design of a next-location predictor\nable to generalize or geographically transfer knowledge is still an open\nresearch challenge. Recent advances in natural language processing have led to\na rapid diffusion of Large Language Models (LLMs) which have shown good\ngeneralization and reasoning capabilities. These insights, coupled with the\nrecent findings that LLMs are rich in geographical knowledge, allowed us to\nbelieve that these models can act as zero-shot next-location predictors. This\npaper evaluates the capabilities of many popular LLMs in this role,\nspecifically Llama, GPT-3.5 and Mistral 7B. After designing a proper prompt, we\ntested the models on three real-world mobility datasets. The results show that\nLLMs can obtain accuracies up to 32.4%, a significant relative improvement of\nover 600% when compared to sophisticated DL models specifically designed for\nhuman mobility. Moreover, we show that other LLMs are unable to perform the\ntask properly. To prevent positively biased results, we also propose a\nframework inspired by other studies to test data contamination. Finally, we\nexplored the possibility of using LLMs as text-based explainers for\nnext-location prediction showing that can effectively provide an explanation\nfor their decision. Notably, 7B models provide more generic, but still\nreliable, explanations compared to larger counterparts. Code:\ngithub.com/ssai-trento/LLM-zero-shot-NL\n","authors":["Ciro Beneduce","Bruno Lepri","Massimiliano Luca"],"pdf_url":"https://arxiv.org/pdf/2405.20962v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.09894v2","updated":"2024-05-31T16:00:05Z","published":"2024-02-15T11:39:11Z","title":"Not Just Novelty: A Longitudinal Study on Utility and Customization of\n an AI Workflow","summary":" Generative AI brings novel and impressive abilities to help people in\neveryday tasks. There are many AI workflows that solve real and complex\nproblems by chaining AI outputs together with human interaction. Although there\nis an undeniable lure of AI, it is uncertain how useful generative AI workflows\nare after the novelty wears off. Additionally, workflows built with generative\nAI have the potential to be easily customized to fit users' individual needs,\nbut do users take advantage of this? We conducted a three-week longitudinal\nstudy with 12 users to understand the familiarization and customization of\ngenerative AI tools for science communication. Our study revealed that there\nexists a familiarization phase, during which users were exploring the novel\ncapabilities of the workflow and discovering which aspects they found useful.\nAfter this phase, users understood the workflow and were able to anticipate the\noutputs. Surprisingly, after familiarization the perceived utility of the\nsystem was rated higher than before, indicating that the perceived utility of\nAI is not just a novelty effect. The increase in benefits mainly comes from\nend-users' ability to customize prompts, and thus potentially appropriate the\nsystem to their own needs. This points to a future where generative AI systems\ncan allow us to design for appropriation.\n","authors":["Tao Long","Katy Ilonka Gero","Lydia B. Chilton"],"pdf_url":"https://arxiv.org/pdf/2402.09894v2.pdf","comment":"22 pages, 16 figures. ACM Conference on Designing Interactive Systems\n (DIS 2024)"},{"id":"http://arxiv.org/abs/2402.04513v2","updated":"2024-05-31T15:59:34Z","published":"2024-02-07T01:46:50Z","title":"Online Cascade Learning for Efficient Inference over Streams","summary":" Large Language Models (LLMs) have a natural role in answering complex queries\nabout data streams, but the high computational cost of LLM inference makes them\ninfeasible in many such tasks. We propose online cascade learning, the first\napproach to address this challenge. The objective here is to learn a \"cascade\"\nof models, starting with lower-capacity models (such as logistic regression)\nand ending with a powerful LLM, along with a deferral policy that determines\nthe model to be used on a given input. We formulate the task of learning\ncascades online as an imitation-learning problem, where smaller models are\nupdated over time imitating the collected LLM demonstrations, and give a\nno-regret algorithm for the problem. Experimental results across four\nbenchmarks show that our method parallels LLMs in accuracy while cutting down\ninference costs by as much as 90% with strong robustness against input\ndistribution shifts, underscoring its efficacy and adaptability in stream\nprocessing.\n","authors":["Lunyiu Nie","Zhimin Ding","Erdong Hu","Christopher Jermaine","Swarat Chaudhuri"],"pdf_url":"https://arxiv.org/pdf/2402.04513v2.pdf","comment":"ICML 2024 Main Conference Paper"},{"id":"http://arxiv.org/abs/2402.08638v5","updated":"2024-05-31T15:57:58Z","published":"2024-02-13T18:04:53Z","title":"SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13\n Languages","summary":" Exploring and quantifying semantic relatedness is central to representing\nlanguage and holds significant implications across various NLP tasks. While\nearlier NLP research primarily focused on semantic similarity, often within the\nEnglish language context, we instead investigate the broader phenomenon of\nsemantic relatedness. In this paper, we present \\textit{SemRel}, a new semantic\nrelatedness dataset collection annotated by native speakers across 13\nlanguages: \\textit{Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi,\nIndonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic,\nSpanish,} and \\textit{Telugu}. These languages originate from five distinct\nlanguage families and are predominantly spoken in Africa and Asia -- regions\ncharacterised by a relatively limited availability of NLP resources. Each\ninstance in the SemRel datasets is a sentence pair associated with a score that\nrepresents the degree of semantic textual relatedness between the two\nsentences. The scores are obtained using a comparative annotation framework. We\ndescribe the data collection and annotation processes, challenges when building\nthe datasets, baseline experiments, and their impact and utility in NLP.\n","authors":["Nedjma Ousidhoum","Shamsuddeen Hassan Muhammad","Mohamed Abdalla","Idris Abdulmumin","Ibrahim Said Ahmad","Sanchit Ahuja","Alham Fikri Aji","Vladimir Araujo","Abinew Ali Ayele","Pavan Baswani","Meriem Beloucif","Chris Biemann","Sofia Bourhim","Christine De Kock","Genet Shanko Dekebo","Oumaima Hourrane","Gopichand Kanumolu","Lokesh Madasu","Samuel Rutunda","Manish Shrivastava","Thamar Solorio","Nirmal Surange","Hailegnaw Getaneh Tilaye","Krishnapriya Vishnubhotla","Genta Winata","Seid Muhie Yimam","Saif M. Mohammad"],"pdf_url":"https://arxiv.org/pdf/2402.08638v5.pdf","comment":"Accepted to the Findings of ACL 2024"},{"id":"http://arxiv.org/abs/2405.20956v1","updated":"2024-05-31T15:55:51Z","published":"2024-05-31T15:55:51Z","title":"A Robot Walks into a Bar: Can Language Models Serve asCreativity Support\n Tools for Comedy? An Evaluation of LLMs' Humour Alignment with Comedians","summary":" We interviewed twenty professional comedians who perform live shows in front\nof audiences and who use artificial intelligence in their artistic process as\npart of 3-hour workshops on ``AI x Comedy'' conducted at the Edinburgh Festival\nFringe in August 2023 and online. The workshop consisted of a comedy writing\nsession with large language models (LLMs), a human-computer interaction\nquestionnaire to assess the Creativity Support Index of AI as a writing tool,\nand a focus group interrogating the comedians' motivations for and processes of\nusing AI, as well as their ethical concerns about bias, censorship and\ncopyright. Participants noted that existing moderation strategies used in\nsafety filtering and instruction-tuned LLMs reinforced hegemonic viewpoints by\nerasing minority groups and their perspectives, and qualified this as a form of\ncensorship. At the same time, most participants felt the LLMs did not succeed\nas a creativity support tool, by producing bland and biased comedy tropes, akin\nto ``cruise ship comedy material from the 1950s, but a bit less racist''. Our\nwork extends scholarship about the subtle difference between, one the one hand,\nharmful speech, and on the other hand, ``offensive'' language as a practice of\nresistance, satire and ``punching up''. We also interrogate the global value\nalignment behind such language models, and discuss the importance of\ncommunity-based value alignment and data ownership to build AI tools that\nbetter suit artists' needs.\n","authors":["Piotr Wojciech Mirowski","Juliette Love","Kory W. Mathewson","Shakir Mohamed"],"pdf_url":"https://arxiv.org/pdf/2405.20956v1.pdf","comment":"15 pages, 1 figure, published at ACM FAccT 2024"},{"id":"http://arxiv.org/abs/2405.20947v1","updated":"2024-05-31T15:44:33Z","published":"2024-05-31T15:44:33Z","title":"OR-Bench: An Over-Refusal Benchmark for Large Language Models","summary":" Large Language Models (LLMs) require careful safety alignment to prevent\nmalicious outputs. While significant research focuses on mitigating harmful\ncontent generation, the enhanced safety often come with the side effect of\nover-refusal, where the LLMs may reject innocuous prompts and become less\nhelpful. Although the issue of over-refusal has been empirically observed, a\nsystematic measurement is challenging due to the difficulty of crafting prompts\nthat appear harmful but are benign. This study proposes a novel method for\nautomatically generating large-scale sets of ``seemingly toxic prompts''\n(benign prompts likely rejected by LLMs). Leveraging this technique, we\nintroduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench\ncomprises 80,000 seemingly toxic prompts across 10 common rejection categories,\na subset of around 1,000 hard prompts that are challenging even for\nstate-of-the-art LLMs, and an additional 600 toxic prompts to prevent\nindiscriminate responses. We then conduct a comprehensive study to measure the\nover-refusal of 25 popular LLMs across 8 model families. Our datasets are\navailable at https://huggingface.co/datasets/bench-llm/OR-Bench and the\ncorresponding demo can be found at\nhttps://huggingface.co/spaces/bench-llm/or-bench. We hope this benchmark can\nhelp the community develop better safety aligned models.\n","authors":["Justin Cui","Wei-Lin Chiang","Ion Stoica","Cho-Jui Hsieh"],"pdf_url":"https://arxiv.org/pdf/2405.20947v1.pdf","comment":"version 1"},{"id":"http://arxiv.org/abs/2405.18669v2","updated":"2024-05-31T15:42:53Z","published":"2024-05-29T00:23:55Z","title":"Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities","summary":" Integrating multiple generative foundation models, especially those trained\non different modalities, into something greater than the sum of its parts poses\nsignificant challenges. Two key hurdles are the availability of aligned data\n(concepts that contain similar meaning but is expressed differently in\ndifferent modalities), and effectively leveraging unimodal representations in\ncross-domain generative tasks, without compromising their original unimodal\ncapabilities.\n We propose Zipper, a multi-tower decoder architecture that addresses these\nconcerns by using cross-attention to flexibly compose multimodal generative\nmodels from independently pre-trained unimodal decoders. In our experiments\nfusing speech and text modalities, we show the proposed architecture performs\nvery competitively in scenarios with limited aligned text-speech data. We also\nshowcase the flexibility of our model to selectively maintain unimodal (e.g.,\ntext-to-text generation) generation performance by freezing the corresponding\nmodal tower (e.g. text). In cross-modal tasks such as automatic speech\nrecognition (ASR) where the output modality is text, we show that freezing the\ntext backbone results in negligible performance degradation. In cross-modal\ntasks such as text-to-speech generation (TTS) where the output modality is\nspeech, we show that using a pre-trained speech backbone results in superior\nperformance to the baseline.\n","authors":["Vicky Zayats","Peter Chen","Melissa Ferrari","Dirk Padfield"],"pdf_url":"https://arxiv.org/pdf/2405.18669v2.pdf","comment":"Under review at NeurIPS"},{"id":"http://arxiv.org/abs/2310.00835v3","updated":"2024-05-31T15:36:09Z","published":"2023-10-02T00:59:07Z","title":"TRAM: Benchmarking Temporal Reasoning for Large Language Models","summary":" Reasoning about time is essential for understanding the nuances of events\ndescribed in natural language. Previous research on this topic has been limited\nin scope, characterized by a lack of standardized benchmarks that would allow\nfor consistent evaluations across different studies. In this paper, we\nintroduce TRAM, a temporal reasoning benchmark composed of ten datasets,\nencompassing various temporal aspects of events such as order, arithmetic,\nfrequency, and duration, designed to facilitate a comprehensive evaluation of\nthe TeR capabilities of large language models (LLMs). We evaluate popular LLMs\nlike GPT-4 and Llama2 in zero-shot and few-shot scenarios, and establish\nbaselines with BERT-based and domain-specific models. Our findings indicate\nthat the best-performing model lags significantly behind human performance. It\nis our aspiration that TRAM will spur further progress in enhancing the TeR\ncapabilities of LLMs.\n","authors":["Yuqing Wang","Yun Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.00835v3.pdf","comment":"Findings of ACL 2024"},{"id":"http://arxiv.org/abs/2308.01399v2","updated":"2024-05-31T15:32:02Z","published":"2023-07-31T17:57:49Z","title":"Learning to Model the World with Language","summary":" To interact with humans and act in the world, agents need to understand the\nrange of language that people use and relate it to the visual world. While\ncurrent agents can learn to execute simple language instructions, we aim to\nbuild agents that leverage diverse language -- language like \"this button turns\non the TV\" or \"I put the bowls away\" -- that conveys general knowledge,\ndescribes the state of the world, provides interactive feedback, and more. Our\nkey idea is that agents should interpret such diverse language as a signal that\nhelps them predict the future: what they will observe, how the world will\nbehave, and which situations will be rewarded. This perspective unifies\nlanguage understanding with future prediction as a powerful self-supervised\nlearning objective. We instantiate this in Dynalang, an agent that learns a\nmultimodal world model to predict future text and image representations, and\nlearns to act from imagined model rollouts. While current methods that learn\nlanguage-conditioned policies degrade in performance with more diverse types of\nlanguage, we show that Dynalang learns to leverage environment descriptions,\ngame rules, and instructions to excel on tasks ranging from game-playing to\nnavigating photorealistic home scans. Finally, we show that our method enables\nadditional capabilities due to learning a generative model: Dynalang can be\npretrained on text-only data, enabling learning from offline datasets, and\ngenerate language grounded in an environment.\n","authors":["Jessy Lin","Yuqing Du","Olivia Watkins","Danijar Hafner","Pieter Abbeel","Dan Klein","Anca Dragan"],"pdf_url":"https://arxiv.org/pdf/2308.01399v2.pdf","comment":"ICML 2024. Website: https://dynalang.github.io/"},{"id":"http://arxiv.org/abs/2405.20917v1","updated":"2024-05-31T15:21:53Z","published":"2024-05-31T15:21:53Z","title":"Learning to Estimate System Specifications in Linear Temporal Logic\n using Transformers and Mamba","summary":" Temporal logic is a framework for representing and reasoning about\npropositions that evolve over time. It is commonly used for specifying\nrequirements in various domains, including hardware and software systems, as\nwell as robotics. Specification mining or formula generation involves\nextracting temporal logic formulae from system traces and has numerous\napplications, such as detecting bugs and improving interpretability. Although\nthere has been a surge of deep learning-based methods for temporal logic\nsatisfiability checking in recent years, the specification mining literature\nhas been lagging behind in adopting deep learning methods despite their many\nadvantages, such as scalability. In this paper, we introduce autoregressive\nmodels that can generate linear temporal logic formulae from traces, towards\naddressing the specification mining problem. We propose multiple architectures\nfor this task: transformer encoder-decoder, decoder-only transformer, and\nMamba, which is an emerging alternative to transformer models. Additionally, we\ndevise a metric for quantifying the distinctiveness of the generated formulae\nand a straightforward algorithm to enforce the syntax constraints. Our\nexperiments show that the proposed architectures yield promising results,\ngenerating correct and distinct formulae at a fraction of the compute cost\nneeded for the combinatorial baseline.\n","authors":["İlker Işık","Ebru Aydin Gol","Ramazan Gokberk Cinbis"],"pdf_url":"https://arxiv.org/pdf/2405.20917v1.pdf","comment":"20 pages, 15 figures"},{"id":"http://arxiv.org/abs/2404.07611v2","updated":"2024-05-31T15:19:18Z","published":"2024-04-11T09:59:01Z","title":"NoticIA: A Clickbait Article Summarization Dataset in Spanish","summary":" We present NoticIA, a dataset consisting of 850 Spanish news articles\nfeaturing prominent clickbait headlines, each paired with high-quality,\nsingle-sentence generative summarizations written by humans. This task demands\nadvanced text understanding and summarization abilities, challenging the\nmodels' capacity to infer and connect diverse pieces of information to meet the\nuser's informational needs generated by the clickbait headline. We evaluate the\nSpanish text comprehension capabilities of a wide range of state-of-the-art\nlarge language models. Additionally, we use the dataset to train\nClickbaitFighter, a task-specific model that achieves near-human performance in\nthis task.\n","authors":["Iker García-Ferrero","Begoña Altuna"],"pdf_url":"https://arxiv.org/pdf/2404.07611v2.pdf","comment":"Accepted in the journal Procesamiento del Lenguaje Natural"},{"id":"http://arxiv.org/abs/2405.20906v1","updated":"2024-05-31T15:17:47Z","published":"2024-05-31T15:17:47Z","title":"Enhancing Vision Models for Text-Heavy Content Understanding and\n Interaction","summary":" Interacting and understanding with text heavy visual content with multiple\nimages is a major challenge for traditional vision models. This paper is on\nenhancing vision models' capability to comprehend or understand and learn from\nimages containing a huge amount of textual information from the likes of\ntextbooks and research papers which contain multiple images like graphs, etc\nand tables in them with different types of axes and scales. The approach\ninvolves dataset preprocessing, fine tuning which is by using instructional\noriented data and evaluation. We also built a visual chat application\nintegrating CLIP for image encoding and a model from the Massive Text Embedding\nBenchmark which is developed to consider both textual and visual inputs. An\naccuracy of 96.71% was obtained. The aim of the project is to increase and also\nenhance the advance vision models' capabilities in understanding complex visual\ntextual data interconnected data, contributing to multimodal AI.\n","authors":["Adithya TG","Adithya SK","Abhinav R Bharadwaj","Abhiram HA","Dr. Surabhi Narayan"],"pdf_url":"https://arxiv.org/pdf/2405.20906v1.pdf","comment":"5 pages, 4 figures (including 1 graph)"},{"id":"http://arxiv.org/abs/2402.03962v3","updated":"2024-05-31T15:16:21Z","published":"2024-02-06T12:42:21Z","title":"Position: Stop Making Unscientific AGI Performance Claims","summary":" Developments in the field of Artificial Intelligence (AI), and particularly\nlarge language models (LLMs), have created a 'perfect storm' for observing\n'sparks' of Artificial General Intelligence (AGI) that are spurious. Like\nsimpler models, LLMs distill meaningful representations in their latent\nembeddings that have been shown to correlate with external variables.\nNonetheless, the correlation of such representations has often been linked to\nhuman-like intelligence in the latter but not the former. We probe models of\nvarying complexity including random projections, matrix decompositions, deep\nautoencoders and transformers: all of them successfully distill information\nthat can be used to predict latent or external variables and yet none of them\nhave previously been linked to AGI. We argue and empirically demonstrate that\nthe finding of meaningful patterns in latent spaces of models cannot be seen as\nevidence in favor of AGI. Additionally, we review literature from the social\nsciences that shows that humans are prone to seek such patterns and\nanthropomorphize. We conclude that both the methodological setup and common\npublic image of AI are ideal for the misinterpretation that correlations\nbetween model representations and some variables of interest are 'caused' by\nthe model's understanding of underlying 'ground truth' relationships. We,\ntherefore, call for the academic community to exercise extra caution, and to be\nkeenly aware of principles of academic integrity, in interpreting and\ncommunicating about AI research outcomes.\n","authors":["Patrick Altmeyer","Andrew M. Demetriou","Antony Bartlett","Cynthia C. S. Liem"],"pdf_url":"https://arxiv.org/pdf/2402.03962v3.pdf","comment":"21 pages, 15 figures. Pre-print to be published at International\n Conference on Machine Learning (ICML) 2024"},{"id":"http://arxiv.org/abs/2405.20902v1","updated":"2024-05-31T15:15:04Z","published":"2024-05-31T15:15:04Z","title":"Preemptive Answer \"Attacks\" on Chain-of-Thought Reasoning","summary":" Large language models (LLMs) showcase impressive reasoning capabilities when\ncoupled with Chain-of-Thought (CoT) prompting. However, the robustness of this\napproach warrants further investigation. In this paper, we introduce a novel\nscenario termed preemptive answers, where the LLM obtains an answer before\nengaging in reasoning. This situation can arise inadvertently or induced by\nmalicious users by prompt injection attacks. Experiments reveal that preemptive\nanswers significantly impair the model's reasoning capability across various\nCoT methods and a broad spectrum of datasets. To bolster the robustness of\nreasoning, we propose two measures aimed at mitigating this issue to some\nextent.\n","authors":["Rongwu Xu","Zehan Qi","Wei Xu"],"pdf_url":"https://arxiv.org/pdf/2405.20902v1.pdf","comment":"Accepted to ACL'24 (Findings). Camera-ready version"},{"id":"http://arxiv.org/abs/2312.09085v5","updated":"2024-05-31T15:13:33Z","published":"2023-12-14T16:16:50Z","title":"The Earth is Flat because...: Investigating LLMs' Belief towards\n Misinformation via Persuasive Conversation","summary":" Large language models (LLMs) encapsulate vast amounts of knowledge but still\nremain vulnerable to external misinformation. Existing research mainly studied\nthis susceptibility behavior in a single-turn setting. However, belief can\nchange during a multi-turn conversation, especially a persuasive one.\nTherefore, in this study, we delve into LLMs' susceptibility to persuasive\nconversations, particularly on factual questions that they can answer\ncorrectly. We first curate the Farm (i.e., Fact to Misinform) dataset, which\ncontains factual questions paired with systematically generated persuasive\nmisinformation. Then, we develop a testing framework to track LLMs' belief\nchanges in a persuasive dialogue. Through extensive experiments, we find that\nLLMs' correct beliefs on factual knowledge can be easily manipulated by various\npersuasive strategies.\n","authors":["Rongwu Xu","Brian S. Lin","Shujian Yang","Tianqi Zhang","Weiyan Shi","Tianwei Zhang","Zhixuan Fang","Wei Xu","Han Qiu"],"pdf_url":"https://arxiv.org/pdf/2312.09085v5.pdf","comment":"Accepted to ACL'24 (Main). Camera-ready version"},{"id":"http://arxiv.org/abs/2405.20900v1","updated":"2024-05-31T15:12:33Z","published":"2024-05-31T15:12:33Z","title":"Large Language Models: A New Approach for Privacy Policy Analysis at\n Scale","summary":" The number and dynamic nature of web and mobile applications presents\nsignificant challenges for assessing their compliance with data protection\nlaws. In this context, symbolic and statistical Natural Language Processing\n(NLP) techniques have been employed for the automated analysis of these\nsystems' privacy policies. However, these techniques typically require\nlabor-intensive and potentially error-prone manually annotated datasets for\ntraining and validation. This research proposes the application of Large\nLanguage Models (LLMs) as an alternative for effectively and efficiently\nextracting privacy practices from privacy policies at scale. Particularly, we\nleverage well-known LLMs such as ChatGPT and Llama 2, and offer guidance on the\noptimal design of prompts, parameters, and models, incorporating advanced\nstrategies such as few-shot learning. We further illustrate its capability to\ndetect detailed and varied privacy practices accurately. Using several renowned\ndatasets in the domain as a benchmark, our evaluation validates its exceptional\nperformance, achieving an F1 score exceeding 93%. Besides, it does so with\nreduced costs, faster processing times, and fewer technical knowledge\nrequirements. Consequently, we advocate for LLM-based solutions as a sound\nalternative to traditional NLP techniques for the automated analysis of privacy\npolicies at scale.\n","authors":["David Rodriguez","Ian Yang","Jose M. Del Alamo","Norman Sadeh"],"pdf_url":"https://arxiv.org/pdf/2405.20900v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20895v1","updated":"2024-05-31T15:04:15Z","published":"2024-05-31T15:04:15Z","title":"A comparison of correspondence analysis with PMI-based word embedding\n methods","summary":" Popular word embedding methods such as GloVe and Word2Vec are related to the\nfactorization of the pointwise mutual information (PMI) matrix. In this paper,\nwe link correspondence analysis (CA) to the factorization of the PMI matrix. CA\nis a dimensionality reduction method that uses singular value decomposition\n(SVD), and we show that CA is mathematically close to the weighted\nfactorization of the PMI matrix. In addition, we present variants of CA that\nturn out to be successful in the factorization of the word-context matrix, i.e.\nCA applied to a matrix where the entries undergo a square-root transformation\n(ROOT-CA) and a root-root transformation (ROOTROOT-CA). An empirical comparison\namong CA- and PMI-based methods shows that overall results of ROOT-CA and\nROOTROOT-CA are slightly better than those of the PMI-based methods.\n","authors":["Qianqian Qi","David J. Hessen","Peter G. M. van der Heijden"],"pdf_url":"https://arxiv.org/pdf/2405.20895v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.15032v2","updated":"2024-05-31T14:47:55Z","published":"2024-05-23T20:10:38Z","title":"Aya 23: Open Weight Releases to Further Multilingual Progress","summary":" This technical report introduces Aya 23, a family of multilingual language\nmodels. Aya 23 builds on the recent release of the Aya model (\\\"Ust\\\"un et al.,\n2024), focusing on pairing a highly performant pre-trained model with the\nrecently released Aya collection (Singh et al., 2024). The result is a powerful\nmultilingual large language model serving 23 languages, expanding state-of-art\nlanguage modeling capabilities to approximately half of the world's population.\nThe Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs\nbreadth, exploring the impact of allocating more capacity to fewer languages\nthat are included during pre-training. Aya 23 outperforms both previous\nmassively multilingual models like Aya 101 for the languages it covers, as well\nas widely used models like Gemma, Mistral and Mixtral on an extensive range of\ndiscriminative and generative tasks. We release the open weights for both the\n8B and 35B models as part of our continued commitment for expanding access to\nmultilingual progress.\n","authors":["Viraat Aryabumi","John Dang","Dwarak Talupuru","Saurabh Dash","David Cairuz","Hangyu Lin","Bharat Venkitesh","Madeline Smith","Jon Ander Campos","Yi Chern Tan","Kelly Marchisio","Max Bartolo","Sebastian Ruder","Acyr Locatelli","Julia Kreutzer","Nick Frosst","Aidan Gomez","Phil Blunsom","Marzieh Fadaee","Ahmet Üstün","Sara Hooker"],"pdf_url":"https://arxiv.org/pdf/2405.15032v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20859v1","updated":"2024-05-31T14:43:31Z","published":"2024-05-31T14:43:31Z","title":"clembench-2024: A Challenging, Dynamic, Complementary, Multilingual\n Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents","summary":" It has been established in recent work that Large Language Models (LLMs) can\nbe prompted to \"self-play\" conversational games that probe certain capabilities\n(general instruction following, strategic goal orientation, language\nunderstanding abilities), where the resulting interactive game play can be\nautomatically scored. In this paper, we take one of the proposed frameworks for\nsetting up such game-play environments, and further test its usefulness as an\nevaluation instrument, along a number of dimensions: We show that it can easily\nkeep up with new developments while avoiding data contamination, we show that\nthe tests implemented within it are not yet saturated (human performance is\nsubstantially higher than that of even the best models), and we show that it\nlends itself to investigating additional questions, such as the impact of the\nprompting language on performance. We believe that the approach forms a good\nbasis for making decisions on model choice for building applied interactive\nsystems, and perhaps ultimately setting up a closed-loop development\nenvironment of system and simulated evaluator.\n","authors":["Anne Beyer","Kranti Chalamalasetti","Sherzod Hakimov","Brielen Madureira","Philipp Sadler","David Schlangen"],"pdf_url":"https://arxiv.org/pdf/2405.20859v1.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2405.20852v1","updated":"2024-05-31T14:34:23Z","published":"2024-05-31T14:34:23Z","title":"Towards Spoken Language Understanding via Multi-level Multi-grained\n Contrastive Learning","summary":" Spoken language understanding (SLU) is a core task in task-oriented dialogue\nsystems, which aims at understanding the user's current goal through\nconstructing semantic frames. SLU usually consists of two subtasks, including\nintent detection and slot filling. Although there are some SLU frameworks joint\nmodeling the two subtasks and achieving high performance, most of them still\noverlook the inherent relationships between intents and slots and fail to\nachieve mutual guidance between the two subtasks. To solve the problem, we\npropose a multi-level multi-grained SLU framework MMCL to apply contrastive\nlearning at three levels, including utterance level, slot level, and word level\nto enable intent and slot to mutually guide each other. For the utterance\nlevel, our framework implements coarse granularity contrastive learning and\nfine granularity contrastive learning simultaneously. Besides, we also apply\nthe self-distillation method to improve the robustness of the model.\nExperimental results and further analysis demonstrate that our proposed model\nachieves new state-of-the-art results on two public multi-intent SLU datasets,\nobtaining a 2.6 overall accuracy improvement on the MixATIS dataset compared to\nprevious best models.\n","authors":["Xuxin Cheng","Wanshi Xu","Zhihong Zhu","Hongxiang Li","Yuexian Zou"],"pdf_url":"https://arxiv.org/pdf/2405.20852v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20850v1","updated":"2024-05-31T14:33:07Z","published":"2024-05-31T14:33:07Z","title":"Improving Reward Models with Synthetic Critiques","summary":" Reward models (RM) play a critical role in aligning language models through\nthe process of reinforcement learning from human feedback. RMs are trained to\npredict a score reflecting human preference, which requires significant time\nand cost for human annotation. Additionally, RMs tend to quickly overfit on\nsuperficial features in the training set, hindering their generalization\nperformance on unseen distributions. We propose a novel approach using\nsynthetic natural language critiques generated by large language models to\nprovide additional feedback, evaluating aspects such as instruction following,\ncorrectness, and style. This offers richer signals and more robust features for\nRMs to assess and score on. We demonstrate that high-quality critiques improve\nthe performance and data efficiency of RMs initialized from different\npretrained models. Conversely, we also show that low-quality critiques\nnegatively impact performance. Furthermore, incorporating critiques enhances\nthe interpretability and robustness of RM training.\n","authors":["Zihuiwen Ye","Fraser Greenlee-Scott","Max Bartolo","Phil Blunsom","Jon Ander Campos","Matthias Gallé"],"pdf_url":"https://arxiv.org/pdf/2405.20850v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20846v1","updated":"2024-05-31T14:31:46Z","published":"2024-05-31T14:31:46Z","title":"Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive\n Multimodal Models","summary":" Image-based advertisements are complex multimodal stimuli that often contain\nunusual visual elements and figurative language. Previous research on automatic\nad understanding has reported impressive zero-shot accuracy of contrastive\nvision-and-language models (VLMs) on an ad-explanation retrieval task. Here, we\nexamine the original task setup and show that contrastive VLMs can solve it by\nexploiting grounding heuristics. To control for this confound, we introduce\nTRADE, a new evaluation test set with adversarial grounded explanations. While\nthese explanations look implausible to humans, we show that they \"fool\" four\ndifferent contrastive VLMs. Our findings highlight the need for an improved\noperationalisation of automatic ad understanding that truly evaluates VLMs'\nmultimodal reasoning abilities. We make our code and TRADE available at\nhttps://github.com/dmg-illc/trade .\n","authors":["A. Bavaresco","A. Testoni","R. Fernández"],"pdf_url":"https://arxiv.org/pdf/2405.20846v1.pdf","comment":"Accepted to the main conference ACL 2024"},{"id":"http://arxiv.org/abs/2204.09140v2","updated":"2024-05-31T14:28:40Z","published":"2022-04-19T21:55:18Z","title":"Multi-hop Question Answering","summary":" The task of Question Answering (QA) has attracted significant research\ninterest for long. Its relevance to language understanding and knowledge\nretrieval tasks, along with the simple setting makes the task of QA crucial for\nstrong AI systems. Recent success on simple QA tasks has shifted the focus to\nmore complex settings. Among these, Multi-Hop QA (MHQA) is one of the most\nresearched tasks over the recent years. In broad terms, MHQA is the task of\nanswering natural language questions that involve extracting and combining\nmultiple pieces of information and doing multiple steps of reasoning. An\nexample of a multi-hop question would be \"The Argentine PGA Championship record\nholder has won how many tournaments worldwide?\". Answering the question would\nneed two pieces of information: \"Who is the record holder for Argentine PGA\nChampionship tournaments?\" and \"How many tournaments did [Answer of Sub Q1]\nwin?\". The ability to answer multi-hop questions and perform multi step\nreasoning can significantly improve the utility of NLP systems. Consequently,\nthe field has seen a surge with high quality datasets, models and evaluation\nstrategies. The notion of 'multiple hops' is somewhat abstract which results in\na large variety of tasks that require multi-hop reasoning. This leads to\ndifferent datasets and models that differ significantly from each other and\nmakes the field challenging to generalize and survey. We aim to provide a\ngeneral and formal definition of the MHQA task, and organize and summarize\nexisting MHQA frameworks. We also outline some best practices for building MHQA\ndatasets. This book provides a systematic and thorough introduction as well as\nthe structuring of the existing attempts to this highly interesting, yet quite\nchallenging task.\n","authors":["Vaibhav Mavi","Anubhav Jangra","Adam Jatowt"],"pdf_url":"https://arxiv.org/pdf/2204.09140v2.pdf","comment":"Published at Foundations and Trends in Information Retrieval"},{"id":"http://arxiv.org/abs/2405.20835v1","updated":"2024-05-31T14:24:33Z","published":"2024-05-31T14:24:33Z","title":"Outliers and Calibration Sets have Diminishing Effect on Quantization of\n Modern LLMs","summary":" Post-Training Quantization (PTQ) enhances the efficiency of Large Language\nModels (LLMs) by enabling faster operation and compatibility with more\naccessible hardware through reduced memory usage, at the cost of small\nperformance drops. We explore the role of calibration sets in PTQ, specifically\ntheir effect on hidden activations in various notable open-source LLMs.\nCalibration sets are crucial for evaluating activation magnitudes and\nidentifying outliers, which can distort the quantization range and negatively\nimpact performance. Our analysis reveals a marked contrast in quantization\neffectiveness across models. The older OPT model, which much of the\nquantization literature is based on, shows significant performance\ndeterioration and high susceptibility to outliers with varying calibration\nsets. In contrast, newer models like Llama-2 7B, Llama-3 8B, Command-R 35B, and\nMistral 7B demonstrate strong robustness, with Mistral 7B showing near-immunity\nto outliers and stable activations. These findings suggest a shift in PTQ\nstrategies might be needed. As advancements in pre-training methods reduce the\nrelevance of outliers, there is an emerging need to reassess the fundamentals\nof current quantization literature. The emphasis should pivot towards\noptimizing inference speed, rather than primarily focusing on outlier\npreservation, to align with the evolving characteristics of state-of-the-art\nLLMs.\n","authors":["Davide Paglieri","Saurabh Dash","Tim Rocktäschel","Jack Parker-Holder"],"pdf_url":"https://arxiv.org/pdf/2405.20835v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20833v1","updated":"2024-05-31T14:23:30Z","published":"2024-05-31T14:23:30Z","title":"That's Optional: A Contemporary Exploration of \"that\" Omission in\n English Subordinate Clauses","summary":" The Uniform Information Density (UID) hypothesis posits that speakers\noptimize the communicative properties of their utterances by avoiding spikes in\ninformation, thereby maintaining a relatively uniform information profile over\ntime. This paper investigates the impact of UID principles on syntactic\nreduction, specifically focusing on the optional omission of the connector\n\"that\" in English subordinate clauses. Building upon previous research, we\nextend our investigation to a larger corpus of written English, utilize\ncontemporary large language models (LLMs) and extend the information-uniformity\nprinciples by the notion of entropy, to estimate the UID manifestations in the\nusecase of syntactic reduction choices.\n","authors":["Ella Rabinovich"],"pdf_url":"https://arxiv.org/pdf/2405.20833v1.pdf","comment":"ACL2024 (main conference), 8 pages"},{"id":"http://arxiv.org/abs/2405.20830v1","updated":"2024-05-31T14:21:04Z","published":"2024-05-31T14:21:04Z","title":"Self-Augmented Preference Optimization: Off-Policy Paradigms for\n Language Model Alignment","summary":" Traditional language model alignment methods, such as Direct Preference\nOptimization (DPO), are limited by their dependence on static, pre-collected\npaired preference data, which hampers their adaptability and practical\napplicability. To overcome this limitation, we introduce Self-Augmented\nPreference Optimization (SAPO), an effective and scalable training paradigm\nthat does not require existing paired data. Building on the self-play concept,\nwhich autonomously generates negative responses, we further incorporate an\noff-policy learning pipeline to enhance data exploration and exploitation.\nSpecifically, we employ an Exponential Moving Average (EMA) model in\nconjunction with a replay buffer to enable dynamic updates of response\nsegments, effectively integrating real-time feedback with insights from\nhistorical data. Our comprehensive evaluations of the LLaMA3-8B and Mistral-7B\nmodels across benchmarks, including the Open LLM Leaderboard, IFEval,\nAlpacaEval 2.0, and MT-Bench, demonstrate that SAPO matches or surpasses\nestablished offline contrastive baselines, such as DPO and Odds Ratio\nPreference Optimization, and outperforms offline self-play methods like SPIN.\nOur code is available at https://github.com/yinyueqin/SAPO\n","authors":["Yueqin Yin","Zhendong Wang","Yujia Xie","Weizhu Chen","Mingyuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2405.20830v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20818v1","updated":"2024-05-31T14:14:01Z","published":"2024-05-31T14:14:01Z","title":"An iterated learning model of language change that mixes supervised and\n unsupervised learning","summary":" The iterated learning model is an agent-based model of language change in\nwhich language is transmitted from a tutor to a pupil which itself becomes a\ntutor to a new pupil, and so on. Languages that are stable, expressive, and\ncompositional arise spontaneously as a consequence of a language transmission\nbottleneck. Previous models have implemented an agent's mapping from signals to\nmeanings using an artificial neural network decoder, but have relied on an\nunrealistic and computationally expensive process of obversion to implement the\nassociated encoder, mapping from meanings to signals. Here, a new model is\npresented in which both decoder and encoder are neural networks, trained\nseparately through supervised learning, and trained together through\nunsupervised learning in the form of an autoencoder. This avoids the\nsubstantial computational burden entailed in obversion and introduces a mixture\nof supervised and unsupervised learning as observed during human development.\n","authors":["Jack Bunyan","Seth Bullock","Conor Houghton"],"pdf_url":"https://arxiv.org/pdf/2405.20818v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20805v1","updated":"2024-05-31T14:05:27Z","published":"2024-05-31T14:05:27Z","title":"Multilingual Text Style Transfer: Datasets & Models for Indian Languages","summary":" Text style transfer (TST) involves altering the linguistic style of a text\nwhile preserving its core content. This paper focuses on sentiment transfer, a\nvital TST subtask (Mukherjee et al., 2022a), across a spectrum of Indian\nlanguages: Hindi, Magahi, Malayalam, Marathi, Punjabi, Odia, Telugu, and Urdu,\nexpanding upon previous work on English-Bangla sentiment transfer (Mukherjee et\nal., 2023). We introduce dedicated datasets of 1,000 positive and 1,000\nnegative style-parallel sentences for each of these eight languages. We then\nevaluate the performance of various benchmark models categorized into parallel,\nnon-parallel, cross-lingual, and shared learning approaches, including the\nLlama2 and GPT-3.5 large language models (LLMs). Our experiments highlight the\nsignificance of parallel data in TST and demonstrate the effectiveness of the\nMasked Style Filling (MSF) approach (Mukherjee et al., 2023) in non-parallel\ntechniques. Moreover, cross-lingual and joint multilingual learning methods\nshow promise, offering insights into selecting optimal models tailored to the\nspecific language and task requirements. To the best of our knowledge, this\nwork represents the first comprehensive exploration of the TST task as\nsentiment transfer across a diverse set of languages.\n","authors":["Sourabrata Mukherjee","Atul Kr. Ojha","Akanksha Bansal","Deepak Alok","John P. McCrae","Ondřej Dušek"],"pdf_url":"https://arxiv.org/pdf/2405.20805v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.16698v2","updated":"2024-05-31T14:03:10Z","published":"2024-04-25T15:59:16Z","title":"Cooperate or Collapse: Emergence of Sustainability Behaviors in a\n Society of LLM Agents","summary":" As AI systems pervade human life, ensuring that large language models (LLMs)\nmake safe decisions is a significant challenge. This paper introduces the\nGovernance of the Commons Simulation (GovSim), a generative simulation platform\ndesigned to study strategic interactions and cooperative decision-making in\nLLMs. Using GovSim, we investigate the dynamics of sustainable resource sharing\nin a society of AI agents. This environment allows us to study the influence of\nethical considerations, strategic planning, and negotiation skills on\ncooperative outcomes for AI agents. We develop an LLM-based agent architecture\ndesigned for these social dilemmas and test it with a variety of LLMs. We find\nthat all but the most powerful LLM agents fail to achieve a sustainable\nequilibrium in GovSim. Ablations reveal that successful multi-agent\ncommunication between agents is critical for achieving cooperation in these\ncases. Furthermore, our analyses show that the failure to achieve sustainable\ncooperation in most LLMs stems from their inability to formulate and analyze\nhypotheses about the long-term effects of their actions on the equilibrium of\nthe group. Finally, we show that agents that leverage\n``Universalization''-based reasoning, a theory of moral thinking, are able to\nachieve significantly greater sustainability. Taken together, GovSim enables us\nto study the mechanisms that underlie sustainable self-government with\nsignificant specificity and scale. We open source the full suite of our\nresearch results, including the simulation environment, agent prompts, and a\ncomprehensive web interface.\n","authors":["Giorgio Piatti","Zhijing Jin","Max Kleiman-Weiner","Bernhard Schölkopf","Mrinmaya Sachan","Rada Mihalcea"],"pdf_url":"https://arxiv.org/pdf/2404.16698v2.pdf","comment":"Revised version"},{"id":"http://arxiv.org/abs/2305.15805v3","updated":"2024-05-31T14:02:24Z","published":"2023-05-25T07:39:41Z","title":"Dynamic Context Pruning for Efficient and Interpretable Autoregressive\n Transformers","summary":" Autoregressive Transformers adopted in Large Language Models (LLMs) are hard\nto scale to long sequences. Despite several works trying to reduce their\ncomputational cost, most of LLMs still adopt attention layers between all pairs\nof tokens in the sequence, thus incurring a quadratic cost. In this study, we\npresent a novel approach that dynamically prunes contextual information while\npreserving the model's expressiveness, resulting in reduced memory and\ncomputational requirements during inference. Our method employs a learnable\nmechanism that determines which uninformative tokens can be dropped from the\ncontext at any point across the generation process. By doing so, our approach\nnot only addresses performance concerns but also enhances interpretability,\nproviding valuable insight into the model's decision-making process. Our\ntechnique can be applied to existing pre-trained models through a\nstraightforward fine-tuning process, and the pruning strength can be specified\nby a sparsity parameter. Notably, our empirical findings demonstrate that we\ncan effectively prune up to 80\\% of the context without significant performance\ndegradation on downstream tasks, offering a valuable tool for mitigating\ninference costs. Our reference implementation achieves up to $2\\times$ increase\nin inference throughput and even greater memory savings.\n","authors":["Sotiris Anagnostidis","Dario Pavllo","Luca Biggio","Lorenzo Noci","Aurelien Lucchi","Thomas Hofmann"],"pdf_url":"https://arxiv.org/pdf/2305.15805v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20797v1","updated":"2024-05-31T13:59:18Z","published":"2024-05-31T13:59:18Z","title":"Ovis: Structural Embedding Alignment for Multimodal Large Language Model","summary":" Current Multimodal Large Language Models (MLLMs) typically integrate a\npre-trained LLM with another pre-trained vision transformer through a\nconnector, such as an MLP, endowing the LLM with visual capabilities. However,\nthe misalignment between two embedding strategies in MLLMs -- the structural\ntextual embeddings based on an embedding look-up table and the continuous\nembeddings generated directly by the vision encoder -- makes challenges for a\nmore seamless fusion of visual and textual information. We propose Ovis, a\nnovel MLLM architecture designed to structurally align visual and textual\nembeddings. Ovis integrates an additional learnable visual embedding table into\nthe visual encoder's process. To capture rich visual semantics, each image\npatch indexes the visual embedding table multiple times, resulting in a final\nvisual embedding that is a probabilistic combination of the indexed embeddings.\nThis structural approach mirrors the method used for generating textual\nembeddings. Empirical evaluations on various multimodal benchmarks demonstrate\nthat Ovis outperforms open-source MLLMs of similar parameter scales and even\nsurpasses the proprietary model Qwen-VL-Plus overall. These results highlight\nthe potential of Ovis' structured visual representation for advancing MLLM\narchitectural design and promoting more effective multimodal learning. Both the\nsource code and the training dataset of Ovis will be made publicly available.\n","authors":["Shiyin Lu","Yang Li","Qing-Guo Chen","Zhao Xu","Weihua Luo","Kaifu Zhang","Han-Jia Ye"],"pdf_url":"https://arxiv.org/pdf/2405.20797v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.10144v2","updated":"2024-05-31T13:11:15Z","published":"2024-03-15T09:43:52Z","title":"NLP Verification: Towards a General Methodology for Certifying\n Robustness","summary":" Deep neural networks have exhibited substantial success in the field of\nNatural Language Processing and ensuring their safety and reliability is\ncrucial: there are safety critical contexts where such models must be robust to\nvariability or attack, and give guarantees over their output. Unlike Computer\nVision, NLP lacks a unified verification methodology and, despite recent\nadvancements in literature, they are often light on the pragmatical issues of\nNLP verification. In this paper, we attempt to distil and evaluate general\ncomponents of an NLP verification pipeline, that emerges from the progress in\nthe field to date. Our contributions are two-fold. Firstly, we give a general\n(i.e. algorithm-independent) characterisation of verifiable subspaces that\nresult from embedding sentences into continuous spaces. We identify, and give\nan effective method to deal with, the technical challenge of semantic\ngeneralisability of verified subspaces; and propose it as a standard metric in\nthe NLP verification pipelines (alongside with the standard metrics of model\naccuracy and model verifiability). Secondly, we propose a general methodology\nto analyse the effect of the embedding gap -- a problem that refers to the\ndiscrepancy between verification of geometric subspaces, and the semantic\nmeaning of sentences which the geometric subspaces are supposed to represent.\nIn extreme cases, poor choices in embedding of sentences may invalidate\nverification results. We propose a number of practical NLP methods that can\nhelp to quantify the effects of the embedding gap; and in particular we propose\nthe metric of falsifiability of semantic subspaces as another fundamental\nmetric to be reported as part of the NLP verification pipeline. We believe that\ntogether these general principles pave the way towards a more consolidated and\neffective development of this new domain.\n","authors":["Marco Casadio","Tanvi Dinkar","Ekaterina Komendantskaya","Luca Arnaboldi","Matthew L. Daggitt","Omri Isac","Guy Katz","Verena Rieser","Oliver Lemon"],"pdf_url":"https://arxiv.org/pdf/2403.10144v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18870v2","updated":"2024-05-31T12:45:50Z","published":"2024-05-29T08:31:16Z","title":"LLMs achieve adult human performance on higher-order theory of mind\n tasks","summary":" This paper examines the extent to which large language models (LLMs) have\ndeveloped higher-order theory of mind (ToM); the human ability to reason about\nmultiple mental and emotional states in a recursive manner (e.g. I think that\nyou believe that she knows). This paper builds on prior work by introducing a\nhandwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to\ncompare the performance of five LLMs to a newly gathered adult human benchmark.\nWe find that GPT-4 and Flan-PaLM reach adult-level and near adult-level\nperformance on ToM tasks overall, and that GPT-4 exceeds adult performance on\n6th order inferences. Our results suggest that there is an interplay between\nmodel size and finetuning for the realisation of ToM abilities, and that the\nbest-performing LLMs have developed a generalised capacity for ToM. Given the\nrole that higher-order ToM plays in a wide range of cooperative and competitive\nhuman behaviours, these findings have significant implications for user-facing\nLLM applications.\n","authors":["Winnie Street","John Oliver Siy","Geoff Keeling","Adrien Baranes","Benjamin Barnett","Michael McKibben","Tatenda Kanyere","Alison Lentz","Blaise Aguera y Arcas","Robin I. M. Dunbar"],"pdf_url":"https://arxiv.org/pdf/2405.18870v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.07043v2","updated":"2024-05-31T12:27:52Z","published":"2024-02-10T21:06:34Z","title":"A Tale of Tails: Model Collapse as a Change of Scaling Laws","summary":" As AI model size grows, neural scaling laws have become a crucial tool to\npredict the improvements of large models when increasing capacity and the size\nof original (human or natural) training data. Yet, the widespread use of\npopular models means that the ecosystem of online data and text will co-evolve\nto progressively contain increased amounts of synthesized data. In this paper\nwe ask: How will the scaling laws change in the inevitable regime where\nsynthetic data makes its way into the training corpus? Will future models,\nstill improve, or be doomed to degenerate up to total (model) collapse? We\ndevelop a theoretical framework of model collapse through the lens of scaling\nlaws. We discover a wide range of decay phenomena, analyzing loss of scaling,\nshifted scaling with number of generations, the ''un-learning\" of skills, and\ngrokking when mixing human and synthesized data. Our theory is validated by\nlarge-scale experiments with a transformer on an arithmetic task and text\ngeneration using the large language model Llama2.\n","authors":["Elvis Dohmatob","Yunzhen Feng","Pu Yang","Francois Charton","Julia Kempe"],"pdf_url":"https://arxiv.org/pdf/2402.07043v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20755v1","updated":"2024-05-31T11:43:31Z","published":"2024-05-31T11:43:31Z","title":"Improving code-mixed hate detection by native sample mixing: A case\n study for Hindi-English code-mixed scenario","summary":" Hate detection has long been a challenging task for the NLP community. The\ntask becomes complex in a code-mixed environment because the models must\nunderstand the context and the hate expressed through language alteration.\nCompared to the monolingual setup, we see very less work on code-mixed hate as\nlarge-scale annotated hate corpora are unavailable to make the study. To\novercome this bottleneck, we propose using native language hate samples. We\nhypothesise that in the era of multilingual language models (MLMs), hate in\ncode-mixed settings can be detected by majorly relying on the native language\nsamples. Even though the NLP literature reports the effectiveness of MLMs on\nhate detection in many cross-lingual settings, their extensive evaluation in a\ncode-mixed scenario is yet to be done. This paper attempts to fill this gap\nthrough rigorous empirical experiments. We considered the Hindi-English\ncode-mixed setup as a case study as we have the linguistic expertise for the\nsame. Some of the interesting observations we got are: (i) adding native hate\nsamples in the code-mixed training set, even in small quantity, improved the\nperformance of MLMs for code-mixed hate detection, (ii) MLMs trained with\nnative samples alone observed to be detecting code-mixed hate to a large\nextent, (iii) The visualisation of attention scores revealed that, when native\nsamples were included in training, MLMs could better focus on the hate emitting\nwords in the code-mixed context, and (iv) finally, when hate is subjective or\nsarcastic, naively mixing native samples doesn't help much to detect code-mixed\nhate. We will release the data and code repository to reproduce the reported\nresults.\n","authors":["Debajyoti Mazumder","Aakash Kumar","Jasabanta Patro"],"pdf_url":"https://arxiv.org/pdf/2405.20755v1.pdf","comment":"Generated from XeLaTeX"},{"id":"http://arxiv.org/abs/2405.20708v1","updated":"2024-05-31T09:00:43Z","published":"2024-05-31T09:00:43Z","title":"FinGen: A Dataset for Argument Generation in Finance","summary":" Thinking about the future is one of the important activities that people do\nin daily life. Futurists also pay a lot of effort into figuring out possible\nscenarios for the future. We argue that the exploration of this direction is\nstill in an early stage in the NLP research. To this end, we propose three\nargument generation tasks in the financial application scenario. Our\nexperimental results show these tasks are still big challenges for\nrepresentative generation models. Based on our empirical results, we further\npoint out several unresolved issues and challenges in this research direction.\n","authors":["Chung-Chi Chen","Hiroya Takamura","Ichiro Kobayashi","Yusuke Miyao"],"pdf_url":"https://arxiv.org/pdf/2405.20708v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20703v1","updated":"2024-05-31T08:57:09Z","published":"2024-05-31T08:57:09Z","title":"It is Simple Sometimes: A Study On Improving Aspect-Based Sentiment\n Analysis Performance","summary":" Aspect-Based Sentiment Analysis (ABSA) involves extracting opinions from\ntextual data about specific entities and their corresponding aspects through\nvarious complementary subtasks. Several prior research has focused on\ndeveloping ad hoc designs of varying complexities for these subtasks. In this\npaper, we present a generative framework extensible to any ABSA subtask. We\nbuild upon the instruction tuned model proposed by Scaria et al. (2023), who\npresent an instruction-based model with task descriptions followed by\nin-context examples on ABSA subtasks. We propose PFInstruct, an extension to\nthis instruction learning paradigm by appending an NLP-related task prefix to\nthe task description. This simple approach leads to improved performance across\nall tested SemEval subtasks, surpassing previous state-of-the-art (SOTA) on the\nATE subtask (Rest14) by +3.28 F1-score, and on the AOOE subtask by an average\nof +5.43 F1-score across SemEval datasets. Furthermore, we explore the impact\nof the prefix-enhanced prompt quality on the ABSA subtasks and find that even a\nnoisy prefix enhances model performance compared to the baseline. Our method\nalso achieves competitive results on a biomedical domain dataset (ERSA).\n","authors":["Laura Cabello","Uchenna Akujuobi"],"pdf_url":"https://arxiv.org/pdf/2405.20703v1.pdf","comment":"Accepted to ACL Findings 2024"},{"id":"http://arxiv.org/abs/2405.19967v2","updated":"2024-05-31T08:54:24Z","published":"2024-05-30T11:46:42Z","title":"Improved Out-of-Scope Intent Classification with Dual Encoding and\n Threshold-based Re-Classification","summary":" Detecting out-of-scope user utterances is essential for task-oriented\ndialogues and intent classification. Current methodologies face difficulties\nwith the unpredictable distribution of outliers and often rely on assumptions\nabout data distributions. We present the Dual Encoder for Threshold-Based\nRe-Classification (DETER) to address these challenges. This end-to-end\nframework efficiently detects out-of-scope intents without requiring\nassumptions on data distributions or additional post-processing steps. The core\nof DETER utilizes dual text encoders, the Universal Sentence Encoder (USE) and\nthe Transformer-based Denoising AutoEncoder (TSDAE), to generate user utterance\nembeddings, which are classified through a branched neural architecture.\nFurther, DETER generates synthetic outliers using self-supervision and\nincorporates out-of-scope phrases from open-domain datasets. This approach\nensures a comprehensive training set for out-of-scope detection. Additionally,\na threshold-based re-classification mechanism refines the model's initial\npredictions. Evaluations on the CLINC-150, Stackoverflow, and Banking77\ndatasets demonstrate DETER's efficacy. Our model outperforms previous\nbenchmarks, increasing up to 13% and 5% in F1 score for known and unknown\nintents on CLINC-150 and Stackoverflow, and 16% for known and 24% % for unknown\nintents on Banking77. The source code has been released at\nhttps://github.com/Hossam-Mohammed-tech/Intent_Classification_OOS.\n","authors":["Hossam M. Zawbaa","Wael Rashwan","Sourav Dutta","Haytham Assem"],"pdf_url":"https://arxiv.org/pdf/2405.19967v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20701v1","updated":"2024-05-31T08:53:59Z","published":"2024-05-31T08:53:59Z","title":"Unveiling the Lexical Sensitivity of LLMs: Combinatorial Optimization\n for Prompt Enhancement","summary":" Large language models (LLMs) demonstrate exceptional instruct-following\nability to complete various downstream tasks. Although this impressive ability\nmakes LLMs flexible task solvers, their performance in solving tasks also\nheavily relies on instructions. In this paper, we reveal that LLMs are\nover-sensitive to lexical variations in task instructions, even when the\nvariations are imperceptible to humans. By providing models with neighborhood\ninstructions, which are closely situated in the latent representation space and\ndiffer by only one semantically similar word, the performance on downstream\ntasks can be vastly different. Following this property, we propose a black-box\nCombinatorial Optimization framework for Prompt Lexical Enhancement (COPLE).\nCOPLE performs iterative lexical optimization according to the feedback from a\nbatch of proxy tasks, using a search strategy related to word influence.\nExperiments show that even widely-used human-crafted prompts for current\nbenchmarks suffer from the lexical sensitivity of models, and COPLE recovers\nthe declined model ability in both instruct-following and solving downstream\ntasks.\n","authors":["Pengwei Zhan","Zhen Xu","Qian Tan","Jie Song","Ru Xie"],"pdf_url":"https://arxiv.org/pdf/2405.20701v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.15854v2","updated":"2024-05-31T08:37:04Z","published":"2024-01-29T03:05:35Z","title":"LSTM-based Deep Neural Network With A Focus on Sentence Representation\n for Sequential Sentence Classification in Medical Scientific Abstracts","summary":" The Sequential Sentence Classification task within the domain of medical\nabstracts, termed as SSC, involves the categorization of sentences into\npre-defined headings based on their roles in conveying critical information in\nthe abstract. In the SSC task, sentences are sequentially related to each\nother. For this reason, the role of sentence embeddings is crucial for\ncapturing both the semantic information between words in the sentence and the\ncontextual relationship of sentences within the abstract, which then enhances\nthe SSC system performance. In this paper, we propose a LSTM-based deep\nlearning network with a focus on creating comprehensive sentence representation\nat the sentence level. To demonstrate the efficacy of the created sentence\nrepresentation, a system utilizing these sentence embeddings is also developed,\nwhich consists of a Convolutional-Recurrent neural network (C-RNN) at the\nabstract level and a multi-layer perception network (MLP) at the segment level.\nOur proposed system yields highly competitive results compared to\nstate-of-the-art systems and further enhances the F1 scores of the baseline by\n1.0%, 2.8%, and 2.6% on the benchmark datasets PudMed 200K RCT, PudMed 20K RCT\nand NICTA-PIBOSO, respectively. This indicates the significant impact of\nimproving sentence representation on boosting model performance.\n","authors":["Phat Lam","Lam Pham","Tin Nguyen","Hieu Tang","Michael Seidl","Medina Andresel","Alexander Schindler"],"pdf_url":"https://arxiv.org/pdf/2401.15854v2.pdf","comment":"Submitted to FedCSIS 2024"},{"id":"http://arxiv.org/abs/2405.20684v1","updated":"2024-05-31T08:26:47Z","published":"2024-05-31T08:26:47Z","title":"Joint Embeddings for Graph Instruction Tuning","summary":" Large Language Models (LLMs) have achieved impressive performance in text\nunderstanding and have become an essential tool for building smart assistants.\nOriginally focusing on text, they have been enhanced with multimodal\ncapabilities in recent works that successfully built visual instruction\nfollowing assistants. As far as the graph modality goes, however, no such\nassistants have yet been developed. Graph structures are complex in that they\nrepresent relation between different features and are permutation invariant.\nMoreover, representing them in purely textual form does not always lead to good\nLLM performance even for finetuned models. As a result, there is a need to\ndevelop a new method to integrate graphs in LLMs for general graph\nunderstanding. This work explores the integration of the graph modality in LLM\nfor general graph instruction following tasks. It aims at producing a deep\nlearning model that enhances an underlying LLM with graph embeddings and trains\nit to understand them and to produce, given an instruction, an answer grounded\nin the graph representation. The approach performs significantly better than a\ngraph to text approach and remains consistent even for larger graphs.\n","authors":["Vlad Argatu","Aaron Haag","Oliver Lohse"],"pdf_url":"https://arxiv.org/pdf/2405.20684v1.pdf","comment":"Conference Preprint"},{"id":"http://arxiv.org/abs/2405.20680v1","updated":"2024-05-31T08:22:49Z","published":"2024-05-31T08:22:49Z","title":"Unraveling and Mitigating Retriever Inconsistencies in\n Retrieval-Augmented Large Language Models","summary":" Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their\nsuperiority in terms of factuality, they do not consistently outperform the\noriginal retrieval-free Language Models (LMs). Our experiments reveal that this\nexample-level performance inconsistency exists not only between\nretrieval-augmented and retrieval-free LM but also among different retrievers.\nTo understand this phenomenon, we investigate the degeneration behavior of\nRALMs and theoretically decompose it into four categories. Further analysis\nbased on our decomposition reveals that the innate difference in knowledge\nsources and the unpredictable degeneration of the reader model contribute most\nto the inconsistency. Drawing from our analysis, we introduce Ensemble of\nRetrievers (EoR), a trainable framework that can adaptively retrieve from\ndifferent knowledge sources and effectively decrease unpredictable reader\nerrors. Our experiments on Open Domain Question Answering show that EoR\nsubstantially improves performance over the RALM with a single retriever by\nconsiderably reducing inconsistent behaviors.\n","authors":["Mingda Li","Xinyu Li","Yifan Chen","Wenfeng Xuan","Weinan Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20680v1.pdf","comment":"ACL 2024 (findings)"},{"id":"http://arxiv.org/abs/2405.20671v1","updated":"2024-05-31T08:13:35Z","published":"2024-05-31T08:13:35Z","title":"Position Coupling: Leveraging Task Structure for Improved Length\n Generalization of Transformers","summary":" Even for simple arithmetic tasks like integer addition, it is challenging for\nTransformers to generalize to longer sequences than those encountered during\ntraining. To tackle this problem, we propose position coupling, a simple yet\neffective method that directly embeds the structure of the tasks into the\npositional encoding of a (decoder-only) Transformer. Taking a departure from\nthe vanilla absolute position mechanism assigning unique position IDs to each\nof the tokens, we assign the same position IDs to two or more \"relevant\"\ntokens; for integer addition tasks, we regard digits of the same significance\nas in the same position. On the empirical side, we show that with the proposed\nposition coupling, a small (1-layer) Transformer trained on 1 to 30-digit\nadditions can generalize up to 200-digit additions (6.67x of the trained\nlength). On the theoretical side, we prove that a 1-layer Transformer with\ncoupled positions can solve the addition task involving exponentially many\ndigits, whereas any 1-layer Transformer without positional information cannot\nentirely solve it. We also demonstrate that position coupling can be applied to\nother algorithmic tasks such as addition with multiple summands, Nx2\nmultiplication, copy/reverse, and a two-dimensional task.\n","authors":["Hanseul Cho","Jaeyoung Cha","Pranjal Awasthi","Srinadh Bhojanapalli","Anupam Gupta","Chulhee Yun"],"pdf_url":"https://arxiv.org/pdf/2405.20671v1.pdf","comment":"73 pages, 20 figures, 90 tables"},{"id":"http://arxiv.org/abs/2405.19732v2","updated":"2024-05-31T08:13:34Z","published":"2024-05-30T06:24:14Z","title":"Two Optimizers Are Better Than One: LLM Catalyst for Enhancing\n Gradient-Based Optimization","summary":" Learning a skill generally relies on both practical experience by doer and\ninsightful high-level guidance by instructor. Will this strategy also work well\nfor solving complex non-convex optimization problems? Here, a common\ngradient-based optimizer acts like a disciplined doer, making locally optimal\nupdate at each step. Recent methods utilize large language models (LLMs) to\noptimize solutions for concrete problems by inferring from natural language\ninstructions, akin to a high-level instructor. In this paper, we show that\nthese two optimizers are complementary to each other, suggesting a\ncollaborative optimization approach. The gradient-based optimizer and LLM-based\noptimizer are combined in an interleaved manner. We instruct LLMs using task\ndescriptions and timely optimization trajectories recorded during\ngradient-based optimization. Inferred results from LLMs are used as restarting\npoints for the next stage of gradient optimization. By leveraging both the\nlocally rigorous gradient-based optimizer and the high-level deductive\nLLM-based optimizer, our combined optimization method consistently yields\nimprovements over competitive baseline prompt tuning methods. Our results\ndemonstrate the synergistic effect of conventional gradient-based optimization\nand the inference ability of LLMs. The code is released at\nhttps://github.com/guozix/LLM-catalyst.\n","authors":["Zixian Guo","Ming Liu","Zhilong Ji","Jinfeng Bai","Yiwen Guo","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2405.19732v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.01289v2","updated":"2024-05-31T08:07:45Z","published":"2024-03-02T19:01:40Z","title":"Greed is All You Need: An Evaluation of Tokenizer Inference Methods","summary":" While subword tokenizers such as BPE and WordPiece are typically used to\nbuild vocabularies for NLP models, the method of decoding text into a sequence\nof tokens from these vocabularies is often left unspecified, or ill-suited to\nthe method in which they were constructed. We provide a controlled analysis of\nseven tokenizer inference methods across four different algorithms and three\nvocabulary sizes, performed on a novel intrinsic evaluation suite we curated\nfor English, combining measures rooted in morphology, cognition, and\ninformation theory. We show that for the most commonly used tokenizers, greedy\ninference performs surprisingly well; and that SaGe, a recently-introduced\ncontextually-informed tokenizer, outperforms all others on morphological\nalignment.\n","authors":["Omri Uzan","Craig W. Schmidt","Chris Tanner","Yuval Pinter"],"pdf_url":"https://arxiv.org/pdf/2403.01289v2.pdf","comment":"ACL 2024 (main)"},{"id":"http://arxiv.org/abs/2311.11745v2","updated":"2024-05-31T07:57:13Z","published":"2023-11-20T13:13:24Z","title":"ELF: Encoding Speaker-Specific Latent Speech Feature for Speech\n Synthesis","summary":" In this work, we propose a novel method for modeling numerous speakers, which\nenables expressing the overall characteristics of speakers in detail like a\ntrained multi-speaker model without additional training on the target speaker's\ndataset. Although various works with similar purposes have been actively\nstudied, their performance has not yet reached that of trained multi-speaker\nmodels due to their fundamental limitations. To overcome previous limitations,\nwe propose effective methods for feature learning and representing target\nspeakers' speech characteristics by discretizing the features and conditioning\nthem to a speech synthesis model. Our method obtained a significantly higher\nsimilarity mean opinion score (SMOS) in subjective similarity evaluation than\nseen speakers of a high-performance multi-speaker model, even with unseen\nspeakers. The proposed method also outperforms a zero-shot method by\nsignificant margins. Furthermore, our method shows remarkable performance in\ngenerating new artificial speakers. In addition, we demonstrate that the\nencoded latent features are sufficiently informative to reconstruct an original\nspeaker's speech completely. It implies that our method can be used as a\ngeneral methodology to encode and reconstruct speakers' characteristics in\nvarious tasks.\n","authors":["Jungil Kong","Junmo Lee","Jeongmin Kim","Beomjeong Kim","Jihoon Park","Dohee Kong","Changheon Lee","Sangjin Kim"],"pdf_url":"https://arxiv.org/pdf/2311.11745v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2310.18339v2","updated":"2024-05-31T07:56:08Z","published":"2023-10-21T17:18:09Z","title":"When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task\n Medical Applications","summary":" The recent surge in Large Language Models (LLMs) has garnered significant\nattention across numerous fields. Fine-tuning is often required to fit general\nLLMs for a specific domain, like the web-based healthcare system. However, two\nproblems arise during fine-tuning LLMs for medical applications. One is the\ntask variety problem, which involves distinct tasks in real-world medical\nscenarios. The variety often leads to sub-optimal fine-tuning for data\nimbalance and seesaw problems. Besides, the large amount of parameters in LLMs\nleads to huge time and computation consumption by fine-tuning. To address these\ntwo problems, we propose a novel parameter efficient fine-tuning framework for\nmulti-task medical applications, dubbed as MOELoRA. The designed framework aims\nto absorb both the benefits of mixture-of-expert (MOE) for multi-task learning\nand low-rank adaptation (LoRA) for parameter efficient fine-tuning. For\nunifying MOE and LoRA, we devise multiple experts as the trainable parameters,\nwhere each expert consists of a pair of low-rank matrices to retain the small\nsize of trainable parameters. Then, a task-motivated gate function for all\nMOELoRA layers is proposed, which can control the contributions of each expert\nand produce distinct parameters for various tasks. We conduct experiments on a\nmulti-task medical dataset, indicating MOELoRA outperforms the existing\nparameter efficient fine-tuning methods. The code is available online.\n","authors":["Qidong Liu","Xian Wu","Xiangyu Zhao","Yuanshao Zhu","Derong Xu","Feng Tian","Yefeng Zheng"],"pdf_url":"https://arxiv.org/pdf/2310.18339v2.pdf","comment":"accepted by SIGIR'24"},{"id":"http://arxiv.org/abs/2405.20657v1","updated":"2024-05-31T07:51:16Z","published":"2024-05-31T07:51:16Z","title":"DORY: Deliberative Prompt Recovery for LLM","summary":" Prompt recovery in large language models (LLMs) is crucial for understanding\nhow LLMs work and addressing concerns regarding privacy, copyright, etc. The\ntrend towards inference-only APIs complicates this task by restricting access\nto essential outputs for recovery. To tackle this challenge, we extract\nprompt-related information from limited outputs and identify a strong(negative)\ncorrelation between output probability-based uncertainty and the success of\nprompt recovery. This finding led to the development of Deliberative PrOmpt\nRecoverY (DORY), our novel approach that leverages uncertainty to recover\nprompts accurately. DORY involves reconstructing drafts from outputs, refining\nthese with hints, and filtering out noise based on uncertainty. Our evaluation\nacross diverse LLMs and prompt benchmarks shows that DORY outperforms existing\nbaselines, improving performance by approximately 10.82% and establishing a new\nstate-of-the-art record in prompt recovery tasks. Significantly, DORY operates\nusing a single LLM without any external resources or model, offering a\ncost-effective, user-friendly prompt recovery solution.\n","authors":["Lirong Gao","Ru Peng","Yiming Zhang","Junbo Zhao"],"pdf_url":"https://arxiv.org/pdf/2405.20657v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20654v1","updated":"2024-05-31T07:43:42Z","published":"2024-05-31T07:43:42Z","title":"Passage-specific Prompt Tuning for Passage Reranking in Question\n Answering with Large Language Models","summary":" Effective passage retrieval and reranking methods have been widely utilized\nto identify suitable candidates in open-domain question answering tasks, recent\nstudies have resorted to LLMs for reranking the retrieved passages by the\nlog-likelihood of the question conditioned on each passage. Although these\nmethods have demonstrated promising results, the performance is notably\nsensitive to the human-written prompt (or hard prompt), and fine-tuning LLMs\ncan be computationally intensive and time-consuming. Furthermore, this approach\nlimits the leverage of question-passage relevance pairs and passage-specific\nknowledge to enhance the ranking capabilities of LLMs. In this paper, we\npropose passage-specific prompt tuning for reranking in open-domain question\nanswering (PSPT): a parameter-efficient method that fine-tunes learnable\npassage-specific soft prompts, incorporating passage-specific knowledge from a\nlimited set of question-passage relevance pairs. The method involves ranking\nretrieved passages based on the log-likelihood of the model generating the\nquestion conditioned on each passage and the learned soft prompt. We conducted\nextensive experiments utilizing the Llama-2-chat-7B model across three publicly\navailable open-domain question answering datasets and the results demonstrate\nthe effectiveness of the proposed approach.\n","authors":["Xuyang Wu","Zhiyuan Peng","Sravanthi Rajanala","Hsin-Tai Wu","Yi Fang"],"pdf_url":"https://arxiv.org/pdf/2405.20654v1.pdf","comment":"Accepted at Gen-IR@SIGIR24"},{"id":"http://arxiv.org/abs/2403.03031v3","updated":"2024-05-31T07:42:44Z","published":"2024-03-05T15:08:16Z","title":"Learning to Use Tools via Cooperative and Interactive Agents","summary":" Tool learning empowers large language models (LLMs) as agents to use external\ntools to extend their capability. Existing methods employ one single LLM-based\nagent to iteratively select and execute tools, thereafter incorporating the\nresult into the next action prediction. However, they still suffer from\npotential performance degradation when addressing complex tasks due to: (1) the\nlimitation of the inherent capability of a single LLM to perform diverse\nactions, and (2) the struggle to adaptively correct mistakes when the task\nfails. To mitigate these problems, we propose the ConAgents, a Cooperative and\ninteractive Agents framework, which modularizes the workflow of tool learning\ninto Grounding, Execution, and Observing agents. We also introduce an iterative\ncalibration (IterCali) method, enabling the agents to adapt themselves based on\nthe feedback from the tool environment. Experiments conducted on three datasets\ndemonstrate the superiority of our ConAgents (e.g., 6 point improvement over\nthe SOTA baseline). We further provide fine-granularity analysis for the\nefficiency and consistency of our framework.\n","authors":["Zhengliang Shi","Shen Gao","Xiuyi Chen","Lingyong Yan","Haibo Shi","Dawei Yin","Zhumin Chen","Pengjie Ren","Suzan Verberne","Zhaochun Ren"],"pdf_url":"https://arxiv.org/pdf/2403.03031v3.pdf","comment":"working in process, 20 pages"},{"id":"http://arxiv.org/abs/2405.20649v1","updated":"2024-05-31T07:30:34Z","published":"2024-05-31T07:30:34Z","title":"Reward-based Input Construction for Cross-document Relation Extraction","summary":" Relation extraction (RE) is a fundamental task in natural language\nprocessing, aiming to identify relations between target entities in text. While\nmany RE methods are designed for a single sentence or document, cross-document\nRE has emerged to address relations across multiple long documents. Given the\nnature of long documents in cross-document RE, extracting document embeddings\nis challenging due to the length constraints of pre-trained language models.\nTherefore, we propose REward-based Input Construction (REIC), the first\nlearning-based sentence selector for cross-document RE. REIC extracts sentences\nbased on relational evidence, enabling the RE module to effectively infer\nrelations. Since supervision of evidence sentences is generally unavailable, we\ntrain REIC using reinforcement learning with RE prediction scores as rewards.\nExperimental results demonstrate the superiority of our method over heuristic\nmethods for different RE structures and backbones in cross-document RE. Our\ncode is publicly available at https://github.com/aailabkaist/REIC.\n","authors":["Byeonghu Na","Suhyeon Jo","Yeongmin Kim","Il-Chul Moon"],"pdf_url":"https://arxiv.org/pdf/2405.20649v1.pdf","comment":"Accepted at ACL 2024 main conference"},{"id":"http://arxiv.org/abs/2405.20648v1","updated":"2024-05-31T07:30:24Z","published":"2024-05-31T07:30:24Z","title":"Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision\n Models For Video Captioning and Summarization","summary":" Video is an increasingly prominent and information-dense medium, yet it poses\nsubstantial challenges for language models. A typical video consists of a\nsequence of shorter segments, or shots, that collectively form a coherent\nnarrative. Each shot is analogous to a word in a sentence where multiple data\nstreams of information (such as visual and auditory data) must be processed\nsimultaneously. Comprehension of the entire video requires not only\nunderstanding the visual-audio information of each shot but also requires that\nthe model links the ideas between each shot to generate a larger,\nall-encompassing story. Despite significant progress in the field, current\nworks often overlook videos' more granular shot-by-shot semantic information.\nIn this project, we propose a family of efficient large language vision models\n(LLVMs) to boost video summarization and captioning called Shotluck Holmes. By\nleveraging better pretraining and data collection strategies, we extend the\nabilities of existing small LLVMs from being able to understand a picture to\nbeing able to understand a sequence of frames. Specifically, we show that\nShotluck Holmes achieves better performance than state-of-the-art results on\nthe Shot2Story video captioning and summary task with significantly smaller and\nmore computationally efficient models.\n","authors":["Richard Luo","Austin Peng","Adithya Vasudev","Rishabh Jain"],"pdf_url":"https://arxiv.org/pdf/2405.20648v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19831v2","updated":"2024-05-31T07:24:55Z","published":"2024-05-30T08:41:33Z","title":"Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic\n Similarity and Privacy Preservation of Differentially Private Rewritten Text","summary":" The study of Differential Privacy (DP) in Natural Language Processing often\nviews the task of text privatization as a $\\textit{rewriting}$ task, in which\nsensitive input texts are rewritten to hide explicit or implicit private\ninformation. In order to evaluate the privacy-preserving capabilities of a DP\ntext rewriting mechanism, $\\textit{empirical privacy}$ tests are frequently\nemployed. In these tests, an adversary is modeled, who aims to infer sensitive\ninformation (e.g., gender) about the author behind a (privatized) text. Looking\nto improve the empirical protections provided by DP rewriting methods, we\npropose a simple post-processing method based on the goal of aligning rewritten\ntexts with their original counterparts, where DP rewritten texts are rewritten\n$\\textit{again}$. Our results show that such an approach not only produces\noutputs that are more semantically reminiscent of the original inputs, but also\ntexts which score on average better in empirical privacy evaluations.\nTherefore, our approach raises the bar for DP rewriting methods in their\nempirical privacy evaluations, providing an extra layer of protection against\nmalicious adversaries.\n","authors":["Stephen Meisenbacher","Florian Matthes"],"pdf_url":"https://arxiv.org/pdf/2405.19831v2.pdf","comment":"10 pages, 2 figures, 2 tables. Accepted to ARES 2024 (IWAPS)"},{"id":"http://arxiv.org/abs/2405.20646v1","updated":"2024-05-31T07:24:42Z","published":"2024-05-31T07:24:42Z","title":"Large Language Models Enhanced Sequential Recommendation for Long-tail\n User and Item","summary":" Sequential recommendation systems (SRS) serve the purpose of predicting\nusers' subsequent preferences based on their past interactions and have been\napplied across various domains such as e-commerce and social networking\nplatforms. However, practical SRS encounters challenges due to the fact that\nmost users engage with only a limited number of items, while the majority of\nitems are seldom consumed. These challenges, termed as the long-tail user and\nlong-tail item dilemmas, often create obstacles for traditional SRS methods.\nMitigating these challenges is crucial as they can significantly impact user\nsatisfaction and business profitability. While some research endeavors have\nalleviated these issues, they still grapple with issues such as seesaw or noise\nstemming from the scarcity of interactions. The emergence of large language\nmodels (LLMs) presents a promising avenue to address these challenges from a\nsemantic standpoint. In this study, we introduce the Large Language Models\nEnhancement framework for Sequential Recommendation (LLM-ESR), which leverages\nsemantic embeddings from LLMs to enhance SRS performance without increasing\ncomputational overhead. To combat the long-tail item challenge, we propose a\ndual-view modeling approach that fuses semantic information from LLMs with\ncollaborative signals from traditional SRS. To address the long-tail user\nchallenge, we introduce a retrieval augmented self-distillation technique to\nrefine user preference representations by incorporating richer interaction data\nfrom similar users. Through comprehensive experiments conducted on three\nauthentic datasets using three widely used SRS models, our proposed enhancement\nframework demonstrates superior performance compared to existing methodologies.\n","authors":["Qidong Liu","Xian Wu","Xiangyu Zhao","Yejing Wang","Zijian Zhang","Feng Tian","Yefeng Zheng"],"pdf_url":"https://arxiv.org/pdf/2405.20646v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.13923v2","updated":"2024-05-31T07:22:45Z","published":"2024-05-22T18:53:25Z","title":"Why Not Transform Chat Large Language Models to Non-English?","summary":" The scarcity of non-English data limits the development of non-English large\nlanguage models (LLMs). Transforming English-centric LLMs to non-English has\nbeen identified as an effective and resource-efficient method. Previous works\nstart from base LLMs and perform knowledge distillation (KD) with data\ngenerated by stronger LLMs, e.g. GPT-4. Compared to base LLMs, chat LLMs are\nfurther optimized for advanced abilities, e.g. multi-turn conversation and\nhuman preference alignment, and thus more powerful in both helpfulness and\nsafety. However, transforming a chat LLM involves two critical issues: (1) How\ncan we effectively transfer advanced abilities without their supervised data?\n(2) How can we prevent the original knowledge from catastrophic forgetting\nduring transformation? We target these issues by introducing a simple framework\ncalled TransLLM. For the first issue, TransLLM divides the transfer problem\ninto some common sub-tasks with the translation chain-of-thought, which uses\nthe translation as the bridge between English and non-English step-by-step. We\nfurther enhance the performance of sub-tasks with publicly available data. For\nthe second issue, we propose a method comprising two synergistic components:\nlow-rank adaptation for training to maintain the original LLM parameters, and\nrecovery KD, which utilizes data generated by the chat LLM itself to recover\nthe original knowledge from the frozen parameters. In the experiments, we\ntransform the LLaMA-2-chat-7B to the Thai language. Our method, using only\nsingle-turn data, outperforms strong baselines and ChatGPT on multi-turn\nbenchmark MT-bench. Furthermore, our method, without safety data, rejects more\nharmful queries of safety benchmark AdvBench than both ChatGPT and GPT-4.\n","authors":["Xiang Geng","Ming Zhu","Jiahuan Li","Zhejian Lai","Wei Zou","Shuaijie She","Jiaxin Guo","Xiaofeng Zhao","Yinglu Li","Yuang Li","Chang Su","Yanqing Zhao","Xinglin Lyu","Min Zhang","Jiajun Chen","Hao Yang","Shujian Huang"],"pdf_url":"https://arxiv.org/pdf/2405.13923v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.10650v6","updated":"2024-05-31T07:22:42Z","published":"2024-05-17T09:25:30Z","title":"SPOR: A Comprehensive and Practical Evaluation Method for Compositional\n Generalization in Data-to-Text Generation","summary":" Compositional generalization is an important ability of language models and\nhas many different manifestations. For data-to-text generation, previous\nresearch on this ability is limited to a single manifestation called\nSystematicity and lacks consideration of large language models (LLMs), which\ncannot fully cover practical application scenarios. In this work, we propose\nSPOR, a comprehensive and practical evaluation method for compositional\ngeneralization in data-to-text generation. SPOR includes four aspects of\nmanifestations (Systematicity, Productivity, Order invariance, and Rule\nlearnability) and allows high-quality evaluation without additional manual\nannotations based on existing datasets. We demonstrate SPOR on two different\ndatasets and evaluate some existing language models including LLMs. We find\nthat the models are deficient in various aspects of the evaluation and need\nfurther improvement. Our work shows the necessity for comprehensive research on\ndifferent manifestations of compositional generalization in data-to-text\ngeneration and provides a framework for evaluation.\n","authors":["Ziyao Xu","Houfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2405.10650v6.pdf","comment":"Accepted to ACL 2024 main conference"},{"id":"http://arxiv.org/abs/2401.00368v3","updated":"2024-05-31T07:22:01Z","published":"2023-12-31T02:13:18Z","title":"Improving Text Embeddings with Large Language Models","summary":" In this paper, we introduce a novel and simple method for obtaining\nhigh-quality text embeddings using only synthetic data and less than 1k\ntraining steps. Unlike existing methods that often depend on multi-stage\nintermediate pre-training with billions of weakly-supervised text pairs,\nfollowed by fine-tuning with a few labeled datasets, our method does not\nrequire building complex training pipelines or relying on manually collected\ndatasets that are often constrained by task diversity and language coverage. We\nleverage proprietary LLMs to generate diverse synthetic data for hundreds of\nthousands of text embedding tasks across 93 languages. We then fine-tune\nopen-source decoder-only LLMs on the synthetic data using standard contrastive\nloss. Experiments demonstrate that our method achieves strong performance on\nhighly competitive text embedding benchmarks without using any labeled data.\nFurthermore, when fine-tuned with a mixture of synthetic and labeled data, our\nmodel sets new state-of-the-art results on the BEIR and MTEB benchmarks.\n","authors":["Liang Wang","Nan Yang","Xiaolong Huang","Linjun Yang","Rangan Majumder","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2401.00368v3.pdf","comment":"Accepted by ACL 2024"},{"id":"http://arxiv.org/abs/2405.20628v1","updated":"2024-05-31T05:40:56Z","published":"2024-05-31T05:40:56Z","title":"ToxVidLLM: A Multimodal LLM-based Framework for Toxicity Detection in\n Code-Mixed Videos","summary":" In an era of rapidly evolving internet technology, the surge in multimodal\ncontent, including videos, has expanded the horizons of online communication.\nHowever, the detection of toxic content in this diverse landscape, particularly\nin low-resource code-mixed languages, remains a critical challenge. While\nsubstantial research has addressed toxic content detection in textual data, the\nrealm of video content, especially in non-English languages, has been\nrelatively underexplored. This paper addresses this research gap by introducing\na benchmark dataset, the first of its kind, consisting of 931 videos with 4021\ncode-mixed Hindi-English utterances collected from YouTube. Each utterance\nwithin this dataset has been meticulously annotated for toxicity, severity, and\nsentiment labels. We have developed an advanced Multimodal Multitask framework\nbuilt for Toxicity detection in Video Content by leveraging Large Language\nModels (LLMs), crafted for the primary objective along with the additional\ntasks of conducting sentiment and severity analysis. ToxVidLLM incorporates\nthree key modules the Encoder module, Cross-Modal Synchronization module, and\nMultitask module crafting a generic multimodal LLM customized for intricate\nvideo classification tasks. Our experiments reveal that incorporating multiple\nmodalities from the videos substantially enhances the performance of toxic\ncontent detection by achieving an Accuracy and Weighted F1 score of 94.29% and\n94.35%, respectively.\n","authors":["Krishanu Maity","A. S. Poornash","Sriparna Saha","Pushpak Bhattacharyya"],"pdf_url":"https://arxiv.org/pdf/2405.20628v1.pdf","comment":"ACL Findings 2024"},{"id":"http://arxiv.org/abs/2405.20624v1","updated":"2024-05-31T05:22:07Z","published":"2024-05-31T05:22:07Z","title":"Leveraging Large Language Models for Entity Matching","summary":" Entity matching (EM) is a critical task in data integration, aiming to\nidentify records across different datasets that refer to the same real-world\nentities. Traditional methods often rely on manually engineered features and\nrule-based systems, which struggle with diverse and unstructured data. The\nemergence of Large Language Models (LLMs) such as GPT-4 offers transformative\npotential for EM, leveraging their advanced semantic understanding and\ncontextual capabilities. This vision paper explores the application of LLMs to\nEM, discussing their advantages, challenges, and future research directions.\nAdditionally, we review related work on applying weak supervision and\nunsupervised approaches to EM, highlighting how LLMs can enhance these methods.\n","authors":["Qianyu Huang","Tongfang Zhao"],"pdf_url":"https://arxiv.org/pdf/2405.20624v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20613v1","updated":"2024-05-31T04:05:09Z","published":"2024-05-31T04:05:09Z","title":"FineRadScore: A Radiology Report Line-by-Line Evaluation Technique\n Generating Corrections with Severity Scores","summary":" The current gold standard for evaluating generated chest x-ray (CXR) reports\nis through radiologist annotations. However, this process can be extremely\ntime-consuming and costly, especially when evaluating large numbers of reports.\nIn this work, we present FineRadScore, a Large Language Model (LLM)-based\nautomated evaluation metric for generated CXR reports. Given a candidate report\nand a ground-truth report, FineRadScore gives the minimum number of\nline-by-line corrections required to go from the candidate to the ground-truth\nreport. Additionally, FineRadScore provides an error severity rating with each\ncorrection and generates comments explaining why the correction was needed. We\ndemonstrate that FineRadScore's corrections and error severity scores align\nwith radiologist opinions. We also show that, when used to judge the quality of\nthe report as a whole, FineRadScore aligns with radiologists as well as current\nstate-of-the-art automated CXR evaluation metrics. Finally, we analyze\nFineRadScore's shortcomings to provide suggestions for future improvements.\n","authors":["Alyssa Huang","Oishi Banerjee","Kay Wu","Eduardo Pontes Reis","Pranav Rajpurkar"],"pdf_url":"https://arxiv.org/pdf/2405.20613v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20612v1","updated":"2024-05-31T03:59:15Z","published":"2024-05-31T03:59:15Z","title":"UniBias: Unveiling and Mitigating LLM Bias through Internal Attention\n and FFN Manipulation","summary":" Large language models (LLMs) have demonstrated impressive capabilities in\nvarious tasks using the in-context learning (ICL) paradigm. However, their\neffectiveness is often compromised by inherent bias, leading to prompt\nbrittleness, i.e., sensitivity to design settings such as example selection,\norder, and prompt formatting. Previous studies have addressed LLM bias through\nexternal adjustment of model outputs, but the internal mechanisms that lead to\nsuch bias remain unexplored. Our work delves into these mechanisms,\nparticularly investigating how feedforward neural networks (FFNs) and attention\nheads result in the bias of LLMs. By Interpreting the contribution of\nindividual FFN vectors and attention heads, we identify the biased LLM\ncomponents that skew LLMs' prediction toward specific labels. To mitigate these\nbiases, we introduce UniBias, an inference-only method that effectively\nidentifies and eliminates biased FFN vectors and attention heads. Extensive\nexperiments across 12 NLP datasets demonstrate that UniBias significantly\nenhances ICL performance and alleviates prompt brittleness of LLMs.\n","authors":["Hanzhang Zhou","Zijian Feng","Zixiao Zhu","Junlang Qian","Kezhi Mao"],"pdf_url":"https://arxiv.org/pdf/2405.20612v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20611v1","updated":"2024-05-31T03:57:19Z","published":"2024-05-31T03:57:19Z","title":"Bi-Directional Transformers vs. word2vec: Discovering Vulnerabilities in\n Lifted Compiled Code","summary":" Detecting vulnerabilities within compiled binaries is challenging due to lost\nhigh-level code structures and other factors such as architectural\ndependencies, compilers, and optimization options. To address these obstacles,\nthis research explores vulnerability detection by using natural language\nprocessing (NLP) embedding techniques with word2vec, BERT, and RoBERTa to learn\nsemantics from intermediate representation (LLVM) code. Long short-term memory\n(LSTM) neural networks were trained on embeddings from encoders created using\napproximately 118k LLVM functions from the Juliet dataset. This study is\npioneering in its comparison of word2vec models with multiple bidirectional\ntransformer (BERT, RoBERTa) embeddings built using LLVM code to train neural\nnetworks to detect vulnerabilities in compiled binaries. word2vec Continuous\nBag of Words (CBOW) models achieved 92.3% validation accuracy in detecting\nvulnerabilities, outperforming word2vec Skip-Gram, BERT, and RoBERTa. This\nsuggests that complex contextual NLP embeddings may not provide advantages over\nsimpler word2vec models for this task when a limited number (e.g. 118K) of data\nsamples are used to train the bidirectional transformer-based models. The\ncomparative results provide novel insights into selecting optimal embeddings\nfor learning compiler-independent semantic code representations to advance\nmachine learning detection of vulnerabilities in compiled binaries.\n","authors":["Gary A. McCully","John D. Hastings","Shengjie Xu","Adam Fortier"],"pdf_url":"https://arxiv.org/pdf/2405.20611v1.pdf","comment":"8 pages, 0 figures, IEEE 4th Cyber Awareness and Research Symposium\n 2024 (CARS'24)"},{"id":"http://arxiv.org/abs/2405.20608v1","updated":"2024-05-31T03:48:00Z","published":"2024-05-31T03:48:00Z","title":"Identifying while Learning for Document Event Causality Identification","summary":" Event Causality Identification (ECI) aims to detect whether there exists a\ncausal relation between two events in a document. Existing studies adopt a kind\nof identifying after learning paradigm, where events' representations are first\nlearned and then used for the identification. Furthermore, they mainly focus on\nthe causality existence, but ignoring causal direction. In this paper, we take\ncare of the causal direction and propose a new identifying while learning mode\nfor the ECI task. We argue that a few causal relations can be easily identified\nwith high confidence, and the directionality and structure of these identified\ncausalities can be utilized to update events' representations for boosting next\nround of causality identification. To this end, this paper designs an\n*iterative learning and identifying framework*: In each iteration, we construct\nan event causality graph, on which events' causal structure representations are\nupdated for boosting causal identification. Experiments on two public datasets\nshow that our approach outperforms the state-of-the-art algorithms in both\nevaluations for causality existence identification and direction\nidentification.\n","authors":["Cheng Liu","Wei Xiang","Bang Wang"],"pdf_url":"https://arxiv.org/pdf/2405.20608v1.pdf","comment":"Accepted at ACL 2024"},{"id":"http://arxiv.org/abs/2402.16714v2","updated":"2024-05-31T03:34:57Z","published":"2024-02-26T16:31:28Z","title":"Quantum linear algebra is all you need for Transformer architectures","summary":" Generative machine learning methods such as large-language models are\nrevolutionizing the creation of text and images. While these models are\npowerful they also harness a large amount of computational resources. The\ntransformer is a key component in large language models that aims to generate a\nsuitable completion of a given partial sequence. In this work, we investigate\ntransformer architectures under the lens of fault-tolerant quantum computing.\nThe input model is one where trained weight matrices are given as block\nencodings and we construct the query, key, and value matrices for the\ntransformer. We show how to prepare a block encoding of the self-attention\nmatrix, with a new subroutine for the row-wise application of the softmax\nfunction. In addition, we combine quantum subroutines to construct important\nbuilding blocks in the transformer, the residual connection and layer\nnormalization, and the feed-forward neural network. Our subroutines prepare an\namplitude encoding of the transformer output, which can be measured to obtain a\nprediction. Based on common open-source large-language models, we provide\ninsights into the behavior of important parameters determining the run time of\nthe quantum algorithm. We discuss the potential and challenges for obtaining a\nquantum advantage.\n","authors":["Naixu Guo","Zhan Yu","Matthew Choi","Aman Agrawal","Kouhei Nakaji","Alán Aspuru-Guzik","Patrick Rebentrost"],"pdf_url":"https://arxiv.org/pdf/2402.16714v2.pdf","comment":"31 pages, 4 figures, 2 tables, comments are welcome"},{"id":"http://arxiv.org/abs/2405.20602v1","updated":"2024-05-31T03:26:42Z","published":"2024-05-31T03:26:42Z","title":"Masked Language Modeling Becomes Conditional Density Estimation for\n Tabular Data Synthesis","summary":" In this paper, our goal is to generate synthetic data for heterogeneous\n(mixed-type) tabular datasets with high machine learning utility (MLu). Given\nthat the MLu performance relies on accurately approximating the conditional\ndistributions, we focus on devising a synthetic data generation method based on\nconditional distribution estimation. We propose a novel synthetic data\ngeneration method, MaCoDE, by redefining the multi-class classification task of\nMasked Language Modeling (MLM) as histogram-based non-parametric conditional\ndensity estimation. Our proposed method enables estimating conditional\ndensities across arbitrary combinations of target and conditional variables.\nFurthermore, we demonstrate that our proposed method bridges the theoretical\ngap between distributional learning and MLM. To validate the effectiveness of\nour proposed model, we conduct synthetic data generation experiments on 10\nreal-world datasets. Given the analogy between predicting masked input tokens\nin MLM and missing data imputation, we also evaluate the performance of\nmultiple imputations on incomplete datasets with various missing data\nmechanisms. Moreover, our proposed model offers the advantage of enabling\nadjustments to data privacy levels without requiring re-training.\n","authors":["Seunghwan An","Gyeongdong Woo","Jaesung Lim","ChangHyun Kim","Sungchul Hong","Jong-June Jeon"],"pdf_url":"https://arxiv.org/pdf/2405.20602v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.12146v3","updated":"2024-05-31T03:25:42Z","published":"2024-02-19T13:57:55Z","title":"Enabling Weak LLMs to Judge Response Reliability via Meta Ranking","summary":" Despite the strong performance of large language models (LLMs) across a wide\nrange of tasks, they still have reliability issues. Previous studies indicate\nthat strong LLMs like GPT-4-turbo excel in evaluating the reliability of\nresponses from LLMs, but face efficiency and local deployment issues. Thus, to\nenable weak LLMs to effectively assess the reliability of LLM responses, we\npropose a novel cross-query-comparison-based method called $\\textit{Meta\nRanking}$ (MR). Unlike previous few-shot methods that solely based on\nin-context learning capabilities in LLMs, MR assesses reliability by pairwisely\nranking the target query-response pair with multiple reference query-response\npairs. We found that MR is highly effective in error detection for LLM\nresponses, where weak LLMs, such as Phi-2, could surpass strong baselines like\nGPT-3.5-turbo, requiring only five reference samples and significantly\nimproving efficiency. We further demonstrate that MR can enhance strong LLMs'\nperformance in two practical applications: model cascading and instruction\ntuning. In model cascading, we combine open- and closed-source LLMs to achieve\nperformance comparable to GPT-4-turbo with lower costs. In instruction tuning,\nwe use MR for iterative training data filtering, significantly reducing data\nprocessing time and enabling LLaMA-7B and Phi-2 to surpass Alpaca-13B with\nfewer training tokens. These results underscore the high potential of MR in\nboth efficiency and effectiveness.\n","authors":["Zijun Liu","Boqun Kou","Peng Li","Ming Yan","Ji Zhang","Fei Huang","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2402.12146v3.pdf","comment":"Preprint, under review. 28 pages"},{"id":"http://arxiv.org/abs/2311.03732v4","updated":"2024-05-31T03:07:23Z","published":"2023-11-07T05:22:11Z","title":"Learning to Learn for Few-shot Continual Active Learning","summary":" Continual learning strives to ensure stability in solving previously seen\ntasks while demonstrating plasticity in a novel domain. Recent advances in\ncontinual learning are mostly confined to a supervised learning setting,\nespecially in NLP domain. In this work, we consider a few-shot continual active\nlearning setting where labeled data are inadequate, and unlabeled data are\nabundant but with a limited annotation budget. We exploit meta-learning and\npropose a method, called Meta-Continual Active Learning. This method\nsequentially queries the most informative examples from a pool of unlabeled\ndata for annotation to enhance task-specific performance and tackle continual\nlearning problems through meta-objective. Specifically, we employ meta-learning\nand experience replay to address inter-task confusion and catastrophic\nforgetting. We further incorporate textual augmentations to avoid memory\nover-fitting caused by experience replay and sample queries, thereby ensuring\ngeneralization. We conduct extensive experiments on benchmark text\nclassification datasets from diverse domains to validate the feasibility and\neffectiveness of meta-continual active learning. We also analyze the impact of\ndifferent active learning strategies on various meta continual learning models.\nThe experimental results demonstrate that introducing randomness into sample\nselection is the best default strategy for maintaining generalization in\nmeta-continual learning framework.\n","authors":["Stella Ho","Ming Liu","Shang Gao","Longxiang Gao"],"pdf_url":"https://arxiv.org/pdf/2311.03732v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19670v2","updated":"2024-05-31T02:56:56Z","published":"2024-05-30T03:44:54Z","title":"One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for\n Retrieval-Augmented Large Language Models","summary":" Retrieval-augmented generation (RAG) is a promising way to improve large\nlanguage models (LLMs) for generating more factual, accurate, and up-to-date\ncontent. Existing methods either optimize prompts to guide LLMs in leveraging\nretrieved information or directly fine-tune the LLMs to adapt to RAG scenarios.\nAlthough fine-tuning can yield better performance, it often compromises the\nLLMs' general generation capabilities by modifying their parameters. This\nlimitation poses challenges in practical applications, especially when LLMs are\nalready deployed, as parameter adjustments may affect their original\nfunctionality. To address this, we propose a novel method that involves\nlearning scalable and pluggable virtual tokens for RAG. By maintaining the\nLLMs' original parameters and fine-tuning only the embeddings of these\npluggable tokens, our approach not only enhances LLMs' performance but also\npreserves their general generation capacities. Furthermore, we design several\ntraining strategies to improve the scalability, flexibility, and\ngeneralizability of our method. Comprehensive experiments across nine\nquestion-answering tasks demonstrate the superiority of our approach.\n","authors":["Yutao Zhu","Zhaoheng Huang","Zhicheng Dou","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2405.19670v2.pdf","comment":"working in progress, repo: https://github.com/DaoD/SPRING/"},{"id":"http://arxiv.org/abs/2405.20588v1","updated":"2024-05-31T02:56:49Z","published":"2024-05-31T02:56:49Z","title":"DAFNet: Dynamic Auxiliary Fusion for Sequential Model Editing in Large\n Language Models","summary":" Recently, while large language models (LLMs) have demonstrated impressive\nresults, they still suffer from hallucination, i.e., the generation of false\ninformation. Model editing is the task of fixing factual mistakes in LLMs; yet,\nmost previous works treat it as a one-time task, paying little attention to\never-emerging mistakes generated by LLMs. We address the task of sequential\nmodel editing (SME) that aims to rectify mistakes continuously. A Dynamic\nAuxiliary Fusion Network (DAFNet) is designed to enhance the semantic\ninteraction among the factual knowledge within the entire sequence, preventing\ncatastrophic forgetting during the editing process of multiple knowledge\ntriples. Specifically, (1) for semantic fusion within a relation triple, we\naggregate the intra-editing attention flow into auto-regressive self-attention\nwith token-level granularity in LLMs. We further leverage multi-layer diagonal\ninter-editing attention flow to update the weighted representations of the\nentire sequence-level granularity. (2) Considering that auxiliary parameters\nare required to store the knowledge for sequential editing, we construct a new\ndataset named \\textbf{DAFSet}, fulfilling recent, popular, long-tail and robust\nproperties to enhance the generality of sequential editing. Experiments show\nDAFNet significantly outperforms strong baselines in single-turn and sequential\nediting. The usage of DAFSet also consistently improves the performance of\nother auxiliary network-based methods in various scenarios\n","authors":["Taolin Zhang","Qizhou Chen","Dongyang Li","Chengyu Wang","Xiaofeng He","Longtao Huang","Hui Xue","Jun Huang"],"pdf_url":"https://arxiv.org/pdf/2405.20588v1.pdf","comment":"ACL2024 findings"},{"id":"http://arxiv.org/abs/2405.20585v1","updated":"2024-05-31T02:53:22Z","published":"2024-05-31T02:53:22Z","title":"GAMedX: Generative AI-based Medical Entity Data Extractor Using Large\n Language Models","summary":" In the rapidly evolving field of healthcare and beyond, the integration of\ngenerative AI in Electronic Health Records (EHRs) represents a pivotal\nadvancement, addressing a critical gap in current information extraction\ntechniques. This paper introduces GAMedX, a Named Entity Recognition (NER)\napproach utilizing Large Language Models (LLMs) to efficiently extract entities\nfrom medical narratives and unstructured text generated throughout various\nphases of the patient hospital visit. By addressing the significant challenge\nof processing unstructured medical text, GAMedX leverages the capabilities of\ngenerative AI and LLMs for improved data extraction. Employing a unified\napproach, the methodology integrates open-source LLMs for NER, utilizing\nchained prompts and Pydantic schemas for structured output to navigate the\ncomplexities of specialized medical jargon. The findings reveal significant\nROUGE F1 score on one of the evaluation datasets with an accuracy of 98\\%. This\ninnovation enhances entity extraction, offering a scalable, cost-effective\nsolution for automated forms filling from unstructured data. As a result,\nGAMedX streamlines the processing of unstructured narratives, and sets a new\nstandard in NER applications, contributing significantly to theoretical and\npractical advancements beyond the medical technology sphere.\n","authors":["Mohammed-Khalil Ghali","Abdelrahman Farrag","Hajar Sakai","Hicham El Baz","Yu Jin","Sarah Lam"],"pdf_url":"https://arxiv.org/pdf/2405.20585v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.08846v2","updated":"2024-05-31T02:37:10Z","published":"2024-04-12T23:27:46Z","title":"Experimental Design for Active Transductive Inference in Large Language\n Models","summary":" One emergent ability of large language models (LLMs) is that query-specific\nexamples can be included in the prompt at inference time. In this work, we use\nactive learning for adaptive prompt design and call it Active In-context Prompt\nDesign (AIPD). We design the LLM prompt by adaptively choosing few-shot\nexamples from a training set to optimize performance on a test set. The\ntraining examples are initially unlabeled and we obtain the label of the most\ninformative ones, which maximally reduces uncertainty in the LLM prediction. We\npropose two algorithms, GO and SAL, which differ in how the few-shot examples\nare chosen. We analyze these algorithms in linear models: first GO and then use\nits equivalence with SAL. We experiment with many different tasks in small,\nmedium-sized, and large language models; and show that GO and SAL outperform\nother methods for choosing few-shot examples in the LLM prompt at inference\ntime.\n","authors":["Subhojyoti Mukherjee","Anusha Lalitha","Aniket Deshmukh","Ge Liu","Yifei Ma","Branislav Kveton"],"pdf_url":"https://arxiv.org/pdf/2404.08846v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20582v1","updated":"2024-05-31T02:28:41Z","published":"2024-05-31T02:28:41Z","title":"The Point of View of a Sentiment: Towards Clinician Bias Detection in\n Psychiatric Notes","summary":" In psychiatry, negative patient descriptions and stigmatizing language can\ncontribute to healthcare disparities in two ways: (1) read by patients they can\nharm their trust and engagement with the medical center; (2) read by future\nproviders they may negatively influence the future perspective of a patient. By\nleveraging large language models, this work aims to identify the sentiment\nexpressed in psychiatric clinical notes based on the reader's point of view.\nExtracting sentences from the Mount Sinai Health System's large and diverse\nclinical notes, we used prompts and in-context learning to adapt three large\nlanguage models (GPT-3.5, Llama 2, Mistral) to classify the sentiment conveyed\nby the sentences according to the provider or non-provider point of view.\nResults showed that GPT-3.5 aligns best to provider point of view, whereas\nMistral aligns best to non-provider point of view.\n","authors":["Alissa A. Valentine","Lauren A. Lepow","Alexander W. Charney","Isotta Landi"],"pdf_url":"https://arxiv.org/pdf/2405.20582v1.pdf","comment":"Oral presentation at NAACL 2024 Queer in AI Workshop"},{"id":"http://arxiv.org/abs/2405.20574v1","updated":"2024-05-31T02:05:45Z","published":"2024-05-31T02:05:45Z","title":"Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with\n Ko-H5 Benchmark","summary":" This paper introduces the Open Ko-LLM Leaderboard and the Ko-H5 Benchmark as\nvital tools for evaluating Large Language Models (LLMs) in Korean.\nIncorporating private test sets while mirroring the English Open LLM\nLeaderboard, we establish a robust evaluation framework that has been well\nintegrated in the Korean LLM community. We perform data leakage analysis that\nshows the benefit of private test sets along with a correlation study within\nthe Ko-H5 benchmark and temporal analyses of the Ko-H5 score. Moreover, we\npresent empirical support for the need to expand beyond set benchmarks. We hope\nthe Open Ko-LLM Leaderboard sets precedent for expanding LLM evaluation to\nfoster more linguistic diversity.\n","authors":["Chanjun Park","Hyeonwoo Kim","Dahyun Kim","Seonghwan Cho","Sanghoon Kim","Sukyung Lee","Yungi Kim","Hwalsuk Lee"],"pdf_url":"https://arxiv.org/pdf/2405.20574v1.pdf","comment":"Accepted at ACL 2024 Main"},{"id":"http://arxiv.org/abs/2405.20172v2","updated":"2024-05-31T01:59:20Z","published":"2024-05-30T15:44:27Z","title":"Iterative Feature Boosting for Explainable Speech Emotion Recognition","summary":" In speech emotion recognition (SER), using predefined features without\nconsidering their practical importance may lead to high dimensional datasets,\nincluding redundant and irrelevant information. Consequently, high-dimensional\nlearning often results in decreasing model accuracy while increasing\ncomputational complexity. Our work underlines the importance of carefully\nconsidering and analyzing features in order to build efficient SER systems. We\npresent a new supervised SER method based on an efficient feature engineering\napproach. We pay particular attention to the explainability of results to\nevaluate feature relevance and refine feature sets. This is performed\niteratively through feature evaluation loop, using Shapley values to boost\nfeature selection and improve overall framework performance. Our approach\nallows thus to balance the benefits between model performance and transparency.\nThe proposed method outperforms human-level performance (HLP) and\nstate-of-the-art machine learning methods in emotion recognition on the TESS\ndataset.\n","authors":["Alaa Nfissi","Wassim Bouachir","Nizar Bouguila","Brian Mishara"],"pdf_url":"https://arxiv.org/pdf/2405.20172v2.pdf","comment":"Published in: 2023 International Conference on Machine Learning and\n Applications (ICMLA)"},{"id":"http://arxiv.org/abs/2405.19325v2","updated":"2024-05-31T01:41:49Z","published":"2024-05-29T17:55:03Z","title":"Nearest Neighbor Speculative Decoding for LLM Generation and Attribution","summary":" Large language models (LLMs) often hallucinate and lack the ability to\nprovide attribution for their generations. Semi-parametric LMs, such as kNN-LM,\napproach these limitations by refining the output of an LM for a given prompt\nusing its nearest neighbor matches in a non-parametric data store. However,\nthese models often exhibit slow inference speeds and produce non-fluent texts.\nIn this paper, we introduce Nearest Neighbor Speculative Decoding (NEST), a\nnovel semi-parametric language modeling approach that is capable of\nincorporating real-world text spans of arbitrary length into the LM generations\nand providing attribution to their sources. NEST performs token-level retrieval\nat each inference step to compute a semi-parametric mixture distribution and\nidentify promising span continuations in a corpus. It then uses an approximate\nspeculative decoding procedure that accepts a prefix of the retrieved span or\ngenerates a new token. NEST significantly enhances the generation quality and\nattribution rate of the base LM across a variety of knowledge-intensive tasks,\nsurpassing the conventional kNN-LM method and performing competitively with\nin-context retrieval augmentation. In addition, NEST substantially improves the\ngeneration speed, achieving a 1.8x speedup in inference time when applied to\nLlama-2-Chat 70B.\n","authors":["Minghan Li","Xilun Chen","Ari Holtzman","Beidi Chen","Jimmy Lin","Wen-tau Yih","Xi Victoria Lin"],"pdf_url":"https://arxiv.org/pdf/2405.19325v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.07989v2","updated":"2024-05-31T01:36:53Z","published":"2024-04-11T17:59:45Z","title":"Any2Point: Empowering Any-modality Large Models for Efficient 3D\n Understanding","summary":" Large foundation models have recently emerged as a prominent focus of\ninterest, attaining superior performance in widespread scenarios. Due to the\nscarcity of 3D data, many efforts have been made to adapt pre-trained\ntransformers from vision to 3D domains. However, such 2D-to-3D approaches are\nstill limited, due to the potential loss of spatial geometries and high\ncomputation cost. More importantly, their frameworks are mainly designed for 2D\nmodels, lacking a general any-to-3D paradigm. In this paper, we introduce\nAny2Point, a parameter-efficient method to empower any-modality large models\n(vision, language, audio) for 3D understanding. Given a frozen transformer from\nany source modality, we propose a 3D-to-any (1D or 2D) virtual projection\nstrategy that correlates the input 3D points to the original 1D or 2D positions\nwithin the source modality. This mechanism enables us to assign each 3D token\nwith a positional encoding paired with the pre-trained model, which avoids 3D\ngeometry loss caused by the true projection and better motivates the\ntransformer for 3D learning with 1D/2D positional priors. Then, within each\ntransformer block, we insert an any-to-3D guided adapter module for\nparameter-efficient fine-tuning. The adapter incorporates prior spatial\nknowledge from the source modality to guide the local feature aggregation of 3D\ntokens, compelling the semantic adaption of any-modality transformers. We\nconduct extensive experiments to showcase the effectiveness and efficiency of\nour method. Code and models are released at\nhttps://github.com/Ivan-Tang-3D/Any2Point.\n","authors":["Yiwen Tang","Ray Zhang","Jiaming Liu","Zoey Guo","Dong Wang","Zhigang Wang","Bin Zhao","Shanghang Zhang","Peng Gao","Hongsheng Li","Xuelong Li"],"pdf_url":"https://arxiv.org/pdf/2404.07989v2.pdf","comment":"Code and models are released at\n https://github.com/Ivan-Tang-3D/Any2Point"},{"id":"http://arxiv.org/abs/2311.09510v3","updated":"2024-05-31T01:32:23Z","published":"2023-11-16T02:25:36Z","title":"Tailoring with Targeted Precision: Edit-Based Agents for Open-Domain\n Procedure Customization","summary":" How-to procedures, such as how to plant a garden, are now used by millions of\nusers, but sometimes need customizing to meet a user's specific needs, e.g.,\nplanting a garden without pesticides. Our goal is to measure and improve an\nLLM's ability to perform such customization. Our approach is to test several\nsimple multi-LLM-agent architectures for customization, as well as an\nend-to-end LLM, using a new evaluation set, called CustomPlans, of over 200\nWikiHow procedures each with a customization need. We find that a simple\narchitecture with two LLM agents used sequentially performs best, one that\nedits a generic how-to procedure and one that verifies its executability,\nsignificantly outperforming (10.5% absolute) an end-to-end prompted LLM. This\nsuggests that LLMs can be configured reasonably effectively for procedure\ncustomization. This also suggests that multi-agent editing architectures may be\nworth exploring further for other customization applications (e.g. coding,\ncreative writing) in the future.\n","authors":["Yash Kumar Lal","Li Zhang","Faeze Brahman","Bodhisattwa Prasad Majumder","Peter Clark","Niket Tandon"],"pdf_url":"https://arxiv.org/pdf/2311.09510v3.pdf","comment":"Camera ready version accepted to Findings of ACL 2024"},{"id":"http://arxiv.org/abs/2305.15255v4","updated":"2024-05-31T01:29:27Z","published":"2023-05-24T15:39:43Z","title":"Spoken Question Answering and Speech Continuation Using\n Spectrogram-Powered LLM","summary":" We present Spectron, a novel approach to adapting pre-trained large language\nmodels (LLMs) to perform spoken question answering (QA) and speech\ncontinuation. By endowing the LLM with a pre-trained speech encoder, our model\nbecomes able to take speech inputs and generate speech outputs. The entire\nsystem is trained end-to-end and operates directly on spectrograms, simplifying\nour architecture. Key to our approach is a training objective that jointly\nsupervises speech recognition, text continuation, and speech synthesis using\nonly paired speech-text pairs, enabling a `cross-modal' chain-of-thought within\na single decoding pass. Our method surpasses existing spoken language models in\nspeaker preservation and semantic coherence. Furthermore, the proposed model\nimproves upon direct initialization in retaining the knowledge of the original\nLLM as demonstrated through spoken QA datasets. We release our audio samples\n(https://michelleramanovich.github.io/spectron/spectron) and spoken QA dataset\n(https://github.com/google-research-datasets/LLAMA1-Test-Set).\n","authors":["Eliya Nachmani","Alon Levkovitch","Roy Hirsch","Julian Salazar","Chulayuth Asawaroengchai","Soroosh Mariooryad","Ehud Rivlin","RJ Skerry-Ryan","Michelle Tadmor Ramanovich"],"pdf_url":"https://arxiv.org/pdf/2305.15255v4.pdf","comment":"ICLR 2024 camera-ready"},{"id":"http://arxiv.org/abs/2405.19787v2","updated":"2024-05-31T01:23:41Z","published":"2024-05-30T07:54:07Z","title":"From Symbolic Tasks to Code Generation: Diversification Yields Better\n Task Performers","summary":" Instruction tuning -- tuning large language models on instruction-output\npairs -- is a promising technique for making models better adapted to the real\nworld. Yet, the key factors driving the model's capability to understand and\nfollow instructions not seen during training remain under-explored. Our\ninvestigation begins with a series of synthetic experiments within the\ntheoretical framework of a Turing-complete algorithm called Markov algorithm,\nwhich allows fine-grained control over the instruction-tuning data.\nGeneralization and robustness with respect to the training distribution emerge\nonce a diverse enough set of tasks is provided, even though very few examples\nare provided for each task. We extend these initial results to a real-world\napplication scenario of code generation and find that a more diverse\ninstruction set, extending beyond code-related tasks, improves the performance\nof code generation. Our observations suggest that a more diverse semantic space\nfor instruction-tuning sets greatly improves the model's ability to follow\ninstructions and perform tasks.\n","authors":["Dylan Zhang","Justin Wang","Francois Charton"],"pdf_url":"https://arxiv.org/pdf/2405.19787v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.04626v2","updated":"2024-05-31T00:12:59Z","published":"2024-03-07T16:11:43Z","title":"MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training\n with Masked Autoencoder","summary":" Within the domain of medical analysis, extensive research has explored the\npotential of mutual learning between Masked Autoencoders(MAEs) and multimodal\ndata. However, the impact of MAEs on intermodality remains a key challenge. We\nintroduce MedFLIP, a Fast Language-Image Pre-training method for Medical\nanalysis. We explore MAEs for zero-shot learning with crossed domains, which\nenhances the model's ability to learn from limited data, a common scenario in\nmedical diagnostics. We verify that masking an image does not affect\ninter-modal learning. Furthermore, we propose the SVD loss to enhance the\nrepresentation learning for characteristics of medical images, aiming to\nimprove classification accuracy by leveraging the structural intricacies of\nsuch data. Our theory posits that masking encourages semantic preservation,\nrobust feature extraction, regularization, domain adaptation, and invariance\nlearning. Lastly, we validate using language will improve the zero-shot\nperformance for the medical image analysis. MedFLIP's scaling of the masking\nprocess marks an advancement in the field, offering a pathway to rapid and\nprecise medical image analysis without the traditional computational\nbottlenecks. Through experiments and validation, MedFLIP demonstrates efficient\nperformance improvements, helps for future research and application in medical\ndiagnostics.\n","authors":["Lei Li","Tianfang Zhang","Xinglin Zhang","Jiaqi Liu","Bingqi Ma","Yan Luo","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2403.04626v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.17509v2","updated":"2024-05-31T00:09:56Z","published":"2024-02-27T13:49:12Z","title":"Extreme Miscalibration and the Illusion of Adversarial Robustness","summary":" Deep learning-based Natural Language Processing (NLP) models are vulnerable\nto adversarial attacks, where small perturbations can cause a model to\nmisclassify. Adversarial Training (AT) is often used to increase model\nrobustness. However, we have discovered an intriguing phenomenon: deliberately\nor accidentally miscalibrating models masks gradients in a way that interferes\nwith adversarial attack search methods, giving rise to an apparent increase in\nrobustness. We show that this observed gain in robustness is an illusion of\nrobustness (IOR), and demonstrate how an adversary can perform various forms of\ntest-time temperature calibration to nullify the aforementioned interference\nand allow the adversarial attack to find adversarial examples. Hence, we urge\nthe NLP community to incorporate test-time temperature scaling into their\nrobustness evaluations to ensure that any observed gains are genuine. Finally,\nwe show how the temperature can be scaled during \\textit{training} to improve\ngenuine robustness.\n","authors":["Vyas Raina","Samson Tan","Volkan Cevher","Aditya Rawal","Sheng Zha","George Karypis"],"pdf_url":"https://arxiv.org/pdf/2402.17509v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.16367v2","updated":"2024-05-31T23:47:15Z","published":"2024-04-25T07:10:29Z","title":"Learning Syntax Without Planting Trees: Understanding When and Why\n Transformers Generalize Hierarchically","summary":" Transformers trained on natural language data have been shown to learn its\nhierarchical structure and generalize to sentences with unseen syntactic\nstructures without explicitly encoding any structural bias. In this work, we\ninvestigate sources of inductive bias in transformer models and their training\nthat could cause such generalization behavior to emerge. We extensively\nexperiment with transformer models trained on multiple synthetic datasets and\nwith different training objectives and show that while other objectives e.g.\nsequence-to-sequence modeling, prefix language modeling, often failed to lead\nto hierarchical generalization, models trained with the language modeling\nobjective consistently learned to generalize hierarchically. We then conduct\npruning experiments to study how transformers trained with the language\nmodeling objective encode hierarchical structure. When pruned, we find joint\nexistence of subnetworks within the model with different generalization\nbehaviors (subnetworks corresponding to hierarchical structure and linear\norder). Finally, we take a Bayesian perspective to further uncover\ntransformers' preference for hierarchical generalization: We establish a\ncorrelation between whether transformers generalize hierarchically on a dataset\nand whether the simplest explanation of that dataset is provided by a\nhierarchical grammar compared to regular grammars exhibiting linear\ngeneralization.\n","authors":["Kabir Ahuja","Vidhisha Balachandran","Madhur Panwar","Tianxing He","Noah A. Smith","Navin Goyal","Yulia Tsvetkov"],"pdf_url":"https://arxiv.org/pdf/2404.16367v2.pdf","comment":"Code now available: https://github.com/kabirahuja2431/transformers-hg"},{"id":"http://arxiv.org/abs/2310.12516v2","updated":"2024-05-31T23:46:24Z","published":"2023-10-19T06:37:32Z","title":"ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large\n Language Models via Transferable Adversarial Attacks","summary":" Despite remarkable advancements in mitigating hallucinations in large\nlanguage models (LLMs) by retrieval augmentation, it remains challenging to\nmeasure the reliability of LLMs using static question-answering (QA) data.\nSpecifically, given the potential of data contamination (e.g., leading to\nmemorization), good static benchmark performance does not ensure that model can\nreliably use the provided evidence for responding, which is essential to avoid\nhallucination when the required knowledge is new or private. Inspired by\nadversarial machine learning, we investigate the feasibility of automatically\nperturbing existing static one for dynamic evaluation. Specifically, this paper\npresents ReEval, an LLM-based framework using prompt chaining to perturb the\noriginal evidence for generating new test cases for evaluating the LLMs'\nreliability in using new evidence for answering.\n We implement ReEval using ChatGPT and evaluate the resulting variants of two\npopular open-domain QA datasets on a collection of LLMs under various prompting\nsettings. Our generated data is human-readable and useful to trigger\nhallucination in LLM. Accurate models on static data are observed to produce\nunsupported answers from the perturbed evidence, with pronounced accuracy drops\nacross LLMs including GPT-4. We find that our adversarial examples are\ntransferable across all considered LLMs. The examples generated by a small\nmodel can be used to evaluate a much larger model, making our approach\ncost-effective.\n","authors":["Xiaodong Yu","Hao Cheng","Xiaodong Liu","Dan Roth","Jianfeng Gao"],"pdf_url":"https://arxiv.org/pdf/2310.12516v2.pdf","comment":"NAACL 2024 Findings"},{"id":"http://arxiv.org/abs/2405.16969v3","updated":"2024-05-31T23:15:55Z","published":"2024-05-27T09:06:24Z","title":"The Multi-Range Theory of Translation Quality Measurement: MQM scoring\n models and Statistical Quality Control","summary":" The year 2024 marks the 10th anniversary of the Multidimensional Quality\nMetrics (MQM) framework for analytic translation quality evaluation. The MQM\nerror typology has been widely used by practitioners in the translation and\nlocalization industry and has served as the basis for many derivative projects.\nThe annual Conference on Machine Translation (WMT) shared tasks on both human\nand automatic translation quality evaluations used the MQM error typology.\n The metric stands on two pillars: error typology and the scoring model. The\nscoring model calculates the quality score from annotation data, detailing how\nto convert error type and severity counts into numeric scores to determine if\nthe content meets specifications. Previously, only the raw scoring model had\nbeen published. This April, the MQM Council published the Linear Calibrated\nScoring Model, officially presented herein, along with the Non-Linear Scoring\nModel, which had not been published before.\n This paper details the latest MQM developments and presents a universal\napproach to translation quality measurement across three sample size ranges. It\nalso explains why Statistical Quality Control should be used for very small\nsample sizes, starting from a single sentence.\n","authors":["Arle Lommel","Serge Gladkoff","Alan Melby","Sue Ellen Wright","Ingemar Strandvik","Katerina Gasova","Angelika Vaasa","Andy Benzo","Romina Marazzato Sparano","Monica Foresi","Johani Innis","Lifeng Han","Goran Nenadic"],"pdf_url":"https://arxiv.org/pdf/2405.16969v3.pdf","comment":"working paper, 20 pages"},{"id":"http://arxiv.org/abs/2308.06795v2","updated":"2024-05-31T22:41:54Z","published":"2023-08-13T15:44:39Z","title":"Robust Infidelity: When Faithfulness Measures on Masked Language Models\n Are Misleading","summary":" A common approach to quantifying neural text classifier interpretability is\nto calculate faithfulness metrics based on iteratively masking salient input\ntokens and measuring changes in the model prediction. We propose that this\nproperty is better described as \"sensitivity to iterative masking\", and\nhighlight pitfalls in using this measure for comparing text classifier\ninterpretability. We show that iterative masking produces large variation in\nfaithfulness scores between otherwise comparable Transformer encoder text\nclassifiers. We then demonstrate that iteratively masked samples produce\nembeddings outside the distribution seen during training, resulting in\nunpredictable behaviour. We further explore task-specific considerations that\nundermine principled comparison of interpretability using iterative masking,\nsuch as an underlying similarity to salience-based adversarial attacks. Our\nfindings give insight into how these behaviours affect neural text classifiers,\nand provide guidance on how sensitivity to iterative masking should be\ninterpreted.\n","authors":["Evan Crothers","Herna Viktor","Nathalie Japkowicz"],"pdf_url":"https://arxiv.org/pdf/2308.06795v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17284v2","updated":"2024-05-31T21:30:44Z","published":"2024-05-27T15:47:46Z","title":"An NLP Crosswalk Between the Common Core State Standards and NAEP Item\n Specifications","summary":" Natural language processing (NLP) is rapidly developing for applications in\neducational assessment. In this paper, I describe an NLP-based procedure that\ncan be used to support subject matter experts in establishing a crosswalk\nbetween item specifications and content standards. This paper extends recent\nwork by proposing and demonstrating the use of multivariate similarity based on\nembedding vectors for sentences or texts. In particular, a hybrid regression\nprocedure is demonstrated for establishing the match of each content standard\nto multiple item specifications. The procedure is used to evaluate the match of\nthe Common Core State Standards (CCSS) for mathematics at grade 4 to the\ncorresponding item specifications for the 2026 National Assessment of\nEducational Progress (NAEP).\n","authors":["Gregory Camilli"],"pdf_url":"https://arxiv.org/pdf/2405.17284v2.pdf","comment":"Deleted repeated sections. Corrected proper nouns. Corrected type in\n CCSS sentences"},{"id":"http://arxiv.org/abs/2310.17086v2","updated":"2024-05-31T20:37:54Z","published":"2023-10-26T01:08:47Z","title":"Transformers Learn Higher-Order Optimization Methods for In-Context\n Learning: A Study with Linear Models","summary":" Transformers excel at in-context learning (ICL) -- learning from\ndemonstrations without parameter updates -- but how they do so remains a\nmystery. Recent work suggests that Transformers may internally run Gradient\nDescent (GD), a first-order optimization method, to perform ICL. In this paper,\nwe instead demonstrate that Transformers learn to approximate higher-order\noptimization methods for ICL. For in-context linear regression, Transformers\nshare a similar convergence rate as Iterative Newton's Method; both are\nexponentially faster than GD. Empirically, predictions from successive\nTransformer layers closely match different iterations of Newton's Method\nlinearly, with each middle layer roughly computing 3 iterations; thus,\nTransformers and Newton's method converge at roughly the same rate. In\ncontrast, Gradient Descent converges exponentially more slowly. We also show\nthat Transformers can learn in-context on ill-conditioned data, a setting where\nGradient Descent struggles but Iterative Newton succeeds. Finally, to\ncorroborate our empirical findings, we prove that Transformers can implement\n$k$ iterations of Newton's method with $k + \\mathcal{O}(1)$ layers.\n","authors":["Deqing Fu","Tian-Qi Chen","Robin Jia","Vatsal Sharan"],"pdf_url":"https://arxiv.org/pdf/2310.17086v2.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2405.21075v1","updated":"2024-05-31T17:59:47Z","published":"2024-05-31T17:59:47Z","title":"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of\n Multi-modal LLMs in Video Analysis","summary":" In the quest for artificial general intelligence, Multi-modal Large Language\nModels (MLLMs) have emerged as a focal point in recent advancements. However,\nthe predominant focus remains on developing their capabilities in static image\nunderstanding. The potential of MLLMs in processing sequential visual data is\nstill insufficiently explored, highlighting the absence of a comprehensive,\nhigh-quality assessment of their performance. In this paper, we introduce\nVideo-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of\nMLLMs in Video analysis. Our work distinguishes from existing benchmarks\nthrough four key features: 1) Diversity in video types, spanning 6 primary\nvisual domains with 30 subfields to ensure broad scenario generalizability; 2)\nDuration in temporal dimension, encompassing both short-, medium-, and\nlong-term videos, ranging from 11 seconds to 1 hour, for robust contextual\ndynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides\nvideo frames, including subtitles and audios, to unveil the all-round\ncapabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual\nlabeling by expert annotators to facilitate precise and reliable model\nassessment. 900 videos with a total of 256 hours are manually selected and\nannotated by repeatedly viewing all the video content, resulting in 2,700\nquestion-answer pairs. With Video-MME, we extensively evaluate various\nstate-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as\nopen-source image models like InternVL-Chat-V1.5 and video models like\nLLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the\nbest-performing commercial model, significantly outperforming the open-source\nmodels. Our dataset along with these findings underscores the need for further\nimprovements in handling longer sequences and multi-modal data. Project Page:\nhttps://video-mme.github.io\n","authors":["Chaoyou Fu","Yuhan Dai","Yondong Luo","Lei Li","Shuhuai Ren","Renrui Zhang","Zihan Wang","Chenyu Zhou","Yunhang Shen","Mengdan Zhang","Peixian Chen","Yanwei Li","Shaohui Lin","Sirui Zhao","Ke Li","Tong Xu","Xiawu Zheng","Enhong Chen","Rongrong Ji","Xing Sun"],"pdf_url":"https://arxiv.org/pdf/2405.21075v1.pdf","comment":"Project Page: https://video-mme.github.io"},{"id":"http://arxiv.org/abs/2405.21074v1","updated":"2024-05-31T17:59:12Z","published":"2024-05-31T17:59:12Z","title":"Latent Intrinsics Emerge from Training to Relight","summary":" Image relighting is the task of showing what a scene from a source image\nwould look like if illuminated differently. Inverse graphics schemes recover an\nexplicit representation of geometry and a set of chosen intrinsics, then\nrelight with some form of renderer. However error control for inverse graphics\nis difficult, and inverse graphics methods can represent only the effects of\nthe chosen intrinsics. This paper describes a relighting method that is\nentirely data-driven, where intrinsics and lighting are each represented as\nlatent variables. Our approach produces SOTA relightings of real scenes, as\nmeasured by standard metrics. We show that albedo can be recovered from our\nlatent intrinsics without using any example albedos, and that the albedos\nrecovered are competitive with SOTA methods.\n","authors":["Xiao Zhang","William Gao","Seemandhar Jain","Michael Maire","David. A. Forsyth","Anand Bhattad"],"pdf_url":"https://arxiv.org/pdf/2405.21074v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21070v1","updated":"2024-05-31T17:57:24Z","published":"2024-05-31T17:57:24Z","title":"Generalization Beyond Data Imbalance: A Controlled Study on CLIP for\n Transferable Insights","summary":" Severe data imbalance naturally exists among web-scale vision-language\ndatasets. Despite this, we find CLIP pre-trained thereupon exhibits notable\nrobustness to the data imbalance compared to supervised learning, and\ndemonstrates significant effectiveness in learning generalizable\nrepresentations. With an aim to investigate the reasons behind this finding, we\nconduct controlled experiments to study various underlying factors, and reveal\nthat CLIP's pretext task forms a dynamic classification problem wherein only a\nsubset of classes is present in training. This isolates the bias from dominant\nclasses and implicitly balances the learning signal. Furthermore, the\nrobustness and discriminability of CLIP improve with more descriptive language\nsupervision, larger data scale, and broader open-world concepts, which are\ninaccessible to supervised learning. Our study not only uncovers the mechanisms\nbehind CLIP's generalizability beyond data imbalance but also provides\ntransferable insights for the research community. The findings are validated in\nboth supervised and self-supervised learning, enabling models trained on\nimbalanced data to achieve CLIP-level performance on diverse recognition tasks.\nCode will be available at: https://github.com/CVMI-Lab/clip-beyond-tail.\n","authors":["Xin Wen","Bingchen Zhao","Yilun Chen","Jiangmiao Pang","Xiaojuan Qi"],"pdf_url":"https://arxiv.org/pdf/2405.21070v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21066v1","updated":"2024-05-31T17:54:52Z","published":"2024-05-31T17:54:52Z","title":"Mixed Diffusion for 3D Indoor Scene Synthesis","summary":" Realistic conditional 3D scene synthesis significantly enhances and\naccelerates the creation of virtual environments, which can also provide\nextensive training data for computer vision and robotics research among other\napplications. Diffusion models have shown great performance in related\napplications, e.g., making precise arrangements of unordered sets. However,\nthese models have not been fully explored in floor-conditioned scene synthesis\nproblems. We present MiDiffusion, a novel mixed discrete-continuous diffusion\nmodel architecture, designed to synthesize plausible 3D indoor scenes from\ngiven room types, floor plans, and potentially pre-existing objects. We\nrepresent a scene layout by a 2D floor plan and a set of objects, each defined\nby its category, location, size, and orientation. Our approach uniquely\nimplements structured corruption across the mixed discrete semantic and\ncontinuous geometric domains, resulting in a better conditioned problem for the\nreverse denoising step. We evaluate our approach on the 3D-FRONT dataset. Our\nexperimental results demonstrate that MiDiffusion substantially outperforms\nstate-of-the-art autoregressive and diffusion models in floor-conditioned 3D\nscene synthesis. In addition, our models can handle partial object constraints\nvia a corruption-and-masking strategy without task specific training. We show\nMiDiffusion maintains clear advantages over existing approaches in scene\ncompletion and furniture arrangement experiments.\n","authors":["Siyi Hu","Diego Martin Arroyo","Stephanie Debats","Fabian Manhardt","Luca Carlone","Federico Tombari"],"pdf_url":"https://arxiv.org/pdf/2405.21066v1.pdf","comment":"19 pages, 14 figures. Under review. Code to be released at:\n https://github.com/MIT-SPARK/MiDiffusion"},{"id":"http://arxiv.org/abs/2405.21059v1","updated":"2024-05-31T17:49:51Z","published":"2024-05-31T17:49:51Z","title":"Unified Directly Denoising for Both Variance Preserving and Variance\n Exploding Diffusion Models","summary":" Previous work has demonstrated that, in the Variance Preserving (VP)\nscenario, the nascent Directly Denoising Diffusion Models (DDDM) can generate\nhigh-quality images in one step while achieving even better performance in\nmultistep sampling. However, the Pseudo-LPIPS loss used in DDDM leads to\nconcerns about the bias in assessment. Here, we propose a unified DDDM (uDDDM)\nframework that generates images in one-step/multiple steps for both Variance\nPreserving (VP) and Variance Exploding (VE) cases. We provide theoretical\nproofs of the existence and uniqueness of the model's solution paths, as well\nas the non-intersecting property of the sampling paths. Additionally, we\npropose an adaptive Pseudo-Huber loss function to balance the convergence to\nthe true solution and the stability of convergence process.Through a\ncomprehensive evaluation, we demonstrate that uDDDMs achieve FID scores\ncomparable to the best-performing methods available for CIFAR-10 in both VP and\nVE. Specifically, uDDDM achieves one-step generation on CIFAR10 with FID of\n2.63 and 2.53 for VE and VP respectively. By extending the sampling to 1000\nsteps, we further reduce FID score to 1.71 and 1.65 for VE and VP respectively,\nsetting state-of-the-art performance in both cases.\n","authors":["Jingjing Wang","Dan Zhang","Feng Luo"],"pdf_url":"https://arxiv.org/pdf/2405.21059v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21056v1","updated":"2024-05-31T17:47:22Z","published":"2024-05-31T17:47:22Z","title":"An Organic Weed Control Prototype using Directed Energy and Deep\n Learning","summary":" Organic weed control is a vital to improve crop yield with a sustainable\napproach. In this work, a directed energy weed control robot prototype\nspecifically designed for organic farms is proposed. The robot uses a novel\ndistributed array robot (DAR) unit for weed treatment. Soybean and corn\ndatabases are built to train deep learning neural nets to perform weed\nrecognition. The initial deep learning neural nets show a high performance in\nclassifying crops. The robot uses a patented directed energy plant eradication\nrecipe that is completely organic and UV-C free, with no chemical damage or\nphysical disturbance to the soil. The deep learning can classify 8 common weed\nspecies in a soybean field under natural environment with up to 98% accuracy.\n","authors":["Deng Cao","Hongbo Zhang","Rajveer Dhillon"],"pdf_url":"https://arxiv.org/pdf/2405.21056v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21050v1","updated":"2024-05-31T17:43:35Z","published":"2024-05-31T17:43:35Z","title":"Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models","summary":" Adapting large-scale pre-trained generative models in a parameter-efficient\nmanner is gaining traction. Traditional methods like low rank adaptation\nachieve parameter efficiency by imposing constraints but may not be optimal for\ntasks requiring high representation capacity. We propose a novel spectrum-aware\nadaptation framework for generative models. Our method adjusts both singular\nvalues and their basis vectors of pretrained weights. Using the Kronecker\nproduct and efficient Stiefel optimizers, we achieve parameter-efficient\nadaptation of orthogonal matrices. We introduce Spectral Orthogonal\nDecomposition Adaptation (SODA), which balances computational efficiency and\nrepresentation capacity. Extensive evaluations on text-to-image diffusion\nmodels demonstrate SODA's effectiveness, offering a spectrum-aware alternative\nto existing fine-tuning methods.\n","authors":["Xinxi Zhang","Song Wen","Ligong Han","Felix Juefei-Xu","Akash Srivastava","Junzhou Huang","Hao Wang","Molei Tao","Dimitris N. Metaxas"],"pdf_url":"https://arxiv.org/pdf/2405.21050v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21048v1","updated":"2024-05-31T17:41:11Z","published":"2024-05-31T17:41:11Z","title":"Kaleido Diffusion: Improving Conditional Diffusion Models with\n Autoregressive Latent Modeling","summary":" Diffusion models have emerged as a powerful tool for generating high-quality\nimages from textual descriptions. Despite their successes, these models often\nexhibit limited diversity in the sampled images, particularly when sampling\nwith a high classifier-free guidance weight. To address this issue, we present\nKaleido, a novel approach that enhances the diversity of samples by\nincorporating autoregressive latent priors. Kaleido integrates an\nautoregressive language model that encodes the original caption and generates\nlatent variables, serving as abstract and intermediary representations for\nguiding and facilitating the image generation process. In this paper, we\nexplore a variety of discrete latent representations, including textual\ndescriptions, detection bounding boxes, object blobs, and visual tokens. These\nrepresentations diversify and enrich the input conditions to the diffusion\nmodels, enabling more diverse outputs. Our experimental results demonstrate\nthat Kaleido effectively broadens the diversity of the generated image samples\nfrom a given textual description while maintaining high image quality.\nFurthermore, we show that Kaleido adheres closely to the guidance provided by\nthe generated latent variables, demonstrating its capability to effectively\ncontrol and direct the image generation process.\n","authors":["Jiatao Gu","Ying Shen","Shuangfei Zhai","Yizhe Zhang","Navdeep Jaitly","Joshua M. Susskind"],"pdf_url":"https://arxiv.org/pdf/2405.21048v1.pdf","comment":"22 pages, 14 figures"},{"id":"http://arxiv.org/abs/2402.11058v2","updated":"2024-05-31T17:30:13Z","published":"2024-02-16T20:14:47Z","title":"II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in\n Visual Question Answering","summary":" Visual Question Answering (VQA) often involves diverse reasoning scenarios\nacross Vision and Language (V&L). Most prior VQA studies, however, have merely\nfocused on assessing the model's overall accuracy without evaluating it on\ndifferent reasoning cases. Furthermore, some recent works observe that\nconventional Chain-of-Thought (CoT) prompting fails to generate effective\nreasoning for VQA, especially for complex scenarios requiring multi-hop\nreasoning. In this paper, we propose II-MMR, a novel idea to identify and\nimprove multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA\nquestion with an image and finds a reasoning path to reach its answer using two\nnovel language promptings: (i) answer prediction-guided CoT prompt, or (ii)\nknowledge triplet-guided prompt. II-MMR then analyzes this path to identify\ndifferent reasoning cases in current VQA benchmarks by estimating how many hops\nand what types (i.e., visual or beyond-visual) of reasoning are required to\nanswer the question. On popular benchmarks including GQA and A-OKVQA, II-MMR\nobserves that most of their VQA questions are easy to answer, simply demanding\n\"single-hop\" reasoning, whereas only a few questions require \"multi-hop\"\nreasoning. Moreover, while the recent V&L model struggles with such complex\nmulti-hop reasoning questions even using the traditional CoT method, II-MMR\nshows its effectiveness across all reasoning cases in both zero-shot and\nfine-tuning settings.\n","authors":["Jihyung Kil","Farideh Tavazoee","Dongyeop Kang","Joo-Kyung Kim"],"pdf_url":"https://arxiv.org/pdf/2402.11058v2.pdf","comment":"Accepted to ACL 2024 Findings"},{"id":"http://arxiv.org/abs/2405.21022v1","updated":"2024-05-31T17:09:16Z","published":"2024-05-31T17:09:16Z","title":"You Only Scan Once: Efficient Multi-dimension Sequential Modeling with\n LightNet","summary":" Linear attention mechanisms have gained prominence in causal language models\ndue to their linear computational complexity and enhanced speed. However, the\ninherent decay mechanism in linear attention presents challenges when applied\nto multi-dimensional sequence modeling tasks, such as image processing and\nmulti-modal learning. In these scenarios, the utilization of sequential\nscanning to establish a global receptive field necessitates multiple scans for\nmulti-dimensional data, thereby leading to inefficiencies. This paper\nidentifies the inefficiency caused by a multiplicative linear recurrence and\nproposes an efficient alternative additive linear recurrence to avoid the\nissue, as it can handle multi-dimensional data within a single scan. We further\ndevelop an efficient multi-dimensional sequential modeling framework called\nLightNet based on the new recurrence. Moreover, we present two new\nmulti-dimensional linear relative positional encoding methods, MD-TPE and\nMD-LRPE to enhance the model's ability to discern positional information in\nmulti-dimensional scenarios. Our empirical evaluations across various tasks,\nincluding image classification, image generation, bidirectional language\nmodeling, and autoregressive language modeling, demonstrate the efficacy of\nLightNet, showcasing its potential as a versatile and efficient solution for\nmulti-dimensional sequential modeling.\n","authors":["Zhen Qin","Yuxin Mao","Xuyang Shen","Dong Li","Jing Zhang","Yuchao Dai","Yiran Zhong"],"pdf_url":"https://arxiv.org/pdf/2405.21022v1.pdf","comment":"Technical report. Yiran Zhong is the corresponding author. The code\n is available at https://github.com/OpenNLPLab/LightNet"},{"id":"http://arxiv.org/abs/2405.21016v1","updated":"2024-05-31T17:05:59Z","published":"2024-05-31T17:05:59Z","title":"MpoxSLDNet: A Novel CNN Model for Detecting Monkeypox Lesions and\n Performance Comparison with Pre-trained Models","summary":" Monkeypox virus (MPXV) is a zoonotic virus that poses a significant threat to\npublic health, particularly in remote parts of Central and West Africa. Early\ndetection of monkeypox lesions is crucial for effective treatment. However, due\nto its similarity with other skin diseases, monkeypox lesion detection is a\nchallenging task. To detect monkeypox, many researchers used various\ndeep-learning models such as MobileNetv2, VGG16, ResNet50, InceptionV3,\nDenseNet121, EfficientNetB3, MobileNetV2, and Xception. However, these models\noften require high storage space due to their large size. This study aims to\nimprove the existing challenges by introducing a CNN model named MpoxSLDNet\n(Monkeypox Skin Lesion Detector Network) to facilitate early detection and\ncategorization of Monkeypox lesions and Non-Monkeypox lesions in digital\nimages. Our model represents a significant advancement in the field of\nmonkeypox lesion detection by offering superior performance metrics, including\nprecision, recall, F1-score, accuracy, and AUC, compared to traditional\npre-trained models such as VGG16, ResNet50, and DenseNet121. The key novelty of\nour approach lies in MpoxSLDNet's ability to achieve high detection accuracy\nwhile requiring significantly less storage space than existing models. By\naddressing the challenge of high storage requirements, MpoxSLDNet presents a\npractical solution for early detection and categorization of monkeypox lesions\nin resource-constrained healthcare settings. In this study, we have used\n\"Monkeypox Skin Lesion Dataset\" comprising 1428 skin images of monkeypox\nlesions and 1764 skin images of Non-Monkeypox lesions. Dataset's limitations\ncould potentially impact the model's ability to generalize to unseen cases.\nHowever, the MpoxSLDNet model achieved a validation accuracy of 94.56%,\ncompared to 86.25%, 84.38%, and 67.19% for VGG16, DenseNet121, and ResNet50,\nrespectively.\n","authors":["Fatema Jannat Dihan","Saydul Akbar Murad","Abu Jafar Md Muzahid","K. M. Aslam Uddin","Mohammed J. F. Alenazi","Anupam Kumar Bairagi","Sujit Biswas"],"pdf_url":"https://arxiv.org/pdf/2405.21016v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18418v2","updated":"2024-05-31T17:03:00Z","published":"2024-05-28T17:57:23Z","title":"Hierarchical World Models as Visual Whole-Body Humanoid Controllers","summary":" Whole-body control for humanoids is challenging due to the high-dimensional\nnature of the problem, coupled with the inherent instability of a bipedal\nmorphology. Learning from visual observations further exacerbates this\ndifficulty. In this work, we explore highly data-driven approaches to visual\nwhole-body humanoid control based on reinforcement learning, without any\nsimplifying assumptions, reward design, or skill primitives. Specifically, we\npropose a hierarchical world model in which a high-level agent generates\ncommands based on visual observations for a low-level agent to execute, both of\nwhich are trained with rewards. Our approach produces highly performant control\npolicies in 8 tasks with a simulated 56-DoF humanoid, while synthesizing\nmotions that are broadly preferred by humans. Code and videos:\nhttps://nicklashansen.com/rlpuppeteer\n","authors":["Nicklas Hansen","Jyothir S V","Vlad Sobal","Yann LeCun","Xiaolong Wang","Hao Su"],"pdf_url":"https://arxiv.org/pdf/2405.18418v2.pdf","comment":"Code and videos at https://nicklashansen.com/rlpuppeteer"},{"id":"http://arxiv.org/abs/2405.00515v3","updated":"2024-05-31T16:55:20Z","published":"2024-05-01T13:51:39Z","title":"GAD-Generative Learning for HD Map-Free Autonomous Driving","summary":" Deep-learning-based techniques have been widely adopted for autonomous\ndriving software stacks for mass production in recent years, focusing primarily\non perception modules, with some work extending this method to prediction\nmodules. However, the downstream planning and control modules are still\ndesigned with hefty handcrafted rules, dominated by optimization-based methods\nsuch as quadratic programming or model predictive control. This results in a\nperformance bottleneck for autonomous driving systems in that corner cases\nsimply cannot be solved by enumerating hand-crafted rules. We present a\ndeep-learning-based approach that brings prediction, decision, and planning\nmodules together with the attempt to overcome the rule-based methods'\ndeficiency in real-world applications of autonomous driving, especially for\nurban scenes. The DNN model we proposed is solely trained with 10 hours of\nhuman driver data, and it supports all mass-production ADAS features available\non the market to date. This method is deployed onto a Jiyue test car with no\nmodification to its factory-ready sensor set and compute platform. the\nfeasibility, usability, and commercial potential are demonstrated in this\narticle.\n","authors":["Weijian Sun","Yanbo Jia","Qi Zeng","Zihao Liu","Jiang Liao","Yue Li","Xianfeng Li"],"pdf_url":"https://arxiv.org/pdf/2405.00515v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21013v1","updated":"2024-05-31T16:55:04Z","published":"2024-05-31T16:55:04Z","title":"StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image\n Perception, Comprehension, and Beyond","summary":" Text-rich images have significant and extensive value, deeply integrated into\nvarious aspects of human life. Notably, both visual cues and linguistic symbols\nin text-rich images play crucial roles in information transmission but are\naccompanied by diverse challenges. Therefore, the efficient and effective\nunderstanding of text-rich images is a crucial litmus test for the capability\nof Vision-Language Models. We have crafted an efficient vision-language model,\nStrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.\nThe significant design of StrucTexTv3 is presented in the following aspects:\nFirstly, we adopt a combination of an effective multi-scale reduced visual\ntransformer and a multi-granularity token sampler (MG-Sampler) as a visual\ntoken generator, successfully solving the challenges of high-resolution input\nand complex representation learning for text-rich images. Secondly, we enhance\nthe perception and comprehension abilities of StrucTexTv3 through instruction\nlearning, seamlessly integrating various text-oriented tasks into a unified\nframework. Thirdly, we have curated a comprehensive collection of high-quality\ntext-rich images, abbreviated as TIM-30M, encompassing diverse scenarios like\nincidental scenes, office documents, web pages, and screenshots, thereby\nimproving the robustness of our model. Our method achieved SOTA results in\ntext-rich image perception tasks, and significantly improved performance in\ncomprehension tasks. Among multimodal models with LLM decoder of approximately\n1.8B parameters, it stands out as a leader, which also makes the deployment of\nedge devices feasible. In summary, the StrucTexTv3 model, featuring efficient\nstructural design, outstanding performance, and broad adaptability, offers\nrobust support for diverse intelligent application tasks involving text-rich\nimages, thus exhibiting immense potential for widespread application.\n","authors":["Pengyuan Lyu","Yulin Li","Hao Zhou","Weihong Ma","Xingyu Wan","Qunyi Xie","Liang Wu","Chengquan Zhang","Kun Yao","Errui Ding","Jingdong Wang"],"pdf_url":"https://arxiv.org/pdf/2405.21013v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.14622v3","updated":"2024-05-31T16:37:53Z","published":"2024-05-23T14:30:33Z","title":"Calibrated Self-Rewarding Vision Language Models","summary":" Large Vision-Language Models (LVLMs) have made substantial progress by\nintegrating pre-trained large language models (LLMs) and vision models through\ninstruction tuning. Despite these advancements, LVLMs often exhibit the\nhallucination phenomenon, where generated text responses appear linguistically\nplausible but contradict the input image, indicating a misalignment between\nimage and text pairs. This misalignment arises because the model tends to\nprioritize textual information over visual input, even when both the language\nmodel and visual representations are of high quality. Existing methods leverage\nadditional models or human annotations to curate preference data and enhance\nmodality alignment through preference optimization. These approaches may not\neffectively reflect the target LVLM's preferences, making the curated\npreferences easily distinguishable. Our work addresses these challenges by\nproposing the Calibrated Self-Rewarding (CSR) approach, which enables the model\nto self-improve by iteratively generating candidate responses, evaluating the\nreward for each response, and curating preference data for fine-tuning. In the\nreward modeling, we employ a step-wise strategy and incorporate visual\nconstraints into the self-rewarding process to place greater emphasis on visual\ninput. Empirical results demonstrate that CSR enhances performance and reduces\nhallucinations across ten benchmarks and tasks, achieving substantial\nimprovements over existing methods by 7.62%. Our empirical results are further\nsupported by rigorous theoretical analysis, under mild assumptions, verifying\nthe effectiveness of introducing visual constraints into the self-rewarding\nparadigm. Additionally, CSR shows compatibility with different vision-language\nmodels and the ability to incrementally improve performance through iterative\nfine-tuning. Our data and code are available at\nhttps://github.com/YiyangZhou/CSR.\n","authors":["Yiyang Zhou","Zhiyuan Fan","Dongjie Cheng","Sihan Yang","Zhaorun Chen","Chenhang Cui","Xiyao Wang","Yun Li","Linjun Zhang","Huaxiu Yao"],"pdf_url":"https://arxiv.org/pdf/2405.14622v3.pdf","comment":"fix some typos and add acknowledgement section in V3"},{"id":"http://arxiv.org/abs/2405.20991v1","updated":"2024-05-31T16:35:41Z","published":"2024-05-31T16:35:41Z","title":"Hard Cases Detection in Motion Prediction by Vision-Language Foundation\n Models","summary":" Addressing hard cases in autonomous driving, such as anomalous road users,\nextreme weather conditions, and complex traffic interactions, presents\nsignificant challenges. To ensure safety, it is crucial to detect and manage\nthese scenarios effectively for autonomous driving systems. However, the rarity\nand high-risk nature of these cases demand extensive, diverse datasets for\ntraining robust models. Vision-Language Foundation Models (VLMs) have shown\nremarkable zero-shot capabilities as being trained on extensive datasets. This\nwork explores the potential of VLMs in detecting hard cases in autonomous\ndriving. We demonstrate the capability of VLMs such as GPT-4v in detecting hard\ncases in traffic participant motion prediction on both agent and scenario\nlevels. We introduce a feasible pipeline where VLMs, fed with sequential image\nframes with designed prompts, effectively identify challenging agents or\nscenarios, which are verified by existing prediction models. Moreover, by\ntaking advantage of this detection of hard cases by VLMs, we further improve\nthe training efficiency of the existing motion prediction pipeline by\nperforming data selection for the training samples suggested by GPT. We show\nthe effectiveness and feasibility of our pipeline incorporating VLMs with\nstate-of-the-art methods on NuScenes datasets. The code is accessible at\nhttps://github.com/KTH-RPL/Detect_VLM.\n","authors":["Yi Yang","Qingwen Zhang","Kei Ikemura","Nazre Batool","John Folkesson"],"pdf_url":"https://arxiv.org/pdf/2405.20991v1.pdf","comment":"IEEE Intelligent Vehicles Symposium (IV) 2024"},{"id":"http://arxiv.org/abs/2405.20987v1","updated":"2024-05-31T16:33:20Z","published":"2024-05-31T16:33:20Z","title":"Early Stopping Criteria for Training Generative Adversarial Networks in\n Biomedical Imaging","summary":" Generative Adversarial Networks (GANs) have high computational costs to train\ntheir complex architectures. Throughout the training process, GANs' output is\nanalyzed qualitatively based on the loss and synthetic images' diversity and\nquality. Based on this qualitative analysis, training is manually halted once\nthe desired synthetic images are generated. By utilizing an early stopping\ncriterion, the computational cost and dependence on manual oversight can be\nreduced yet impacted by training problems such as mode collapse,\nnon-convergence, and instability. This is particularly prevalent in biomedical\nimagery, where training problems degrade the diversity and quality of synthetic\nimages, and the high computational cost associated with training makes complex\narchitectures increasingly inaccessible. This work proposes a novel early\nstopping criteria to quantitatively detect training problems, halt training,\nand reduce the computational costs associated with synthesizing biomedical\nimages. Firstly, the range of generator and discriminator loss values is\ninvestigated to assess whether mode collapse, non-convergence, and instability\noccur sequentially, concurrently, or interchangeably throughout the training of\nGANs. Secondly, utilizing these occurrences in conjunction with the Mean\nStructural Similarity Index (MS-SSIM) and Fr\\'echet Inception Distance (FID)\nscores of synthetic images forms the basis of the proposed early stopping\ncriteria. This work helps identify the occurrence of training problems in GANs\nusing low-resource computational cost and reduces training time to generate\ndiversified and high-quality synthetic images.\n","authors":["Muhammad Muneeb Saad","Mubashir Husain Rehmani","Ruairi O'Reilly"],"pdf_url":"https://arxiv.org/pdf/2405.20987v1.pdf","comment":"This paper is accepted at the 35th IEEE Irish Signals and Systems\n Conference (ISSC 2024)"},{"id":"http://arxiv.org/abs/2405.20986v1","updated":"2024-05-31T16:32:46Z","published":"2024-05-31T16:32:46Z","title":"Uncertainty Quantification for Bird's Eye View Semantic Segmentation:\n Methods and Benchmarks","summary":" The fusion of raw features from multiple sensors on an autonomous vehicle to\ncreate a Bird's Eye View (BEV) representation is crucial for planning and\ncontrol systems. There is growing interest in using deep learning models for\nBEV semantic segmentation. Anticipating segmentation errors and improving the\nexplainability of DNNs is essential for autonomous driving, yet it is\nunder-studied. This paper introduces a benchmark for predictive uncertainty\nquantification in BEV segmentation. The benchmark assesses various approaches\nacross three popular datasets using two representative backbones and focuses on\nthe effectiveness of predicted uncertainty in identifying misclassified and\nout-of-distribution (OOD) pixels, as well as calibration. Empirical findings\nhighlight the challenges in uncertainty quantification. Our results find that\nevidential deep learning based approaches show the most promise by efficiently\nquantifying aleatoric and epistemic uncertainty. We propose the\nUncertainty-Focal-Cross-Entropy (UFCE) loss, designed for highly imbalanced\ndata, which consistently improves the segmentation quality and calibration.\nAdditionally, we introduce a vacuity-scaled regularization term that enhances\nthe model's focus on high uncertainty pixels, improving epistemic uncertainty\nquantification.\n","authors":["Linlin Yu","Bowen Yang","Tianhao Wang","Kangshuo Li","Feng Chen"],"pdf_url":"https://arxiv.org/pdf/2405.20986v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20985v1","updated":"2024-05-31T16:31:38Z","published":"2024-05-31T16:31:38Z","title":"DeCo: Decoupling Token Compression from Semantic Abstraction in\n Multimodal Large Language Models","summary":" The visual projector, which bridges the vision and language modalities and\nfacilitates cross-modal alignment, serves as a crucial component in MLLMs.\nHowever, measuring the effectiveness of projectors in vision-language alignment\nremains under-explored, which currently can only be inferred from the\nperformance of MLLMs on downstream tasks. Motivated by the problem, this study\nexamines the projector module by interpreting the vision-language semantic flow\nwithin MLLMs. Specifically, we trace back the semantic relevance flow from\ngenerated language tokens to raw visual encoder patches and the intermediate\noutputs produced by projectors. Our findings reveal that compressive projectors\n(e.g., QFormer), abstract visual patches into a limited set of semantic\nconcepts, such as objects or attributes, resulting in a 'double abstraction'\nphenomenon. This involves a first visual semantic abstraction by the projector\nreferring to pre-defined query tokens, and a second extraction by the LLM based\non text instructions. The double abstraction is inefficient in training and\nwill result in cumulative vision semantics deficiency. To mitigate this issue,\nwe propose the key insight of 'Decouple Compression from Abstraction (DeCo),\nthat is compressing the visual token number at the patch level by projectors\nand allowing the LLM to handle visual semantic abstraction entirely.\nConsequently, we adopt a simple compressor, i.e., 2D Adaptive Pooling, to\ndownsample visual patches in a parameter-free manner. Empirical evaluation\ndemonstrates that DeCo surpasses traditional compressive projectors regarding\nboth performance and efficiency. It achieves performance gains of 0.9%, 7.1%,\nand 2.9% across the MLLM Benchmarks, Visual Localization, and Open-ended VQA\ntasks with fewer trainable parameters and faster convergence speed.\n","authors":["Linli Yao","Lei Li","Shuhuai Ren","Lean Wang","Yuanxin Liu","Xu Sun","Lu Hou"],"pdf_url":"https://arxiv.org/pdf/2405.20985v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20981v1","updated":"2024-05-31T16:26:30Z","published":"2024-05-31T16:26:30Z","title":"Generative Adversarial Networks in Ultrasound Imaging: Extending Field\n of View Beyond Conventional Limits","summary":" Transthoracic Echocardiography (TTE) is a fundamental, non-invasive\ndiagnostic tool in cardiovascular medicine, enabling detailed visualization of\ncardiac structures crucial for diagnosing various heart conditions. Despite its\nwidespread use, TTE ultrasound imaging faces inherent limitations, notably the\ntrade-off between field of view (FoV) and resolution. This paper introduces a\nnovel application of conditional Generative Adversarial Networks (cGANs),\nspecifically designed to extend the FoV in TTE ultrasound imaging while\nmaintaining high resolution. Our proposed cGAN architecture, termed echoGAN,\ndemonstrates the capability to generate realistic anatomical structures through\noutpainting, effectively broadening the viewable area in medical imaging. This\nadvancement has the potential to enhance both automatic and manual ultrasound\nnavigation, offering a more comprehensive view that could significantly reduce\nthe learning curve associated with ultrasound imaging and aid in more accurate\ndiagnoses. The results confirm that echoGAN reliably reproduce detailed cardiac\nfeatures, thereby promising a significant step forward in the field of\nnon-invasive cardiac naviagation and diagnostics.\n","authors":["Matej Gazda","Samuel Kadoury","Jakub Gazda","Peter Drotar"],"pdf_url":"https://arxiv.org/pdf/2405.20981v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20980v1","updated":"2024-05-31T16:26:08Z","published":"2024-05-31T16:26:08Z","title":"Neural Gaussian Scale-Space Fields","summary":" Gaussian scale spaces are a cornerstone of signal representation and\nprocessing, with applications in filtering, multiscale analysis, anti-aliasing,\nand many more. However, obtaining such a scale space is costly and cumbersome,\nin particular for continuous representations such as neural fields. We present\nan efficient and lightweight method to learn the fully continuous, anisotropic\nGaussian scale space of an arbitrary signal. Based on Fourier feature\nmodulation and Lipschitz bounding, our approach is trained self-supervised,\ni.e., training does not require any manual filtering. Our neural Gaussian\nscale-space fields faithfully capture multiscale representations across a broad\nrange of modalities, and support a diverse set of applications. These include\nimages, geometry, light-stage data, texture anti-aliasing, and multiscale\noptimization.\n","authors":["Felix Mujkanovic","Ntumba Elie Nsampi","Christian Theobalt","Hans-Peter Seidel","Thomas Leimkühler"],"pdf_url":"https://arxiv.org/pdf/2405.20980v1.pdf","comment":"15 pages; SIGGRAPH 2024; project page at\n https://neural-gaussian-scale-space-fields.mpi-inf.mpg.de"},{"id":"http://arxiv.org/abs/2405.20971v1","updated":"2024-05-31T16:18:46Z","published":"2024-05-31T16:18:46Z","title":"Amortizing intractable inference in diffusion models for vision,\n language, and control","summary":" Diffusion models have emerged as effective distribution estimators in vision,\nlanguage, and reinforcement learning, but their use as priors in downstream\ntasks poses an intractable posterior inference problem. This paper studies\namortized sampling of the posterior over data, $\\mathbf{x}\\sim p^{\\rm\npost}(\\mathbf{x})\\propto p(\\mathbf{x})r(\\mathbf{x})$, in a model that consists\nof a diffusion generative model prior $p(\\mathbf{x})$ and a black-box\nconstraint or likelihood function $r(\\mathbf{x})$. We state and prove the\nasymptotic correctness of a data-free learning objective, relative trajectory\nbalance, for training a diffusion model that samples from this posterior, a\nproblem that existing methods solve only approximately or in restricted cases.\nRelative trajectory balance arises from the generative flow network perspective\non diffusion models, which allows the use of deep reinforcement learning\ntechniques to improve mode coverage. Experiments illustrate the broad potential\nof unbiased inference of arbitrary posteriors under diffusion priors: in vision\n(classifier guidance), language (infilling under a discrete diffusion LLM), and\nmultimodal data (text-to-image generation). Beyond generative modeling, we\napply relative trajectory balance to the problem of continuous control with a\nscore-based behavior prior, achieving state-of-the-art results on benchmarks in\noffline reinforcement learning.\n","authors":["Siddarth Venkatraman","Moksh Jain","Luca Scimeca","Minsu Kim","Marcin Sendera","Mohsin Hasan","Luke Rowe","Sarthak Mittal","Pablo Lemos","Emmanuel Bengio","Alexandre Adam","Jarrid Rector-Brooks","Yoshua Bengio","Glen Berseth","Nikolay Malkin"],"pdf_url":"https://arxiv.org/pdf/2405.20971v1.pdf","comment":"Code: https://github.com/GFNOrg/diffusion-finetuning"},{"id":"http://arxiv.org/abs/2311.10879v3","updated":"2024-05-31T16:15:01Z","published":"2023-11-17T21:48:41Z","title":"Pre- to Post-Contrast Breast MRI Synthesis for Enhanced Tumour\n Segmentation","summary":" Despite its benefits for tumour detection and treatment, the administration\nof contrast agents in dynamic contrast-enhanced MRI (DCE-MRI) is associated\nwith a range of issues, including their invasiveness, bioaccumulation, and a\nrisk of nephrogenic systemic fibrosis. This study explores the feasibility of\nproducing synthetic contrast enhancements by translating pre-contrast\nT1-weighted fat-saturated breast MRI to their corresponding first DCE-MRI\nsequence leveraging the capabilities of a generative adversarial network (GAN).\nAdditionally, we introduce a Scaled Aggregate Measure (SAMe) designed for\nquantitatively evaluating the quality of synthetic data in a principled manner\nand serving as a basis for selecting the optimal generative model. We assess\nthe generated DCE-MRI data using quantitative image quality metrics and apply\nthem to the downstream task of 3D breast tumour segmentation. Our results\nhighlight the potential of post-contrast DCE-MRI synthesis in enhancing the\nrobustness of breast tumour segmentation models via data augmentation. Our code\nis available at https://github.com/RichardObi/pre_post_synthesis.\n","authors":["Richard Osuala","Smriti Joshi","Apostolia Tsirikoglou","Lidia Garrucho","Walter H. L. Pinaya","Oliver Diaz","Karim Lekadir"],"pdf_url":"https://arxiv.org/pdf/2311.10879v3.pdf","comment":"Accepted as oral presentation at SPIE Medical Imaging 2024 (Image\n Processing)"},{"id":"http://arxiv.org/abs/2207.11860v5","updated":"2024-05-31T16:04:07Z","published":"2022-07-25T00:42:38Z","title":"Behind Every Domain There is a Shift: Adapting Distortion-aware Vision\n Transformers for Panoramic Semantic Segmentation","summary":" In this paper, we address panoramic semantic segmentation which is\nunder-explored due to two critical challenges: (1) image distortions and object\ndeformations on panoramas; (2) lack of semantic annotations in the 360{\\deg}\nimagery. To tackle these problems, first, we propose the upgraded Transformer\nfor Panoramic Semantic Segmentation, i.e., Trans4PASS+, equipped with\nDeformable Patch Embedding (DPE) and Deformable MLP (DMLPv2) modules for\nhandling object deformations and image distortions whenever (before or after\nadaptation) and wherever (shallow or deep levels). Second, we enhance the\nMutual Prototypical Adaptation (MPA) strategy via pseudo-label rectification\nfor unsupervised domain adaptive panoramic segmentation. Third, aside from\nPinhole-to-Panoramic (Pin2Pan) adaptation, we create a new dataset (SynPASS)\nwith 9,080 panoramic images, facilitating Synthetic-to-Real (Syn2Real)\nadaptation scheme in 360{\\deg} imagery. Extensive experiments are conducted,\nwhich cover indoor and outdoor scenarios, and each of them is investigated with\nPin2Pan and Syn2Real regimens. Trans4PASS+ achieves state-of-the-art\nperformances on four domain adaptive panoramic semantic segmentation\nbenchmarks. Code is available at https://github.com/jamycheung/Trans4PASS.\n","authors":["Jiaming Zhang","Kailun Yang","Hao Shi","Simon Reiß","Kunyu Peng","Chaoxiang Ma","Haodong Fu","Philip H. S. Torr","Kaiwei Wang","Rainer Stiefelhagen"],"pdf_url":"https://arxiv.org/pdf/2207.11860v5.pdf","comment":"Accepted to IEEE Transactions on Pattern Analysis and Machine\n Intelligence (TPAMI). Extended version of CVPR 2022 paper arXiv:2203.01452.\n Code is available at https://github.com/jamycheung/Trans4PASS"},{"id":"http://arxiv.org/abs/2307.16565v2","updated":"2024-05-31T16:00:18Z","published":"2023-07-31T10:55:15Z","title":"Towards Imbalanced Motion: Part-Decoupling Network for Video Portrait\n Segmentation","summary":" Video portrait segmentation (VPS), aiming at segmenting prominent foreground\nportraits from video frames, has received much attention in recent years.\nHowever, simplicity of existing VPS datasets leads to a limitation on extensive\nresearch of the task. In this work, we propose a new intricate large-scale\nMulti-scene Video Portrait Segmentation dataset MVPS consisting of 101 video\nclips in 7 scenario categories, in which 10,843 sampled frames are finely\nannotated at pixel level. The dataset has diverse scenes and complicated\nbackground environments, which is the most complex dataset in VPS to our best\nknowledge. Through the observation of a large number of videos with portraits\nduring dataset construction, we find that due to the joint structure of human\nbody, motion of portraits is part-associated, which leads that different parts\nare relatively independent in motion. That is, motion of different parts of the\nportraits is imbalanced. Towards this imbalance, an intuitive and reasonable\nidea is that different motion states in portraits can be better exploited by\ndecoupling the portraits into parts. To achieve this, we propose a\nPart-Decoupling Network (PDNet) for video portrait segmentation. Specifically,\nan Inter-frame Part-Discriminated Attention (IPDA) module is proposed which\nunsupervisedly segments portrait into parts and utilizes different\nattentiveness on discriminative features specified to each different part. In\nthis way, appropriate attention can be imposed to portrait parts with\nimbalanced motion to extract part-discriminated correlations, so that the\nportraits can be segmented more accurately. Experimental results demonstrate\nthat our method achieves leading performance with the comparison to\nstate-of-the-art methods.\n","authors":["Tianshu Yu","Changqun Xia","Jia Li"],"pdf_url":"https://arxiv.org/pdf/2307.16565v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.11265v2","updated":"2024-05-31T15:59:32Z","published":"2024-04-17T11:15:58Z","title":"The Victim and The Beneficiary: Exploiting a Poisoned Model to Train a\n Clean Model on Poisoned Data","summary":" Recently, backdoor attacks have posed a serious security threat to the\ntraining process of deep neural networks (DNNs). The attacked model behaves\nnormally on benign samples but outputs a specific result when the trigger is\npresent. However, compared with the rocketing progress of backdoor attacks,\nexisting defenses are difficult to deal with these threats effectively or\nrequire benign samples to work, which may be unavailable in real scenarios. In\nthis paper, we find that the poisoned samples and benign samples can be\ndistinguished with prediction entropy. This inspires us to propose a novel\ndual-network training framework: The Victim and The Beneficiary (V&B), which\nexploits a poisoned model to train a clean model without extra benign samples.\nFirstly, we sacrifice the Victim network to be a powerful poisoned sample\ndetector by training on suspicious samples. Secondly, we train the Beneficiary\nnetwork on the credible samples selected by the Victim to inhibit backdoor\ninjection. Thirdly, a semi-supervised suppression strategy is adopted for\nerasing potential backdoors and improving model performance. Furthermore, to\nbetter inhibit missed poisoned samples, we propose a strong data augmentation\nmethod, AttentionMix, which works well with our proposed V&B framework.\nExtensive experiments on two widely used datasets against 6 state-of-the-art\nattacks demonstrate that our framework is effective in preventing backdoor\ninjection and robust to various attacks while maintaining the performance on\nbenign samples. Our code is available at https://github.com/Zixuan-Zhu/VaB.\n","authors":["Zixuan Zhu","Rui Wang","Cong Zou","Lihua Jing"],"pdf_url":"https://arxiv.org/pdf/2404.11265v2.pdf","comment":"13 pages, 6 figures, published to ICCV"},{"id":"http://arxiv.org/abs/2405.19751v2","updated":"2024-05-31T15:48:05Z","published":"2024-05-30T06:56:11Z","title":"HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization","summary":" Diffusion Transformers (DiTs) have recently gained substantial attention in\nboth industrial and academic fields for their superior visual generation\ncapabilities, outperforming traditional diffusion models that use U-Net.\nHowever,the enhanced performance of DiTs also comes with high parameter counts\nand implementation costs, seriously restricting their use on resource-limited\ndevices such as mobile phones. To address these challenges, we introduce the\nHybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training\nquantization method that utilizes 4-bit floating-point (FP) precision on both\nweights and activations for DiT inference. Compared to fixed-point quantization\n(e.g., INT8), FP quantization, complemented by our proposed clipping range\nselection mechanism, naturally aligns with the data distribution within DiT,\nresulting in a minimal quantization error. Furthermore, HQ-DiT also implements\na universal identity mathematical transform to mitigate the serious\nquantization error caused by the outliers. The experimental results demonstrate\nthat DiT can achieve extremely low-precision quantization (i.e., 4 bits) with\nnegligible impact on performance. Our approach marks the first instance where\nboth weights and activations in DiTs are quantized to just 4 bits, with only a\n0.12 increase in sFID on ImageNet.\n","authors":["Wenxuan Liu","Sai Qian Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.19751v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20310v2","updated":"2024-05-31T15:27:52Z","published":"2024-05-30T17:52:52Z","title":"A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D\n Reconstruction","summary":" Learning 3D scene representation from a single-view image is a long-standing\nfundamental problem in computer vision, with the inherent ambiguity in\npredicting contents unseen from the input view. Built on the recently proposed\n3D Gaussian Splatting (3DGS), the Splatter Image method has made promising\nprogress on fast single-image novel view synthesis via learning a single 3D\nGaussian for each pixel based on the U-Net feature map of an input image.\nHowever, it has limited expressive power to represent occluded components that\nare not observable in the input view. To address this problem, this paper\npresents a Hierarchical Splatter Image method in which a pixel is worth more\nthan one 3D Gaussians. Specifically, each pixel is represented by a parent 3D\nGaussian and a small number of child 3D Gaussians. Parent 3D Gaussians are\nlearned as done in the vanilla Splatter Image. Child 3D Gaussians are learned\nvia a lightweight Multi-Layer Perceptron (MLP) which takes as input the\nprojected image features of a parent 3D Gaussian and the embedding of a target\ncamera view. Both parent and child 3D Gaussians are learned end-to-end in a\nstage-wise way. The joint condition of input image features from eyes of the\nparent Gaussians and the target camera position facilitates learning to\nallocate child Gaussians to ``see the unseen'', recovering the occluded details\nthat are often missed by parent Gaussians.\n In experiments, the proposed method is tested on the ShapeNet-SRN and CO3D\ndatasets with state-of-the-art performance obtained, especially showing\npromising capabilities of reconstructing occluded contents in the input view.\n","authors":["Jianghao Shen","Xue Nan","Tianfu Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20310v2.pdf","comment":"preprint, under review"},{"id":"http://arxiv.org/abs/2402.05861v2","updated":"2024-05-31T15:22:58Z","published":"2024-02-08T17:50:22Z","title":"Memory Consolidation Enables Long-Context Video Understanding","summary":" Most transformer-based video encoders are limited to short temporal contexts\ndue to their quadratic complexity. While various attempts have been made to\nextend this context, this has often come at the cost of both conceptual and\ncomputational complexity. We propose to instead re-purpose existing pre-trained\nvideo transformers by simply fine-tuning them to attend to memories derived\nnon-parametrically from past activations. By leveraging redundancy reduction,\nour memory-consolidated vision transformer (MC-ViT) effortlessly extends its\ncontext far into the past and exhibits excellent scaling behavior when learning\nfrom longer videos. In doing so, MC-ViT sets a new state-of-the-art in\nlong-context video understanding on EgoSchema, Perception Test, and Diving48,\noutperforming methods that benefit from orders of magnitude more parameters.\n","authors":["Ivana Balažević","Yuge Shi","Pinelopi Papalampidi","Rahma Chaabouni","Skanda Koppula","Olivier J. Hénaff"],"pdf_url":"https://arxiv.org/pdf/2402.05861v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20915v1","updated":"2024-05-31T15:21:44Z","published":"2024-05-31T15:21:44Z","title":"Fast yet Safe: Early-Exiting with Risk Control","summary":" Scaling machine learning models significantly improves their performance.\nHowever, such gains come at the cost of inference being slow and\nresource-intensive. Early-exit neural networks (EENNs) offer a promising\nsolution: they accelerate inference by allowing intermediate layers to exit and\nproduce a prediction early. Yet a fundamental issue with EENNs is how to\ndetermine when to exit without severely degrading performance. In other words,\nwhen is it 'safe' for an EENN to go 'fast'? To address this issue, we\ninvestigate how to adapt frameworks of risk control to EENNs. Risk control\noffers a distribution-free, post-hoc solution that tunes the EENN's exiting\nmechanism so that exits only occur when the output is of sufficient quality. We\nempirically validate our insights on a range of vision and language tasks,\ndemonstrating that risk control can produce substantial computational savings,\nall the while preserving user-specified performance goals.\n","authors":["Metod Jazbec","Alexander Timans","Tin Hadži Veljković","Kaspar Sakmann","Dan Zhang","Christian A. Naesseth","Eric Nalisnick"],"pdf_url":"https://arxiv.org/pdf/2405.20915v1.pdf","comment":"25 pages, 11 figures, 4 tables (incl. appendix)"},{"id":"http://arxiv.org/abs/2405.20910v1","updated":"2024-05-31T15:21:06Z","published":"2024-05-31T15:21:06Z","title":"Predicting ptychography probe positions using single-shot phase\n retrieval neural network","summary":" Ptychography is a powerful imaging technique that is used in a variety of\nfields, including materials science, biology, and nanotechnology. However, the\naccuracy of the reconstructed ptychography image is highly dependent on the\naccuracy of the recorded probe positions which often contain errors. These\nerrors are typically corrected jointly with phase retrieval through numerical\noptimization approaches. When the error accumulates along the scan path or when\nthe error magnitude is large, these approaches may not converge with\nsatisfactory result. We propose a fundamentally new approach for ptychography\nprobe position prediction for data with large position errors, where a neural\nnetwork is used to make single-shot phase retrieval on individual diffraction\npatterns, yielding the object image at each scan point. The pairwise offsets\namong these images are then found using a robust image registration method, and\nthe results are combined to yield the complete scan path by constructing and\nsolving a linear equation. We show that our method can achieve good position\nprediction accuracy for data with large and accumulating errors on the order of\n$10^2$ pixels, a magnitude that often makes optimization-based algorithms fail\nto converge. For ptychography instruments without sophisticated position\ncontrol equipment such as interferometers, our method is of significant\npractical potential.\n","authors":["Ming Du","Tao Zhou","Junjing Deng","Daniel J. Ching","Steven Henke","Mathew J. Cherukara"],"pdf_url":"https://arxiv.org/pdf/2405.20910v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20906v1","updated":"2024-05-31T15:17:47Z","published":"2024-05-31T15:17:47Z","title":"Enhancing Vision Models for Text-Heavy Content Understanding and\n Interaction","summary":" Interacting and understanding with text heavy visual content with multiple\nimages is a major challenge for traditional vision models. This paper is on\nenhancing vision models' capability to comprehend or understand and learn from\nimages containing a huge amount of textual information from the likes of\ntextbooks and research papers which contain multiple images like graphs, etc\nand tables in them with different types of axes and scales. The approach\ninvolves dataset preprocessing, fine tuning which is by using instructional\noriented data and evaluation. We also built a visual chat application\nintegrating CLIP for image encoding and a model from the Massive Text Embedding\nBenchmark which is developed to consider both textual and visual inputs. An\naccuracy of 96.71% was obtained. The aim of the project is to increase and also\nenhance the advance vision models' capabilities in understanding complex visual\ntextual data interconnected data, contributing to multimodal AI.\n","authors":["Adithya TG","Adithya SK","Abhinav R Bharadwaj","Abhiram HA","Dr. Surabhi Narayan"],"pdf_url":"https://arxiv.org/pdf/2405.20906v1.pdf","comment":"5 pages, 4 figures (including 1 graph)"},{"id":"http://arxiv.org/abs/2405.07801v3","updated":"2024-05-31T15:11:51Z","published":"2024-05-13T14:44:22Z","title":"Deep Learning-Based Object Pose Estimation: A Comprehensive Survey","summary":" Object pose estimation is a fundamental computer vision problem with broad\napplications in augmented reality and robotics. Over the past decade, deep\nlearning models, due to their superior accuracy and robustness, have\nincreasingly supplanted conventional algorithms reliant on engineered point\npair features. Nevertheless, several challenges persist in contemporary\nmethods, including their dependency on labeled training data, model\ncompactness, robustness under challenging conditions, and their ability to\ngeneralize to novel unseen objects. A recent survey discussing the progress\nmade on different aspects of this area, outstanding challenges, and promising\nfuture directions, is missing. To fill this gap, we discuss the recent advances\nin deep learning-based object pose estimation, covering all three formulations\nof the problem, \\emph{i.e.}, instance-level, category-level, and unseen object\npose estimation. Our survey also covers multiple input data modalities,\ndegrees-of-freedom of output poses, object properties, and downstream tasks,\nproviding the readers with a holistic understanding of this field.\nAdditionally, it discusses training paradigms of different domains, inference\nmodes, application areas, evaluation metrics, and benchmark datasets, as well\nas reports the performance of current state-of-the-art methods on these\nbenchmarks, thereby facilitating the readers in selecting the most suitable\nmethod for their application. Finally, the survey identifies key challenges,\nreviews the prevailing trends along with their pros and cons, and identifies\npromising directions for future research. We also keep tracing the latest works\nat https://github.com/CNJianLiu/Awesome-Object-Pose-Estimation.\n","authors":["Jian Liu","Wei Sun","Hui Yang","Zhiwen Zeng","Chongpei Liu","Jin Zheng","Xingyu Liu","Hossein Rahmani","Nicu Sebe","Ajmal Mian"],"pdf_url":"https://arxiv.org/pdf/2405.07801v3.pdf","comment":"27 pages, 7 figures"},{"id":"http://arxiv.org/abs/2212.00394v3","updated":"2024-05-31T15:08:21Z","published":"2022-12-01T09:42:55Z","title":"From CNNs to Shift-Invariant Twin Models Based on Complex Wavelets","summary":" We propose a novel method to increase shift invariance and prediction\naccuracy in convolutional neural networks. Specifically, we replace the\nfirst-layer combination \"real-valued convolutions + max pooling\" (RMax) by\n\"complex-valued convolutions + modulus\" (CMod), which is stable to\ntranslations, or shifts. To justify our approach, we claim that CMod and RMax\nproduce comparable outputs when the convolution kernel is band-pass and\noriented (Gabor-like filter). In this context, CMod can therefore be considered\nas a stable alternative to RMax. To enforce this property, we constrain the\nconvolution kernels to adopt such a Gabor-like structure. The corresponding\narchitecture is called mathematical twin, because it employs a well-defined\nmathematical operator to mimic the behavior of the original, freely-trained\nmodel. Our approach achieves superior accuracy on ImageNet and CIFAR-10\nclassification tasks, compared to prior methods based on low-pass filtering.\nArguably, our approach's emphasis on retaining high-frequency details\ncontributes to a better balance between shift invariance and information\npreservation, resulting in improved performance. Furthermore, it has a lower\ncomputational cost and memory footprint than concurrent work, making it a\npromising solution for practical implementation.\n","authors":["Hubert Leterme","Kévin Polisano","Valérie Perrier","Karteek Alahari"],"pdf_url":"https://arxiv.org/pdf/2212.00394v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.00110v3","updated":"2024-05-31T15:07:31Z","published":"2023-11-30T18:19:47Z","title":"CLIP-QDA: An Explainable Concept Bottleneck Model","summary":" In this paper, we introduce an explainable algorithm designed from a\nmulti-modal foundation model, that performs fast and explainable image\nclassification. Drawing inspiration from CLIP-based Concept Bottleneck Models\n(CBMs), our method creates a latent space where each neuron is linked to a\nspecific word. Observing that this latent space can be modeled with simple\ndistributions, we use a Mixture of Gaussians (MoG) formalism to enhance the\ninterpretability of this latent space. Then, we introduce CLIP-QDA, a\nclassifier that only uses statistical values to infer labels from the concepts.\nIn addition, this formalism allows for both local and global explanations.\nThese explanations come from the inner design of our architecture, our work is\npart of a new family of greybox models, combining performances of opaque\nfoundation models and the interpretability of transparent models. Our empirical\nfindings show that in instances where the MoG assumption holds, CLIP-QDA\nachieves similar accuracy with state-of-the-art methods CBMs. Our explanations\ncompete with existing XAI methods while being faster to compute.\n","authors":["Rémi Kazmierczak","Eloïse Berthier","Goran Frehse","Gianni Franchi"],"pdf_url":"https://arxiv.org/pdf/2312.00110v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20892v1","updated":"2024-05-31T15:03:35Z","published":"2024-05-31T15:03:35Z","title":"MALT: Multi-scale Action Learning Transformer for Online Action\n Detection","summary":" Online action detection (OAD) aims to identify ongoing actions from streaming\nvideo in real-time, without access to future frames. Since these actions\nmanifest at varying scales of granularity, ranging from coarse to fine,\nprojecting an entire set of action frames to a single latent encoding may\nresult in a lack of local information, necessitating the acquisition of action\nfeatures across multiple scales. In this paper, we propose a multi-scale action\nlearning transformer (MALT), which includes a novel recurrent decoder (used for\nfeature fusion) that includes fewer parameters and can be trained more\nefficiently. A hierarchical encoder with multiple encoding branches is further\nproposed to capture multi-scale action features. The output from the preceding\nbranch is then incrementally input to the subsequent branch as part of a\ncross-attention calculation. In this way, output features transition from\ncoarse to fine as the branches deepen. We also introduce an explicit frame\nscoring mechanism employing sparse attention, which filters irrelevant frames\nmore efficiently, without requiring an additional network. The proposed method\nachieved state-of-the-art performance on two benchmark datasets (THUMOS'14 and\nTVSeries), outperforming all existing models used for comparison, with an mAP\nof 0.2% for THUMOS'14 and an mcAP of 0.1% for TVseries.\n","authors":["Zhipeng Yang","Ruoyu Wang","Yang Tan","Liping Xie"],"pdf_url":"https://arxiv.org/pdf/2405.20892v1.pdf","comment":"8 pages, 3 figures"},{"id":"http://arxiv.org/abs/2405.20881v1","updated":"2024-05-31T14:55:31Z","published":"2024-05-31T14:55:31Z","title":"S4Fusion: Saliency-aware Selective State Space Model for Infrared\n Visible Image Fusion","summary":" As one of the tasks in Image Fusion, Infrared and Visible Image Fusion aims\nto integrate complementary information captured by sensors of different\nmodalities into a single image. The Selective State Space Model (SSSM), known\nfor its ability to capture long-range dependencies, has demonstrated its\npotential in the field of computer vision. However, in image fusion, current\nmethods underestimate the potential of SSSM in capturing the global spatial\ninformation of both modalities. This limitation prevents the simultaneous\nconsideration of the global spatial information from both modalities during\ninteraction, leading to a lack of comprehensive perception of salient targets.\nConsequently, the fusion results tend to bias towards one modality instead of\nadaptively preserving salient targets. To address this issue, we propose the\nSaliency-aware Selective State Space Fusion Model (S4Fusion). In our S4Fusion,\nthe designed Cross-Modal Spatial Awareness Module (CMSA) can simultaneously\nfocus on global spatial information from both modalities while facilitating\ntheir interaction, thereby comprehensively capturing complementary information.\nAdditionally, S4Fusion leverages a pre-trained network to perceive uncertainty\nin the fused images. By minimizing this uncertainty, S4Fusion adaptively\nhighlights salient targets from both images. Extensive experiments demonstrate\nthat our approach produces high-quality images and enhances performance in\ndownstream tasks.\n","authors":["Haolong Ma","Hui Li","Chunyang Cheng","Gaoang Wang","Xiaoning Song","Xiaojun Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20881v1.pdf","comment":"NurIPS, Under review"},{"id":"http://arxiv.org/abs/2405.20876v1","updated":"2024-05-31T14:52:49Z","published":"2024-05-31T14:52:49Z","title":"Investigating Calibration and Corruption Robustness of Post-hoc Pruned\n Perception CNNs: An Image Classification Benchmark Study","summary":" Convolutional Neural Networks (CNNs) have achieved state-of-the-art\nperformance in many computer vision tasks. However, high computational and\nstorage demands hinder their deployment into resource-constrained environments,\nsuch as embedded devices. Model pruning helps to meet these restrictions by\nreducing the model size, while maintaining superior performance. Meanwhile,\nsafety-critical applications pose more than just resource and performance\nconstraints. In particular, predictions must not be overly confident, i.e.,\nprovide properly calibrated uncertainty estimations (proper uncertainty\ncalibration), and CNNs must be robust against corruptions like naturally\noccurring input perturbations (natural corruption robustness). This work\ninvestigates the important trade-off between uncertainty calibration, natural\ncorruption robustness, and performance for current state-of-research post-hoc\nCNN pruning techniques in the context of image classification tasks. Our study\nreveals that post-hoc pruning substantially improves the model's uncertainty\ncalibration, performance, and natural corruption robustness, sparking hope for\nsafe and robust embedded CNNs.Furthermore, uncertainty calibration and natural\ncorruption robustness are not mutually exclusive targets under pruning, as\nevidenced by the improved safety aspects obtained by post-hoc unstructured\npruning with increasing compression.\n","authors":["Pallavi Mitra","Gesina Schwalbe","Nadja Klein"],"pdf_url":"https://arxiv.org/pdf/2405.20876v1.pdf","comment":"11 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.18953v2","updated":"2024-05-31T14:51:58Z","published":"2023-10-29T09:54:03Z","title":"TIC-TAC: A Framework for Improved Covariance Estimation in Deep\n Heteroscedastic Regression","summary":" Deep heteroscedastic regression involves jointly optimizing the mean and\ncovariance of the predicted distribution using the negative log-likelihood.\nHowever, recent works show that this may result in sub-optimal convergence due\nto the challenges associated with covariance estimation. While the literature\naddresses this by proposing alternate formulations to mitigate the impact of\nthe predicted covariance, we focus on improving the predicted covariance\nitself. We study two questions: (1) Does the predicted covariance truly capture\nthe randomness of the predicted mean? (2) In the absence of supervision, how\ncan we quantify the accuracy of covariance estimation? We address (1) with a\nTaylor Induced Covariance (TIC), which captures the randomness of the predicted\nmean by incorporating its gradient and curvature through the second order\nTaylor polynomial. Furthermore, we tackle (2) by introducing a Task Agnostic\nCorrelations (TAC) metric, which combines the notion of correlations and\nabsolute error to evaluate the covariance. We evaluate TIC-TAC across multiple\nexperiments spanning synthetic and real-world datasets. Our results show that\nnot only does TIC accurately learn the covariance, it additionally facilitates\nan improved convergence of the negative log-likelihood. Our code is available\nat https://github.com/vita-epfl/TIC-TAC\n","authors":["Megh Shukla","Mathieu Salzmann","Alexandre Alahi"],"pdf_url":"https://arxiv.org/pdf/2310.18953v2.pdf","comment":"ICML 2024. Please feel free to provide feedback!"},{"id":"http://arxiv.org/abs/2405.20868v1","updated":"2024-05-31T14:47:27Z","published":"2024-05-31T14:47:27Z","title":"Responsible AI for Earth Observation","summary":" The convergence of artificial intelligence (AI) and Earth observation (EO)\ntechnologies has brought geoscience and remote sensing into an era of\nunparalleled capabilities. AI's transformative impact on data analysis,\nparticularly derived from EO platforms, holds great promise in addressing\nglobal challenges such as environmental monitoring, disaster response and\nclimate change analysis. However, the rapid integration of AI necessitates a\ncareful examination of the responsible dimensions inherent in its application\nwithin these domains. In this paper, we represent a pioneering effort to\nsystematically define the intersection of AI and EO, with a central focus on\nresponsible AI practices. Specifically, we identify several critical components\nguiding this exploration from both academia and industry perspectives within\nthe EO field: AI and EO for social good, mitigating unfair biases, AI security\nin EO, geo-privacy and privacy-preserving measures, as well as maintaining\nscientific excellence, open data, and guiding AI usage based on ethical\nprinciples. Furthermore, the paper explores potential opportunities and\nemerging trends, providing valuable insights for future research endeavors.\n","authors":["Pedram Ghamisi","Weikang Yu","Andrea Marinoni","Caroline M. Gevaert","Claudio Persello","Sivasakthy Selvakumaran","Manuela Girotto","Benjamin P. Horton","Philippe Rufin","Patrick Hostert","Fabio Pacifici","Peter M. Atkinson"],"pdf_url":"https://arxiv.org/pdf/2405.20868v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20867v1","updated":"2024-05-31T14:47:20Z","published":"2024-05-31T14:47:20Z","title":"Automatic Channel Pruning for Multi-Head Attention","summary":" Despite the strong performance of Transformers, their quadratic computation\ncomplexity presents challenges in applying them to vision tasks. Automatic\npruning is one of effective methods for reducing computation complexity without\nheuristic approaches. However, directly applying it to multi-head attention is\nnot straightforward due to channel misalignment. In this paper, we propose an\nautomatic channel pruning method to take into account the multi-head attention\nmechanism. First, we incorporate channel similarity-based weights into the\npruning indicator to preserve more informative channels in each head. Then, we\nadjust pruning indicator to enforce removal of channels in equal proportions\nacross all heads, preventing the channel misalignment. We also add a reweight\nmodule to compensate for information loss resulting from channel removal, and\nan effective initialization step for pruning indicator based on difference of\nattention between original structure and each channel. Our proposed method can\nbe used to not only original attention, but also linear attention, which is\nmore efficient as linear complexity with respect to the number of tokens. On\nImageNet-1K, applying our pruning method to the FLattenTransformer, which\nincludes both attention mechanisms, shows outperformed accuracy for several\nMACs compared with previous state-of-the-art efficient models and pruned\nmethods. Code will be available soon.\n","authors":["Eunho Lee","Youngbae Hwang"],"pdf_url":"https://arxiv.org/pdf/2405.20867v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.17389v3","updated":"2024-05-31T14:38:08Z","published":"2023-11-29T06:42:12Z","title":"360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization\n with Cross-device Queries","summary":" Portable 360$^\\circ$ cameras are becoming a cheap and efficient tool to\nestablish large visual databases. By capturing omnidirectional views of a\nscene, these cameras could expedite building environment models that are\nessential for visual localization. However, such an advantage is often\noverlooked due to the lack of valuable datasets. This paper introduces a new\nbenchmark dataset, 360Loc, composed of 360$^\\circ$ images with ground truth\nposes for visual localization. We present a practical implementation of\n360$^\\circ$ mapping combining 360$^\\circ$ images with lidar data to generate\nthe ground truth 6DoF poses. 360Loc is the first dataset and benchmark that\nexplores the challenge of cross-device visual positioning, involving\n360$^\\circ$ reference frames, and query frames from pinhole, ultra-wide FoV\nfisheye, and 360$^\\circ$ cameras. We propose a virtual camera approach to\ngenerate lower-FoV query frames from 360$^\\circ$ images, which ensures a fair\ncomparison of performance among different query types in visual localization\ntasks. We also extend this virtual camera approach to feature matching-based\nand pose regression-based methods to alleviate the performance loss caused by\nthe cross-device domain gap, and evaluate its effectiveness against\nstate-of-the-art baselines. We demonstrate that omnidirectional visual\nlocalization is more robust in challenging large-scale scenes with symmetries\nand repetitive structures. These results provide new insights into 360-camera\nmapping and omnidirectional visual localization with cross-device queries.\n","authors":["Huajian Huang","Changkun Liu","Yipeng Zhu","Hui Cheng","Tristan Braud","Sai-Kit Yeung"],"pdf_url":"https://arxiv.org/pdf/2311.17389v3.pdf","comment":"CVPR 2024. Project Page: https://huajianup.github.io/research/360Loc/"},{"id":"http://arxiv.org/abs/2405.20853v1","updated":"2024-05-31T14:35:35Z","published":"2024-05-31T14:35:35Z","title":"MeshXL: Neural Coordinate Field for Generative 3D Foundation Models","summary":" The polygon mesh representation of 3D data exhibits great flexibility, fast\nrendering speed, and storage efficiency, which is widely preferred in various\napplications. However, given its unstructured graph representation, the direct\ngeneration of high-fidelity 3D meshes is challenging. Fortunately, with a\npre-defined ordering strategy, 3D meshes can be represented as sequences, and\nthe generation process can be seamlessly treated as an auto-regressive problem.\nIn this paper, we validate the Neural Coordinate Field (NeurCF), an explicit\ncoordinate representation with implicit neural embeddings, is a\nsimple-yet-effective representation for large-scale sequential mesh modeling.\nAfter that, we present MeshXL, a family of generative pre-trained\nauto-regressive models, which addresses the process of 3D mesh generation with\nmodern large language model approaches. Extensive experiments show that MeshXL\nis able to generate high-quality 3D meshes, and can also serve as foundation\nmodels for various down-stream applications.\n","authors":["Sijin Chen","Xin Chen","Anqi Pang","Xianfang Zeng","Wei Cheng","Yijun Fu","Fukun Yin","Yanru Wang","Zhibin Wang","Chi Zhang","Jingyi Yu","Gang Yu","Bin Fu","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2405.20853v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20851v1","updated":"2024-05-31T14:33:13Z","published":"2024-05-31T14:33:13Z","title":"MegActor: Harness the Power of Raw Video for Vivid Portrait Animation","summary":" Despite raw driving videos contain richer information on facial expressions\nthan intermediate representations such as landmarks in the field of portrait\nanimation, they are seldom the subject of research. This is due to two\nchallenges inherent in portrait animation driven with raw videos: 1)\nsignificant identity leakage; 2) Irrelevant background and facial details such\nas wrinkles degrade performance. To harnesses the power of the raw videos for\nvivid portrait animation, we proposed a pioneering conditional diffusion model\nnamed as MegActor. First, we introduced a synthetic data generation framework\nfor creating videos with consistent motion and expressions but inconsistent IDs\nto mitigate the issue of ID leakage. Second, we segmented the foreground and\nbackground of the reference image and employed CLIP to encode the background\ndetails. This encoded information is then integrated into the network via a\ntext embedding module, thereby ensuring the stability of the background.\nFinally, we further style transfer the appearance of the reference image to the\ndriving video to eliminate the influence of facial details in the driving\nvideos. Our final model was trained solely on public datasets, achieving\nresults comparable to commercial models. We hope this will help the open-source\ncommunity.The code is available at\nhttps://github.com/megvii-research/MegFaceAnimate.\n","authors":["Shurong Yang","Huadong Li","Juhao Wu","Minhao Jing","Linze Li","Renhe Ji","Jiajun Liang","Haoqiang Fan"],"pdf_url":"https://arxiv.org/pdf/2405.20851v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.00766v4","updated":"2024-05-31T14:29:13Z","published":"2024-01-01T14:14:35Z","title":"Exposure Bracketing is All You Need for Unifying Image Restoration and\n Enhancement Tasks","summary":" It is highly desired but challenging to acquire high-quality photos with\nclear content in low-light environments. Although multi-image processing\nmethods (using burst, dual-exposure, or multi-exposure images) have made\nsignificant progress in addressing this issue, they typically focus on specific\nrestoration or enhancement problems, and do not fully explore the potential of\nutilizing multiple images. Motivated by the fact that multi-exposure images are\ncomplementary in denoising, deblurring, high dynamic range imaging, and\nsuper-resolution, we propose to utilize exposure bracketing photography to\nunify image restoration and enhancement tasks in this work. Due to the\ndifficulty in collecting real-world pairs, we suggest a solution that first\npre-trains the model with synthetic paired data and then adapts it to\nreal-world unlabeled images. In particular, a temporally modulated recurrent\nnetwork (TMRNet) and self-supervised adaptation method are proposed. Moreover,\nwe construct a data simulation pipeline to synthesize pairs and collect\nreal-world images from 200 nighttime scenarios. Experiments on both datasets\nshow that our method performs favorably against the state-of-the-art\nmulti-image processing ones. The dataset, code, and pre-trained models are\navailable at https://github.com/cszhilu1998/BracketIRE.\n","authors":["Zhilu Zhang","Shuohao Zhang","Renlong Wu","Zifei Yan","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2401.00766v4.pdf","comment":"21 pages"},{"id":"http://arxiv.org/abs/2405.20838v1","updated":"2024-05-31T14:25:45Z","published":"2024-05-31T14:25:45Z","title":"einspace: Searching for Neural Architectures from Fundamental Operations","summary":" Neural architecture search (NAS) finds high performing networks for a given\ntask. Yet the results of NAS are fairly prosaic; they did not e.g. create a\nshift from convolutional structures to transformers. This is not least because\nthe search spaces in NAS often aren't diverse enough to include such\ntransformations a priori. Instead, for NAS to provide greater potential for\nfundamental design shifts, we need a novel expressive search space design which\nis built from more fundamental operations. To this end, we introduce einspace,\na search space based on a parameterised probabilistic context-free grammar. Our\nspace is versatile, supporting architectures of various sizes and complexities,\nwhile also containing diverse network operations which allow it to model\nconvolutions, attention components and more. It contains many existing\ncompetitive architectures, and provides flexibility for discovering new ones.\nUsing this search space, we perform experiments to find novel architectures as\nwell as improvements on existing ones on the diverse Unseen NAS datasets. We\nshow that competitive architectures can be obtained by searching from scratch,\nand we consistently find large improvements when initialising the search with\nstrong baselines. We believe that this work is an important advancement towards\na transformative NAS paradigm where search space expressivity and strategic\nsearch initialisation play key roles.\n","authors":["Linus Ericsson","Miguel Espinosa","Chenhongyi Yang","Antreas Antoniou","Amos Storkey","Shay B. Cohen","Steven McDonagh","Elliot J. Crowley"],"pdf_url":"https://arxiv.org/pdf/2405.20838v1.pdf","comment":"Project page at https://linusericsson.github.io/einspace/"},{"id":"http://arxiv.org/abs/2405.20834v1","updated":"2024-05-31T14:23:49Z","published":"2024-05-31T14:23:49Z","title":"Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits\n Multimodal Reasoning","summary":" Large language models equipped with retrieval-augmented generation (RAG)\nrepresent a burgeoning field aimed at enhancing answering capabilities by\nleveraging external knowledge bases. Although the application of RAG with\nlanguage-only models has been extensively explored, its adaptation into\nmultimodal vision-language models remains nascent. Going beyond mere answer\ngeneration, the primary goal of multimodal RAG is to cultivate the models'\nability to reason in response to relevant queries. To this end, we introduce a\nnovel multimodal RAG framework named RMR (Retrieval Meets Reasoning). The RMR\nframework employs a bi-modal retrieval module to identify the most relevant\nquestion-answer pairs, which then serve as scaffolds for the multimodal\nreasoning process. This training-free approach not only encourages the model to\nengage deeply with the reasoning processes inherent in the retrieved content\nbut also facilitates the generation of answers that are precise and richly\ninterpretable. Surprisingly, utilizing solely the ScienceQA dataset, collected\nfrom elementary and high school science curricula, RMR significantly boosts the\nperformance of various vision-language models across a spectrum of benchmark\ndatasets, including A-OKVQA, MMBench, and SEED. These outcomes highlight the\nsubstantial potential of our multimodal retrieval and reasoning mechanism to\nimprove the reasoning capabilities of vision-language models.\n","authors":["Cheng Tan","Jingxuan Wei","Linzhuang Sun","Zhangyang Gao","Siyuan Li","Bihui Yu","Ruifeng Guo","Stan Z. Li"],"pdf_url":"https://arxiv.org/pdf/2405.20834v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2404.07217v2","updated":"2024-05-31T14:23:09Z","published":"2024-02-23T10:08:45Z","title":"Attention-aware Semantic Communications for Collaborative Inference","summary":" We propose a communication-efficient collaborative inference framework in the\ndomain of edge inference, focusing on the efficient use of vision transformer\n(ViT) models. The partitioning strategy of conventional collaborative inference\nfails to reduce communication cost because of the inherent architecture of ViTs\nmaintaining consistent layer dimensions across the entire transformer encoder.\nTherefore, instead of employing the partitioning strategy, our framework\nutilizes a lightweight ViT model on the edge device, with the server deploying\na complicated ViT model. To enhance communication efficiency and achieve the\nclassification accuracy of the server model, we propose two strategies: 1)\nattention-aware patch selection and 2) entropy-aware image transmission.\nAttention-aware patch selection leverages the attention scores generated by the\nedge device's transformer encoder to identify and select the image patches\ncritical for classification. This strategy enables the edge device to transmit\nonly the essential patches to the server, significantly improving communication\nefficiency. Entropy-aware image transmission uses min-entropy as a metric to\naccurately determine whether to depend on the lightweight model on the edge\ndevice or to request the inference from the server model. In our framework, the\nlightweight ViT model on the edge device acts as a semantic encoder,\nefficiently identifying and selecting the crucial image information required\nfor the classification task. Our experiments demonstrate that the proposed\ncollaborative inference framework can reduce communication overhead by 68% with\nonly a minimal loss in accuracy compared to the server model on the ImageNet\ndataset.\n","authors":["Jiwoong Im","Nayoung Kwon","Taewoo Park","Jiheon Woo","Jaeho Lee","Yongjune Kim"],"pdf_url":"https://arxiv.org/pdf/2404.07217v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20829v1","updated":"2024-05-31T14:21:00Z","published":"2024-05-31T14:21:00Z","title":"Rethinking Open-World Semi-Supervised Learning: Distribution Mismatch\n and Inductive Inference","summary":" Open-world semi-supervised learning (OWSSL) extends conventional\nsemi-supervised learning to open-world scenarios by taking account of novel\ncategories in unlabeled datasets. Despite the recent advancements in OWSSL, the\nsuccess often relies on the assumptions that 1) labeled and unlabeled datasets\nshare the same balanced class prior distribution, which does not generally hold\nin real-world applications, and 2) unlabeled training datasets are utilized for\nevaluation, where such transductive inference might not adequately address\nchallenges in the wild. In this paper, we aim to generalize OWSSL by addressing\nthem. Our work suggests that practical OWSSL may require different training\nsettings, evaluation methods, and learning strategies compared to those\nprevalent in the existing literature.\n","authors":["Seongheon Park","Hyuk Kwon","Kwanghoon Sohn","Kibok Lee"],"pdf_url":"https://arxiv.org/pdf/2405.20829v1.pdf","comment":"CVPR Workshop on Computer Vision in the Wild (CVinW), 2024"},{"id":"http://arxiv.org/abs/2405.20810v1","updated":"2024-05-31T14:07:39Z","published":"2024-05-31T14:07:39Z","title":"Context-aware Difference Distilling for Multi-change Captioning","summary":" Multi-change captioning aims to describe complex and coupled changes within\nan image pair in natural language. Compared with single-change captioning, this\ntask requires the model to have higher-level cognition ability to reason an\narbitrary number of changes. In this paper, we propose a novel context-aware\ndifference distilling (CARD) network to capture all genuine changes for\nyielding sentences. Given an image pair, CARD first decouples context features\nthat aggregate all similar/dissimilar semantics, termed common/difference\ncontext features. Then, the consistency and independence constraints are\ndesigned to guarantee the alignment/discrepancy of common/difference context\nfeatures. Further, the common context features guide the model to mine locally\nunchanged features, which are subtracted from the pair to distill locally\ndifference features. Next, the difference context features augment the locally\ndifference features to ensure that all changes are distilled. In this way, we\nobtain an omni-representation of all changes, which is translated into\nlinguistic sentences by a transformer decoder. Extensive experiments on three\npublic datasets show CARD performs favourably against state-of-the-art\nmethods.The code is available at https://github.com/tuyunbin/CARD.\n","authors":["Yunbin Tu","Liang Li","Li Su","Zheng-Jun Zha","Chenggang Yan","Qingming Huang"],"pdf_url":"https://arxiv.org/pdf/2405.20810v1.pdf","comment":"Accepted by ACL 2024 main conference (long paper)"},{"id":"http://arxiv.org/abs/2402.12550v2","updated":"2024-05-31T14:04:05Z","published":"2024-02-19T21:20:22Z","title":"Multilinear Mixture of Experts: Scalable Expert Specialization through\n Factorization","summary":" The Mixture of Experts (MoE) paradigm provides a powerful way to decompose\ndense layers into smaller, modular computations often more amenable to human\ninterpretation, debugging, and editability. However, a major challenge lies in\nthe computational cost of scaling the number of experts high enough to achieve\nfine-grained specialization. In this paper, we propose the Multilinear Mixture\nof Experts ($\\mu$MoE) layer to address this, focusing on vision models.\n$\\mu$MoE layers enable scalable expert specialization by performing an implicit\ncomputation on prohibitively large weight tensors entirely in factorized form.\nConsequently, $\\mu$MoEs (1) avoid the restrictively high inference-time costs\nof 'soft' MoEs, yet (2) do not inherit the training issues of the popular\n'sparse' MoEs' discrete (non-differentiable) expert routing. We present both\nqualitative and quantitative evidence that scaling $\\mu$MoE layers when\nfine-tuning foundation models for vision tasks leads to more specialized\nexperts at the class-level, further enabling manual bias correction in CelebA\nattribute classification. Finally, we show qualitative results demonstrating\nthe expert specialism achieved when pre-training large GPT2 and MLP-Mixer\nmodels with parameter-matched $\\mu$MoE blocks at every layer, maintaining\ncomparable accuracy. Our code is available at:\nhttps://github.com/james-oldfield/muMoE.\n","authors":["James Oldfield","Markos Georgopoulos","Grigorios G. Chrysos","Christos Tzelepis","Yannis Panagakis","Mihalis A. Nicolaou","Jiankang Deng","Ioannis Patras"],"pdf_url":"https://arxiv.org/pdf/2402.12550v2.pdf","comment":"Github: https://github.com/james-oldfield/muMoE. Project page:\n https://james-oldfield.github.io/muMoE/"},{"id":"http://arxiv.org/abs/2405.18839v2","updated":"2024-05-31T14:03:07Z","published":"2024-05-29T07:40:31Z","title":"MEGA: Masked Generative Autoencoder for Human Mesh Recovery","summary":" Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous\nproblem, as similar 2D projections can correspond to multiple 3D\ninterpretations. Nevertheless, most HMR methods overlook this ambiguity and\nmake a single prediction without accounting for the associated uncertainty. A\nfew approaches generate a distribution of human meshes, enabling the sampling\nof multiple predictions; however, none of them is competitive with the latest\nsingle-output model when making a single prediction. This work proposes a new\napproach based on masked generative modeling. By tokenizing the human pose and\nshape, we formulate the HMR task as generating a sequence of discrete tokens\nconditioned on an input image. We introduce MEGA, a MaskEd Generative\nAutoencoder trained to recover human meshes from images and partial human mesh\ntoken sequences. Given an image, our flexible generation scheme allows us to\npredict a single human mesh in deterministic mode or to generate multiple human\nmeshes in stochastic mode. MEGA enables us to propose multiple outputs and to\nevaluate the uncertainty of the predictions. Experiments on in-the-wild\nbenchmarks show that MEGA achieves state-of-the-art performance in\ndeterministic and stochastic modes, outperforming single-output and\nmulti-output approaches.\n","authors":["Guénolé Fiche","Simon Leglaive","Xavier Alameda-Pineda","Francesc Moreno-Noguer"],"pdf_url":"https://arxiv.org/pdf/2405.18839v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20797v1","updated":"2024-05-31T13:59:18Z","published":"2024-05-31T13:59:18Z","title":"Ovis: Structural Embedding Alignment for Multimodal Large Language Model","summary":" Current Multimodal Large Language Models (MLLMs) typically integrate a\npre-trained LLM with another pre-trained vision transformer through a\nconnector, such as an MLP, endowing the LLM with visual capabilities. However,\nthe misalignment between two embedding strategies in MLLMs -- the structural\ntextual embeddings based on an embedding look-up table and the continuous\nembeddings generated directly by the vision encoder -- makes challenges for a\nmore seamless fusion of visual and textual information. We propose Ovis, a\nnovel MLLM architecture designed to structurally align visual and textual\nembeddings. Ovis integrates an additional learnable visual embedding table into\nthe visual encoder's process. To capture rich visual semantics, each image\npatch indexes the visual embedding table multiple times, resulting in a final\nvisual embedding that is a probabilistic combination of the indexed embeddings.\nThis structural approach mirrors the method used for generating textual\nembeddings. Empirical evaluations on various multimodal benchmarks demonstrate\nthat Ovis outperforms open-source MLLMs of similar parameter scales and even\nsurpasses the proprietary model Qwen-VL-Plus overall. These results highlight\nthe potential of Ovis' structured visual representation for advancing MLLM\narchitectural design and promoting more effective multimodal learning. Both the\nsource code and the training dataset of Ovis will be made publicly available.\n","authors":["Shiyin Lu","Yang Li","Qing-Guo Chen","Zhao Xu","Weihua Luo","Kaifu Zhang","Han-Jia Ye"],"pdf_url":"https://arxiv.org/pdf/2405.20797v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20795v1","updated":"2024-05-31T13:56:55Z","published":"2024-05-31T13:56:55Z","title":"InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced\n Visual Understanding","summary":" Accurate visual understanding is imperative for advancing autonomous systems\nand intelligent robots. Despite the powerful capabilities of vision-language\nmodels (VLMs) in processing complex visual scenes, precisely recognizing\nobscured or ambiguously presented visual elements remains challenging. To\ntackle such issues, this paper proposes InsightSee, a multi-agent framework to\nenhance VLMs' interpretative capabilities in handling complex visual\nunderstanding scenarios. The framework comprises a description agent, two\nreasoning agents, and a decision agent, which are integrated to refine the\nprocess of visual information interpretation. The design of these agents and\nthe mechanisms by which they can be enhanced in visual information processing\nare presented. Experimental results demonstrate that the InsightSee framework\nnot only boosts performance on specific visual tasks but also retains the\noriginal models' strength. The proposed framework outperforms state-of-the-art\nalgorithms in 6 out of 9 benchmark tests, with a substantial advancement in\nmultimodal understanding.\n","authors":["Huaxiang Zhang","Yaojia Mu","Guo-Niu Zhu","Zhongxue Gan"],"pdf_url":"https://arxiv.org/pdf/2405.20795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.19473v5","updated":"2024-05-31T13:56:39Z","published":"2024-02-29T18:59:01Z","title":"Retrieval-Augmented Generation for AI-Generated Content: A Survey","summary":" Advancements in model algorithms, the growth of foundational models, and\naccess to high-quality datasets have propelled the evolution of Artificial\nIntelligence Generated Content (AIGC). Despite its notable successes, AIGC\nstill faces hurdles such as updating knowledge, handling long-tail data,\nmitigating data leakage, and managing high training and inference costs.\nRetrieval-Augmented Generation (RAG) has recently emerged as a paradigm to\naddress such challenges. In particular, RAG introduces the information\nretrieval process, which enhances the generation process by retrieving relevant\nobjects from available data stores, leading to higher accuracy and better\nrobustness. In this paper, we comprehensively review existing efforts that\nintegrate RAG technique into AIGC scenarios. We first classify RAG foundations\naccording to how the retriever augments the generator, distilling the\nfundamental abstractions of the augmentation methodologies for various\nretrievers and generators. This unified perspective encompasses all RAG\nscenarios, illuminating advancements and pivotal technologies that help with\npotential future progress. We also summarize additional enhancements methods\nfor RAG, facilitating effective engineering and implementation of RAG systems.\nThen from another view, we survey on practical applications of RAG across\ndifferent modalities and tasks, offering valuable references for researchers\nand practitioners. Furthermore, we introduce the benchmarks for RAG, discuss\nthe limitations of current RAG systems, and suggest potential directions for\nfuture research. Github: https://github.com/PKU-DAIR/RAG-Survey.\n","authors":["Penghao Zhao","Hailin Zhang","Qinhan Yu","Zhengren Wang","Yunteng Geng","Fangcheng Fu","Ling Yang","Wentao Zhang","Jie Jiang","Bin Cui"],"pdf_url":"https://arxiv.org/pdf/2402.19473v5.pdf","comment":"Citing 334 papers, 21 pages, 1 table, 12 figures. Project:\n https://github.com/PKU-DAIR/RAG-Survey"},{"id":"http://arxiv.org/abs/2405.20791v1","updated":"2024-05-31T13:48:54Z","published":"2024-05-31T13:48:54Z","title":"GS-Phong: Meta-Learned 3D Gaussians for Relightable Novel View Synthesis","summary":" Decoupling the illumination in 3D scenes is crucial for novel view synthesis\nand relighting. In this paper, we propose a novel method for representing a\nscene illuminated by a point light using a set of relightable 3D Gaussian\npoints. Inspired by the Blinn-Phong model, our approach decomposes the scene\ninto ambient, diffuse, and specular components, enabling the synthesis of\nrealistic lighting effects. To facilitate the decomposition of geometric\ninformation independent of lighting conditions, we introduce a novel bilevel\noptimization-based meta-learning framework. The fundamental idea is to view the\nrendering tasks under various lighting positions as a multi-task learning\nproblem, which our meta-learning approach effectively addresses by generalizing\nthe learned Gaussian geometries not only across different viewpoints but also\nacross diverse light positions. Experimental results demonstrate the\neffectiveness of our approach in terms of training efficiency and rendering\nquality compared to existing methods for free-viewpoint relighting.\n","authors":["Yumeng He","Yunbo Wang","Xiaokang Yang"],"pdf_url":"https://arxiv.org/pdf/2405.20791v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.15356v2","updated":"2024-05-31T12:53:36Z","published":"2023-11-26T17:17:28Z","title":"Having Second Thoughts? Let's hear it","summary":" Deep learning models loosely mimic bottom-up signal pathways from low-order\nsensory areas to high-order cognitive areas. After training, DL models can\noutperform humans on some domain-specific tasks, but their decision-making\nprocess has been known to be easily disrupted. Since the human brain consists\nof multiple functional areas highly connected to one another and relies on\nintricate interplays between bottom-up and top-down (from high-order to\nlow-order areas) processing, we hypothesize that incorporating top-down signal\nprocessing may make DL models more robust. To address this hypothesis, we\npropose a certification process mimicking selective attention and test if it\ncould make DL models more robust. Our empirical evaluations suggest that this\nnewly proposed certification can improve DL models' accuracy and help us build\nsafety measures to alleviate their vulnerabilities with both artificial and\nnatural adversarial examples.\n","authors":["Jung H. Lee","Sujith Vijayan"],"pdf_url":"https://arxiv.org/pdf/2311.15356v2.pdf","comment":"10 pages, 6 figures, 3 table and Append/Supplementary materials.\n Section 3 has been substantially revised"},{"id":"http://arxiv.org/abs/2405.20764v1","updated":"2024-05-31T12:35:06Z","published":"2024-05-31T12:35:06Z","title":"CoMoFusion: Fast and High-quality Fusion of Infrared and Visible Image\n with Consistency Model","summary":" Generative models are widely utilized to model the distribution of fused\nimages in the field of infrared and visible image fusion. However, current\ngenerative models based fusion methods often suffer from unstable training and\nslow inference speed. To tackle this problem, a novel fusion method based on\nconsistency model is proposed, termed as CoMoFusion, which can generate the\nhigh-quality images and achieve fast image inference speed. In specific, the\nconsistency model is used to construct multi-modal joint features in the latent\nspace with the forward and reverse process. Then, the infrared and visible\nfeatures extracted by the trained consistency model are fed into fusion module\nto generate the final fused image. In order to enhance the texture and salient\ninformation of fused images, a novel loss based on pixel value selection is\nalso designed. Extensive experiments on public datasets illustrate that our\nmethod obtains the SOTA fusion performance compared with the existing fusion\nmethods.\n","authors":["Zhiming Meng","Hui Li","Zeyang Zhang","Zhongwei Shen","Yunlong Yu","Xiaoning Song","Xiaojun Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20764v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20759v1","updated":"2024-05-31T12:20:02Z","published":"2024-05-31T12:20:02Z","title":"Information Theoretic Text-to-Image Alignment","summary":" Diffusion models for Text-to-Image (T2I) conditional generation have seen\ntremendous success recently. Despite their success, accurately capturing user\nintentions with these models still requires a laborious trial and error\nprocess. This challenge is commonly identified as a model alignment problem, an\nissue that has attracted considerable attention by the research community.\nInstead of relying on fine-grained linguistic analyses of prompts, human\nannotation, or auxiliary vision-language models to steer image generation, in\nthis work we present a novel method that relies on an information-theoretic\nalignment measure. In a nutshell, our method uses self-supervised fine-tuning\nand relies on point-wise mutual information between prompts and images to\ndefine a synthetic training set to induce model alignment. Our comparative\nanalysis shows that our method is on-par or superior to the state-of-the-art,\nyet requires nothing but a pre-trained denoising network to estimate MI and a\nlightweight fine-tuning strategy.\n","authors":["Chao Wang","Giulio Franzese","Alessandro Finamore","Massimo Gallo","Pietro Michiardi"],"pdf_url":"https://arxiv.org/pdf/2405.20759v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19996v2","updated":"2024-05-31T11:39:33Z","published":"2024-05-30T12:32:35Z","title":"DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in\n the Wild","summary":" Image quality assessment (IQA) plays a critical role in selecting\nhigh-quality images and guiding compression and enhancement methods in a series\nof applications. The blind IQA, which assesses the quality of in-the-wild\nimages containing complex authentic distortions without reference images, poses\ngreater challenges. Existing methods are limited to modeling a uniform\ndistribution with local patches and are bothered by the gap between low and\nhigh-level visions (caused by widely adopted pre-trained classification\nnetworks). In this paper, we propose a novel IQA method called diffusion\npriors-based IQA (DP-IQA), which leverages the prior knowledge from the\npre-trained diffusion model with its excellent powers to bridge semantic gaps\nin the perception of the visual quality of images. Specifically, we use\npre-trained stable diffusion as the backbone, extract multi-level features from\nthe denoising U-Net during the upsampling process at a specified timestep, and\ndecode them to estimate the image quality score. The text and image adapters\nare adopted to mitigate the domain gap for downstream tasks and correct the\ninformation loss caused by the variational autoencoder bottleneck. Finally, we\ndistill the knowledge in the above model into a CNN-based student model,\nsignificantly reducing the parameter to enhance applicability, with the student\nmodel performing similarly or even better than the teacher model surprisingly.\nExperimental results demonstrate that our DP-IQA achieves state-of-the-art\nresults on various in-the-wild datasets with better generalization capability,\nwhich shows the superiority of our method in global modeling and utilizing the\nhierarchical feature clues of diffusion for evaluating image quality.\n","authors":["Honghao Fu","Yufei Wang","Wenhan Yang","Bihan Wen"],"pdf_url":"https://arxiv.org/pdf/2405.19996v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20750v1","updated":"2024-05-31T11:14:12Z","published":"2024-05-31T11:14:12Z","title":"Diffusion Models Are Innate One-Step Generators","summary":" Diffusion Models (DMs) have achieved great success in image generation and\nother fields. By fine sampling through the trajectory defined by the SDE/ODE\nsolver based on a well-trained score model, DMs can generate remarkable\nhigh-quality results. However, this precise sampling often requires multiple\nsteps and is computationally demanding. To address this problem, instance-based\ndistillation methods have been proposed to distill a one-step generator from a\nDM by having a simpler student model mimic a more complex teacher model. Yet,\nour research reveals an inherent limitations in these methods: the teacher\nmodel, with more steps and more parameters, occupies different local minima\ncompared to the student model, leading to suboptimal performance when the\nstudent model attempts to replicate the teacher. To avoid this problem, we\nintroduce a novel distributional distillation method, which uses an exclusive\ndistributional loss. This method exceeds state-of-the-art (SOTA) results while\nrequiring significantly fewer training images. Additionally, we show that DMs'\nlayers are activated differently at different time steps, leading to an\ninherent capability to generate images in a single step. Freezing most of the\nconvolutional layers in a DM during distributional distillation leads to\nfurther performance improvements. Our method achieves the SOTA results on\nCIFAR-10 (FID 1.54), AFHQv2 64x64 (FID 1.23), FFHQ 64x64 (FID 0.85) and\nImageNet 64x64 (FID 1.16) with great efficiency. Most of those results are\nobtained with only 5 million training images within 6 hours on 8 A100 GPUs.\nThis breakthrough not only enhances the understanding of efficient image\ngeneration models but also offers a scalable framework for advancing the state\nof the art in various applications.\n","authors":["Bowen Zheng","Tianming Yang"],"pdf_url":"https://arxiv.org/pdf/2405.20750v1.pdf","comment":"9 pages, 4 figures and 4 tables on the main contents"},{"id":"http://arxiv.org/abs/2405.14200v2","updated":"2024-05-31T11:09:59Z","published":"2024-05-23T05:58:10Z","title":"Awesome Multi-modal Object Tracking","summary":" Multi-modal object tracking (MMOT) is an emerging field that combines data\nfrom various modalities, \\eg vision (RGB), depth, thermal infrared, event,\nlanguage and audio, to estimate the state of an arbitrary object in a video\nsequence. It is of great significance for many applications such as autonomous\ndriving and intelligent surveillance. In recent years, MMOT has received more\nand more attention. However, existing MMOT algorithms mainly focus on two\nmodalities (\\eg RGB+depth, RGB+thermal infrared, and RGB+language). To leverage\nmore modalities, some recent efforts have been made to learn a unified visual\nobject tracking model for any modality. Additionally, some large-scale\nmulti-modal tracking benchmarks have been established by simultaneously\nproviding more than two modalities, such as vision-language-audio (\\eg\nWebUAV-3M) and vision-depth-language (\\eg UniMod1K). To track the latest\nprogress in MMOT, we conduct a comprehensive investigation in this report.\nSpecifically, we first divide existing MMOT tasks into five main categories,\n\\ie RGBL tracking, RGBE tracking, RGBD tracking, RGBT tracking, and\nmiscellaneous (RGB+X), where X can be any modality, such as language, depth,\nand event. Then, we analyze and summarize each MMOT task, focusing on widely\nused datasets and mainstream tracking algorithms based on their technical\nparadigms (\\eg self-supervised learning, prompt learning, knowledge\ndistillation, generative models, and state space models). Finally, we maintain\na continuously updated paper list for MMOT at\nhttps://github.com/983632847/Awesome-Multimodal-Object-Tracking.\n","authors":["Chunhui Zhang","Li Liu","Hao Wen","Xi Zhou","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2405.14200v2.pdf","comment":"A continuously updated project to track the latest progress in\n multi-modal object tracking"},{"id":"http://arxiv.org/abs/2405.20330v2","updated":"2024-05-31T10:52:56Z","published":"2024-05-30T17:59:02Z","title":"4DHands: Reconstructing Interactive Hands in 4D with Transformers","summary":" In this paper, we introduce 4DHands, a robust approach to recovering\ninteractive hand meshes and their relative movement from monocular inputs. Our\napproach addresses two major limitations of previous methods: lacking a unified\nsolution for handling various hand image inputs and neglecting the positional\nrelationship of two hands within images. To overcome these challenges, we\ndevelop a transformer-based architecture with novel tokenization and feature\nfusion strategies. Specifically, we propose a Relation-aware Two-Hand\nTokenization (RAT) method to embed positional relation information into the\nhand tokens. In this way, our network can handle both single-hand and two-hand\ninputs and explicitly leverage relative hand positions, facilitating the\nreconstruction of intricate hand interactions in real-world scenarios. As such\ntokenization indicates the relative relationship of two hands, it also supports\nmore effective feature fusion. To this end, we further develop a\nSpatio-temporal Interaction Reasoning (SIR) module to fuse hand tokens in 4D\nwith attention and decode them into 3D hand meshes and relative temporal\nmovements. The efficacy of our approach is validated on several benchmark\ndatasets. The results on in-the-wild videos and real-world scenarios\ndemonstrate the superior performances of our approach for interactive hand\nreconstruction. More video results can be found on the project page:\nhttps://4dhands.github.io.\n","authors":["Dixuan Lin","Yuxiang Zhang","Mengcheng Li","Yebin Liu","Wei Jing","Qi Yan","Qianying Wang","Hongwen Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20330v2.pdf","comment":"More demo videos can be seen at our project page:\n https://4dhands.github.io"},{"id":"http://arxiv.org/abs/2405.20743v1","updated":"2024-05-31T10:13:17Z","published":"2024-05-31T10:13:17Z","title":"Trajectory Forecasting through Low-Rank Adaptation of Discrete Latent\n Codes","summary":" Trajectory forecasting is crucial for video surveillance analytics, as it\nenables the anticipation of future movements for a set of agents, e.g.\nbasketball players engaged in intricate interactions with long-term intentions.\nDeep generative models offer a natural learning approach for trajectory\nforecasting, yet they encounter difficulties in achieving an optimal balance\nbetween sampling fidelity and diversity. We address this challenge by\nleveraging Vector Quantized Variational Autoencoders (VQ-VAEs), which utilize a\ndiscrete latent space to tackle the issue of posterior collapse. Specifically,\nwe introduce an instance-based codebook that allows tailored latent\nrepresentations for each example. In a nutshell, the rows of the codebook are\ndynamically adjusted to reflect contextual information (i.e., past motion\npatterns extracted from the observed trajectories). In this way, the\ndiscretization process gains flexibility, leading to improved reconstructions.\nNotably, instance-level dynamics are injected into the codebook through\nlow-rank updates, which restrict the customization of the codebook to a lower\ndimension space. The resulting discrete space serves as the basis of the\nsubsequent step, which regards the training of a diffusion-based predictive\nmodel. We show that such a two-fold framework, augmented with instance-level\ndiscretization, leads to accurate and diverse forecasts, yielding\nstate-of-the-art performance on three established benchmarks.\n","authors":["Riccardo Benaglia","Angelo Porrello","Pietro Buzzega","Simone Calderara","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2405.20743v1.pdf","comment":"15 pages, 3 figures, 5 tables"},{"id":"http://arxiv.org/abs/2405.20735v1","updated":"2024-05-31T09:59:11Z","published":"2024-05-31T09:59:11Z","title":"Language Augmentation in CLIP for Improved Anatomy Detection on\n Multi-modal Medical Images","summary":" Vision-language models have emerged as a powerful tool for previously\nchallenging multi-modal classification problem in the medical domain. This\ndevelopment has led to the exploration of automated image description\ngeneration for multi-modal clinical scans, particularly for radiology report\ngeneration. Existing research has focused on clinical descriptions for specific\nmodalities or body regions, leaving a gap for a model providing entire-body\nmulti-modal descriptions. In this paper, we address this gap by automating the\ngeneration of standardized body station(s) and list of organ(s) across the\nwhole body in multi-modal MR and CT radiological images. Leveraging the\nversatility of the Contrastive Language-Image Pre-training (CLIP), we refine\nand augment the existing approach through multiple experiments, including\nbaseline model fine-tuning, adding station(s) as a superset for better\ncorrelation between organs, along with image and language augmentations. Our\nproposed approach demonstrates 47.6% performance improvement over baseline\nPubMedCLIP.\n","authors":["Mansi Kakkar","Dattesh Shanbhag","Chandan Aladahalli","Gurunath Reddy M"],"pdf_url":"https://arxiv.org/pdf/2405.20735v1.pdf","comment":"$\\copyright$ 2024 IEEE. Accepted in 46th Annual International\n Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)\n 2024"},{"id":"http://arxiv.org/abs/2405.20729v1","updated":"2024-05-31T09:37:39Z","published":"2024-05-31T09:37:39Z","title":"Extreme Point Supervised Instance Segmentation","summary":" This paper introduces a novel approach to learning instance segmentation\nusing extreme points, i.e., the topmost, leftmost, bottommost, and rightmost\npoints, of each object. These points are readily available in the modern\nbounding box annotation process while offering strong clues for precise\nsegmentation, and thus allows to improve performance at the same annotation\ncost with box-supervised methods. Our work considers extreme points as a part\nof the true instance mask and propagates them to identify potential foreground\nand background points, which are all together used for training a pseudo label\ngenerator. Then pseudo labels given by the generator are in turn used for\nsupervised learning of our final model. On three public benchmarks, our method\nsignificantly outperforms existing box-supervised methods, further narrowing\nthe gap with its fully supervised counterpart. In particular, our model\ngenerates high-quality masks when a target object is separated into multiple\nparts, where previous box-supervised methods often fail.\n","authors":["Hyeonjun Lee","Sehyun Hwang","Suha Kwak"],"pdf_url":"https://arxiv.org/pdf/2405.20729v1.pdf","comment":"CVPR 2024 Accepted"},{"id":"http://arxiv.org/abs/2405.20091v2","updated":"2024-05-31T09:35:36Z","published":"2024-05-30T14:27:40Z","title":"Visual Attention Analysis in Online Learning","summary":" In this paper, we present an approach in the Multimodal Learning Analytics\nfield. Within this approach, we have developed a tool to visualize and analyze\neye movement data collected during learning sessions in online courses. The\ntool is named VAAD (an acronym for Visual Attention Analysis Dashboard). These\neye movement data have been gathered using an eye-tracker and subsequently\nprocessed and visualized for interpretation. The purpose of the tool is to\nconduct a descriptive analysis of the data by facilitating its visualization,\nenabling the identification of differences and learning patterns among various\nlearner populations. Additionally, it integrates a predictive module capable of\nanticipating learner activities during a learning session. Consequently, VAAD\nholds the potential to offer valuable insights into online learning behaviors\nfrom both descriptive and predictive perspectives.\n","authors":["Miriam Navarro","Álvaro Becerra","Roberto Daza","Ruth Cobos","Aythami Morales","Julian Fierrez"],"pdf_url":"https://arxiv.org/pdf/2405.20091v2.pdf","comment":"Accepted in CEDI 2024 (VII Congreso Espa\\~nol de Inform\\'atica), A\n Coru\\~na, Spain"},{"id":"http://arxiv.org/abs/2405.20725v1","updated":"2024-05-31T09:29:43Z","published":"2024-05-31T09:29:43Z","title":"GI-NAS: Boosting Gradient Inversion Attacks through Adaptive Neural\n Architecture Search","summary":" Gradient Inversion Attacks invert the transmitted gradients in Federated\nLearning (FL) systems to reconstruct the sensitive data of local clients and\nhave raised considerable privacy concerns. A majority of gradient inversion\nmethods rely heavily on explicit prior knowledge (e.g., a well pre-trained\ngenerative model), which is often unavailable in realistic scenarios. To\nalleviate this issue, researchers have proposed to leverage the implicit prior\nknowledge of an over-parameterized network. However, they only utilize a fixed\nneural architecture for all the attack settings. This would hinder the adaptive\nuse of implicit architectural priors and consequently limit the\ngeneralizability. In this paper, we further exploit such implicit prior\nknowledge by proposing Gradient Inversion via Neural Architecture Search\n(GI-NAS), which adaptively searches the network and captures the implicit\npriors behind neural architectures. Extensive experiments verify that our\nproposed GI-NAS can achieve superior attack performance compared to\nstate-of-the-art gradient inversion methods, even under more practical settings\nwith high-resolution images, large-sized batches, and advanced defense\nstrategies.\n","authors":["Wenbo Yu","Hao Fang","Bin Chen","Xiaohang Sui","Chuan Chen","Hao Wu","Shu-Tao Xia","Ke Xu"],"pdf_url":"https://arxiv.org/pdf/2405.20725v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20721v1","updated":"2024-05-31T09:23:39Z","published":"2024-05-31T09:23:39Z","title":"ContextGS: Compact 3D Gaussian Splatting with Anchor Level Context Model","summary":" Recently, 3D Gaussian Splatting (3DGS) has become a promising framework for\nnovel view synthesis, offering fast rendering speeds and high fidelity.\nHowever, the large number of Gaussians and their associated attributes require\neffective compression techniques. Existing methods primarily compress neural\nGaussians individually and independently, i.e., coding all the neural Gaussians\nat the same time, with little design for their interactions and spatial\ndependence. Inspired by the effectiveness of the context model in image\ncompression, we propose the first autoregressive model at the anchor level for\n3DGS compression in this work. We divide anchors into different levels and the\nanchors that are not coded yet can be predicted based on the already coded ones\nin all the coarser levels, leading to more accurate modeling and higher coding\nefficiency. To further improve the efficiency of entropy coding, e.g., to code\nthe coarsest level with no already coded anchors, we propose to introduce a\nlow-dimensional quantized feature as the hyperprior for each anchor, which can\nbe effectively compressed. Our work pioneers the context model in the anchor\nlevel for 3DGS representation, yielding an impressive size reduction of over\n100 times compared to vanilla 3DGS and 15 times compared to the most recent\nstate-of-the-art work Scaffold-GS, while achieving comparable or even higher\nrendering quality.\n","authors":["Yufei Wang","Zhihao Li","Lanqing Guo","Wenhan Yang","Alex C. Kot","Bihan Wen"],"pdf_url":"https://arxiv.org/pdf/2405.20721v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20720v1","updated":"2024-05-31T09:23:25Z","published":"2024-05-31T09:23:25Z","title":"Power of Cooperative Supervision: Multiple Teachers Framework for\n Enhanced 3D Semi-Supervised Object Detection","summary":" To ensure safe urban driving for autonomous platforms, it is crucial not only\nto develop high-performance object detection techniques but also to establish a\ndiverse and representative dataset that captures various urban environments and\nobject characteristics. To address these two issues, we have constructed a\nmulti-class 3D LiDAR dataset reflecting diverse urban environments and object\ncharacteristics, and developed a robust 3D semi-supervised object detection\n(SSOD) based on a multiple teachers framework. This SSOD framework categorizes\nsimilar classes and assigns specialized teachers to each category. Through\ncollaborative supervision among these category-specialized teachers, the\nstudent network becomes increasingly proficient, leading to a highly effective\nobject detector. We propose a simple yet effective augmentation technique,\nPie-based Point Compensating Augmentation (PieAug), to enable the teacher\nnetwork to generate high-quality pseudo-labels. Extensive experiments on the\nWOD, KITTI, and our datasets validate the effectiveness of our proposed method\nand the quality of our dataset. Experimental results demonstrate that our\napproach consistently outperforms existing state-of-the-art 3D semi-supervised\nobject detection methods across all datasets. We plan to release our\nmulti-class LiDAR dataset and the source code available on our Github\nrepository in the near future.\n","authors":["Jin-Hee Lee","Jae-Keun Lee","Je-Seok Kim","Soon Kwon"],"pdf_url":"https://arxiv.org/pdf/2405.20720v1.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2405.20719v1","updated":"2024-05-31T09:20:33Z","published":"2024-05-31T09:20:33Z","title":"Climate Variable Downscaling with Conditional Normalizing Flows","summary":" Predictions of global climate models typically operate on coarse spatial\nscales due to the large computational costs of climate simulations. This has\nled to a considerable interest in methods for statistical downscaling, a\nsimilar process to super-resolution in the computer vision context, to provide\nmore local and regional climate information. In this work, we apply conditional\nnormalizing flows to the task of climate variable downscaling. We showcase its\nsuccessful performance on an ERA5 water content dataset for different\nupsampling factors. Additionally, we show that the method allows us to assess\nthe predictive uncertainty in terms of standard deviation from the fitted\nconditional distribution mean.\n","authors":["Christina Winkler","Paula Harder","David Rolnick"],"pdf_url":"https://arxiv.org/pdf/2405.20719v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20717v1","updated":"2024-05-31T09:14:36Z","published":"2024-05-31T09:14:36Z","title":"Cyclic image generation using chaotic dynamics","summary":" Successive image generation using cyclic transformations is demonstrated by\nextending the CycleGAN model to transform images among three different\ncategories. Repeated application of the trained generators produces sequences\nof images that transition among the different categories. The generated image\nsequences occupy a more limited region of the image space compared with the\noriginal training dataset. Quantitative evaluation using precision and recall\nmetrics indicates that the generated images have high quality but reduced\ndiversity relative to the training dataset. Such successive generation\nprocesses are characterized as chaotic dynamics in terms of dynamical system\ntheory. Positive Lyapunov exponents estimated from the generated trajectories\nconfirm the presence of chaotic dynamics, with the Lyapunov dimension of the\nattractor found to be comparable to the intrinsic dimension of the training\ndata manifold. The results suggest that chaotic dynamics in the image space\ndefined by the deep generative model contribute to the diversity of the\ngenerated images, constituting a novel approach for multi-class image\ngeneration. This model can be interpreted as an extension of classical\nassociative memory to perform hetero-association among image categories.\n","authors":["Takaya Tanaka","Yutaka Yamaguti"],"pdf_url":"https://arxiv.org/pdf/2405.20717v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20711v1","updated":"2024-05-31T09:07:15Z","published":"2024-05-31T09:07:15Z","title":"Revisiting Mutual Information Maximization for Generalized Category\n Discovery","summary":" Generalized category discovery presents a challenge in a realistic scenario,\nwhich requires the model's generalization ability to recognize unlabeled\nsamples from known and unknown categories. This paper revisits the challenge of\ngeneralized category discovery through the lens of information maximization\n(InfoMax) with a probabilistic parametric classifier. Our findings reveal that\nensuring independence between known and unknown classes while concurrently\nassuming a uniform probability distribution across all classes, yields an\nenlarged margin among known and unknown classes that promotes the model's\nperformance. To achieve the aforementioned independence, we propose a novel\nInfoMax-based method, Regularized Parametric InfoMax (RPIM), which adopts\npseudo labels to supervise unlabeled samples during InfoMax, while proposing a\nregularization to ensure the quality of the pseudo labels. Additionally, we\nintroduce novel semantic-bias transformation to refine the features from the\npre-trained model instead of direct fine-tuning to rescue the computational\ncosts. Extensive experiments on six benchmark datasets validate the\neffectiveness of our method. RPIM significantly improves the performance\nregarding unknown classes, surpassing the state-of-the-art method by an average\nmargin of 3.5%.\n","authors":["Zhaorui Tan","Chengrui Zhang","Xi Yang","Jie Sun","Kaizhu Huang"],"pdf_url":"https://arxiv.org/pdf/2405.20711v1.pdf","comment":"Preprint version"},{"id":"http://arxiv.org/abs/2405.11129v2","updated":"2024-05-31T08:56:29Z","published":"2024-05-18T00:47:29Z","title":"MotionGS : Compact Gaussian Splatting SLAM by Motion Filter","summary":" With their high-fidelity scene representation capability, the attention of\nSLAM field is deeply attracted by the Neural Radiation Field (NeRF) and 3D\nGaussian Splatting (3DGS). Recently, there has been a surge in NeRF-based SLAM,\nwhile 3DGS-based SLAM is sparse. A novel 3DGS-based SLAM approach with a fusion\nof deep visual feature, dual keyframe selection and 3DGS is presented in this\npaper. Compared with the existing methods, the proposed tracking is achieved by\nfeature extraction and motion filter on each frame. The joint optimization of\nposes and 3D Gaussians runs through the entire mapping process. Additionally,\nthe coarse-to-fine pose estimation and compact Gaussian scene representation\nare implemented by dual keyframe selection and novel loss functions.\nExperimental results demonstrate that the proposed algorithm not only\noutperforms the existing methods in tracking and mapping, but also has less\nmemory usage.\n","authors":["Xinli Guo","Weidong Zhang","Ruonan Liu","Peng Han","Hongtian Chen"],"pdf_url":"https://arxiv.org/pdf/2405.11129v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2103.03636v2","updated":"2024-05-31T08:50:18Z","published":"2021-03-05T12:44:22Z","title":"CoDeGAN: Contrastive Disentanglement for Generative Adversarial Network","summary":" Disentanglement, a critical concern in interpretable machine learning, has\nalso garnered significant attention from the computer vision community. Many\nexisting GAN-based class disentanglement (unsupervised) approaches, such as\nInfoGAN and its variants, primarily aim to maximize the mutual information (MI)\nbetween the generated image and its latent codes. However, this focus may lead\nto a tendency for the network to generate highly similar images when presented\nwith the same latent class factor, potentially resulting in mode collapse or\nmode dropping. To alleviate this problem, we propose \\texttt{CoDeGAN}\n(Contrastive Disentanglement for Generative Adversarial Networks), where we\nrelax similarity constraints for disentanglement from the image domain to the\nfeature domain. This modification not only enhances the stability of GAN\ntraining but also improves their disentangling capabilities. Moreover, we\nintegrate self-supervised pre-training into CoDeGAN to learn semantic\nrepresentations, significantly facilitating unsupervised disentanglement.\nExtensive experimental results demonstrate the superiority of our method over\nstate-of-the-art approaches across multiple benchmarks. The code is available\nat https://github.com/learninginvision/CoDeGAN.\n","authors":["Jiangwei Zhao","Zejia Liu","Xiaohan Guo","Lili Pan"],"pdf_url":"https://arxiv.org/pdf/2103.03636v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.08291v3","updated":"2024-05-31T08:48:46Z","published":"2023-12-13T17:08:38Z","title":"VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent\n Space","summary":" Previous works on Human Pose and Shape Estimation (HPSE) from RGB images can\nbe broadly categorized into two main groups: parametric and non-parametric\napproaches. Parametric techniques leverage a low-dimensional statistical body\nmodel for realistic results, whereas recent non-parametric methods achieve\nhigher precision by directly regressing the 3D coordinates of the human body\nmesh. This work introduces a novel paradigm to address the HPSE problem,\ninvolving a low-dimensional discrete latent representation of the human mesh\nand framing HPSE as a classification task. Instead of predicting body model\nparameters or 3D vertex coordinates, we focus on predicting the proposed\ndiscrete latent representation, which can be decoded into a registered human\nmesh. This innovative paradigm offers two key advantages. Firstly, predicting a\nlow-dimensional discrete representation confines our predictions to the space\nof anthropomorphic poses and shapes even when little training data is\navailable. Secondly, by framing the problem as a classification task, we can\nharness the discriminative power inherent in neural networks. The proposed\nmodel, VQ-HPS, predicts the discrete latent representation of the mesh. The\nexperimental results demonstrate that VQ-HPS outperforms the current\nstate-of-the-art non-parametric approaches while yielding results as realistic\nas those produced by parametric methods when trained with little data. VQ-HPS\nalso shows promising results when training on large-scale datasets,\nhighlighting the significant potential of the classification approach for HPSE.\nSee the project page at https://g-fiche.github.io/research-pages/vqhps/\n","authors":["Guénolé Fiche","Simon Leglaive","Xavier Alameda-Pineda","Antonio Agudo","Francesc Moreno-Noguer"],"pdf_url":"https://arxiv.org/pdf/2312.08291v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20693v1","updated":"2024-05-31T08:39:02Z","published":"2024-05-31T08:39:02Z","title":"R$^2$-Gaussian: Rectifying Radiative Gaussian Splatting for Tomographic\n Reconstruction","summary":" 3D Gaussian splatting (3DGS) has shown promising results in image rendering\nand surface reconstruction. However, its potential in volumetric reconstruction\ntasks, such as X-ray computed tomography, remains under-explored. This paper\nintroduces R2-Gaussian, the first 3DGS-based framework for sparse-view\ntomographic reconstruction. By carefully deriving X-ray rasterization\nfunctions, we discover a previously unknown integration bias in the standard\n3DGS formulation, which hampers accurate volume retrieval. To address this\nissue, we propose a novel rectification technique via refactoring the\nprojection from 3D to 2D Gaussians. Our new method presents three key\ninnovations: (1) introducing tailored Gaussian kernels, (2) extending\nrasterization to X-ray imaging, and (3) developing a CUDA-based differentiable\nvoxelizer. Extensive experiments demonstrate that our method outperforms\nstate-of-the-art approaches by 0.93 dB in PSNR and 0.014 in SSIM. Crucially, it\ndelivers high-quality results in 3 minutes, which is 12x faster than NeRF-based\nmethods and on par with traditional algorithms. The superior performance and\nrapid convergence of our method highlight its practical value.\n","authors":["Ruyi Zha","Tao Jun Lin","Yuanhao Cai","Jiwen Cao","Yanhao Zhang","Hongdong Li"],"pdf_url":"https://arxiv.org/pdf/2405.20693v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20687v1","updated":"2024-05-31T08:31:26Z","published":"2024-05-31T08:31:26Z","title":"Conditioning GAN Without Training Dataset","summary":" Deep learning algorithms have a large number of trainable parameters often\nwith sizes of hundreds of thousands or more. Training this algorithm requires a\nlarge amount of training data and generating a sufficiently large dataset for\nthese algorithms is costly\\cite{noguchi2019image}.\n GANs are generative neural networks that use two deep learning networks that\nare competing with each other. The networks are generator and discriminator\nnetworks. The generator tries to generate realistic images which resemble the\nactual training dataset by approximating the training data distribution and the\ndiscriminator is trained to classify images as real or\nfake(generated)\\cite{goodfellow2016nips}. Training these GAN algorithms also\nrequires a large amount of training dataset\\cite{noguchi2019image}.\n In this study, the aim is to address the question, \"Given an unconditioned\npretrained generator network and a pretrained classifier, is it feasible to\ndevelop a conditioned generator without relying on any training dataset?\"\n The paper begins with a general introduction to the problem. The subsequent\nsections are structured as follows: Section 2 provides background information\non the problem. Section 3 reviews relevant literature on the topic. Section 4\noutlines the methodology employed in this study. Section 5 presents the\nexperimental results. Section 6 discusses the findings and proposes potential\nfuture research directions. Finally, Section 7 offers concluding remarks.\n The implementation can be accessed\n\\href{https://github.com/kidist-amde/BigGAN-PyTorch}{here}.\n","authors":["Kidist Amde Mekonnen"],"pdf_url":"https://arxiv.org/pdf/2405.20687v1.pdf","comment":"5 pages, 2 figures, Part of my MSc project course, School Project\n Course 2022"},{"id":"http://arxiv.org/abs/2405.20685v1","updated":"2024-05-31T08:26:53Z","published":"2024-05-31T08:26:53Z","title":"Enhancing Counterfactual Image Generation Using Mahalanobis Distance\n with Distribution Preferences in Feature Space","summary":" In the realm of Artificial Intelligence (AI), the importance of Explainable\nArtificial Intelligence (XAI) is increasingly recognized, particularly as AI\nmodels become more integral to our lives. One notable single-instance XAI\napproach is counterfactual explanation, which aids users in comprehending a\nmodel's decisions and offers guidance on altering these decisions. Specifically\nin the context of image classification models, effective image counterfactual\nexplanations can significantly enhance user understanding. This paper\nintroduces a novel method for computing feature importance within the feature\nspace of a black-box model. By employing information fusion techniques, our\nmethod maximizes the use of data to address feature counterfactual explanations\nin the feature space. Subsequently, we utilize an image generation model to\ntransform these feature counterfactual explanations into image counterfactual\nexplanations. Our experiments demonstrate that the counterfactual explanations\ngenerated by our method closely resemble the original images in both pixel and\nfeature spaces. Additionally, our method outperforms established baselines,\nachieving impressive experimental results.\n","authors":["Yukai Zhang","Ao Xu","Zihao Li","Tieru Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20685v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.16539v3","updated":"2024-05-31T08:24:43Z","published":"2024-03-25T08:31:14Z","title":"Data-Efficient 3D Visual Grounding via Order-Aware Referring","summary":" 3D visual grounding aims to identify the target object within a 3D point\ncloud scene referred to by a natural language description. Previous works\nusually require significant data relating to point color and their descriptions\nto exploit the corresponding complicated verbo-visual relations. In our work,\nwe introduce Vigor, a novel Data-Efficient 3D Visual Grounding framework via\nOrder-aware Referring. Vigor leverages LLM to produce a desirable referential\norder from the input description for 3D visual grounding. With the proposed\nstacked object-referring blocks, the predicted anchor objects in the above\norder allow one to locate the target object progressively without supervision\non the identities of anchor objects or exact relations between anchor/target\nobjects. In addition, we present an order-aware warm-up training strategy,\nwhich augments referential orders for pre-training the visual grounding\nframework. This allows us to better capture the complex verbo-visual relations\nand benefit the desirable data-efficient learning scheme. Experimental results\non the NR3D and ScanRefer datasets demonstrate our superiority in low-resource\nscenarios. In particular, Vigor surpasses current state-of-the-art frameworks\nby 9.3% and 7.6% grounding accuracy under 1% data and 10% data settings on the\nNR3D dataset, respectively.\n","authors":["Tung-Yu Wu","Sheng-Yu Huang","Yu-Chiang Frank Wang"],"pdf_url":"https://arxiv.org/pdf/2403.16539v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20675v1","updated":"2024-05-31T08:19:44Z","published":"2024-05-31T08:19:44Z","title":"Adv-KD: Adversarial Knowledge Distillation for Faster Diffusion Sampling","summary":" Diffusion Probabilistic Models (DPMs) have emerged as a powerful class of\ndeep generative models, achieving remarkable performance in image synthesis\ntasks. However, these models face challenges in terms of widespread adoption\ndue to their reliance on sequential denoising steps during sample generation.\nThis dependence leads to substantial computational requirements, making them\nunsuitable for resource-constrained or real-time processing systems. To address\nthese challenges, we propose a novel method that integrates denoising phases\ndirectly into the model's architecture, thereby reducing the need for\nresource-intensive computations. Our approach combines diffusion models with\ngenerative adversarial networks (GANs) through knowledge distillation, enabling\nmore efficient training and evaluation. By utilizing a pre-trained diffusion\nmodel as a teacher model, we train a student model through adversarial\nlearning, employing layerwise transformations for denoising and submodules for\npredicting the teacher model's output at various points in time. This\nintegration significantly reduces the number of parameters and denoising steps\nrequired, leading to improved sampling speed at test time. We validate our\nmethod with extensive experiments, demonstrating comparable performance with\nreduced computational requirements compared to existing approaches. By enabling\nthe deployment of diffusion models on resource-constrained devices, our\nresearch mitigates their computational burden and paves the way for wider\naccessibility and practical use across the research community and end-users.\n Our code is publicly available at https://github.com/kidist-amde/Adv-KD\n","authors":["Kidist Amde Mekonnen","Nicola Dall'Asen","Paolo Rota"],"pdf_url":"https://arxiv.org/pdf/2405.20675v1.pdf","comment":"7 pages, 11 figures, ELLIS Doctoral Symposium 2023 in Helsinki,\n Finland"},{"id":"http://arxiv.org/abs/2402.17502v2","updated":"2024-05-31T08:19:08Z","published":"2024-02-27T13:41:32Z","title":"FedLPPA: Learning Personalized Prompt and Aggregation for Federated\n Weakly-supervised Medical Image Segmentation","summary":" Federated learning (FL) effectively mitigates the data silo challenge brought\nabout by policies and privacy concerns, implicitly harnessing more data for\ndeep model training. However, traditional centralized FL models grapple with\ndiverse multi-center data, especially in the face of significant data\nheterogeneity, notably in medical contexts. In the realm of medical image\nsegmentation, the growing imperative to curtail annotation costs has amplified\nthe importance of weakly-supervised techniques which utilize sparse annotations\nsuch as points, scribbles, etc. A pragmatic FL paradigm shall accommodate\ndiverse annotation formats across different sites, which research topic remains\nunder-investigated. In such context, we propose a novel personalized FL\nframework with learnable prompt and aggregation (FedLPPA) to uniformly leverage\nheterogeneous weak supervision for medical image segmentation. In FedLPPA, a\nlearnable universal knowledge prompt is maintained, complemented by multiple\nlearnable personalized data distribution prompts and prompts representing the\nsupervision sparsity. Integrated with sample features through a dual-attention\nmechanism, those prompts empower each local task decoder to adeptly adjust to\nboth the local distribution and the supervision form. Concurrently, a\ndual-decoder strategy, predicated on prompt similarity, is introduced for\nenhancing the generation of pseudo-labels in weakly-supervised learning,\nalleviating overfitting and noise accumulation inherent to local data, while an\nadaptable aggregation method is employed to customize the task decoder on a\nparameter-wise basis. Extensive experiments on four distinct medical image\nsegmentation tasks involving different modalities underscore the superiority of\nFedLPPA, with its efficacy closely parallels that of fully supervised\ncentralized training. Our code and data will be available.\n","authors":["Li Lin","Yixiang Liu","Jiewei Wu","Pujin Cheng","Zhiyuan Cai","Kenneth K. Y. Wong","Xiaoying Tang"],"pdf_url":"https://arxiv.org/pdf/2402.17502v2.pdf","comment":"12 pages, 10 figures"},{"id":"http://arxiv.org/abs/2405.20674v1","updated":"2024-05-31T08:18:39Z","published":"2024-05-31T08:18:39Z","title":"4Diffusion: Multi-view Video Diffusion Model for 4D Generation","summary":" Current 4D generation methods have achieved noteworthy efficacy with the aid\nof advanced diffusion generative models. However, these methods lack multi-view\nspatial-temporal modeling and encounter challenges in integrating diverse prior\nknowledge from multiple diffusion models, resulting in inconsistent temporal\nappearance and flickers. In this paper, we propose a novel 4D generation\npipeline, namely 4Diffusion aimed at generating spatial-temporally consistent\n4D content from a monocular video. We first design a unified diffusion model\ntailored for multi-view video generation by incorporating a learnable motion\nmodule into a frozen 3D-aware diffusion model to capture multi-view\nspatial-temporal correlations. After training on a curated dataset, our\ndiffusion model acquires reasonable temporal consistency and inherently\npreserves the generalizability and spatial consistency of the 3D-aware\ndiffusion model. Subsequently, we propose 4D-aware Score Distillation Sampling\nloss, which is based on our multi-view video diffusion model, to optimize 4D\nrepresentation parameterized by dynamic NeRF. This aims to eliminate\ndiscrepancies arising from multiple diffusion models, allowing for generating\nspatial-temporally consistent 4D content. Moreover, we devise an anchor loss to\nenhance the appearance details and facilitate the learning of dynamic NeRF.\nExtensive qualitative and quantitative experiments demonstrate that our method\nachieves superior performance compared to previous methods.\n","authors":["Haiyu Zhang","Xinyuan Chen","Yaohui Wang","Xihui Liu","Yunhong Wang","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2405.20674v1.pdf","comment":"Project Page: https://aejion.github.io/4diffusion/"},{"id":"http://arxiv.org/abs/2405.20672v1","updated":"2024-05-31T08:14:44Z","published":"2024-05-31T08:14:44Z","title":"Investigating and unmasking feature-level vulnerabilities of CNNs to\n adversarial perturbations","summary":" This study explores the impact of adversarial perturbations on Convolutional\nNeural Networks (CNNs) with the aim of enhancing the understanding of their\nunderlying mechanisms. Despite numerous defense methods proposed in the\nliterature, there is still an incomplete understanding of this phenomenon.\nInstead of treating the entire model as vulnerable, we propose that specific\nfeature maps learned during training contribute to the overall vulnerability.\nTo investigate how the hidden representations learned by a CNN affect its\nvulnerability, we introduce the Adversarial Intervention framework. Experiments\nwere conducted on models trained on three well-known computer vision datasets,\nsubjecting them to attacks of different nature. Our focus centers on the\neffects that adversarial perturbations to a model's initial layer have on the\noverall behavior of the model. Empirical results revealed compelling insights:\na) perturbing selected channel combinations in shallow layers causes\nsignificant disruptions; b) the channel combinations most responsible for the\ndisruptions are common among different types of attacks; c) despite shared\nvulnerable combinations of channels, different attacks affect hidden\nrepresentations with varying magnitudes; d) there exists a positive correlation\nbetween a kernel's magnitude and its vulnerability. In conclusion, this work\nintroduces a novel framework to study the vulnerability of a CNN model to\nadversarial perturbations, revealing insights that contribute to a deeper\nunderstanding of the phenomenon. The identified properties pave the way for the\ndevelopment of efficient ad-hoc defense mechanisms in future applications.\n","authors":["Davide Coppola","Hwee Kuan Lee"],"pdf_url":"https://arxiv.org/pdf/2405.20672v1.pdf","comment":"22 pages, 15 figures (including appendix)"},{"id":"http://arxiv.org/abs/2405.19732v2","updated":"2024-05-31T08:13:34Z","published":"2024-05-30T06:24:14Z","title":"Two Optimizers Are Better Than One: LLM Catalyst for Enhancing\n Gradient-Based Optimization","summary":" Learning a skill generally relies on both practical experience by doer and\ninsightful high-level guidance by instructor. Will this strategy also work well\nfor solving complex non-convex optimization problems? Here, a common\ngradient-based optimizer acts like a disciplined doer, making locally optimal\nupdate at each step. Recent methods utilize large language models (LLMs) to\noptimize solutions for concrete problems by inferring from natural language\ninstructions, akin to a high-level instructor. In this paper, we show that\nthese two optimizers are complementary to each other, suggesting a\ncollaborative optimization approach. The gradient-based optimizer and LLM-based\noptimizer are combined in an interleaved manner. We instruct LLMs using task\ndescriptions and timely optimization trajectories recorded during\ngradient-based optimization. Inferred results from LLMs are used as restarting\npoints for the next stage of gradient optimization. By leveraging both the\nlocally rigorous gradient-based optimizer and the high-level deductive\nLLM-based optimizer, our combined optimization method consistently yields\nimprovements over competitive baseline prompt tuning methods. Our results\ndemonstrate the synergistic effect of conventional gradient-based optimization\nand the inference ability of LLMs. The code is released at\nhttps://github.com/guozix/LLM-catalyst.\n","authors":["Zixian Guo","Ming Liu","Zhilong Ji","Jinfeng Bai","Yiwen Guo","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2405.19732v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20669v1","updated":"2024-05-31T08:11:25Z","published":"2024-05-31T08:11:25Z","title":"Fourier123: One Image to High-Quality 3D Object Generation with Hybrid\n Fourier Score Distillation","summary":" Single image-to-3D generation is pivotal for crafting controllable 3D assets.\nGiven its underconstrained nature, we leverage geometric priors from a 3D novel\nview generation diffusion model and appearance priors from a 2D image\ngeneration method to guide the optimization process. We note that a disparity\nexists between the training datasets of 2D and 3D diffusion models, leading to\ntheir outputs showing marked differences in appearance. Specifically, 2D models\ntend to deliver more detailed visuals, whereas 3D models produce consistent yet\nover-smooth results across different views. Hence, we optimize a set of 3D\nGaussians using 3D priors in spatial domain to ensure geometric consistency,\nwhile exploiting 2D priors in the frequency domain through Fourier transform\nfor higher visual quality. This 2D-3D hybrid Fourier Score Distillation\nobjective function (dubbed hy-FSD), can be integrated into existing 3D\ngeneration methods, yielding significant performance improvements. With this\ntechnique, we further develop an image-to-3D generation pipeline to create\nhigh-quality 3D objects within one minute, named Fourier123. Extensive\nexperiments demonstrate that Fourier123 excels in efficient generation with\nrapid convergence speed and visual-friendly generation results.\n","authors":["Shuzhou Yang","Yu Wang","Haijie Li","Jiarui Meng","Xiandong Meng","Jian Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20669v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20067v2","updated":"2024-05-31T08:11:24Z","published":"2024-05-30T13:56:58Z","title":"N-Dimensional Gaussians for Fitting of High Dimensional Functions","summary":" In the wake of many new ML-inspired approaches for reconstructing and\nrepresenting high-quality 3D content, recent hybrid and explicitly learned\nrepresentations exhibit promising performance and quality characteristics.\nHowever, their scaling to higher dimensions is challenging, e.g. when\naccounting for dynamic content with respect to additional parameters such as\nmaterial properties, illumination, or time. In this paper, we tackle these\nchallenges for an explicit representations based on Gaussian mixture models.\nWith our solutions, we arrive at efficient fitting of compact N-dimensional\nGaussian mixtures and enable efficient evaluation at render time: For fast\nfitting and evaluation, we introduce a high-dimensional culling scheme that\nefficiently bounds N-D Gaussians, inspired by Locality Sensitive Hashing. For\nadaptive refinement yet compact representation, we introduce a loss-adaptive\ndensity control scheme that incrementally guides the use of additional capacity\ntowards missing details. With these tools we can for the first time represent\ncomplex appearance that depends on many input dimensions beyond position or\nviewing angle within a compact, explicit representation optimized in minutes\nand rendered in milliseconds.\n","authors":["Stavros Diolatzis","Tobias Zirr","Alexandr Kuznetsov","Georgios Kopanas","Anton Kaplanyan"],"pdf_url":"https://arxiv.org/pdf/2405.20067v2.pdf","comment":"https://www.sdiolatz.info/ndg-fitting/"},{"id":"http://arxiv.org/abs/2405.15465v2","updated":"2024-05-31T08:08:23Z","published":"2024-05-24T11:40:22Z","title":"Scale-Invariant Feature Disentanglement via Adversarial Learning for\n UAV-based Object Detection","summary":" Detecting objects from Unmanned Aerial Vehicles (UAV) is often hindered by a\nlarge number of small objects, resulting in low detection accuracy. To address\nthis issue, mainstream approaches typically utilize multi-stage inferences.\nDespite their remarkable detecting accuracies, real-time efficiency is\nsacrificed, making them less practical to handle real applications. To this\nend, we propose to improve the single-stage inference accuracy through learning\nscale-invariant features. Specifically, a Scale-Invariant Feature Disentangling\nmodule is designed to disentangle scale-related and scale-invariant features.\nThen an Adversarial Feature Learning scheme is employed to enhance\ndisentanglement. Finally, scale-invariant features are leveraged for robust\nUAV-based object detection. Furthermore, we construct a multi-modal UAV object\ndetection dataset, State-Air, which incorporates annotated UAV state\nparameters. We apply our approach to three state-of-the-art lightweight\ndetection frameworks on three benchmark datasets, including State-Air.\nExtensive experiments demonstrate that our approach can effectively improve\nmodel accuracy. Our code and dataset are provided in Supplementary Materials\nand will be publicly available once the paper is accepted.\n","authors":["Fan Liu","Liang Yao","Chuanyi Zhang","Ting Wu","Xinlei Zhang","Xiruo Jiang","Jun Zhou"],"pdf_url":"https://arxiv.org/pdf/2405.15465v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20666v1","updated":"2024-05-31T08:06:05Z","published":"2024-05-31T08:06:05Z","title":"MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign\n Language Recognition","summary":" Sign language recognition (SLR) has long been plagued by insufficient model\nrepresentation capabilities. Although current pre-training approaches have\nalleviated this dilemma to some extent and yielded promising performance by\nemploying various pretext tasks on sign pose data, these methods still suffer\nfrom two primary limitations: 1) Explicit motion information is usually\ndisregarded in previous pretext tasks, leading to partial information loss and\nlimited representation capability. 2) Previous methods focus on the local\ncontext of a sign pose sequence, without incorporating the guidance of the\nglobal meaning of lexical signs. To this end, we propose a Motion-Aware masked\nautoencoder with Semantic Alignment (MASA) that integrates rich motion cues and\nglobal semantic information in a self-supervised learning paradigm for SLR. Our\nframework contains two crucial components, i.e., a motion-aware masked\nautoencoder (MA) and a momentum semantic alignment module (SA). Specifically,\nin MA, we introduce an autoencoder architecture with a motion-aware masked\nstrategy to reconstruct motion residuals of masked frames, thereby explicitly\nexploring dynamic motion cues among sign pose sequences. Moreover, in SA, we\nembed our framework with global semantic awareness by aligning the embeddings\nof different augmented samples from the input sequence in the shared latent\nspace. In this way, our framework can simultaneously learn local motion cues\nand global semantic features for comprehensive sign language representation.\nFurthermore, we conduct extensive experiments to validate the effectiveness of\nour method, achieving new state-of-the-art performance on four public\nbenchmarks.\n","authors":["Weichao Zhao","Hezhen Hu","Wengang Zhou","Yunyao Mao","Min Wang","Houqiang Li"],"pdf_url":"https://arxiv.org/pdf/2405.20666v1.pdf","comment":"Accepted by TCSVT 2024"},{"id":"http://arxiv.org/abs/2405.19092v3","updated":"2024-05-31T07:56:37Z","published":"2024-05-29T13:54:12Z","title":"Benchmarking and Improving Detail Image Caption","summary":" Image captioning has long been regarded as a fundamental task in visual\nunderstanding. Recently, however, few large vision-language model (LVLM)\nresearch discusses model's image captioning performance because of the outdated\nshort-caption benchmarks and unreliable evaluation metrics. In this work, we\npropose to benchmark detail image caption task by curating high-quality\nevaluation datasets annotated by human experts, GPT-4V and Gemini-1.5-Pro. We\nalso design a more reliable caption evaluation metric called CAPTURE (CAPtion\nevaluation by exTracting and coUpling coRE information). CAPTURE extracts\nvisual elements, e.g., objects, attributes and relations from captions, and\nthen matches these elements through three stages, achieving the highest\nconsistency with expert judgements over other rule-based or model-based caption\nmetrics. The proposed benchmark and metric provide reliable evaluation for\nLVLM's detailed image captioning ability. Guided by this evaluation, we further\nexplore to unleash LVLM's detail caption capabilities by synthesizing\nhigh-quality data through a five-stage data construction pipeline. Our pipeline\nonly uses a given LVLM itself and other open-source tools, without any human or\nGPT-4V annotation in the loop. Experiments show that the proposed data\nconstruction strategy significantly improves model-generated detail caption\ndata quality for LVLMs with leading performance, and the data quality can be\nfurther improved in a self-looping paradigm. All code and dataset will be\npublicly available at https://github.com/foundation-multimodal-models/CAPTURE.\n","authors":["Hongyuan Dong","Jiawen Li","Bohong Wu","Jiacong Wang","Yuan Zhang","Haoyuan Guo"],"pdf_url":"https://arxiv.org/pdf/2405.19092v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19620v2","updated":"2024-05-31T07:40:55Z","published":"2024-05-30T02:13:56Z","title":"SparseDrive: End-to-End Autonomous Driving via Sparse Scene\n Representation","summary":" The well-established modular autonomous driving system is decoupled into\ndifferent standalone tasks, e.g. perception, prediction and planning, suffering\nfrom information loss and error accumulation across modules. In contrast,\nend-to-end paradigms unify multi-tasks into a fully differentiable framework,\nallowing for optimization in a planning-oriented spirit. Despite the great\npotential of end-to-end paradigms, both the performance and efficiency of\nexisting methods are not satisfactory, particularly in terms of planning\nsafety. We attribute this to the computationally expensive BEV (bird's eye\nview) features and the straightforward design for prediction and planning. To\nthis end, we explore the sparse representation and review the task design for\nend-to-end autonomous driving, proposing a new paradigm named SparseDrive.\nConcretely, SparseDrive consists of a symmetric sparse perception module and a\nparallel motion planner. The sparse perception module unifies detection,\ntracking and online mapping with a symmetric model architecture, learning a\nfully sparse representation of the driving scene. For motion prediction and\nplanning, we review the great similarity between these two tasks, leading to a\nparallel design for motion planner. Based on this parallel design, which models\nplanning as a multi-modal problem, we propose a hierarchical planning selection\nstrategy , which incorporates a collision-aware rescore module, to select a\nrational and safe trajectory as the final planning output. With such effective\ndesigns, SparseDrive surpasses previous state-of-the-arts by a large margin in\nperformance of all tasks, while achieving much higher training and inference\nefficiency. Code will be avaliable at https://github.com/swc-17/SparseDrive for\nfacilitating future research.\n","authors":["Wenchao Sun","Xuewu Lin","Yining Shi","Chuang Zhang","Haoran Wu","Sifa Zheng"],"pdf_url":"https://arxiv.org/pdf/2405.19620v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20081v2","updated":"2024-05-31T07:40:04Z","published":"2024-05-30T14:11:27Z","title":"NoiseBoost: Alleviating Hallucination with Noise Perturbation for\n Multimodal Large Language Models","summary":" Multimodal large language models (MLLMs) contribute a powerful mechanism to\nunderstanding visual information building on large language models. However,\nMLLMs are notorious for suffering from hallucinations, especially when\ngenerating lengthy, detailed descriptions for images. Our analysis reveals that\nhallucinations stem from the inherent summarization mechanism of large language\nmodels, leading to excessive dependence on linguistic tokens while neglecting\nvision information. In this paper, we propose NoiseBoost, a broadly applicable\nand simple method for alleviating hallucinations for MLLMs through the\nintegration of noise feature perturbations. Noise perturbation acts as a\nregularizer, facilitating a balanced distribution of attention weights among\nvisual and linguistic tokens. Despite its simplicity, NoiseBoost consistently\nenhances the performance of MLLMs across common training strategies, including\nsupervised fine-tuning and reinforcement learning. Further, NoiseBoost\npioneerly enables semi-supervised learning for MLLMs, unleashing the power of\nunlabeled data. Comprehensive experiments demonstrate that NoiseBoost improves\ndense caption accuracy by 8.1% with human evaluation and achieves comparable\nresults with 50% of the data by mining unlabeled data. Code and models are\navailable at https://kaiwu5.github.io/noiseboost.\n","authors":["Kai Wu","Boyuan Jiang","Zhengkai Jiang","Qingdong He","Donghao Luo","Shengzhi Wang","Qingwen Liu","Chengjie Wang"],"pdf_url":"https://arxiv.org/pdf/2405.20081v2.pdf","comment":"14 pages, 5 figures with supplementary material"},{"id":"http://arxiv.org/abs/2405.20650v1","updated":"2024-05-31T07:32:31Z","published":"2024-05-31T07:32:31Z","title":"GenMix: Combining Generative and Mixture Data Augmentation for Medical\n Image Classification","summary":" In this paper, we propose a novel data augmentation technique called GenMix,\nwhich combines generative and mixture approaches to leverage the strengths of\nboth methods. While generative models excel at creating new data patterns, they\nface challenges such as mode collapse in GANs and difficulties in training\ndiffusion models, especially with limited medical imaging data. On the other\nhand, mixture models enhance class boundary regions but tend to favor the major\nclass in scenarios with class imbalance. To address these limitations, GenMix\nintegrates both approaches to complement each other. GenMix operates in two\nstages: (1) training a generative model to produce synthetic images, and (2)\nperforming mixup between synthetic and real data. This process improves the\nquality and diversity of synthetic data while simultaneously benefiting from\nthe new pattern learning of generative models and the boundary enhancement of\nmixture models. We validate the effectiveness of our method on the task of\nclassifying focal liver lesions (FLLs) in CT images. Our results demonstrate\nthat GenMix enhances the performance of various generative models, including\nDCGAN, StyleGAN, Textual Inversion, and Diffusion Models. Notably, the proposed\nmethod with Textual Inversion outperforms other methods without fine-tuning\ndiffusion model on the FLL dataset.\n","authors":["Hansang Lee","Haeil Lee","Helen Hong"],"pdf_url":"https://arxiv.org/pdf/2405.20650v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20648v1","updated":"2024-05-31T07:30:24Z","published":"2024-05-31T07:30:24Z","title":"Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision\n Models For Video Captioning and Summarization","summary":" Video is an increasingly prominent and information-dense medium, yet it poses\nsubstantial challenges for language models. A typical video consists of a\nsequence of shorter segments, or shots, that collectively form a coherent\nnarrative. Each shot is analogous to a word in a sentence where multiple data\nstreams of information (such as visual and auditory data) must be processed\nsimultaneously. Comprehension of the entire video requires not only\nunderstanding the visual-audio information of each shot but also requires that\nthe model links the ideas between each shot to generate a larger,\nall-encompassing story. Despite significant progress in the field, current\nworks often overlook videos' more granular shot-by-shot semantic information.\nIn this project, we propose a family of efficient large language vision models\n(LLVMs) to boost video summarization and captioning called Shotluck Holmes. By\nleveraging better pretraining and data collection strategies, we extend the\nabilities of existing small LLVMs from being able to understand a picture to\nbeing able to understand a sequence of frames. Specifically, we show that\nShotluck Holmes achieves better performance than state-of-the-art results on\nthe Shot2Story video captioning and summary task with significantly smaller and\nmore computationally efficient models.\n","authors":["Richard Luo","Austin Peng","Adithya Vasudev","Rishabh Jain"],"pdf_url":"https://arxiv.org/pdf/2405.20648v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.08970v3","updated":"2024-05-31T07:29:20Z","published":"2023-06-15T09:05:36Z","title":"An Efficient and Multi-private Key Secure Aggregation for Federated\n Learning","summary":" With the emergence of privacy leaks in federated learning, secure aggregation\nprotocols that mainly adopt either homomorphic encryption or threshold secret\nsharing have been widely developed for federated learning to protect the\nprivacy of the local training data of each client. However, these existing\nprotocols suffer from many shortcomings, such as the dependence on a trusted\nthird party, the vulnerability to clients being corrupted, low efficiency, the\ntrade-off between security and fault tolerance, etc. To solve these\ndisadvantages, we propose an efficient and multi-private key secure aggregation\nscheme for federated learning. Specifically, we skillfully modify the variant\nElGamal encryption technique to achieve homomorphic addition operation, which\nhas two important advantages: 1) The server and each client can freely select\npublic and private keys without introducing a trust third party and 2) Compared\nto the variant ElGamal encryption, the plaintext space is relatively large,\nwhich is more suitable for the deep model. Besides, for the high dimensional\ndeep model parameter, we introduce a super-increasing sequence to compress\nmulti-dimensional data into 1-D, which can greatly reduce encryption and\ndecryption times as well as communication for ciphertext transmission. Detailed\nsecurity analyses show that our proposed scheme achieves the semantic security\nof both individual local gradients and the aggregated result while achieving\noptimal robustness in tolerating both client collusion and dropped clients.\nExtensive simulations demonstrate that the accuracy of our scheme is almost the\nsame as the non-private approach, while the efficiency of our scheme is much\nbetter than the state-of-the-art homomorphic encryption-based secure\naggregation schemes. More importantly, the efficiency advantages of our scheme\nwill become increasingly prominent as the number of model parameters increases.\n","authors":["Xue Yang","Zifeng Liu","Xiaohu Tang","Rongxing Lu","Bo Liu"],"pdf_url":"https://arxiv.org/pdf/2306.08970v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.11190v2","updated":"2024-05-31T07:24:55Z","published":"2024-05-18T06:03:42Z","title":"ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing","summary":" Instruction-based image editing focuses on equipping a generative model with\nthe capacity to adhere to human-written instructions for editing images.\nCurrent approaches typically comprehend explicit and specific instructions.\nHowever, they often exhibit a deficiency in executing active reasoning\ncapacities required to comprehend instructions that are implicit or\ninsufficiently defined. To enhance active reasoning capabilities and impart\nintelligence to the editing model, we introduce ReasonPix2Pix, a comprehensive\nreasoning-attentive instruction editing dataset. The dataset is characterized\nby 1) reasoning instruction, 2) more realistic images from fine-grained\ncategories, and 3) increased variances between input and edited images. When\nfine-tuned with our dataset under supervised conditions, the model demonstrates\nsuperior performance in instructional editing tasks, independent of whether the\ntasks require reasoning or not. The code will be available at\nhttps://github.com/Jin-Ying/ReasonPix2Pix.\n","authors":["Ying Jin","Pengyang Ling","Xiaoyi Dong","Pan Zhang","Jiaqi Wang","Dahua Lin"],"pdf_url":"https://arxiv.org/pdf/2405.11190v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20643v1","updated":"2024-05-31T07:07:54Z","published":"2024-05-31T07:07:54Z","title":"Learning Gaze-aware Compositional GAN","summary":" Gaze-annotated facial data is crucial for training deep neural networks\n(DNNs) for gaze estimation. However, obtaining these data is labor-intensive\nand requires specialized equipment due to the challenge of accurately\nannotating the gaze direction of a subject. In this work, we present a\ngenerative framework to create annotated gaze data by leveraging the benefits\nof labeled and unlabeled data sources. We propose a Gaze-aware Compositional\nGAN that learns to generate annotated facial images from a limited labeled\ndataset. Then we transfer this model to an unlabeled data domain to take\nadvantage of the diversity it provides. Experiments demonstrate our approach's\neffectiveness in generating within-domain image augmentations in the ETH-XGaze\ndataset and cross-domain augmentations in the CelebAMask-HQ dataset domain for\ngaze estimation DNN training. We also show additional applications of our work,\nwhich include facial image editing and gaze redirection.\n","authors":["Nerea Aranjuelo","Siyu Huang","Ignacio Arganda-Carreras","Luis Unzueta","Oihana Otaegui","Hanspeter Pfister","Donglai Wei"],"pdf_url":"https://arxiv.org/pdf/2405.20643v1.pdf","comment":"Accepted by ETRA 2024 as Full paper, and as journal paper in\n Proceedings of the ACM on Computer Graphics and Interactive Techniques"},{"id":"http://arxiv.org/abs/2405.20299v2","updated":"2024-05-31T06:56:51Z","published":"2024-05-30T17:46:23Z","title":"Scaling White-Box Transformers for Vision","summary":" CRATE, a white-box transformer architecture designed to learn compressed and\nsparse representations, offers an intriguing alternative to standard vision\ntransformers (ViTs) due to its inherent mathematical interpretability. Despite\nextensive investigations into the scaling behaviors of language and vision\ntransformers, the scalability of CRATE remains an open question which this\npaper aims to address. Specifically, we propose CRATE-$\\alpha$, featuring\nstrategic yet minimal modifications to the sparse coding block in the CRATE\narchitecture design, and a light training recipe designed to improve the\nscalability of CRATE. Through extensive experiments, we demonstrate that\nCRATE-$\\alpha$ can effectively scale with larger model sizes and datasets. For\nexample, our CRATE-$\\alpha$-B substantially outperforms the prior best CRATE-B\nmodel accuracy on ImageNet classification by 3.7%, achieving an accuracy of\n83.2%. Meanwhile, when scaling further, our CRATE-$\\alpha$-L obtains an\nImageNet classification accuracy of 85.1%. More notably, these model\nperformance improvements are achieved while preserving, and potentially even\nenhancing the interpretability of learned CRATE models, as we demonstrate\nthrough showing that the learned token representations of increasingly larger\ntrained CRATE-$\\alpha$ models yield increasingly higher-quality unsupervised\nobject segmentation of images. The project page is\nhttps://rayjryang.github.io/CRATE-alpha/.\n","authors":["Jinrui Yang","Xianhang Li","Druv Pai","Yuyin Zhou","Yi Ma","Yaodong Yu","Cihang Xie"],"pdf_url":"https://arxiv.org/pdf/2405.20299v2.pdf","comment":"project page: https://rayjryang.github.io/CRATE-alpha/"},{"id":"http://arxiv.org/abs/2404.18426v2","updated":"2024-05-31T06:54:44Z","published":"2024-04-29T04:56:52Z","title":"Efficient Meta-Learning Enabled Lightweight Multiscale Few-Shot Object\n Detection in Remote Sensing Images","summary":" Presently, the task of few-shot object detection (FSOD) in remote sensing\nimages (RSIs) has become a focal point of attention. Numerous few-shot\ndetectors, particularly those based on two-stage detectors, face challenges\nwhen dealing with the multiscale complexities inherent in RSIs. Moreover, these\ndetectors present impractical characteristics in real-world applications,\nmainly due to their unwieldy model parameters when handling large amount of\ndata. In contrast, we recognize the advantages of one-stage detectors,\nincluding high detection speed and a global receptive field. Consequently, we\nchoose the YOLOv7 one-stage detector as a baseline and subject it to a novel\nmeta-learning training framework. This transformation allows the detector to\nadeptly address FSOD tasks while capitalizing on its inherent advantage of\nlightweight. Additionally, we thoroughly investigate the samples generated by\nthe meta-learning strategy and introduce a novel meta-sampling approach to\nretain samples produced by our designed meta-detection head. Coupled with our\ndevised meta-cross loss, we deliberately utilize \"negative samples\" that are\noften overlooked to extract valuable knowledge from them. This approach serves\nto enhance detection accuracy and efficiently refine the overall meta-learning\nstrategy. To validate the effectiveness of our proposed detector, we conducted\nperformance comparisons with current state-of-the-art detectors using the DIOR\nand NWPU VHR-10.v2 datasets, yielding satisfactory results.\n","authors":["Wenbin Guan","Zijiu Yang","Xiaohong Wu","Liqiong Chen","Feng Huang","Xiaohai He","Honggang Chen"],"pdf_url":"https://arxiv.org/pdf/2404.18426v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20633v1","updated":"2024-05-31T05:49:37Z","published":"2024-05-31T05:49:37Z","title":"Action-OOD: An End-to-End Skeleton-Based Model for Robust\n Out-of-Distribution Human Action Detection","summary":" Human action recognition is a crucial task in computer vision systems.\nHowever, in real-world scenarios, human actions often fall outside the\ndistribution of training data, requiring a model to both recognize\nin-distribution (ID) actions and reject out-of-distribution (OOD) ones. Despite\nits importance, there has been limited research on OOD detection in human\nactions. Existing works on OOD detection mainly focus on image data with RGB\nstructure, and many methods are post-hoc in nature. While these methods are\nconvenient and computationally efficient, they often lack sufficient accuracy\nand fail to consider the presence of OOD samples. To address these challenges,\nwe propose a novel end-to-end skeleton-based model called Action-OOD,\nspecifically designed for OOD human action detection. Unlike some existing\napproaches that may require prior knowledge of existing OOD data distribution,\nour model solely utilizes in-distribution (ID) data during the training stage,\neffectively mitigating the overconfidence issue prevalent in OOD detection. We\nintroduce an attention-based feature fusion block, which enhances the model's\ncapability to recognize unknown classes while preserving classification\naccuracy for known classes. Further, we present a novel energy-based loss\nfunction and successfully integrate it with the traditional cross-entropy loss\nto maximize the separation of data distributions between ID and OOD. Through\nextensive experiments conducted on NTU-RGB+D 60, NTU-RGB+D 120, and\nKinetics-400 datasets, we demonstrate the superior performance of our proposed\napproach compared to state-of-the-art methods. Our findings underscore the\neffectiveness of classic OOD detection techniques in the context of\nskeleton-based action recognition tasks, offering promising avenues for future\nresearch in this field. Code will be available at:\nhttps://github.com/YilliaJing/Action-OOD.git.\n","authors":["Jing Xu","Anqi Zhu","Jingyu Lin","Qiuhong Ke","Cunjian Chen"],"pdf_url":"https://arxiv.org/pdf/2405.20633v1.pdf","comment":"Under consideration at Computer Vision and Image Understanding"},{"id":"http://arxiv.org/abs/2405.20628v1","updated":"2024-05-31T05:40:56Z","published":"2024-05-31T05:40:56Z","title":"ToxVidLLM: A Multimodal LLM-based Framework for Toxicity Detection in\n Code-Mixed Videos","summary":" In an era of rapidly evolving internet technology, the surge in multimodal\ncontent, including videos, has expanded the horizons of online communication.\nHowever, the detection of toxic content in this diverse landscape, particularly\nin low-resource code-mixed languages, remains a critical challenge. While\nsubstantial research has addressed toxic content detection in textual data, the\nrealm of video content, especially in non-English languages, has been\nrelatively underexplored. This paper addresses this research gap by introducing\na benchmark dataset, the first of its kind, consisting of 931 videos with 4021\ncode-mixed Hindi-English utterances collected from YouTube. Each utterance\nwithin this dataset has been meticulously annotated for toxicity, severity, and\nsentiment labels. We have developed an advanced Multimodal Multitask framework\nbuilt for Toxicity detection in Video Content by leveraging Large Language\nModels (LLMs), crafted for the primary objective along with the additional\ntasks of conducting sentiment and severity analysis. ToxVidLLM incorporates\nthree key modules the Encoder module, Cross-Modal Synchronization module, and\nMultitask module crafting a generic multimodal LLM customized for intricate\nvideo classification tasks. Our experiments reveal that incorporating multiple\nmodalities from the videos substantially enhances the performance of toxic\ncontent detection by achieving an Accuracy and Weighted F1 score of 94.29% and\n94.35%, respectively.\n","authors":["Krishanu Maity","A. S. Poornash","Sriparna Saha","Pushpak Bhattacharyya"],"pdf_url":"https://arxiv.org/pdf/2405.20628v1.pdf","comment":"ACL Findings 2024"},{"id":"http://arxiv.org/abs/2405.19917v2","updated":"2024-05-31T05:29:13Z","published":"2024-05-30T10:30:07Z","title":"Multimodal Cross-Domain Few-Shot Learning for Egocentric Action\n Recognition","summary":" We address a novel cross-domain few-shot learning task (CD-FSL) with\nmultimodal input and unlabeled target data for egocentric action recognition.\nThis paper simultaneously tackles two critical challenges associated with\negocentric action recognition in CD-FSL settings: (1) the extreme domain gap in\negocentric videos (\\eg, daily life vs. industrial domain) and (2) the\ncomputational cost for real-world applications. We propose MM-CDFSL, a\ndomain-adaptive and computationally efficient approach designed to enhance\nadaptability to the target domain and improve inference speed. To address the\nfirst challenge, we propose the incorporation of multimodal distillation into\nthe student RGB model using teacher models. Each teacher model is trained\nindependently on source and target data for its respective modality. Leveraging\nonly unlabeled target data during multimodal distillation enhances the student\nmodel's adaptability to the target domain. We further introduce ensemble masked\ninference, a technique that reduces the number of input tokens through masking.\nIn this approach, ensemble prediction mitigates the performance degradation\ncaused by masking, effectively addressing the second issue. Our approach\noutperformed the state-of-the-art CD-FSL approaches with a substantial margin\non multiple egocentric datasets, improving by an average of 6.12/6.10 points\nfor 1-shot/5-shot settings while achieving $2.2$ times faster inference speed.\nProject page: https://masashi-hatano.github.io/MM-CDFSL/\n","authors":["Masashi Hatano","Ryo Hachiuma","Ryo Fujii","Hideo Saito"],"pdf_url":"https://arxiv.org/pdf/2405.19917v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.13540v2","updated":"2024-05-31T05:15:40Z","published":"2024-05-22T11:20:32Z","title":"Directly Denoising Diffusion Models","summary":" In this paper, we present the Directly Denoising Diffusion Model (DDDM): a\nsimple and generic approach for generating realistic images with few-step\nsampling, while multistep sampling is still preserved for better performance.\nDDDMs require no delicately designed samplers nor distillation on pre-trained\ndistillation models. DDDMs train the diffusion model conditioned on an\nestimated target that was generated from previous training iterations of its\nown. To generate images, samples generated from the previous time step are also\ntaken into consideration, guiding the generation process iteratively. We\nfurther propose Pseudo-LPIPS, a novel metric loss that is more robust to\nvarious values of hyperparameter. Despite its simplicity, the proposed approach\ncan achieve strong performance in benchmark datasets. Our model achieves FID\nscores of 2.57 and 2.33 on CIFAR-10 in one-step and two-step sampling\nrespectively, surpassing those obtained from GANs and distillation-based\nmodels. By extending the sampling to 1000 steps, we further reduce FID score to\n1.79, aligning with state-of-the-art methods in the literature. For ImageNet\n64x64, our approach stands as a competitive contender against leading models.\n","authors":["Dan Zhang","Jingjing Wang","Feng Luo"],"pdf_url":"https://arxiv.org/pdf/2405.13540v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20319v2","updated":"2024-05-31T04:09:41Z","published":"2024-05-30T17:55:46Z","title":"ParSEL: Parameterized Shape Editing with Language","summary":" The ability to edit 3D assets from natural language presents a compelling\nparadigm to aid in the democratization of 3D content creation. However, while\nnatural language is often effective at communicating general intent, it is\npoorly suited for specifying precise manipulation. To address this gap, we\nintroduce ParSEL, a system that enables controllable editing of high-quality 3D\nassets from natural language. Given a segmented 3D mesh and an editing request,\nParSEL produces a parameterized editing program. Adjusting the program\nparameters allows users to explore shape variations with a precise control over\nthe magnitudes of edits. To infer editing programs which align with an input\nedit request, we leverage the abilities of large-language models (LLMs).\nHowever, while we find that LLMs excel at identifying initial edit operations,\nthey often fail to infer complete editing programs, and produce outputs that\nviolate shape semantics. To overcome this issue, we introduce Analytical Edit\nPropagation (AEP), an algorithm which extends a seed edit with additional\noperations until a complete editing program has been formed. Unlike prior\nmethods, AEP searches for analytical editing operations compatible with a range\nof possible user edits through the integration of computer algebra systems for\ngeometric analysis. Experimentally we demonstrate ParSEL's effectiveness in\nenabling controllable editing of 3D objects through natural language requests\nover alternative system designs.\n","authors":["Aditya Ganeshan","Ryan Y. Huang","Xianghao Xu","R. Kenny Jones","Daniel Ritchie"],"pdf_url":"https://arxiv.org/pdf/2405.20319v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20614v1","updated":"2024-05-31T04:06:11Z","published":"2024-05-31T04:06:11Z","title":"EPIDetect: Video-based convulsive seizure detection in chronic epilepsy\n mouse model for anti-epilepsy drug screening","summary":" In the preclinical translational studies, drug candidates with remarkable\nanti-epileptic efficacy demonstrate long-term suppression of spontaneous\nrecurrent seizures (SRSs), particularly convulsive seizures (CSs), in mouse\nmodels of chronic epilepsy. However, the current methods for monitoring CSs\nhave limitations in terms of invasiveness, specific laboratory settings, high\ncost, and complex operation, which hinder drug screening efforts. In this\nstudy, a camera-based system for automated detection of CSs in chronically\nepileptic mice is first established to screen potential anti-epilepsy drugs.\n","authors":["Junming Ren","Zhoujian Xiao","Yujia Zhang","Yujie Yang","Ling He","Ezra Yoon","Stephen Temitayo Bello","Xi Chen","Dapeng Wu","Micky Tortorella","Jufang He"],"pdf_url":"https://arxiv.org/pdf/2405.20614v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20610v1","updated":"2024-05-31T03:54:59Z","published":"2024-05-31T03:54:59Z","title":"Revisiting and Maximizing Temporal Knowledge in Semi-supervised Semantic\n Segmentation","summary":" In semi-supervised semantic segmentation, the Mean Teacher- and\nco-training-based approaches are employed to mitigate confirmation bias and\ncoupling problems. However, despite their high performance, these approaches\nfrequently involve complex training pipelines and a substantial computational\nburden, limiting the scalability and compatibility of these methods. In this\npaper, we propose a PrevMatch framework that effectively mitigates the\naforementioned limitations by maximizing the utilization of the temporal\nknowledge obtained during the training process. The PrevMatch framework relies\non two core strategies: (1) we reconsider the use of temporal knowledge and\nthus directly utilize previous models obtained during training to generate\nadditional pseudo-label guidance, referred to as previous guidance. (2) we\ndesign a highly randomized ensemble strategy to maximize the effectiveness of\nthe previous guidance. Experimental results on four benchmark semantic\nsegmentation datasets confirm that the proposed method consistently outperforms\nexisting methods across various evaluation protocols. In particular, with\nDeepLabV3+ and ResNet-101 network settings, PrevMatch outperforms the existing\nstate-of-the-art method, Diverse Co-training, by +1.6 mIoU on Pascal VOC with\nonly 92 annotated images, while achieving 2.4 times faster training.\nFurthermore, the results indicate that PrevMatch induces stable optimization,\nparticularly in benefiting classes that exhibit poor performance. Code is\navailable at https://github.com/wooseok-shin/PrevMatch\n","authors":["Wooseok Shin","Hyun Joon Park","Jin Sob Kim","Sung Won Han"],"pdf_url":"https://arxiv.org/pdf/2405.20610v1.pdf","comment":"14 pages, 5 figures, submitted to IEEE TPAMI. This work has been\n submitted to the IEEE for possible publication. Copyright may be transferred\n without notice, after which this version may no longer be accessible"},{"id":"http://arxiv.org/abs/2405.20607v1","updated":"2024-05-31T03:47:44Z","published":"2024-05-31T03:47:44Z","title":"Textual Inversion and Self-supervised Refinement for Radiology Report\n Generation","summary":" Existing mainstream approaches follow the encoder-decoder paradigm for\ngenerating radiology reports. They focus on improving the network structure of\nencoders and decoders, which leads to two shortcomings: overlooking the\nmodality gap and ignoring report content constraints. In this paper, we\nproposed Textual Inversion and Self-supervised Refinement (TISR) to address the\nabove two issues. Specifically, textual inversion can project text and image\ninto the same space by representing images as pseudo words to eliminate the\ncross-modeling gap. Subsequently, self-supervised refinement refines these\npseudo words through contrastive loss computation between images and texts,\nenhancing the fidelity of generated reports to images. Notably, TISR is\northogonal to most existing methods, plug-and-play. We conduct experiments on\ntwo widely-used public datasets and achieve significant improvements on various\nbaselines, which demonstrates the effectiveness and generalization of TISR. The\ncode will be available soon.\n","authors":["Yuanjiang Luo","Hongxiang Li","Xuan Wu","Meng Cao","Xiaoshuang Huang","Zhihong Zhu","Peixi Liao","Hu Chen","Yi Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20607v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20606v1","updated":"2024-05-31T03:40:15Z","published":"2024-05-31T03:40:15Z","title":"Vision-Language Meets the Skeleton: Progressively Distillation with\n Cross-Modal Knowledge for 3D Action Representation Learning","summary":" Supervised and self-supervised learning are two main training paradigms for\nskeleton-based human action recognition. However, the former one-hot\nclassification requires labor-intensive predefined action categories\nannotations, while the latter involves skeleton transformations (e.g.,\ncropping) in the pretext tasks that may impair the skeleton structure. To\naddress these challenges, we introduce a novel skeleton-based training\nframework (C$^2$VL) based on Cross-modal Contrastive learning that uses the\nprogressive distillation to learn task-agnostic human skeleton action\nrepresentation from the Vision-Language knowledge prompts. Specifically, we\nestablish the vision-language action concept space through vision-language\nknowledge prompts generated by pre-trained large multimodal models (LMMs),\nwhich enrich the fine-grained details that the skeleton action space lacks.\nMoreover, we propose the intra-modal self-similarity and inter-modal\ncross-consistency softened targets in the cross-modal contrastive process to\nprogressively control and guide the degree of pulling vision-language knowledge\nprompts and corresponding skeletons closer. These soft instance discrimination\nand self-knowledge distillation strategies contribute to the learning of better\nskeleton-based action representations from the noisy skeleton-vision-language\npairs. During the inference phase, our method requires only the skeleton data\nas the input for action recognition and no longer for vision-language prompts.\nExtensive experiments show that our method achieves state-of-the-art results on\nNTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets. The code will be available\nin the future.\n","authors":["Yang Chen","Tian He","Junfeng Fu","Ling Wang","Jingcai Guo","Hong Cheng"],"pdf_url":"https://arxiv.org/pdf/2405.20606v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20605v1","updated":"2024-05-31T03:39:26Z","published":"2024-05-31T03:39:26Z","title":"Searching for internal symbols underlying deep learning","summary":" Deep learning (DL) enables deep neural networks (DNNs) to automatically learn\ncomplex tasks or rules from given examples without instructions or guiding\nprinciples. As we do not engineer DNNs' functions, it is extremely difficult to\ndiagnose their decisions, and multiple lines of studies proposed to explain\nprinciples of DNNs/DL operations. Notably, one line of studies suggests that\nDNNs may learn concepts, the high level features recognizable to humans. Thus,\nwe hypothesized that DNNs develop abstract codes, not necessarily recognizable\nto humans, which can be used to augment DNNs' decision-making. To address this\nhypothesis, we combined foundation segmentation models and unsupervised\nlearning to extract internal codes and identify potential use of abstract codes\nto make DL's decision-making more reliable and safer.\n","authors":["Jung H. Lee","Sujith Vijayan"],"pdf_url":"https://arxiv.org/pdf/2405.20605v1.pdf","comment":"10 pages, 7 figures, 3 tables and Appendix"},{"id":"http://arxiv.org/abs/2405.20596v1","updated":"2024-05-31T03:13:45Z","published":"2024-05-31T03:13:45Z","title":"Generalized Semi-Supervised Learning via Self-Supervised Feature\n Adaptation","summary":" Traditional semi-supervised learning (SSL) assumes that the feature\ndistributions of labeled and unlabeled data are consistent which rarely holds\nin realistic scenarios. In this paper, we propose a novel SSL setting, where\nunlabeled samples are drawn from a mixed distribution that deviates from the\nfeature distribution of labeled samples. Under this setting, previous SSL\nmethods tend to predict wrong pseudo-labels with the model fitted on labeled\ndata, resulting in noise accumulation. To tackle this issue, we propose\nSelf-Supervised Feature Adaptation (SSFA), a generic framework for improving\nSSL performance when labeled and unlabeled data come from different\ndistributions. SSFA decouples the prediction of pseudo-labels from the current\nmodel to improve the quality of pseudo-labels. Particularly, SSFA incorporates\na self-supervised task into the SSL framework and uses it to adapt the feature\nextractor of the model to the unlabeled data. In this way, the extracted\nfeatures better fit the distribution of unlabeled data, thereby generating\nhigh-quality pseudo-labels. Extensive experiments show that our proposed SSFA\nis applicable to various pseudo-label-based SSL learners and significantly\nimproves performance in labeled, unlabeled, and even unseen distributions.\n","authors":["Jiachen Liang","Ruibing Hou","Hong Chang","Bingpeng Ma","Shiguang Shan","Xilin Chen"],"pdf_url":"https://arxiv.org/pdf/2405.20596v1.pdf","comment":"10 pages; Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2405.20584v1","updated":"2024-05-31T02:45:31Z","published":"2024-05-31T02:45:31Z","title":"Disrupting Diffusion: Token-Level Attention Erasure Attack against\n Diffusion-based Customization","summary":" With the development of diffusion-based customization methods like\nDreamBooth, individuals now have access to train the models that can generate\ntheir personalized images. Despite the convenience, malicious users have\nmisused these techniques to create fake images, thereby triggering a privacy\nsecurity crisis. In light of this, proactive adversarial attacks are proposed\nto protect users against customization. The adversarial examples are trained to\ndistort the customization model's outputs and thus block the misuse. In this\npaper, we propose DisDiff (Disrupting Diffusion), a novel adversarial attack\nmethod to disrupt the diffusion model outputs. We first delve into the\nintrinsic image-text relationships, well-known as cross-attention, and\nempirically find that the subject-identifier token plays an important role in\nguiding image generation. Thus, we propose the Cross-Attention Erasure module\nto explicitly \"erase\" the indicated attention maps and disrupt the text\nguidance. Besides,we analyze the influence of the sampling process of the\ndiffusion model on Projected Gradient Descent (PGD) attack and introduce a\nnovel Merit Sampling Scheduler to adaptively modulate the perturbation updating\namplitude in a step-aware manner. Our DisDiff outperforms the state-of-the-art\nmethods by 12.75% of FDFR scores and 7.25% of ISM scores across two facial\nbenchmarks and two commonly used prompts on average.\n","authors":["Yisu Liu","Jinyang An","Wanqian Zhang","Dayan Wu","Jingzi Gu","Zheng Lin","Weiping Wang"],"pdf_url":"https://arxiv.org/pdf/2405.20584v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2402.06497v3","updated":"2024-05-31T01:58:44Z","published":"2024-02-09T16:08:16Z","title":"Iris-SAM: Iris Segmentation Using a Foundation Model","summary":" Iris segmentation is a critical component of an iris biometric system and it\ninvolves extracting the annular iris region from an ocular image. In this work,\nwe develop a pixel-level iris segmentation model from a foundational model,\nviz., Segment Anything Model (SAM), that has been successfully used for\nsegmenting arbitrary objects. The primary contribution of this work lies in the\nintegration of different loss functions during the fine-tuning of SAM on ocular\nimages. In particular, the importance of Focal Loss is borne out in the\nfine-tuning process since it strategically addresses the class imbalance\nproblem (i.e., iris versus non-iris pixels). Experiments on ND-IRIS-0405,\nCASIA-Iris-Interval-v3, and IIT-Delhi-Iris datasets convey the efficacy of the\ntrained model for the task of iris segmentation. For instance, on the\nND-IRIS-0405 dataset, an average segmentation accuracy of 99.58% was achieved,\ncompared to the best baseline performance of 89.75%.\n","authors":["Parisa Farmanifard","Arun Ross"],"pdf_url":"https://arxiv.org/pdf/2402.06497v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.07248v4","updated":"2024-05-31T01:37:03Z","published":"2023-04-14T16:53:06Z","title":"The University of California San Francisco Brain Metastases Stereotactic\n Radiosurgery (UCSF-BMSR) MRI Dataset","summary":" The University of California San Francisco Brain Metastases Stereotactic\nRadiosurgery (UCSF-BMSR) dataset is a public, clinical, multimodal brain MRI\ndataset consisting of 560 brain MRIs from 412 patients with expert annotations\nof 5136 brain metastases. Data consists of registered and skull stripped T1\npost-contrast, T1 pre-contrast, FLAIR and subtraction (T1 pre-contrast - T1\npost-contrast) images and voxelwise segmentations of enhancing brain metastases\nin NifTI format. The dataset also includes patient demographics, surgical\nstatus and primary cancer types. The UCSF-BSMR has been made publicly available\nin the hopes that researchers will use these data to push the boundaries of AI\napplications for brain metastases. The dataset is freely available for\nnon-commercial use at https://imagingdatasets.ucsf.edu/dataset/1\n","authors":["Jeffrey D. Rudie","Rachit Saluja","David A. Weiss","Pierre Nedelec","Evan Calabrese","John B. Colby","Benjamin Laguna","John Mongan","Steve Braunstein","Christopher P. Hess","Andreas M. Rauschecker","Leo P. Sugrue","Javier E. Villanueva-Meyer"],"pdf_url":"https://arxiv.org/pdf/2304.07248v4.pdf","comment":"15 pages, 2 tables, 2 figures"},{"id":"http://arxiv.org/abs/2404.07989v2","updated":"2024-05-31T01:36:53Z","published":"2024-04-11T17:59:45Z","title":"Any2Point: Empowering Any-modality Large Models for Efficient 3D\n Understanding","summary":" Large foundation models have recently emerged as a prominent focus of\ninterest, attaining superior performance in widespread scenarios. Due to the\nscarcity of 3D data, many efforts have been made to adapt pre-trained\ntransformers from vision to 3D domains. However, such 2D-to-3D approaches are\nstill limited, due to the potential loss of spatial geometries and high\ncomputation cost. More importantly, their frameworks are mainly designed for 2D\nmodels, lacking a general any-to-3D paradigm. In this paper, we introduce\nAny2Point, a parameter-efficient method to empower any-modality large models\n(vision, language, audio) for 3D understanding. Given a frozen transformer from\nany source modality, we propose a 3D-to-any (1D or 2D) virtual projection\nstrategy that correlates the input 3D points to the original 1D or 2D positions\nwithin the source modality. This mechanism enables us to assign each 3D token\nwith a positional encoding paired with the pre-trained model, which avoids 3D\ngeometry loss caused by the true projection and better motivates the\ntransformer for 3D learning with 1D/2D positional priors. Then, within each\ntransformer block, we insert an any-to-3D guided adapter module for\nparameter-efficient fine-tuning. The adapter incorporates prior spatial\nknowledge from the source modality to guide the local feature aggregation of 3D\ntokens, compelling the semantic adaption of any-modality transformers. We\nconduct extensive experiments to showcase the effectiveness and efficiency of\nour method. Code and models are released at\nhttps://github.com/Ivan-Tang-3D/Any2Point.\n","authors":["Yiwen Tang","Ray Zhang","Jiaming Liu","Zoey Guo","Dong Wang","Zhigang Wang","Bin Zhao","Shanghang Zhang","Peng Gao","Hongsheng Li","Xuelong Li"],"pdf_url":"https://arxiv.org/pdf/2404.07989v2.pdf","comment":"Code and models are released at\n https://github.com/Ivan-Tang-3D/Any2Point"},{"id":"http://arxiv.org/abs/2405.20247v2","updated":"2024-05-31T01:33:45Z","published":"2024-05-30T16:58:34Z","title":"KerasCV and KerasNLP: Vision and Language Power-Ups","summary":" We present the Keras domain packages KerasCV and KerasNLP, extensions of the\nKeras API for Computer Vision and Natural Language Processing workflows,\ncapable of running on either JAX, TensorFlow, or PyTorch. These domain packages\nare designed to enable fast experimentation, with a focus on ease-of-use and\nperformance. We adopt a modular, layered design: at the library's lowest level\nof abstraction, we provide building blocks for creating models and data\npreprocessing pipelines, and at the library's highest level of abstraction, we\nprovide pretrained ``task\" models for popular architectures such as Stable\nDiffusion, YOLOv8, GPT2, BERT, Mistral, CLIP, Gemma, T5, etc. Task models have\nbuilt-in preprocessing, pretrained weights, and can be fine-tuned on raw\ninputs. To enable efficient training, we support XLA compilation for all\nmodels, and run all preprocessing via a compiled graph of TensorFlow operations\nusing the tf.data API. The libraries are fully open-source (Apache 2.0 license)\nand available on GitHub.\n","authors":["Matthew Watson","Divyashree Shivakumar Sreepathihalli","Francois Chollet","Martin Gorner","Kiranbir Sodhia","Ramesh Sampath","Tirth Patel","Haifeng Jin","Neel Kovelamudi","Gabriel Rasskin","Samaneh Saadat","Luke Wood","Chen Qian","Jonathan Bischof","Ian Stenbit","Abheesht Sharma","Anshuman Mishra"],"pdf_url":"https://arxiv.org/pdf/2405.20247v2.pdf","comment":"Submitted to Journal of Machine Learning Open Source Software"},{"id":"http://arxiv.org/abs/2401.03922v3","updated":"2024-05-31T01:10:42Z","published":"2024-01-08T14:33:57Z","title":"SNeurodCNN: Structure-focused Neurodegeneration Convolutional Neural\n Network for Modelling and Classification of Alzheimer's Disease","summary":" Alzheimer's disease (AD), the predominant form of dementia, is a growing\nglobal challenge, emphasizing the urgent need for accurate and early diagnosis.\nCurrent clinical diagnoses rely on radiologist expert interpretation, which is\nprone to human error. Deep learning has thus far shown promise for early AD\ndiagnosis. However, existing methods often overlook focal structural atrophy\ncritical for enhanced understanding of the cerebral cortex neurodegeneration.\nThis paper proposes a deep learning framework that includes a novel\nstructure-focused neurodegeneration CNN architecture named SNeurodCNN and an\nimage brightness enhancement preprocessor using gamma correction. The\nSNeurodCNN architecture takes as input the focal structural atrophy features\nresulting from segmentation of brain structures captured through magnetic\nresonance imaging (MRI). As a result, the architecture considers only necessary\nCNN components, which comprises of two downsampling convolutional blocks and\ntwo fully connected layers, for achieving the desired classification task, and\nutilises regularisation techniques to regularise learnable parameters.\nLeveraging mid-sagittal and para-sagittal brain image viewpoints from the\nAlzheimer's Disease Neuroimaging Initiative (ADNI) dataset, our framework\ndemonstrated exceptional performance. The para-sagittal viewpoint achieved\n97.8% accuracy, 97.0% specificity, and 98.5% sensitivity, while the\nmid-sagittal viewpoint offered deeper insights with 98.1% accuracy, 97.2%\nspecificity, and 99.0% sensitivity. Model analysis revealed the ability of\nSNeurodCNN to capture the structural dynamics of mild cognitive impairment\n(MCI) and AD in the frontal lobe, occipital lobe, cerebellum, temporal, and\nparietal lobe, suggesting its potential as a brain structural change\ndigi-biomarker for early AD diagnosis. This work can be reproduced using code\nwe made available on GitHub.\n","authors":["Simisola Odimayo","Chollette C. Olisah","Khadija Mohammed"],"pdf_url":"https://arxiv.org/pdf/2401.03922v3.pdf","comment":"36 Pages, 10 figures, 4 tables"},{"id":"http://arxiv.org/abs/2405.20559v1","updated":"2024-05-31T00:57:58Z","published":"2024-05-31T00:57:58Z","title":"Universal evaluation and design of imaging systems using information\n estimation","summary":" Information theory, which describes the transmission of signals in the\npresence of noise, has enabled the development of reliable communication\nsystems that underlie the modern world. Imaging systems can also be viewed as a\nform of communication, in which information about the object is \"transmitted\"\nthrough images. However, the application of information theory to imaging\nsystems has been limited by the challenges of accounting for their physical\nconstraints. Here, we introduce a framework that addresses these limitations by\nmodeling the probabilistic relationship between objects and their measurements.\nUsing this framework, we develop a method to estimate information using only a\ndataset of noisy measurements, without making any assumptions about the image\nformation process. We demonstrate that these estimates comprehensively quantify\nmeasurement quality across a diverse range of imaging systems and applications.\nFurthermore, we introduce Information-Driven Encoder Analysis Learning (IDEAL),\na technique to optimize the design of imaging hardware for maximum information\ncapture. This work provides new insights into the fundamental performance\nlimits of imaging systems and offers powerful new tools for their analysis and\ndesign.\n","authors":["Henry Pinkard","Leyla Kabuli","Eric Markley","Tiffany Chien","Jiantao Jiao","Laura Waller"],"pdf_url":"https://arxiv.org/pdf/2405.20559v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.11386v2","updated":"2024-05-31T00:48:18Z","published":"2024-05-18T20:22:22Z","title":"Liver Fat Quantification Network with Body Shape","summary":" It is critically important to detect the content of liver fat as it is\nrelated to cardiac complications and cardiovascular disease mortality. However,\nexisting methods are either associated with high cost and/or medical\ncomplications (e.g., liver biopsy, imaging technology) or only roughly estimate\nthe grades of steatosis. In this paper, we propose a deep neural network to\nestimate the percentage of liver fat using only body shapes. The proposed is\ncomposed of a flexible baseline network and a lightweight Attention module. The\nattention module is trained to generate discriminative and diverse features\nwhich significant improve the performance. In order to validate the method, we\nperform extensive tests on the public medical dataset. The results verify that\nour proposed method yields state-of-the-art performance with Root mean squared\nerror (RMSE) of 5.26% and R-Squared value over 0.8. It offers an accurate and\nmore accessible assessment of hepatic steatosis.\n","authors":["Qiyue Wang","Wu Xue","Xiaoke Zhang","Fang Jin","James Hahn"],"pdf_url":"https://arxiv.org/pdf/2405.11386v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19687v2","updated":"2024-05-31T00:35:31Z","published":"2024-05-30T04:57:54Z","title":"Autonomous Driving with Spiking Neural Networks","summary":" Autonomous driving demands an integrated approach that encompasses\nperception, prediction, and planning, all while operating under strict energy\nconstraints to enhance scalability and environmental sustainability. We present\nSpiking Autonomous Driving (SAD), the first unified Spiking Neural Network\n(SNN) to address the energy challenges faced by autonomous driving systems\nthrough its event-driven and energy-efficient nature. SAD is trained end-to-end\nand consists of three main modules: perception, which processes inputs from\nmulti-view cameras to construct a spatiotemporal bird's eye view; prediction,\nwhich utilizes a novel dual-pathway with spiking neurons to forecast future\nstates; and planning, which generates safe trajectories considering predicted\noccupancy, traffic rules, and ride comfort. Evaluated on the nuScenes dataset,\nSAD achieves competitive performance in perception, prediction, and planning\ntasks, while drawing upon the energy efficiency of SNNs. This work highlights\nthe potential of neuromorphic computing to be applied to energy-efficient\nautonomous driving, a critical step toward sustainable and safety-critical\nautomotive technology. Our code is available at\n\\url{https://github.com/ridgerchu/SAD}.\n","authors":["Rui-Jie Zhu","Ziqing Wang","Leilani Gilpin","Jason K. Eshraghian"],"pdf_url":"https://arxiv.org/pdf/2405.19687v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.04626v2","updated":"2024-05-31T00:12:59Z","published":"2024-03-07T16:11:43Z","title":"MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training\n with Masked Autoencoder","summary":" Within the domain of medical analysis, extensive research has explored the\npotential of mutual learning between Masked Autoencoders(MAEs) and multimodal\ndata. However, the impact of MAEs on intermodality remains a key challenge. We\nintroduce MedFLIP, a Fast Language-Image Pre-training method for Medical\nanalysis. We explore MAEs for zero-shot learning with crossed domains, which\nenhances the model's ability to learn from limited data, a common scenario in\nmedical diagnostics. We verify that masking an image does not affect\ninter-modal learning. Furthermore, we propose the SVD loss to enhance the\nrepresentation learning for characteristics of medical images, aiming to\nimprove classification accuracy by leveraging the structural intricacies of\nsuch data. Our theory posits that masking encourages semantic preservation,\nrobust feature extraction, regularization, domain adaptation, and invariance\nlearning. Lastly, we validate using language will improve the zero-shot\nperformance for the medical image analysis. MedFLIP's scaling of the masking\nprocess marks an advancement in the field, offering a pathway to rapid and\nprecise medical image analysis without the traditional computational\nbottlenecks. Through experiments and validation, MedFLIP demonstrates efficient\nperformance improvements, helps for future research and application in medical\ndiagnostics.\n","authors":["Lei Li","Tianfang Zhang","Xinglin Zhang","Jiaqi Liu","Bingqi Ma","Yan Luo","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2403.04626v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.10373v3","updated":"2024-05-31T23:45:57Z","published":"2023-08-20T21:47:54Z","title":"HoSNN: Adversarially-Robust Homeostatic Spiking Neural Networks with\n Adaptive Firing Thresholds","summary":" While spiking neural networks (SNNs) offer a promising neurally-inspired\nmodel of computation, they are vulnerable to adversarial attacks. We present\nthe first study that draws inspiration from neural homeostasis to design a\nthreshold-adapting leaky integrate-and-fire (TA-LIF) neuron model and utilize\nTA-LIF neurons to construct the adversarially robust homeostatic SNNs (HoSNNs)\nfor improved robustness. The TA-LIF model incorporates a self-stabilizing\ndynamic thresholding mechanism, offering a local feedback control solution to\nthe minimization of each neuron's membrane potential error caused by\nadversarial disturbance. Theoretical analysis demonstrates favorable dynamic\nproperties of TA-LIF neurons in terms of the bounded-input bounded-output\nstability and suppressed time growth of membrane potential error, underscoring\ntheir superior robustness compared with the standard LIF neurons. When trained\nwith weak FGSM attacks (attack budget = 2/255) and tested with much stronger\nPGD attacks (attack budget = 8/255), our HoSNNs significantly improve model\naccuracy on several datasets: from 30.54% to 74.91% on FashionMNIST, from 0.44%\nto 35.06% on SVHN, from 0.56% to 42.63% on CIFAR10, from 0.04% to 16.66% on\nCIFAR100, over the conventional LIF-based SNNs.\n","authors":["Hejia Geng","Peng Li"],"pdf_url":"https://arxiv.org/pdf/2308.10373v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.16589v2","updated":"2024-05-31T21:26:39Z","published":"2023-11-28T08:15:27Z","title":"HD Maps are Lane Detection Generalizers: A Novel Generative Framework\n for Single-Source Domain Generalization","summary":" Lane detection is a vital task for vehicles to navigate and localize their\nposition on the road. To ensure reliable driving, lane detection models must\nhave robust generalization performance in various road environments. However,\ndespite the advanced performance in the trained domain, their generalization\nperformance still falls short of expectations due to the domain discrepancy. To\nbridge this gap, we propose a novel generative framework using HD Maps for\nSingle-Source Domain Generalization (SSDG) in lane detection. We first generate\nnumerous front-view images from lane markings of HD Maps. Next, we\nstrategically select a core subset among the generated images using (i) lane\nstructure and (ii) road surrounding criteria to maximize their diversity. In\nthe end, utilizing this core set, we train lane detection models to boost their\ngeneralization performance. We validate that our generative framework from HD\nMaps outperforms the Domain Adaptation model MLDA with +3.01%p accuracy\nimprovement, even though we do not access the target domain images.\n","authors":["Daeun Lee","Minhyeok Heo","Jiwon Kim"],"pdf_url":"https://arxiv.org/pdf/2311.16589v2.pdf","comment":"Accepted by CVPR Data-Driven Autonomous Driving Simulation Workshop,\n 2024"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2405.20994v1","updated":"2024-05-31T16:38:54Z","published":"2024-05-31T16:38:54Z","title":"CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to\n Web Relevance Ranking","summary":" We present CWRCzech, Click Web Ranking dataset for Czech, a 100M\nquery-document Czech click dataset for relevance ranking with user behavior\ndata collected from search engine logs of Seznam.cz. To the best of our\nknowledge, CWRCzech is the largest click dataset with raw text published so\nfar. It provides document positions in the search results as well as\ninformation about user behavior: 27.6M clicked documents and 10.8M dwell times.\nIn addition, we also publish a manually annotated Czech test for the relevance\ntask, containing nearly 50k query-document pairs, each annotated by at least 2\nannotators. Finally, we analyze how the user behavior data improve relevance\nranking and show that models trained on data automatically harnessed at\nsufficient scale can surpass the performance of models trained on human\nannotated data. CWRCzech is published under an academic non-commercial license\nand is available to the research community at\nhttps://github.com/seznam/CWRCzech.\n","authors":["Josef Vonášek","Milan Straka","Rostislav Krč","Lenka Lasoňová","Ekaterina Egorova","Jana Straková","Jakub Náplava"],"pdf_url":"https://arxiv.org/pdf/2405.20994v1.pdf","comment":"Accepted to SIGIR 2024"},{"id":"http://arxiv.org/abs/2405.20878v1","updated":"2024-05-31T14:53:12Z","published":"2024-05-31T14:53:12Z","title":"SelfGNN: Self-Supervised Graph Neural Networks for Sequential\n Recommendation","summary":" Sequential recommendation effectively addresses information overload by\nmodeling users' temporal and sequential interaction patterns. To overcome the\nlimitations of supervision signals, recent approaches have adopted\nself-supervised learning techniques in recommender systems. However, there are\nstill two critical challenges that remain unsolved. Firstly, existing\nsequential models primarily focus on long-term modeling of individual\ninteraction sequences, overlooking the valuable short-term collaborative\nrelationships among the behaviors of different users. Secondly, real-world data\noften contain noise, particularly in users' short-term behaviors, which can\narise from temporary intents or misclicks. Such noise negatively impacts the\naccuracy of both graph and sequence models, further complicating the modeling\nprocess. To address these challenges, we propose a novel framework called\nSelf-Supervised Graph Neural Network (SelfGNN) for sequential recommendation.\nThe SelfGNN framework encodes short-term graphs based on time intervals and\nutilizes Graph Neural Networks (GNNs) to learn short-term collaborative\nrelationships. It captures long-term user and item representations at multiple\ngranularity levels through interval fusion and dynamic behavior modeling.\nImportantly, our personalized self-augmented learning structure enhances model\nrobustness by mitigating noise in short-term graphs based on long-term user\ninterests and personal stability. Extensive experiments conducted on four\nreal-world datasets demonstrate that SelfGNN outperforms various\nstate-of-the-art baselines. Our model implementation codes are available at\nhttps://github.com/HKUDS/SelfGNN.\n","authors":["Yuxi Liu","Lianghao Xia","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2405.20878v1.pdf","comment":"Accepted by SIGIR'24"},{"id":"http://arxiv.org/abs/2204.09140v2","updated":"2024-05-31T14:28:40Z","published":"2022-04-19T21:55:18Z","title":"Multi-hop Question Answering","summary":" The task of Question Answering (QA) has attracted significant research\ninterest for long. Its relevance to language understanding and knowledge\nretrieval tasks, along with the simple setting makes the task of QA crucial for\nstrong AI systems. Recent success on simple QA tasks has shifted the focus to\nmore complex settings. Among these, Multi-Hop QA (MHQA) is one of the most\nresearched tasks over the recent years. In broad terms, MHQA is the task of\nanswering natural language questions that involve extracting and combining\nmultiple pieces of information and doing multiple steps of reasoning. An\nexample of a multi-hop question would be \"The Argentine PGA Championship record\nholder has won how many tournaments worldwide?\". Answering the question would\nneed two pieces of information: \"Who is the record holder for Argentine PGA\nChampionship tournaments?\" and \"How many tournaments did [Answer of Sub Q1]\nwin?\". The ability to answer multi-hop questions and perform multi step\nreasoning can significantly improve the utility of NLP systems. Consequently,\nthe field has seen a surge with high quality datasets, models and evaluation\nstrategies. The notion of 'multiple hops' is somewhat abstract which results in\na large variety of tasks that require multi-hop reasoning. This leads to\ndifferent datasets and models that differ significantly from each other and\nmakes the field challenging to generalize and survey. We aim to provide a\ngeneral and formal definition of the MHQA task, and organize and summarize\nexisting MHQA frameworks. We also outline some best practices for building MHQA\ndatasets. This book provides a systematic and thorough introduction as well as\nthe structuring of the existing attempts to this highly interesting, yet quite\nchallenging task.\n","authors":["Vaibhav Mavi","Anubhav Jangra","Adam Jatowt"],"pdf_url":"https://arxiv.org/pdf/2204.09140v2.pdf","comment":"Published at Foundations and Trends in Information Retrieval"},{"id":"http://arxiv.org/abs/2405.20718v1","updated":"2024-05-31T09:14:48Z","published":"2024-05-31T09:14:48Z","title":"Popularity-Aware Alignment and Contrast for Mitigating Popularity Bias","summary":" Collaborative Filtering (CF) typically suffers from the significant challenge\nof popularity bias due to the uneven distribution of items in real-world\ndatasets. This bias leads to a significant accuracy gap between popular and\nunpopular items. It not only hinders accurate user preference understanding but\nalso exacerbates the Matthew effect in recommendation systems. To alleviate\npopularity bias, existing efforts focus on emphasizing unpopular items or\nseparating the correlation between item representations and their popularity.\nDespite the effectiveness, existing works still face two persistent challenges:\n(1) how to extract common supervision signals from popular items to improve the\nunpopular item representations, and (2) how to alleviate the representation\nseparation caused by popularity bias. In this work, we conduct an empirical\nanalysis of popularity bias and propose Popularity-Aware Alignment and Contrast\n(PAAC) to address two challenges. Specifically, we use the common supervisory\nsignals modeled in popular item representations and propose a novel\npopularity-aware supervised alignment module to learn unpopular item\nrepresentations. Additionally, we suggest re-weighting the contrastive learning\nloss to mitigate the representation separation from a popularity-centric\nperspective. Finally, we validate the effectiveness and rationale of PAAC in\nmitigating popularity bias through extensive experiments on three real-world\ndatasets. Our code is available at\nhttps://github.com/miaomiao-cai2/KDD2024-PAAC.\n","authors":["Miaomiao Cai","Lei Chen","Yifan Wang","Haoyue Bai","Peijie Sun","Le Wu","Min Zhang","Meng Wang"],"pdf_url":"https://arxiv.org/pdf/2405.20718v1.pdf","comment":"Accepted by KDD 2024"},{"id":"http://arxiv.org/abs/2405.20710v1","updated":"2024-05-31T09:07:03Z","published":"2024-05-31T09:07:03Z","title":"Information Maximization via Variational Autoencoders for Cross-Domain\n Recommendation","summary":" Cross-Domain Sequential Recommendation (CDSR) methods aim to address the data\nsparsity and cold-start problems present in Single-Domain Sequential\nRecommendation (SDSR). Existing CDSR methods typically rely on overlapping\nusers, designing complex cross-domain modules to capture users' latent\ninterests that can propagate across different domains. However, their\npropagated informative information is limited to the overlapping users and the\nusers who have rich historical behavior records. As a result, these methods\noften underperform in real-world scenarios, where most users are\nnon-overlapping (cold-start) and long-tailed. In this research, we introduce a\nnew CDSR framework named Information Maximization Variational Autoencoder\n(\\textbf{\\texttt{IM-VAE}}). Here, we suggest using a Pseudo-Sequence Generator\nto enhance the user's interaction history input for downstream fine-grained\nCDSR models to alleviate the cold-start issues. We also propose a Generative\nRecommendation Framework combined with three regularizers inspired by the\nmutual information maximization (MIM) theory \\cite{mcgill1954multivariate} to\ncapture the semantic differences between a user's interests shared across\ndomains and those specific to certain domains, as well as address the\ninformational gap between a user's actual interaction sequences and the\npseudo-sequences generated. To the best of our knowledge, this paper is the\nfirst CDSR work that considers the information disentanglement and denoising of\npseudo-sequences in the open-world recommendation scenario. Empirical\nexperiments illustrate that \\texttt{IM-VAE} outperforms the state-of-the-art\napproaches on two real-world cross-domain datasets on all sorts of users,\nincluding cold-start and tailed users, demonstrating the effectiveness of\n\\texttt{IM-VAE} in open-world recommendation.\n","authors":["Xuying Ning","Wujiang Xu","Xiaolei Liu","Mingming Ha","Qiongxu Ma","Youru Li","Linxun Chen","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20710v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.16969v2","updated":"2024-05-31T08:37:12Z","published":"2024-01-30T12:50:38Z","title":"Taxonomy of Mathematical Plagiarism","summary":" Plagiarism is a pressing concern, even more so with the availability of large\nlanguage models. Existing plagiarism detection systems reliably find copied and\nmoderately reworded text but fail for idea plagiarism, especially in\nmathematical science, which heavily uses formal mathematical notation. We make\ntwo contributions. First, we establish a taxonomy of mathematical content reuse\nby annotating potentially plagiarised 122 scientific document pairs. Second, we\nanalyze the best-performing approaches to detect plagiarism and mathematical\ncontent similarity on the newly established taxonomy. We found that the\nbest-performing methods for plagiarism and math content similarity achieve an\noverall detection score (PlagDet) of 0.06 and 0.16, respectively. The\nbest-performing methods failed to detect most cases from all seven newly\nestablished math similarity types. Outlined contributions will benefit research\nin plagiarism detection systems, recommender systems, question-answering\nsystems, and search engines. We make our experiment's code and annotated\ndataset available to the community:\nhttps://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism\n","authors":["Ankit Satpute","Andre Greiner-Petter","Noah Gießing","Isabel Beckenbach","Moritz Schubotz","Olaf Teschke","Akiko Aizawa","Bela Gipp"],"pdf_url":"https://arxiv.org/pdf/2401.16969v2.pdf","comment":"46th European Conference on Information Retrieval (ECIR)"},{"id":"http://arxiv.org/abs/2405.20654v1","updated":"2024-05-31T07:43:42Z","published":"2024-05-31T07:43:42Z","title":"Passage-specific Prompt Tuning for Passage Reranking in Question\n Answering with Large Language Models","summary":" Effective passage retrieval and reranking methods have been widely utilized\nto identify suitable candidates in open-domain question answering tasks, recent\nstudies have resorted to LLMs for reranking the retrieved passages by the\nlog-likelihood of the question conditioned on each passage. Although these\nmethods have demonstrated promising results, the performance is notably\nsensitive to the human-written prompt (or hard prompt), and fine-tuning LLMs\ncan be computationally intensive and time-consuming. Furthermore, this approach\nlimits the leverage of question-passage relevance pairs and passage-specific\nknowledge to enhance the ranking capabilities of LLMs. In this paper, we\npropose passage-specific prompt tuning for reranking in open-domain question\nanswering (PSPT): a parameter-efficient method that fine-tunes learnable\npassage-specific soft prompts, incorporating passage-specific knowledge from a\nlimited set of question-passage relevance pairs. The method involves ranking\nretrieved passages based on the log-likelihood of the model generating the\nquestion conditioned on each passage and the learned soft prompt. We conducted\nextensive experiments utilizing the Llama-2-chat-7B model across three publicly\navailable open-domain question answering datasets and the results demonstrate\nthe effectiveness of the proposed approach.\n","authors":["Xuyang Wu","Zhiyuan Peng","Sravanthi Rajanala","Hsin-Tai Wu","Yi Fang"],"pdf_url":"https://arxiv.org/pdf/2405.20654v1.pdf","comment":"Accepted at Gen-IR@SIGIR24"},{"id":"http://arxiv.org/abs/2405.20646v1","updated":"2024-05-31T07:24:42Z","published":"2024-05-31T07:24:42Z","title":"Large Language Models Enhanced Sequential Recommendation for Long-tail\n User and Item","summary":" Sequential recommendation systems (SRS) serve the purpose of predicting\nusers' subsequent preferences based on their past interactions and have been\napplied across various domains such as e-commerce and social networking\nplatforms. However, practical SRS encounters challenges due to the fact that\nmost users engage with only a limited number of items, while the majority of\nitems are seldom consumed. These challenges, termed as the long-tail user and\nlong-tail item dilemmas, often create obstacles for traditional SRS methods.\nMitigating these challenges is crucial as they can significantly impact user\nsatisfaction and business profitability. While some research endeavors have\nalleviated these issues, they still grapple with issues such as seesaw or noise\nstemming from the scarcity of interactions. The emergence of large language\nmodels (LLMs) presents a promising avenue to address these challenges from a\nsemantic standpoint. In this study, we introduce the Large Language Models\nEnhancement framework for Sequential Recommendation (LLM-ESR), which leverages\nsemantic embeddings from LLMs to enhance SRS performance without increasing\ncomputational overhead. To combat the long-tail item challenge, we propose a\ndual-view modeling approach that fuses semantic information from LLMs with\ncollaborative signals from traditional SRS. To address the long-tail user\nchallenge, we introduce a retrieval augmented self-distillation technique to\nrefine user preference representations by incorporating richer interaction data\nfrom similar users. Through comprehensive experiments conducted on three\nauthentic datasets using three widely used SRS models, our proposed enhancement\nframework demonstrates superior performance compared to existing methodologies.\n","authors":["Qidong Liu","Xian Wu","Xiangyu Zhao","Yejing Wang","Zijian Zhang","Feng Tian","Yefeng Zheng"],"pdf_url":"https://arxiv.org/pdf/2405.20646v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.00368v3","updated":"2024-05-31T07:22:01Z","published":"2023-12-31T02:13:18Z","title":"Improving Text Embeddings with Large Language Models","summary":" In this paper, we introduce a novel and simple method for obtaining\nhigh-quality text embeddings using only synthetic data and less than 1k\ntraining steps. Unlike existing methods that often depend on multi-stage\nintermediate pre-training with billions of weakly-supervised text pairs,\nfollowed by fine-tuning with a few labeled datasets, our method does not\nrequire building complex training pipelines or relying on manually collected\ndatasets that are often constrained by task diversity and language coverage. We\nleverage proprietary LLMs to generate diverse synthetic data for hundreds of\nthousands of text embedding tasks across 93 languages. We then fine-tune\nopen-source decoder-only LLMs on the synthetic data using standard contrastive\nloss. Experiments demonstrate that our method achieves strong performance on\nhighly competitive text embedding benchmarks without using any labeled data.\nFurthermore, when fine-tuned with a mixture of synthetic and labeled data, our\nmodel sets new state-of-the-art results on the BEIR and MTEB benchmarks.\n","authors":["Liang Wang","Nan Yang","Xiaolong Huang","Linjun Yang","Rangan Majumder","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2401.00368v3.pdf","comment":"Accepted by ACL 2024"},{"id":"http://arxiv.org/abs/2405.20626v1","updated":"2024-05-31T05:31:00Z","published":"2024-05-31T05:31:00Z","title":"Causal Distillation for Alleviating Performance Heterogeneity in\n Recommender Systems","summary":" Recommendation performance usually exhibits a long-tail distribution over\nusers -- a small portion of head users enjoy much more accurate recommendation\nservices than the others. We reveal two sources of this performance\nheterogeneity problem: the uneven distribution of historical interactions (a\nnatural source); and the biased training of recommender models (a model\nsource). As addressing this problem cannot sacrifice the overall performance, a\nwise choice is to eliminate the model bias while maintaining the natural\nheterogeneity. The key to debiased training lies in eliminating the effect of\nconfounders that influence both the user's historical behaviors and the next\nbehavior. The emerging causal recommendation methods achieve this by modeling\nthe causal effect between user behaviors, however potentially neglect\nunobserved confounders (\\eg, friend suggestions) that are hard to measure in\npractice. To address unobserved confounders, we resort to the front-door\nadjustment (FDA) in causal theory and propose a causal multi-teacher\ndistillation framework (CausalD). FDA requires proper mediators in order to\nestimate the causal effects of historical behaviors on the next behavior. To\nachieve this, we equip CausalD with multiple heterogeneous recommendation\nmodels to model the mediator distribution. Then, the causal effect estimated by\nFDA is the expectation of recommendation prediction over the mediator\ndistribution and the prior distribution of historical behaviors, which is\ntechnically achieved by multi-teacher ensemble. To pursue efficient inference,\nCausalD further distills multiple teachers into one student model to directly\ninfer the causal effect for making recommendations.\n","authors":["Shengyu Zhang","Ziqi Jiang","Jiangchao Yao","Fuli Feng","Kun Kuang","Zhou Zhao","Shuo Li","Hongxia Yang","Tat-Seng Chua","Fei Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20626v1.pdf","comment":"TKDE 2023"},{"id":"http://arxiv.org/abs/2405.20565v1","updated":"2024-05-31T01:07:37Z","published":"2024-05-31T01:07:37Z","title":"Knowledge Enhanced Multi-intent Transformer Network for Recommendation","summary":" Incorporating Knowledge Graphs into Recommendation has attracted growing\nattention in industry, due to the great potential of KG in providing abundant\nsupplementary information and interpretability for the underlying models.\nHowever, simply integrating KG into recommendation usually brings in negative\nfeedback in industry, due to the ignorance of the following two factors: i)\nusers' multiple intents, which involve diverse nodes in KG. For example, in\ne-commerce scenarios, users may exhibit preferences for specific styles,\nbrands, or colors. ii) knowledge noise, which is a prevalent issue in Knowledge\nEnhanced Recommendation (KGR) and even more severe in industry scenarios. The\nirrelevant knowledge properties of items may result in inferior model\nperformance compared to approaches that do not incorporate knowledge. To tackle\nthese challenges, we propose a novel approach named Knowledge Enhanced\nMulti-intent Transformer Network for Recommendation (KGTN), comprising two\nprimary modules: Global Intents Modeling with Graph Transformer, and Knowledge\nContrastive Denoising under Intents. Specifically, Global Intents with Graph\nTransformer focuses on capturing learnable user intents, by incorporating\nglobal signals from user-item-relation-entity interactions with a graph\ntransformer, meanwhile learning intent-aware user/item representations.\nKnowledge Contrastive Denoising under Intents is dedicated to learning precise\nand robust representations. It leverages intent-aware representations to sample\nrelevant knowledge, and proposes a local-global contrastive mechanism to\nenhance noise-irrelevant representation learning. Extensive experiments\nconducted on benchmark datasets show the superior performance of our proposed\nmethod over the state-of-the-arts. And online A/B testing results on Alibaba\nlarge-scale industrial recommendation platform also indicate the real-scenario\neffectiveness of KGTN.\n","authors":["Ding Zou","Wei Wei","Feida Zhu","Chuanyu Xu","Tao Zhang","Chengfu Huo"],"pdf_url":"https://arxiv.org/pdf/2405.20565v1.pdf","comment":"Accept By The Web Conf 2024 (WWW 2024) Industry Track. arXiv admin\n note: text overlap with arXiv:2204.08807"},{"id":"http://arxiv.org/abs/2406.00231v1","updated":"2024-05-31T23:29:42Z","published":"2024-05-31T23:29:42Z","title":"LLM-RankFusion: Mitigating Intrinsic Inconsistency in LLM-based Ranking","summary":" Ranking passages by prompting a large language model (LLM) can achieve\npromising performance in modern information retrieval (IR) systems. A common\napproach is to sort the ranking list by prompting LLMs for pairwise comparison.\nHowever, sorting-based methods require consistent comparisons to correctly sort\nthe passages, which we show that LLMs often violate. We identify two kinds of\nintrinsic inconsistency in LLM-based pairwise comparisons: order inconsistency\nwhich leads to conflicting results when switching the passage order, and\ntransitive inconsistency which leads to non-transitive triads among all\npreference pairs. In this paper, we propose LLM-RankFusion, an LLM-based\nranking framework that mitigates these inconsistencies and produces a robust\nranking list. LLM-RankFusion mitigates order inconsistency using in-context\nlearning (ICL) to demonstrate order-agnostic comparisons and calibration to\nestimate the underlying preference probability between two passages. We then\naddress transitive inconsistency by aggregating the ranking results from\nmultiple rankers. In our experiments, we empirically show that LLM-RankFusion\ncan significantly reduce inconsistent pairwise comparison results, and improve\nthe ranking quality by making the final ranking list more robust.\n","authors":["Yifan Zeng","Ojas Tendolkar","Raymond Baartmans","Qingyun Wu","Huazheng Wang","Lizhong Chen"],"pdf_url":"https://arxiv.org/pdf/2406.00231v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00198v1","updated":"2024-05-31T21:19:41Z","published":"2024-05-31T21:19:41Z","title":"ImplicitSLIM and How it Improves Embedding-based Collaborative Filtering","summary":" We present ImplicitSLIM, a novel unsupervised learning approach for sparse\nhigh-dimensional data, with applications to collaborative filtering. Sparse\nlinear methods (SLIM) and their variations show outstanding performance, but\nthey are memory-intensive and hard to scale. ImplicitSLIM improves\nembedding-based models by extracting embeddings from SLIM-like models in a\ncomputationally cheap and memory-efficient way, without explicit learning of\nheavy SLIM-like models. We show that ImplicitSLIM improves performance and\nspeeds up convergence for both state of the art and classical collaborative\nfiltering methods. The source code for ImplicitSLIM, related models, and\napplications is available at https://github.com/ilya-shenbin/ImplicitSLIM.\n","authors":["Ilya Shenbin","Sergey Nikolenko"],"pdf_url":"https://arxiv.org/pdf/2406.00198v1.pdf","comment":"Published as a conference paper at ICLR 2024; authors' version"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2402.07131v2","updated":"2024-05-31T17:59:36Z","published":"2024-02-11T08:59:02Z","title":"Resampling methods for Private Statistical Inference","summary":" We consider the task of constructing confidence intervals with differential\nprivacy. We propose two private variants of the non-parametric bootstrap, which\nprivately compute the median of the results of multiple \"little\" bootstraps run\non partitions of the data and give asymptotic bounds on the coverage error of\nthe resulting confidence intervals. For a fixed differential privacy parameter\n$\\epsilon$, our methods enjoy the same error rates as that of the non-private\nbootstrap to within logarithmic factors in the sample size $n$. We empirically\nvalidate the performance of our methods for mean estimation, median estimation,\nand logistic regression with both real and synthetic data. Our methods achieve\nsimilar coverage accuracy to existing methods (and non-private baselines) while\nproviding notably shorter ($\\gtrsim 10$ times) confidence intervals than\nprevious approaches.\n","authors":["Karan Chadha","John Duchi","Rohith Kuditipudi"],"pdf_url":"https://arxiv.org/pdf/2402.07131v2.pdf","comment":"45 pages"},{"id":"http://arxiv.org/abs/2405.21070v1","updated":"2024-05-31T17:57:24Z","published":"2024-05-31T17:57:24Z","title":"Generalization Beyond Data Imbalance: A Controlled Study on CLIP for\n Transferable Insights","summary":" Severe data imbalance naturally exists among web-scale vision-language\ndatasets. Despite this, we find CLIP pre-trained thereupon exhibits notable\nrobustness to the data imbalance compared to supervised learning, and\ndemonstrates significant effectiveness in learning generalizable\nrepresentations. With an aim to investigate the reasons behind this finding, we\nconduct controlled experiments to study various underlying factors, and reveal\nthat CLIP's pretext task forms a dynamic classification problem wherein only a\nsubset of classes is present in training. This isolates the bias from dominant\nclasses and implicitly balances the learning signal. Furthermore, the\nrobustness and discriminability of CLIP improve with more descriptive language\nsupervision, larger data scale, and broader open-world concepts, which are\ninaccessible to supervised learning. Our study not only uncovers the mechanisms\nbehind CLIP's generalizability beyond data imbalance but also provides\ntransferable insights for the research community. The findings are validated in\nboth supervised and self-supervised learning, enabling models trained on\nimbalanced data to achieve CLIP-level performance on diverse recognition tasks.\nCode will be available at: https://github.com/CVMI-Lab/clip-beyond-tail.\n","authors":["Xin Wen","Bingchen Zhao","Yilun Chen","Jiangmiao Pang","Xiaojuan Qi"],"pdf_url":"https://arxiv.org/pdf/2405.21070v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.00752v2","updated":"2024-05-31T17:55:27Z","published":"2023-12-01T18:01:34Z","title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","summary":" Foundation models, now powering most of the exciting applications in deep\nlearning, are almost universally based on the Transformer architecture and its\ncore attention module. Many subquadratic-time architectures such as linear\nattention, gated convolution and recurrent models, and structured state space\nmodels (SSMs) have been developed to address Transformers' computational\ninefficiency on long sequences, but they have not performed as well as\nattention on important modalities such as language. We identify that a key\nweakness of such models is their inability to perform content-based reasoning,\nand make several improvements. First, simply letting the SSM parameters be\nfunctions of the input addresses their weakness with discrete modalities,\nallowing the model to selectively propagate or forget information along the\nsequence length dimension depending on the current token. Second, even though\nthis change prevents the use of efficient convolutions, we design a\nhardware-aware parallel algorithm in recurrent mode. We integrate these\nselective SSMs into a simplified end-to-end neural network architecture without\nattention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\\times$\nhigher throughput than Transformers) and linear scaling in sequence length, and\nits performance improves on real data up to million-length sequences. As a\ngeneral sequence model backbone, Mamba achieves state-of-the-art performance\nacross several modalities such as language, audio, and genomics. On language\nmodeling, our Mamba-3B model outperforms Transformers of the same size and\nmatches Transformers twice its size, both in pretraining and downstream\nevaluation.\n","authors":["Albert Gu","Tri Dao"],"pdf_url":"https://arxiv.org/pdf/2312.00752v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21064v1","updated":"2024-05-31T17:53:00Z","published":"2024-05-31T17:53:00Z","title":"Recurrent neural networks: vanishing and exploding gradients are not the\n end of the story","summary":" Recurrent neural networks (RNNs) notoriously struggle to learn long-term\nmemories, primarily due to vanishing and exploding gradients. The recent\nsuccess of state-space models (SSMs), a subclass of RNNs, to overcome such\ndifficulties challenges our theoretical understanding. In this paper, we delve\ninto the optimization challenges of RNNs and discover that, as the memory of a\nnetwork increases, changes in its parameters result in increasingly large\noutput variations, making gradient-based learning highly sensitive, even\nwithout exploding gradients. Our analysis further reveals the importance of the\nelement-wise recurrence design pattern combined with careful parametrizations\nin mitigating this effect. This feature is present in SSMs, as well as in other\narchitectures, such as LSTMs. Overall, our insights provide a new explanation\nfor some of the difficulties in gradient-based learning of RNNs and why some\narchitectures perform better than others.\n","authors":["Nicolas Zucchet","Antonio Orvieto"],"pdf_url":"https://arxiv.org/pdf/2405.21064v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21063v1","updated":"2024-05-31T17:51:07Z","published":"2024-05-31T17:51:07Z","title":"Neural Network Verification with Branch-and-Bound for General\n Nonlinearities","summary":" Branch-and-bound (BaB) is among the most effective methods for neural network\n(NN) verification. However, existing works on BaB have mostly focused on NNs\nwith piecewise linear activations, especially ReLU networks. In this paper, we\ndevelop a general framework, named GenBaB, to conduct BaB for general\nnonlinearities in general computational graphs based on linear bound\npropagation. To decide which neuron to branch, we design a new branching\nheuristic which leverages linear bounds as shortcuts to efficiently estimate\nthe potential improvement after branching. To decide nontrivial branching\npoints for general nonlinear functions, we propose to optimize branching points\noffline, which can be efficiently leveraged during verification with a lookup\ntable. We demonstrate the effectiveness of our GenBaB on verifying a wide range\nof NNs, including networks with activation functions such as Sigmoid, Tanh,\nSine and GeLU, as well as networks involving multi-dimensional nonlinear\noperations such as multiplications in LSTMs and Vision Transformers. Our\nframework also allows the verification of general nonlinear computation graphs\nand enables verification applications beyond simple neural networks,\nparticularly for AC Optimal Power Flow (ACOPF). GenBaB is part of the latest\n$\\alpha,\\!\\beta$-CROWN, the winner of the 4th International Verification of\nNeural Networks Competition (VNN-COMP 2023).\n","authors":["Zhouxing Shi","Qirui Jin","Zico Kolter","Suman Jana","Cho-Jui Hsieh","Huan Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.21063v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2405.21061v1","updated":"2024-05-31T17:50:27Z","published":"2024-05-31T17:50:27Z","title":"Graph External Attention Enhanced Transformer","summary":" The Transformer architecture has recently gained considerable attention in\nthe field of graph representation learning, as it naturally overcomes several\nlimitations of Graph Neural Networks (GNNs) with customized attention\nmechanisms or positional and structural encodings. Despite making some\nprogress, existing works tend to overlook external information of graphs,\nspecifically the correlation between graphs. Intuitively, graphs with similar\nstructures should have similar representations. Therefore, we propose Graph\nExternal Attention (GEA) -- a novel attention mechanism that leverages multiple\nexternal node/edge key-value units to capture inter-graph correlations\nimplicitly. On this basis, we design an effective architecture called Graph\nExternal Attention Enhanced Transformer (GEAET), which integrates local\nstructure and global interaction information for more comprehensive graph\nrepresentations. Extensive experiments on benchmark datasets demonstrate that\nGEAET achieves state-of-the-art empirical performance. The source code is\navailable for reproducibility at: https://github.com/icm1018/GEAET.\n","authors":["Jianqing Liang","Min Chen","Jiye Liang"],"pdf_url":"https://arxiv.org/pdf/2405.21061v1.pdf","comment":"In Proceedings of ICML 2024"},{"id":"http://arxiv.org/abs/2405.21060v1","updated":"2024-05-31T17:50:01Z","published":"2024-05-31T17:50:01Z","title":"Transformers are SSMs: Generalized Models and Efficient Algorithms\n Through Structured State Space Duality","summary":" While Transformers have been the main architecture behind deep learning's\nsuccess in language modeling, state-space models (SSMs) such as Mamba have\nrecently been shown to match or outperform Transformers at small to medium\nscale. We show that these families of models are actually quite closely\nrelated, and develop a rich framework of theoretical connections between SSMs\nand variants of attention, connected through various decompositions of a\nwell-studied class of structured semiseparable matrices. Our state space\nduality (SSD) framework allows us to design a new architecture (Mamba-2) whose\ncore layer is an a refinement of Mamba's selective SSM that is 2-8X faster,\nwhile continuing to be competitive with Transformers on language modeling.\n","authors":["Tri Dao","Albert Gu"],"pdf_url":"https://arxiv.org/pdf/2405.21060v1.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2402.15938v3","updated":"2024-05-31T17:49:03Z","published":"2024-02-24T23:54:41Z","title":"Generalization or Memorization: Data Contamination and Trustworthy\n Evaluation for Large Language Models","summary":" Recent statements about the impressive capabilities of large language models\n(LLMs) are usually supported by evaluating on open-access benchmarks.\nConsidering the vast size and wide-ranging sources of LLMs' training data, it\ncould explicitly or implicitly include test data, leading to LLMs being more\nsusceptible to data contamination. However, due to the opacity of training\ndata, the black-box access of models, and the rapid growth of synthetic\ntraining data, detecting and mitigating data contamination for LLMs faces\nsignificant challenges. In this paper, we propose CDD, which stands for\nContamination Detection via output Distribution for LLMs. CDD necessitates only\nthe sampled texts to detect data contamination, by identifying the peakedness\nof LLM's output distribution. To mitigate the impact of data contamination in\nevaluation, we also present TED: Trustworthy Evaluation via output\nDistribution, based on the correction of LLM's output distribution. To\nfacilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval,\nfor data contamination detection and contamination mitigation evaluation tasks.\nExtensive experimental results show that CDD achieves the average relative\nimprovements of 21.8\\%-30.2\\% over other contamination detection approaches in\nterms of Accuracy, F1 Score, and AUC metrics, and can effectively detect\nimplicit contamination. TED substantially mitigates performance improvements up\nto 66.9\\% attributed to data contamination across various contamination setups.\nIn real-world applications, we reveal that ChatGPT exhibits a high potential to\nsuffer from data contamination on HumanEval benchmark.\n","authors":["Yihong Dong","Xue Jiang","Huanyu Liu","Zhi Jin","Bin Gu","Mengfei Yang","Ge Li"],"pdf_url":"https://arxiv.org/pdf/2402.15938v3.pdf","comment":"Accepted to ACL"},{"id":"http://arxiv.org/abs/2405.17697v2","updated":"2024-05-31T17:47:52Z","published":"2024-05-27T23:04:37Z","title":"P4: Towards private, personalized, and Peer-to-Peer learning","summary":" Personalized learning is a proposed approach to address the problem of data\nheterogeneity in collaborative machine learning. In a decentralized setting,\nthe two main challenges of personalization are client clustering and data\nprivacy. In this paper, we address these challenges by developing P4\n(Personalized Private Peer-to-Peer) a method that ensures that each client\nreceives a personalized model while maintaining differential privacy guarantee\nof each client's local dataset during and after the training. Our approach\nincludes the design of a lightweight algorithm to identify similar clients and\ngroup them in a private, peer-to-peer (P2P) manner. Once grouped, we develop\ndifferentially-private knowledge distillation for clients to co-train with\nminimal impact on accuracy. We evaluate our proposed method on three benchmark\ndatasets (FEMNIST or Federated EMNIST, CIFAR-10 and CIFAR-100) and two\ndifferent neural network architectures (Linear and CNN-based networks) across a\nrange of privacy parameters. The results demonstrate the potential of P4, as it\noutperforms the state-of-the-art of differential private P2P by up to 40\npercent in terms of accuracy. We also show the practicality of P4 by\nimplementing it on resource constrained devices, and validating that it has\nminimal overhead, e.g., about 7 seconds to run collaborative training between\ntwo clients.\n","authors":["Mohammad Mahdi Maheri","Sandra Siby","Sina Abdollahi","Anastasia Borovykh","Hamed Haddadi"],"pdf_url":"https://arxiv.org/pdf/2405.17697v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.04240v2","updated":"2024-05-31T17:43:54Z","published":"2024-04-05T17:41:52Z","title":"Dynamic Conditional Optimal Transport through Simulation-Free Flows","summary":" We study the geometry of conditional optimal transport (COT) and prove a\ndynamical formulation which generalizes the Benamou-Brenier Theorem. Equipped\nwith these tools, we propose a simulation-free flow-based method for\nconditional generative modeling. Our method couples an arbitrary source\ndistribution to a specified target distribution through a triangular COT plan,\nand a conditional generative model is obtained by approximating the geodesic\npath of measures induced by this COT plan. Our theory and methods are\napplicable in infinite-dimensional settings, making them well suited for a wide\nclass of Bayesian inverse problems. Empirically, we demonstrate that our method\nis competitive on several challenging conditional generation tasks, including\nan infinite-dimensional inverse problem.\n","authors":["Gavin Kerrigan","Giosue Migliorini","Padhraic Smyth"],"pdf_url":"https://arxiv.org/pdf/2404.04240v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21050v1","updated":"2024-05-31T17:43:35Z","published":"2024-05-31T17:43:35Z","title":"Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models","summary":" Adapting large-scale pre-trained generative models in a parameter-efficient\nmanner is gaining traction. Traditional methods like low rank adaptation\nachieve parameter efficiency by imposing constraints but may not be optimal for\ntasks requiring high representation capacity. We propose a novel spectrum-aware\nadaptation framework for generative models. Our method adjusts both singular\nvalues and their basis vectors of pretrained weights. Using the Kronecker\nproduct and efficient Stiefel optimizers, we achieve parameter-efficient\nadaptation of orthogonal matrices. We introduce Spectral Orthogonal\nDecomposition Adaptation (SODA), which balances computational efficiency and\nrepresentation capacity. Extensive evaluations on text-to-image diffusion\nmodels demonstrate SODA's effectiveness, offering a spectrum-aware alternative\nto existing fine-tuning methods.\n","authors":["Xinxi Zhang","Song Wen","Ligong Han","Felix Juefei-Xu","Akash Srivastava","Junzhou Huang","Hao Wang","Molei Tao","Dimitris N. Metaxas"],"pdf_url":"https://arxiv.org/pdf/2405.21050v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21047v1","updated":"2024-05-31T17:39:15Z","published":"2024-05-31T17:39:15Z","title":"Grammar-Aligned Decoding","summary":" Large Language Models (LLMs) struggle with reliably generating highly\nstructured outputs, such as program code, mathematical formulas, or well-formed\nmarkup. Constrained decoding approaches mitigate this problem by greedily\nrestricting what tokens an LLM can output at each step to guarantee that the\noutput matches a given constraint. Specifically, in grammar-constrained\ndecoding (GCD), the LLM's output must follow a given grammar. In this paper we\ndemonstrate that GCD techniques (and in general constrained decoding\ntechniques) can distort the LLM's distribution, leading to outputs that are\ngrammatical but appear with likelihoods that are not proportional to the ones\ngiven by the LLM, and so ultimately are low-quality. We call the problem of\naligning sampling with a grammar constraint, grammar-aligned decoding (GAD),\nand propose adaptive sampling with approximate expected futures (ASAp), a\ndecoding algorithm that guarantees the output to be grammatical while provably\nproducing outputs that match the conditional probability of the LLM's\ndistribution conditioned on the given grammar constraint. Our algorithm uses\nprior sample outputs to soundly overapproximate the future grammaticality of\ndifferent output prefixes. Our evaluation on code generation and structured NLP\ntasks shows how ASAp often produces outputs with higher likelihood (according\nto the LLM's distribution) than existing GCD techniques, while still enforcing\nthe desired grammatical constraints.\n","authors":["Kanghee Park","Jiayu Wang","Taylor Berg-Kirkpatrick","Nadia Polikarpova","Loris D'Antoni"],"pdf_url":"https://arxiv.org/pdf/2405.21047v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21046v1","updated":"2024-05-31T17:39:06Z","published":"2024-05-31T17:39:06Z","title":"Exploratory Preference Optimization: Harnessing Implicit\n Q*-Approximation for Sample-Efficient RLHF","summary":" Reinforcement learning from human feedback (RLHF) has emerged as a central\ntool for language model alignment. We consider online exploration in RLHF,\nwhich exploits interactive access to human or AI feedback by deliberately\nencouraging the model to produce diverse, maximally informative responses. By\nallowing RLHF to confidently stray from the pre-trained model, online\nexploration offers the possibility of novel, potentially super-human\ncapabilities, but its full potential as a paradigm for language model training\nhas yet to be realized, owing to computational and statistical bottlenecks in\ndirectly adapting existing reinforcement learning techniques. We propose a new\nalgorithm for online exploration in RLHF, Exploratory Preference Optimization\n(XPO), which is simple and practical -- a one-line change to (online) Direct\nPreference Optimization (DPO; Rafailov et al., 2023) -- yet enjoys the\nstrongest known provable guarantees and promising empirical performance. XPO\naugments the DPO objective with a novel and principled exploration bonus,\nempowering the algorithm to explore outside the support of the initial model\nand human feedback data. In theory, we show that XPO is provably\nsample-efficient and converges to a near-optimal language model policy under\nnatural exploration conditions, irrespective of whether the initial model has\ngood coverage. Our analysis, which builds on the observation that DPO\nimplicitly performs a form of $Q^{\\star}$-approximation (or, Bellman error\nminimization), combines previously disparate techniques from language modeling\nand theoretical reinforcement learning in a serendipitous fashion through the\nperspective of KL-regularized Markov decision processes. Empirically, we find\nthat XPO is more sample-efficient than non-exploratory DPO variants in a\npreliminary evaluation.\n","authors":["Tengyang Xie","Dylan J. Foster","Akshay Krishnamurthy","Corby Rosset","Ahmed Awadallah","Alexander Rakhlin"],"pdf_url":"https://arxiv.org/pdf/2405.21046v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.18239v2","updated":"2024-05-31T17:38:51Z","published":"2024-04-28T16:31:32Z","title":"SOUL: Unlocking the Power of Second-Order Optimization for LLM\n Unlearning","summary":" Large Language Models (LLMs) have highlighted the necessity of effective\nunlearning mechanisms to comply with data regulations and ethical AI practices.\nLLM unlearning aims at removing undesired data influences and associated model\ncapabilities without compromising utility out of the scope of unlearning. While\ninterest in studying LLM unlearning is growing,the impact of the optimizer\nchoice for LLM unlearning remains under-explored. In this work, we shed light\non the significance of optimizer selection in LLM unlearning for the first\ntime, establishing a clear connection between {second-order optimization} and\ninfluence unlearning (a classical approach using influence functions to update\nthe model for data influence removal). This insight propels us to develop a\nsecond-order unlearning framework, termed SOUL, built upon the second-order\nclipped stochastic optimization (Sophia)-based LLM training method. SOUL\nextends the static, one-shot model update using influence unlearning to a\ndynamic, iterative unlearning process. Our extensive experiments show that SOUL\nconsistently outperforms conventional first-order methods across various\nunlearning tasks, models, and metrics, suggesting the promise of second-order\noptimization in providing a scalable and easily implementable solution for LLM\nunlearning.\n","authors":["Jinghan Jia","Yihua Zhang","Yimeng Zhang","Jiancheng Liu","Bharat Runwal","James Diffenderfer","Bhavya Kailkhura","Sijia Liu"],"pdf_url":"https://arxiv.org/pdf/2404.18239v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21045v1","updated":"2024-05-31T17:38:49Z","published":"2024-05-31T17:38:49Z","title":"An Attention-Based Multi-Context Convolutional Encoder-Decoder Neural\n Network for Work Zone Traffic Impact Prediction","summary":" Work zone is one of the major causes of non-recurrent traffic congestion and\nroad incidents. Despite the significance of its impact, studies on predicting\nthe traffic impact of work zones remain scarce. In this paper, we propose a\ndata integration pipeline that enhances the utilization of work zone and\ntraffic data from diversified platforms, and introduce a novel deep learning\nmodel to predict the traffic speed and incident likelihood during planned work\nzone events. The proposed model transforms traffic patterns into 2D space-time\nimages for both model input and output and employs an attention-based\nmulti-context convolutional encoder-decoder architecture to capture the\nspatial-temporal dependencies between work zone events and traffic variations.\nTrained and validated on four years of archived work zone traffic data from\nMaryland, USA, the model demonstrates superior performance over baseline models\nin predicting traffic speed, incident likelihood, and inferred traffic\nattributes such as queue length and congestion timings (i.e., start time and\nduration). Specifically, the proposed model outperforms the baseline models by\nreducing the prediction error of traffic speed by 5% to 34%, queue length by\n11% to 29%, congestion timing by 6% to 17%, and increasing the accuracy of\nincident predictions by 5% to 7%. Consequently, this model offers substantial\npromise for enhancing the planning and traffic management of work zones.\n","authors":["Qinhua Jiang","Xishun Liao","Yaofa Gong","Jiaqi Ma"],"pdf_url":"https://arxiv.org/pdf/2405.21045v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21043v1","updated":"2024-05-31T17:36:16Z","published":"2024-05-31T17:36:16Z","title":"Target Networks and Over-parameterization Stabilize Off-policy\n Bootstrapping with Function Approximation","summary":" We prove that the combination of a target network and over-parameterized\nlinear function approximation establishes a weaker convergence condition for\nbootstrapped value estimation in certain cases, even with off-policy data. Our\ncondition is naturally satisfied for expected updates over the entire\nstate-action space or learning with a batch of complete trajectories from\nepisodic Markov decision processes. Notably, using only a target network or an\nover-parameterized model does not provide such a convergence guarantee.\nAdditionally, we extend our results to learning with truncated trajectories,\nshowing that convergence is achievable for all tasks with minor modifications,\nakin to value truncation for the final states in trajectories. Our primary\nresult focuses on temporal difference estimation for prediction, providing\nhigh-probability value estimation error bounds and empirical analysis on\nBaird's counterexample and a Four-room task. Furthermore, we explore the\ncontrol setting, demonstrating that similar convergence conditions apply to\nQ-learning.\n","authors":["Fengdi Che","Chenjun Xiao","Jincheng Mei","Bo Dai","Ramki Gummadi","Oscar A Ramirez","Christopher K Harris","A. Rupam Mahmood","Dale Schuurmans"],"pdf_url":"https://arxiv.org/pdf/2405.21043v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21042v1","updated":"2024-05-31T17:33:07Z","published":"2024-05-31T17:33:07Z","title":"Comparing information content of representation spaces for\n disentanglement with VAE ensembles","summary":" Disentanglement is the endeavour to use machine learning to divide\ninformation about a dataset into meaningful fragments. In practice these\nfragments are representation (sub)spaces, often the set of channels in the\nlatent space of a variational autoencoder (VAE). Assessments of disentanglement\npredominantly employ metrics that are coarse-grained at the model level, but\nthis approach can obscure much about the process of information fragmentation.\nHere we propose to study the learned channels in aggregate, as the fragments of\ninformation learned by an ensemble of repeat training runs. Additionally, we\ndepart from prior work where measures of similarity between individual\nsubspaces neglected the nature of data embeddings as probability distributions.\nInstead, we view representation subspaces as communication channels that\nperform a soft clustering of the data; consequently, we generalize two classic\ninformation-theoretic measures of similarity between clustering assignments to\ncompare representation spaces. We develop a lightweight method of estimation\nbased on fingerprinting representation subspaces by their ability to\ndistinguish dataset samples, allowing us to identify, analyze, and leverage\nmeaningful structure in ensembles of VAEs trained on synthetic and natural\ndatasets. Using this fully unsupervised pipeline we identify \"hotspots\" in the\nspace of information fragments: groups of nearly identical representation\nsubspaces that appear repeatedly in an ensemble of VAEs, particularly as\nregularization is increased. Finally, we leverage the proposed methodology to\nachieve ensemble learning with VAEs, boosting the information content of a set\nof weak learners -- a capability not possible with previous methods of\nassessing channel similarity.\n","authors":["Kieran A. Murphy","Sam Dillavou","Dani S. Bassett"],"pdf_url":"https://arxiv.org/pdf/2405.21042v1.pdf","comment":"Code:\n https://github.com/murphyka/representation-space-info-comparison"},{"id":"http://arxiv.org/abs/2402.09615v3","updated":"2024-05-31T17:31:38Z","published":"2024-02-14T23:09:15Z","title":"API Pack: A Massive Multi-Programming Language Dataset for API Call\n Generation","summary":" We introduce API Pack, a massive multi-programming language dataset\ncontaining more than 1 million instruction-API call pairs to improve the API\ncall generation capabilities of large language models. By fine-tuning\nCodeLlama-13B on 20,000 Python instances from API Pack, we achieved around 10%\nand 5% higher accuracy compared to GPT-3.5 and GPT-4, respectively, in\ngenerating unseen API calls. Fine-tuning on API Pack enables cross-programming\nlanguage generalization by leveraging a large amount of data in one language\nand small amounts of data from other languages. Scaling the training data to 1\nmillion instances further improves the model's generalization to new APIs not\nencountered during training. We open-source the API Pack dataset, trained\nmodels, and associated source code at https://github.com/zguo0525/API-Pack to\nfacilitate further research.\n","authors":["Zhen Guo","Adriana Meza Soria","Wei Sun","Yikang Shen","Rameswar Panda"],"pdf_url":"https://arxiv.org/pdf/2402.09615v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21036v1","updated":"2024-05-31T17:29:39Z","published":"2024-05-31T17:29:39Z","title":"A-PETE: Adaptive Prototype Explanations of Tree Ensembles","summary":" The need for interpreting machine learning models is addressed through\nprototype explanations within the context of tree ensembles. An algorithm named\nAdaptive Prototype Explanations of Tree Ensembles (A-PETE) is proposed to\nautomatise the selection of prototypes for these classifiers. Its unique\ncharacteristics is using a specialised distance measure and a modified k-medoid\napproach. Experiments demonstrated its competitive predictive accuracy with\nrespect to earlier explanation algorithms. It also provides a a sufficient\nnumber of prototypes for the purpose of interpreting the random forest\nclassifier.\n","authors":["Jacek Karolczak","Jerzy Stefanowski"],"pdf_url":"https://arxiv.org/pdf/2405.21036v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.08097v2","updated":"2024-05-31T17:20:29Z","published":"2024-02-12T22:34:53Z","title":"An Accelerated Gradient Method for Convex Smooth Simple Bilevel\n Optimization","summary":" In this paper, we focus on simple bilevel optimization problems, where we\nminimize a convex smooth objective function over the optimal solution set of\nanother convex smooth constrained optimization problem. We present a novel\nbilevel optimization method that locally approximates the solution set of the\nlower-level problem using a cutting plane approach and employs an accelerated\ngradient-based update to reduce the upper-level objective function over the\napproximated solution set. We measure the performance of our method in terms of\nsuboptimality and infeasibility errors and provide non-asymptotic convergence\nguarantees for both error criteria. Specifically, when the feasible set is\ncompact, we show that our method requires at most\n$\\mathcal{O}(\\max\\{1/\\sqrt{\\epsilon_{f}}, 1/\\epsilon_g\\})$ iterations to find a\nsolution that is $\\epsilon_f$-suboptimal and $\\epsilon_g$-infeasible. Moreover,\nunder the additional assumption that the lower-level objective satisfies the\n$r$-th H\\\"olderian error bound, we show that our method achieves an iteration\ncomplexity of\n$\\mathcal{O}(\\max\\{\\epsilon_{f}^{-\\frac{2r-1}{2r}},\\epsilon_{g}^{-\\frac{2r-1}{2r}}\\})$,\nwhich matches the optimal complexity of single-level convex constrained\noptimization when $r=1$.\n","authors":["Jincheng Cao","Ruichen Jiang","Erfan Yazdandoost Hamedani","Aryan Mokhtari"],"pdf_url":"https://arxiv.org/pdf/2402.08097v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19961v2","updated":"2024-05-31T17:18:35Z","published":"2024-05-30T11:32:42Z","title":"Collective Variable Free Transition Path Sampling with Generative Flow\n Network","summary":" Understanding transition paths between meta-stable states in molecular\nsystems is fundamental for material design and drug discovery. However,\nsampling these paths via molecular dynamics simulations is computationally\nprohibitive due to the high-energy barriers between the meta-stable states.\nRecent machine learning approaches are often restricted to simple systems or\nrely on collective variables (CVs) extracted from expensive domain knowledge.\nIn this work, we propose to leverage generative flow networks (GFlowNets) to\nsample transition paths without relying on CVs. We reformulate the problem as\namortized energy-based sampling over molecular trajectories and train a bias\npotential by minimizing the squared log-ratio between the target distribution\nand the generator, derived from the flow matching objective of GFlowNets. Our\nevaluation on three proteins (Alanine Dipeptide, Polyproline, and Chignolin)\ndemonstrates that our approach, called TPS-GFN, generates more realistic and\ndiverse transition paths than the previous CV-free machine learning approach.\n","authors":["Kiyoung Seong","Seonghyun Park","Seonghwan Kim","Woo Youn Kim","Sungsoo Ahn"],"pdf_url":"https://arxiv.org/pdf/2405.19961v2.pdf","comment":"9 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2405.21027v1","updated":"2024-05-31T17:16:29Z","published":"2024-05-31T17:16:29Z","title":"Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles","summary":" For solving zero-sum games involving non-transitivity, a common approach is\nto maintain population policies to approximate the Nash Equilibrium (NE).\nPrevious research has shown that the Policy Space Response Oracle (PSRO) is an\neffective multi-agent reinforcement learning framework for these games.\nHowever, repeatedly training new policies from scratch to approximate the Best\nResponse (BR) to opponents' mixed policies at each iteration is inefficient and\ncostly. While some PSRO methods initialize a new BR policy by inheriting from\npast BR policies, this approach limits the exploration of new policies,\nespecially against challenging opponents.To address this issue, we propose\nFusion-PSRO, which uses model fusion to initialize the policy for better\napproximation to BR. With Top-k probabilities from NE, we select high-quality\nbase policies and fuse them into a new BR policy through model averaging. This\napproach allows the initialized policy to incorporate multiple expert policies,\nmaking it easier to handle difficult opponents compared to inheriting or\ninitializing from scratch. Additionally, our method only modifies the policy\ninitialization, enabling its application to nearly all PSRO variants without\nadditional training overhead.Our experiments with non-transitive matrix games,\nLeduc poker, and the more complex Liars Dice demonstrate that Fusion-PSRO\nenhances the performance of nearly all PSRO variants, achieving lower\nexploitability.\n","authors":["Jiesong Lian","Yucong Huang","Mingzhi Wang","Chengdong Ma","Yixue Hao","Ying Wen","Yaodong Yang"],"pdf_url":"https://arxiv.org/pdf/2405.21027v1.pdf","comment":"20 pages, 5 figures"},{"id":"http://arxiv.org/abs/2405.21021v1","updated":"2024-05-31T17:09:07Z","published":"2024-05-31T17:09:07Z","title":"Beyond Conventional Parametric Modeling: Data-Driven Framework for\n Estimation and Prediction of Time Activity Curves in Dynamic PET Imaging","summary":" Dynamic Positron Emission Tomography (dPET) imaging and Time-Activity Curve\n(TAC) analyses are essential for understanding and quantifying the\nbiodistribution of radiopharmaceuticals over time and space. Traditional\ncompartmental modeling, while foundational, commonly struggles to fully capture\nthe complexities of biological systems, including non-linear dynamics and\nvariability. This study introduces an innovative data-driven neural\nnetwork-based framework, inspired by Reaction Diffusion systems, designed to\naddress these limitations. Our approach, which adaptively fits TACs from dPET,\nenables the direct calibration of diffusion coefficients and reaction terms\nfrom observed data, offering significant improvements in predictive accuracy\nand robustness over traditional methods, especially in complex biological\nscenarios. By more accurately modeling the spatio-temporal dynamics of\nradiopharmaceuticals, our method advances modeling of pharmacokinetic and\npharmacodynamic processes, enabling new possibilities in quantitative nuclear\nmedicine.\n","authors":["Niloufar Zakariaei","Arman Rahmim","Eldad Haber"],"pdf_url":"https://arxiv.org/pdf/2405.21021v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21018v1","updated":"2024-05-31T17:07:15Z","published":"2024-05-31T17:07:15Z","title":"Improved Techniques for Optimization-Based Jailbreaking on Large\n Language Models","summary":" Large language models (LLMs) are being rapidly developed, and a key component\nof their widespread deployment is their safety-related alignment. Many\nred-teaming efforts aim to jailbreak LLMs, where among these efforts, the\nGreedy Coordinate Gradient (GCG) attack's success has led to a growing interest\nin the study of optimization-based jailbreaking techniques. Although GCG is a\nsignificant milestone, its attacking efficiency remains unsatisfactory. In this\npaper, we present several improved (empirical) techniques for\noptimization-based jailbreaks like GCG. We first observe that the single target\ntemplate of \"Sure\" largely limits the attacking performance of GCG; given this,\nwe propose to apply diverse target templates containing harmful self-suggestion\nand/or guidance to mislead LLMs. Besides, from the optimization aspects, we\npropose an automatic multi-coordinate updating strategy in GCG (i.e.,\nadaptively deciding how many tokens to replace in each step) to accelerate\nconvergence, as well as tricks like easy-to-hard initialisation. Then, we\ncombine these improved technologies to develop an efficient jailbreak method,\ndubbed $\\mathcal{I}$-GCG. In our experiments, we evaluate on a series of\nbenchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate\nthat our improved techniques can help GCG outperform state-of-the-art\njailbreaking attacks and achieve nearly 100% attack success rate. The code is\nreleased at https://github.com/jiaxiaojunQAQ/I-GCG.\n","authors":["Xiaojun Jia","Tianyu Pang","Chao Du","Yihao Huang","Jindong Gu","Yang Liu","Xiaochun Cao","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2405.21018v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.00825v4","updated":"2024-05-31T17:07:04Z","published":"2022-04-27T05:03:45Z","title":"Stochastic Online Fisher Markets: Static Pricing Limits and Adaptive\n Enhancements","summary":" Fisher markets are one of the most fundamental models for resource\nallocation. However, the problem of computing equilibrium prices in Fisher\nmarkets typically relies on complete knowledge of users' budgets and utility\nfunctions and requires transactions to happen in a static market where all\nusers are present simultaneously. Motivated by these practical considerations,\nwe study an online variant of Fisher markets, wherein users with privately\nknown utility and budget parameters, drawn i.i.d. from a distribution, arrive\nsequentially. In this setting, we first study the limitations of static pricing\nalgorithms, which set uniform prices for all users, along two performance\nmetrics: (i) regret, i.e., the optimality gap in the objective of the\nEisenberg-Gale program between an online algorithm and an oracle with complete\ninformation, and (ii) capacity violations, i.e., the over-consumption of goods\nrelative to their capacities. Given the limitations of static pricing, we\ndesign adaptive posted-pricing algorithms, one with knowledge of the\ndistribution of users' budget and utility parameters and another that adjusts\nprices solely based on past observations of user consumption, i.e., revealed\npreference feedback, with improved performance guarantees. Finally, we present\nnumerical experiments to compare our revealed preference algorithm's\nperformance to several benchmarks.\n","authors":["Devansh Jalota","Yinyu Ye"],"pdf_url":"https://arxiv.org/pdf/2205.00825v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18418v2","updated":"2024-05-31T17:03:00Z","published":"2024-05-28T17:57:23Z","title":"Hierarchical World Models as Visual Whole-Body Humanoid Controllers","summary":" Whole-body control for humanoids is challenging due to the high-dimensional\nnature of the problem, coupled with the inherent instability of a bipedal\nmorphology. Learning from visual observations further exacerbates this\ndifficulty. In this work, we explore highly data-driven approaches to visual\nwhole-body humanoid control based on reinforcement learning, without any\nsimplifying assumptions, reward design, or skill primitives. Specifically, we\npropose a hierarchical world model in which a high-level agent generates\ncommands based on visual observations for a low-level agent to execute, both of\nwhich are trained with rewards. Our approach produces highly performant control\npolicies in 8 tasks with a simulated 56-DoF humanoid, while synthesizing\nmotions that are broadly preferred by humans. Code and videos:\nhttps://nicklashansen.com/rlpuppeteer\n","authors":["Nicklas Hansen","Jyothir S V","Vlad Sobal","Yann LeCun","Xiaolong Wang","Hao Su"],"pdf_url":"https://arxiv.org/pdf/2405.18418v2.pdf","comment":"Code and videos at https://nicklashansen.com/rlpuppeteer"},{"id":"http://arxiv.org/abs/2305.09938v4","updated":"2024-05-31T17:02:37Z","published":"2023-05-17T03:52:40Z","title":"Mastering Long-Tail Complexity on Graphs: Characterization, Learning,\n and Generalization","summary":" In the context of long-tail classification on graphs, the vast majority of\nexisting work primarily revolves around the development of model debiasing\nstrategies, intending to mitigate class imbalances and enhance the overall\nperformance. Despite the notable success, there is very limited literature that\nprovides a theoretical tool for characterizing the behaviors of long-tail\nclasses in graphs and gaining insight into generalization performance in\nreal-world scenarios. To bridge this gap, we propose a generalization bound for\nlong-tail classification on graphs by formulating the problem in the fashion of\nmulti-task learning, i.e., each task corresponds to the prediction of one\nparticular class. Our theoretical results show that the generalization\nperformance of long-tail classification is dominated by the overall loss range\nand the task complexity. Building upon the theoretical findings, we propose a\nnovel generic framework HierTail for long-tail classification on graphs. In\nparticular, we start with a hierarchical task grouping module that allows us to\nassign related tasks into hypertasks and thus control the complexity of the\ntask space; then, we further design a balanced contrastive learning module to\nadaptively balance the gradients of both head and tail classes to control the\nloss range across all tasks in a unified fashion. Extensive experiments\ndemonstrate the effectiveness of HierTail in characterizing long-tail classes\non real graphs, which achieves up to 12.9% improvement over the leading\nbaseline method in accuracy.\n","authors":["Haohui Wang","Baoyu Jing","Kaize Ding","Yada Zhu","Wei Cheng","Si Zhang","Yonghui Fan","Liqing Zhang","Dawei Zhou"],"pdf_url":"https://arxiv.org/pdf/2305.09938v4.pdf","comment":"Accepted at KDD 2024"},{"id":"http://arxiv.org/abs/2405.21012v1","updated":"2024-05-31T16:52:51Z","published":"2024-05-31T16:52:51Z","title":"G-Transformer for Conditional Average Potential Outcome Estimation over\n Time","summary":" Estimating potential outcomes for treatments over time based on observational\ndata is important for personalized decision-making in medicine. Yet, existing\nneural methods for this task suffer from either (a) bias or (b) large variance.\nIn order to address both limitations, we introduce the G-transformer (GT). Our\nGT is a novel, neural end-to-end model designed for unbiased, low-variance\nestimation of conditional average potential outcomes (CAPOs) over time.\nSpecifically, our GT is the first neural model to perform regression-based\niterative G-computation for CAPOs in the time-varying setting. We evaluate the\neffectiveness of our GT across various experiments. In sum, this work\nrepresents a significant step towards personalized decision-making from\nelectronic health records.\n","authors":["Konstantin Hess","Dennis Frauen","Valentyn Melnychuk","Stefan Feuerriegel"],"pdf_url":"https://arxiv.org/pdf/2405.21012v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21003v1","updated":"2024-05-31T16:44:40Z","published":"2024-05-31T16:44:40Z","title":"Explaining Predictions by Characteristic Rules","summary":" Characteristic rules have been advocated for their ability to improve\ninterpretability over discriminative rules within the area of rule learning.\nHowever, the former type of rule has not yet been used by techniques for\nexplaining predictions. A novel explanation technique, called CEGA\n(Characteristic Explanatory General Association rules), is proposed, which\nemploys association rule mining to aggregate multiple explanations generated by\nany standard local explanation technique into a set of characteristic rules. An\nempirical investigation is presented, in which CEGA is compared to two\nstate-of-the-art methods, Anchors and GLocalX, for producing local and\naggregated explanations in the form of discriminative rules. The results\nsuggest that the proposed approach provides a better trade-off between fidelity\nand complexity compared to the two state-of-the-art approaches; CEGA and\nAnchors significantly outperform GLocalX with respect to fidelity, while CEGA\nand GLocalX significantly outperform Anchors with respect to the number of\ngenerated rules. The effect of changing the format of the explanations of CEGA\nto discriminative rules and using LIME and SHAP as local explanation techniques\ninstead of Anchors are also investigated. The results show that the\ncharacteristic explanatory rules still compete favorably with rules in the\nstandard discriminative format. The results also indicate that using CEGA in\ncombination with either SHAP or Anchors consistently leads to a higher fidelity\ncompared to using LIME as the local explanation technique.\n","authors":["Amr Alkhatib","Henrik Boström","Michalis Vazirgiannis"],"pdf_url":"https://arxiv.org/pdf/2405.21003v1.pdf","comment":"Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022"},{"id":"http://arxiv.org/abs/2212.07946v3","updated":"2024-05-31T16:40:03Z","published":"2022-12-15T16:28:06Z","title":"Active Inference and Reinforcement Learning: A unified inference on\n continuous state and action spaces under partial observability","summary":" Reinforcement learning (RL) has garnered significant attention for developing\ndecision-making agents that aim to maximize rewards, specified by an external\nsupervisor, within fully observable environments. However, many real-world\nproblems involve partial observations, formulated as partially observable\nMarkov decision processes (POMDPs). Previous studies have tackled RL in POMDPs\nby either incorporating the memory of past actions and observations or by\ninferring the true state of the environment from observed data. However,\naggregating observed data over time becomes impractical in continuous spaces.\nMoreover, inference-based RL approaches often require many samples to perform\nwell, as they focus solely on reward maximization and neglect uncertainty in\nthe inferred state. Active inference (AIF) is a framework formulated in POMDPs\nand directs agents to select actions by minimizing a function called expected\nfree energy (EFE). This supplies reward-maximizing (exploitative) behaviour, as\nin RL, with information-seeking (exploratory) behaviour. Despite this\nexploratory behaviour of AIF, its usage is limited to discrete spaces due to\nthe computational challenges associated with EFE. In this paper, we propose a\nunified principle that establishes a theoretical connection between AIF and RL,\nenabling seamless integration of these two approaches and overcoming their\naforementioned limitations in continuous space POMDP settings. We substantiate\nour findings with theoretical analysis, providing novel perspectives for\nutilizing AIF in the design of artificial agents. Experimental results\ndemonstrate the superior learning capabilities of our method in solving\ncontinuous space partially observable tasks. Notably, our approach harnesses\ninformation-seeking exploration, enabling it to effectively solve reward-free\nproblems and rendering explicit task reward design by an external supervisor\noptional.\n","authors":["Parvin Malekzadeh","Konstantinos N. Plataniotis"],"pdf_url":"https://arxiv.org/pdf/2212.07946v3.pdf","comment":"90 pages including appendices"},{"id":"http://arxiv.org/abs/2405.20993v1","updated":"2024-05-31T16:38:35Z","published":"2024-05-31T16:38:35Z","title":"Information limits and Thouless-Anderson-Palmer equations for spiked\n matrix models with structured noise","summary":" We consider a prototypical problem of Bayesian inference for a structured\nspiked model: a low-rank signal is corrupted by additive noise. While both\ninformation-theoretic and algorithmic limits are well understood when the noise\nis i.i.d. Gaussian, the more realistic case of structured noise still proves to\nbe challenging. To capture the structure while maintaining mathematical\ntractability, a line of work has focused on rotationally invariant noise.\nHowever, existing studies either provide sub-optimal algorithms or they are\nlimited to a special class of noise ensembles. In this paper, we establish the\nfirst characterization of the information-theoretic limits for a noise matrix\ndrawn from a general trace ensemble. These limits are then achieved by an\nefficient algorithm inspired by the theory of adaptive Thouless-Anderson-Palmer\n(TAP) equations. Our approach leverages tools from statistical physics (replica\nmethod) and random matrix theory (generalized spherical integrals), and it\nunveils the equivalence between the rotationally invariant model and a\nsurrogate Gaussian model.\n","authors":["Jean Barbier","Francesco Camilli","Marco Mondelli","Yizhou Xu"],"pdf_url":"https://arxiv.org/pdf/2405.20993v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.14622v3","updated":"2024-05-31T16:37:53Z","published":"2024-05-23T14:30:33Z","title":"Calibrated Self-Rewarding Vision Language Models","summary":" Large Vision-Language Models (LVLMs) have made substantial progress by\nintegrating pre-trained large language models (LLMs) and vision models through\ninstruction tuning. Despite these advancements, LVLMs often exhibit the\nhallucination phenomenon, where generated text responses appear linguistically\nplausible but contradict the input image, indicating a misalignment between\nimage and text pairs. This misalignment arises because the model tends to\nprioritize textual information over visual input, even when both the language\nmodel and visual representations are of high quality. Existing methods leverage\nadditional models or human annotations to curate preference data and enhance\nmodality alignment through preference optimization. These approaches may not\neffectively reflect the target LVLM's preferences, making the curated\npreferences easily distinguishable. Our work addresses these challenges by\nproposing the Calibrated Self-Rewarding (CSR) approach, which enables the model\nto self-improve by iteratively generating candidate responses, evaluating the\nreward for each response, and curating preference data for fine-tuning. In the\nreward modeling, we employ a step-wise strategy and incorporate visual\nconstraints into the self-rewarding process to place greater emphasis on visual\ninput. Empirical results demonstrate that CSR enhances performance and reduces\nhallucinations across ten benchmarks and tasks, achieving substantial\nimprovements over existing methods by 7.62%. Our empirical results are further\nsupported by rigorous theoretical analysis, under mild assumptions, verifying\nthe effectiveness of introducing visual constraints into the self-rewarding\nparadigm. Additionally, CSR shows compatibility with different vision-language\nmodels and the ability to incrementally improve performance through iterative\nfine-tuning. Our data and code are available at\nhttps://github.com/YiyangZhou/CSR.\n","authors":["Yiyang Zhou","Zhiyuan Fan","Dongjie Cheng","Sihan Yang","Zhaorun Chen","Chenhang Cui","Xiyao Wang","Yun Li","Linjun Zhang","Huaxiu Yao"],"pdf_url":"https://arxiv.org/pdf/2405.14622v3.pdf","comment":"fix some typos and add acknowledgement section in V3"},{"id":"http://arxiv.org/abs/2405.20991v1","updated":"2024-05-31T16:35:41Z","published":"2024-05-31T16:35:41Z","title":"Hard Cases Detection in Motion Prediction by Vision-Language Foundation\n Models","summary":" Addressing hard cases in autonomous driving, such as anomalous road users,\nextreme weather conditions, and complex traffic interactions, presents\nsignificant challenges. To ensure safety, it is crucial to detect and manage\nthese scenarios effectively for autonomous driving systems. However, the rarity\nand high-risk nature of these cases demand extensive, diverse datasets for\ntraining robust models. Vision-Language Foundation Models (VLMs) have shown\nremarkable zero-shot capabilities as being trained on extensive datasets. This\nwork explores the potential of VLMs in detecting hard cases in autonomous\ndriving. We demonstrate the capability of VLMs such as GPT-4v in detecting hard\ncases in traffic participant motion prediction on both agent and scenario\nlevels. We introduce a feasible pipeline where VLMs, fed with sequential image\nframes with designed prompts, effectively identify challenging agents or\nscenarios, which are verified by existing prediction models. Moreover, by\ntaking advantage of this detection of hard cases by VLMs, we further improve\nthe training efficiency of the existing motion prediction pipeline by\nperforming data selection for the training samples suggested by GPT. We show\nthe effectiveness and feasibility of our pipeline incorporating VLMs with\nstate-of-the-art methods on NuScenes datasets. The code is accessible at\nhttps://github.com/KTH-RPL/Detect_VLM.\n","authors":["Yi Yang","Qingwen Zhang","Kei Ikemura","Nazre Batool","John Folkesson"],"pdf_url":"https://arxiv.org/pdf/2405.20991v1.pdf","comment":"IEEE Intelligent Vehicles Symposium (IV) 2024"},{"id":"http://arxiv.org/abs/2405.20990v1","updated":"2024-05-31T16:35:29Z","published":"2024-05-31T16:35:29Z","title":"Locking Machine Learning Models into Hardware","summary":" Modern Machine Learning models are expensive IP and business competitiveness\noften depends on keeping this IP confidential. This in turn restricts how these\nmodels are deployed -- for example it is unclear how to deploy a model\non-device without inevitably leaking the underlying model. At the same time,\nconfidential computing technologies such as Multi-Party Computation or\nHomomorphic encryption remain impractical for wide adoption. In this paper we\ntake a different approach and investigate feasibility of ML-specific mechanisms\nthat deter unauthorized model use by restricting the model to only be usable on\nspecific hardware, making adoption on unauthorized hardware inconvenient. That\nway, even if IP is compromised, it cannot be trivially used without specialised\nhardware or major model adjustment. In a sense, we seek to enable cheap locking\nof machine learning models into specific hardware. We demonstrate that locking\nmechanisms are feasible by either targeting efficiency of model\nrepresentations, such making models incompatible with quantisation, or tie the\nmodel's operation on specific characteristics of hardware, such as number of\ncycles for arithmetic operations. We demonstrate that locking comes with\nnegligible work and latency overheads, while significantly restricting\nusability of the resultant model on unauthorized hardware.\n","authors":["Eleanor Clifford","Adhithya Saravanan","Harry Langford","Cheng Zhang","Yiren Zhao","Robert Mullins","Ilia Shumailov","Jamie Hayes"],"pdf_url":"https://arxiv.org/pdf/2405.20990v1.pdf","comment":"10 pages, 2 figures of main text; 14 pages, 16 figures of appendices"},{"id":"http://arxiv.org/abs/2405.20988v1","updated":"2024-05-31T16:34:11Z","published":"2024-05-31T16:34:11Z","title":"Communication-Efficient Distributed Deep Learning via Federated Dynamic\n Averaging","summary":" Driven by the ever-growing volume and decentralized nature of data, coupled\nwith the escalating size of modern models, distributed deep learning (DDL) has\nbeen entrenched as the preferred paradigm for training. However, frequent\nsynchronization of DL models, encompassing millions to many billions of\nparameters, creates a communication bottleneck, severely hindering scalability.\nWorse yet, DDL algorithms typically waste valuable bandwidth, and make\nthemselves less practical in bandwidth-constrained federated settings, by\nrelying on overly simplistic, periodic, and rigid synchronization schedules. To\naddress these shortcomings, we propose Federated Dynamic Averaging (FDA), a\ncommunication-efficient DDL strategy that dynamically triggers synchronization\nbased on the value of the model variance. Through extensive experiments across\na wide range of learning tasks we demonstrate that FDA reduces communication\ncost by orders of magnitude, compared to both traditional and cutting-edge\ncommunication-efficient algorithms. Remarkably, FDA achieves this without\nsacrificing convergence speed - in stark contrast to the trade-offs encountered\nin the field. Additionally, we show that FDA maintains robust performance\nacross diverse data heterogeneity settings.\n","authors":["Michail Theologitis","Georgios Frangias","Georgios Anestis","Vasilis Samoladas","Antonios Deligiannakis"],"pdf_url":"https://arxiv.org/pdf/2405.20988v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20987v1","updated":"2024-05-31T16:33:20Z","published":"2024-05-31T16:33:20Z","title":"Early Stopping Criteria for Training Generative Adversarial Networks in\n Biomedical Imaging","summary":" Generative Adversarial Networks (GANs) have high computational costs to train\ntheir complex architectures. Throughout the training process, GANs' output is\nanalyzed qualitatively based on the loss and synthetic images' diversity and\nquality. Based on this qualitative analysis, training is manually halted once\nthe desired synthetic images are generated. By utilizing an early stopping\ncriterion, the computational cost and dependence on manual oversight can be\nreduced yet impacted by training problems such as mode collapse,\nnon-convergence, and instability. This is particularly prevalent in biomedical\nimagery, where training problems degrade the diversity and quality of synthetic\nimages, and the high computational cost associated with training makes complex\narchitectures increasingly inaccessible. This work proposes a novel early\nstopping criteria to quantitatively detect training problems, halt training,\nand reduce the computational costs associated with synthesizing biomedical\nimages. Firstly, the range of generator and discriminator loss values is\ninvestigated to assess whether mode collapse, non-convergence, and instability\noccur sequentially, concurrently, or interchangeably throughout the training of\nGANs. Secondly, utilizing these occurrences in conjunction with the Mean\nStructural Similarity Index (MS-SSIM) and Fr\\'echet Inception Distance (FID)\nscores of synthetic images forms the basis of the proposed early stopping\ncriteria. This work helps identify the occurrence of training problems in GANs\nusing low-resource computational cost and reduces training time to generate\ndiversified and high-quality synthetic images.\n","authors":["Muhammad Muneeb Saad","Mubashir Husain Rehmani","Ruairi O'Reilly"],"pdf_url":"https://arxiv.org/pdf/2405.20987v1.pdf","comment":"This paper is accepted at the 35th IEEE Irish Signals and Systems\n Conference (ISSC 2024)"},{"id":"http://arxiv.org/abs/2405.20986v1","updated":"2024-05-31T16:32:46Z","published":"2024-05-31T16:32:46Z","title":"Uncertainty Quantification for Bird's Eye View Semantic Segmentation:\n Methods and Benchmarks","summary":" The fusion of raw features from multiple sensors on an autonomous vehicle to\ncreate a Bird's Eye View (BEV) representation is crucial for planning and\ncontrol systems. There is growing interest in using deep learning models for\nBEV semantic segmentation. Anticipating segmentation errors and improving the\nexplainability of DNNs is essential for autonomous driving, yet it is\nunder-studied. This paper introduces a benchmark for predictive uncertainty\nquantification in BEV segmentation. The benchmark assesses various approaches\nacross three popular datasets using two representative backbones and focuses on\nthe effectiveness of predicted uncertainty in identifying misclassified and\nout-of-distribution (OOD) pixels, as well as calibration. Empirical findings\nhighlight the challenges in uncertainty quantification. Our results find that\nevidential deep learning based approaches show the most promise by efficiently\nquantifying aleatoric and epistemic uncertainty. We propose the\nUncertainty-Focal-Cross-Entropy (UFCE) loss, designed for highly imbalanced\ndata, which consistently improves the segmentation quality and calibration.\nAdditionally, we introduce a vacuity-scaled regularization term that enhances\nthe model's focus on high uncertainty pixels, improving epistemic uncertainty\nquantification.\n","authors":["Linlin Yu","Bowen Yang","Tianhao Wang","Kangshuo Li","Feng Chen"],"pdf_url":"https://arxiv.org/pdf/2405.20986v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20984v1","updated":"2024-05-31T16:31:07Z","published":"2024-05-31T16:31:07Z","title":"Bayesian Design Principles for Offline-to-Online Reinforcement Learning","summary":" Offline reinforcement learning (RL) is crucial for real-world applications\nwhere exploration can be costly or unsafe. However, offline learned policies\nare often suboptimal, and further online fine-tuning is required. In this\npaper, we tackle the fundamental dilemma of offline-to-online fine-tuning: if\nthe agent remains pessimistic, it may fail to learn a better policy, while if\nit becomes optimistic directly, performance may suffer from a sudden drop. We\nshow that Bayesian design principles are crucial in solving such a dilemma.\nInstead of adopting optimistic or pessimistic policies, the agent should act in\na way that matches its belief in optimal policies.\n Such a probability-matching agent can avoid a sudden performance drop while\nstill being guaranteed to find the optimal policy. Based on our theoretical\nfindings, we introduce a novel algorithm that outperforms existing methods on\nvarious benchmarks, demonstrating the efficacy of our approach. Overall, the\nproposed approach provides a new perspective on offline-to-online RL that has\nthe potential to enable more effective learning from offline data.\n","authors":["Hao Hu","Yiqin Yang","Jianing Ye","Chengjie Wu","Ziqing Mai","Yujing Hu","Tangjie Lv","Changjie Fan","Qianchuan Zhao","Chongjie Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20984v1.pdf","comment":"Forty-first International Conference on Machine Learning (ICML), 2024"},{"id":"http://arxiv.org/abs/2402.15259v3","updated":"2024-05-31T16:28:10Z","published":"2024-02-23T11:04:33Z","title":"Open Ad Hoc Teamwork with Cooperative Game Theory","summary":" Ad hoc teamwork poses a challenging problem, requiring the design of an agent\nto collaborate with teammates without prior coordination or joint training.\nOpen ad hoc teamwork (OAHT) further complicates this challenge by considering\nenvironments with a changing number of teammates, referred to as open teams.\nOne promising solution in practice to this problem is leveraging the\ngeneralizability of graph neural networks to handle an unrestricted number of\nagents, named graph-based policy learning (GPL). However, its joint Q-value\nrepresentation over a coordination graph lacks convincing explanations. In this\npaper, we establish a new theory to understand the joint Q-value representation\nfor OAHT, from the perspective of cooperative game theory, and validate its\nlearning paradigm. Building on our theory, we propose a novel algorithm named\nCIAO, compatible with GPL framework, with additional provable implementation\ntricks that can facilitate learning. The demos of experimental results are\navailable on https://sites.google.com/view/ciao2024, and the code of\nexperiments is published on https://github.com/hsvgbkhgbv/CIAO.\n","authors":["Jianhong Wang","Yang Li","Yuan Zhang","Wei Pan","Samuel Kaski"],"pdf_url":"https://arxiv.org/pdf/2402.15259v3.pdf","comment":"Published at ICML 2024, 29 pages"},{"id":"http://arxiv.org/abs/2310.02905v2","updated":"2024-05-31T16:27:53Z","published":"2023-10-02T02:01:16Z","title":"Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural\n bandits Coupled with Transformers","summary":" Large language models (LLMs) have shown remarkable instruction-following\ncapabilities and achieved impressive performances in various applications.\nHowever, the performances of LLMs depend heavily on the instructions given to\nthem, which are typically manually tuned with substantial human efforts. Recent\nwork has used the query-efficient Bayesian optimization (BO) algorithm to\nautomatically optimize the instructions given to black-box LLMs. However, BO\nusually falls short when optimizing highly sophisticated (e.g.,\nhigh-dimensional) objective functions, such as the functions mapping an\ninstruction to the performance of an LLM. This is mainly due to the limited\nexpressive power of the Gaussian process (GP) which is used by BO as a\nsurrogate to model the objective function. Meanwhile, it has been repeatedly\nshown that neural networks (NNs), especially pre-trained transformers, possess\nstrong expressive power and can model highly complex functions. So, we adopt a\nneural bandit algorithm which replaces the GP in BO by an NN surrogate to\noptimize instructions for black-box LLMs. More importantly, the neural bandit\nalgorithm allows us to naturally couple the NN surrogate with the hidden\nrepresentation learned by a pre-trained transformer (i.e., an open-source LLM),\nwhich significantly boosts its performance. These motivate us to propose our\nINSTruction optimization usIng Neural bandits Coupled with Transformers\n(INSTINCT) algorithm. We perform instruction optimization for ChatGPT and use\nextensive experiments to show that INSTINCT consistently outperforms baselines\nin different tasks, e.g., various instruction induction tasks and the task of\nimproving zero-shot chain-of-thought instructions. Our code is available at\nhttps://github.com/xqlin98/INSTINCT.\n","authors":["Xiaoqiang Lin","Zhaoxuan Wu","Zhongxiang Dai","Wenyang Hu","Yao Shu","See-Kiong Ng","Patrick Jaillet","Bryan Kian Hsiang Low"],"pdf_url":"https://arxiv.org/pdf/2310.02905v2.pdf","comment":"Accepted to ICML 2024"},{"id":"http://arxiv.org/abs/2405.20980v1","updated":"2024-05-31T16:26:08Z","published":"2024-05-31T16:26:08Z","title":"Neural Gaussian Scale-Space Fields","summary":" Gaussian scale spaces are a cornerstone of signal representation and\nprocessing, with applications in filtering, multiscale analysis, anti-aliasing,\nand many more. However, obtaining such a scale space is costly and cumbersome,\nin particular for continuous representations such as neural fields. We present\nan efficient and lightweight method to learn the fully continuous, anisotropic\nGaussian scale space of an arbitrary signal. Based on Fourier feature\nmodulation and Lipschitz bounding, our approach is trained self-supervised,\ni.e., training does not require any manual filtering. Our neural Gaussian\nscale-space fields faithfully capture multiscale representations across a broad\nrange of modalities, and support a diverse set of applications. These include\nimages, geometry, light-stage data, texture anti-aliasing, and multiscale\noptimization.\n","authors":["Felix Mujkanovic","Ntumba Elie Nsampi","Christian Theobalt","Hans-Peter Seidel","Thomas Leimkühler"],"pdf_url":"https://arxiv.org/pdf/2405.20980v1.pdf","comment":"15 pages; SIGGRAPH 2024; project page at\n https://neural-gaussian-scale-space-fields.mpi-inf.mpg.de"},{"id":"http://arxiv.org/abs/2405.20975v1","updated":"2024-05-31T16:21:55Z","published":"2024-05-31T16:21:55Z","title":"ACE: A Model Poisoning Attack on Contribution Evaluation Methods in\n Federated Learning","summary":" In Federated Learning (FL), a set of clients collaboratively train a machine\nlearning model (called global model) without sharing their local training data.\nThe local training data of clients is typically non-i.i.d. and heterogeneous,\nresulting in varying contributions from individual clients to the final\nperformance of the global model. In response, many contribution evaluation\nmethods were proposed, where the server could evaluate the contribution made by\neach client and incentivize the high-contributing clients to sustain their\nlong-term participation in FL. Existing studies mainly focus on developing new\nmetrics or algorithms to better measure the contribution of each client.\nHowever, the security of contribution evaluation methods of FL operating in\nadversarial environments is largely unexplored. In this paper, we propose the\nfirst model poisoning attack on contribution evaluation methods in FL, termed\nACE. Specifically, we show that any malicious client utilizing ACE could\nmanipulate the parameters of its local model such that it is evaluated to have\na high contribution by the server, even when its local training data is indeed\nof low quality. We perform both theoretical analysis and empirical evaluations\nof ACE. Theoretically, we show our design of ACE can effectively boost the\nmalicious client's perceived contribution when the server employs the\nwidely-used cosine distance metric to measure contribution. Empirically, our\nresults show ACE effectively and efficiently deceive five state-of-the-art\ncontribution evaluation methods. In addition, ACE preserves the accuracy of the\nfinal global models on testing inputs. We also explore six countermeasures to\ndefend ACE. Our results show they are inadequate to thwart ACE, highlighting\nthe urgent need for new defenses to safeguard the contribution evaluation\nmethods in FL.\n","authors":["Zhangchen Xu","Fengqing Jiang","Luyao Niu","Jinyuan Jia","Bo Li","Radha Poovendran"],"pdf_url":"https://arxiv.org/pdf/2405.20975v1.pdf","comment":"To appear in the 33rd USENIX Security Symposium, 2024"},{"id":"http://arxiv.org/abs/2405.20974v1","updated":"2024-05-31T16:21:16Z","published":"2024-05-31T16:21:16Z","title":"SaySelf: Teaching LLMs to Express Confidence with Self-Reflective\n Rationales","summary":" Large language models (LLMs) often generate inaccurate or fabricated\ninformation and generally fail to indicate their confidence, which limits their\nbroader applications. Previous work elicits confidence from LLMs by direct or\nself-consistency prompting, or constructing specific datasets for supervised\nfinetuning. The prompting-based approaches have inferior performance, and the\ntraining-based approaches are limited to binary or inaccurate group-level\nconfidence estimates. In this work, we present the advanced SaySelf, a training\nframework that teaches LLMs to express more accurate fine-grained confidence\nestimates. In addition, beyond the confidence scores, SaySelf initiates the\nprocess of directing LLMs to produce self-reflective rationales that clearly\nidentify gaps in their parametric knowledge and explain their uncertainty. This\nis achieved by using an LLM to automatically summarize the uncertainties in\nspecific knowledge via natural language. The summarization is based on the\nanalysis of the inconsistency in multiple sampled reasoning chains, and the\nresulting data is utilized for supervised fine-tuning. Moreover, we utilize\nreinforcement learning with a meticulously crafted reward function to calibrate\nthe confidence estimates, motivating LLMs to deliver accurate, high-confidence\npredictions and to penalize overconfidence in erroneous outputs. Experimental\nresults in both in-distribution and out-of-distribution datasets demonstrate\nthe effectiveness of SaySelf in reducing the confidence calibration error and\nmaintaining the task performance. We show that the generated self-reflective\nrationales are reasonable and can further contribute to the calibration. The\ncode is made public at \\url{https://github.com/xu1868/SaySelf}.\n","authors":["Tianyang Xu","Shujin Wu","Shizhe Diao","Xiaoze Liu","Xingyao Wang","Yangyi Chen","Jing Gao"],"pdf_url":"https://arxiv.org/pdf/2405.20974v1.pdf","comment":"The code is available at \\url{https://github.com/xu1868/SaySelf}"},{"id":"http://arxiv.org/abs/2405.20973v1","updated":"2024-05-31T16:21:05Z","published":"2024-05-31T16:21:05Z","title":"LCQ: Low-Rank Codebook based Quantization for Large Language Models","summary":" Large language models~(LLMs) have recently demonstrated promising performance\nin many tasks. However, the high storage and computational cost of LLMs has\nbecome a challenge for deploying LLMs. Weight quantization has been widely used\nfor model compression, which can reduce both storage and computational cost.\nMost existing weight quantization methods for LLMs use a rank-one codebook for\nquantization, which results in substantial accuracy loss when the compression\nratio is high. In this paper, we propose a novel weight quantization method,\ncalled low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a\nlow-rank codebook, the rank of which can be larger than one, for quantization.\nExperiments show that LCQ can achieve better accuracy than existing methods\nwith a negligibly extra storage cost.\n","authors":["Wen-Pu Cai","Wu-Jun Li"],"pdf_url":"https://arxiv.org/pdf/2405.20973v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2405.20971v1","updated":"2024-05-31T16:18:46Z","published":"2024-05-31T16:18:46Z","title":"Amortizing intractable inference in diffusion models for vision,\n language, and control","summary":" Diffusion models have emerged as effective distribution estimators in vision,\nlanguage, and reinforcement learning, but their use as priors in downstream\ntasks poses an intractable posterior inference problem. This paper studies\namortized sampling of the posterior over data, $\\mathbf{x}\\sim p^{\\rm\npost}(\\mathbf{x})\\propto p(\\mathbf{x})r(\\mathbf{x})$, in a model that consists\nof a diffusion generative model prior $p(\\mathbf{x})$ and a black-box\nconstraint or likelihood function $r(\\mathbf{x})$. We state and prove the\nasymptotic correctness of a data-free learning objective, relative trajectory\nbalance, for training a diffusion model that samples from this posterior, a\nproblem that existing methods solve only approximately or in restricted cases.\nRelative trajectory balance arises from the generative flow network perspective\non diffusion models, which allows the use of deep reinforcement learning\ntechniques to improve mode coverage. Experiments illustrate the broad potential\nof unbiased inference of arbitrary posteriors under diffusion priors: in vision\n(classifier guidance), language (infilling under a discrete diffusion LLM), and\nmultimodal data (text-to-image generation). Beyond generative modeling, we\napply relative trajectory balance to the problem of continuous control with a\nscore-based behavior prior, achieving state-of-the-art results on benchmarks in\noffline reinforcement learning.\n","authors":["Siddarth Venkatraman","Moksh Jain","Luca Scimeca","Minsu Kim","Marcin Sendera","Mohsin Hasan","Luke Rowe","Sarthak Mittal","Pablo Lemos","Emmanuel Bengio","Alexandre Adam","Jarrid Rector-Brooks","Yoshua Bengio","Glen Berseth","Nikolay Malkin"],"pdf_url":"https://arxiv.org/pdf/2405.20971v1.pdf","comment":"Code: https://github.com/GFNOrg/diffusion-finetuning"},{"id":"http://arxiv.org/abs/2405.20970v1","updated":"2024-05-31T16:18:06Z","published":"2024-05-31T16:18:06Z","title":"PUAL: A Classifier on Trifurcate Positive-Unlabeled Data","summary":" Positive-unlabeled (PU) learning aims to train a classifier using the data\ncontaining only labeled-positive instances and unlabeled instances. However,\nexisting PU learning methods are generally hard to achieve satisfactory\nperformance on trifurcate data, where the positive instances distribute on both\nsides of the negative instances. To address this issue, firstly we propose a PU\nclassifier with asymmetric loss (PUAL), by introducing a structure of\nasymmetric loss on positive instances into the objective function of the global\nand local learning classifier. Then we develop a kernel-based algorithm to\nenable PUAL to obtain non-linear decision boundary. We show that, through\nexperiments on both simulated and real-world datasets, PUAL can achieve\nsatisfactory classification on trifurcate data.\n","authors":["Xiaoke Wang","Xiaochen Yang","Rui Zhu","Jing-Hao Xue"],"pdf_url":"https://arxiv.org/pdf/2405.20970v1.pdf","comment":"24 pages, 6 figures"},{"id":"http://arxiv.org/abs/2311.10879v3","updated":"2024-05-31T16:15:01Z","published":"2023-11-17T21:48:41Z","title":"Pre- to Post-Contrast Breast MRI Synthesis for Enhanced Tumour\n Segmentation","summary":" Despite its benefits for tumour detection and treatment, the administration\nof contrast agents in dynamic contrast-enhanced MRI (DCE-MRI) is associated\nwith a range of issues, including their invasiveness, bioaccumulation, and a\nrisk of nephrogenic systemic fibrosis. This study explores the feasibility of\nproducing synthetic contrast enhancements by translating pre-contrast\nT1-weighted fat-saturated breast MRI to their corresponding first DCE-MRI\nsequence leveraging the capabilities of a generative adversarial network (GAN).\nAdditionally, we introduce a Scaled Aggregate Measure (SAMe) designed for\nquantitatively evaluating the quality of synthetic data in a principled manner\nand serving as a basis for selecting the optimal generative model. We assess\nthe generated DCE-MRI data using quantitative image quality metrics and apply\nthem to the downstream task of 3D breast tumour segmentation. Our results\nhighlight the potential of post-contrast DCE-MRI synthesis in enhancing the\nrobustness of breast tumour segmentation models via data augmentation. Our code\nis available at https://github.com/RichardObi/pre_post_synthesis.\n","authors":["Richard Osuala","Smriti Joshi","Apostolia Tsirikoglou","Lidia Garrucho","Walter H. L. Pinaya","Oliver Diaz","Karim Lekadir"],"pdf_url":"https://arxiv.org/pdf/2311.10879v3.pdf","comment":"Accepted as oral presentation at SPIE Medical Imaging 2024 (Image\n Processing)"},{"id":"http://arxiv.org/abs/2310.00154v2","updated":"2024-05-31T16:11:27Z","published":"2023-09-29T21:23:27Z","title":"Primal Dual Continual Learning: Balancing Stability and Plasticity\n through Adaptive Memory Allocation","summary":" Continual learning is inherently a constrained learning problem. The goal is\nto learn a predictor under a no-forgetting requirement. Although several prior\nstudies formulate it as such, they do not solve the constrained problem\nexplicitly. In this work, we show that it is both possible and beneficial to\nundertake the constrained optimization problem directly. To do this, we\nleverage recent results in constrained learning through Lagrangian duality. We\nfocus on memory-based methods, where a small subset of samples from previous\ntasks can be stored in a replay buffer. In this setting, we analyze two\nversions of the continual learning problem: a coarse approach with constraints\nat the task level and a fine approach with constraints at the sample level. We\nshow that dual variables indicate the sensitivity of the optimal value of the\ncontinual learning problem with respect to constraint perturbations. We then\nleverage this result to partition the buffer in the coarse approach, allocating\nmore resources to harder tasks, and to populate the buffer in the fine\napproach, including only impactful samples. We derive a deviation bound on dual\nvariables as sensitivity indicators, and empirically corroborate this result in\ndiverse continual learning benchmarks. We also discuss the limitations of these\nmethods with respect to the amount of memory available and the expressiveness\nof the parametrization.\n","authors":["Juan Elenter","Navid NaderiAlizadeh","Tara Javidi","Alejandro Ribeiro"],"pdf_url":"https://arxiv.org/pdf/2310.00154v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.01371v2","updated":"2024-05-31T16:05:37Z","published":"2024-03-03T02:19:49Z","title":"eXponential FAmily Dynamical Systems (XFADS): Large-scale nonlinear\n Gaussian state-space modeling","summary":" State-space graphical models and the variational autoencoder framework\nprovide a principled apparatus for learning dynamical systems from data.\nState-of-the-art probabilistic approaches are often able to scale to large\nproblems at the cost of flexibility of the variational posterior or\nexpressivity of the dynamics model. However, those consolidations can be\ndetrimental if the ultimate goal is to learn a generative model capable of\nexplaining the spatiotemporal structure of the data and making accurate\nforecasts. We introduce a low-rank structured variational autoencoding\nframework for nonlinear Gaussian state-space graphical models capable of\ncapturing dense covariance structures that are important for learning dynamical\nsystems with predictive capabilities. Our inference algorithm exploits the\ncovariance structures that arise naturally from sample based approximate\nGaussian message passing and low-rank amortized posterior updates --\neffectively performing approximate variational smoothing with time complexity\nscaling linearly in the state dimensionality. In comparisons with other deep\nstate-space model architectures our approach consistently demonstrates the\nability to learn a more predictive generative model. Furthermore, when applied\nto neural physiological recordings, our approach is able to learn a dynamical\nsystem capable of forecasting population spiking and behavioral correlates from\na small portion of single trials.\n","authors":["Matthew Dowling","Yuan Zhao","Il Memming Park"],"pdf_url":"https://arxiv.org/pdf/2403.01371v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.04513v2","updated":"2024-05-31T15:59:34Z","published":"2024-02-07T01:46:50Z","title":"Online Cascade Learning for Efficient Inference over Streams","summary":" Large Language Models (LLMs) have a natural role in answering complex queries\nabout data streams, but the high computational cost of LLM inference makes them\ninfeasible in many such tasks. We propose online cascade learning, the first\napproach to address this challenge. The objective here is to learn a \"cascade\"\nof models, starting with lower-capacity models (such as logistic regression)\nand ending with a powerful LLM, along with a deferral policy that determines\nthe model to be used on a given input. We formulate the task of learning\ncascades online as an imitation-learning problem, where smaller models are\nupdated over time imitating the collected LLM demonstrations, and give a\nno-regret algorithm for the problem. Experimental results across four\nbenchmarks show that our method parallels LLMs in accuracy while cutting down\ninference costs by as much as 90% with strong robustness against input\ndistribution shifts, underscoring its efficacy and adaptability in stream\nprocessing.\n","authors":["Lunyiu Nie","Zhimin Ding","Erdong Hu","Christopher Jermaine","Swarat Chaudhuri"],"pdf_url":"https://arxiv.org/pdf/2402.04513v2.pdf","comment":"ICML 2024 Main Conference Paper"},{"id":"http://arxiv.org/abs/2405.20954v1","updated":"2024-05-31T15:54:01Z","published":"2024-05-31T15:54:01Z","title":"Aligning Multiclass Neural Network Classifier Criterion with Task\n Performance via $F_β$-Score","summary":" Multiclass neural network classifiers are typically trained using\ncross-entropy loss. Following training, the performance of this same neural\nnetwork is evaluated using an application-specific metric based on the\nmulticlass confusion matrix, such as the Macro $F_\\beta$-Score. It is\nquestionable whether the use of cross-entropy will yield a classifier that\naligns with the intended application-specific performance criteria,\nparticularly in scenarios where there is a need to emphasize one aspect of\nclassifier performance. For example, if greater precision is preferred over\nrecall, the $\\beta$ value in the $F_\\beta$ evaluation metric can be adjusted\naccordingly, but the cross-entropy objective remains unaware of this preference\nduring training. We propose a method that addresses this training-evaluation\ngap for multiclass neural network classifiers such that users can train these\nmodels informed by the desired final $F_\\beta$-Score. Following prior work in\nbinary classification, we utilize the concepts of the soft-set confusion\nmatrices and a piecewise-linear approximation of the Heaviside step function.\nOur method extends the $2 \\times 2$ binary soft-set confusion matrix to a\nmulticlass $d \\times d$ confusion matrix and proposes dynamic adaptation of the\nthreshold value $\\tau$, which parameterizes the piecewise-linear Heaviside\napproximation during run-time. We present a theoretical analysis that shows\nthat our method can be used to optimize for a soft-set based approximation of\nMacro-$F_\\beta$ that is a consistent estimator of Macro-$F_\\beta$, and our\nextensive experiments show the practical effectiveness of our approach.\n","authors":["Nathan Tsoi","Deyuan Li","Taesoo Daniel Lee","Marynel Vázquez"],"pdf_url":"https://arxiv.org/pdf/2405.20954v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18669v2","updated":"2024-05-31T15:42:53Z","published":"2024-05-29T00:23:55Z","title":"Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities","summary":" Integrating multiple generative foundation models, especially those trained\non different modalities, into something greater than the sum of its parts poses\nsignificant challenges. Two key hurdles are the availability of aligned data\n(concepts that contain similar meaning but is expressed differently in\ndifferent modalities), and effectively leveraging unimodal representations in\ncross-domain generative tasks, without compromising their original unimodal\ncapabilities.\n We propose Zipper, a multi-tower decoder architecture that addresses these\nconcerns by using cross-attention to flexibly compose multimodal generative\nmodels from independently pre-trained unimodal decoders. In our experiments\nfusing speech and text modalities, we show the proposed architecture performs\nvery competitively in scenarios with limited aligned text-speech data. We also\nshowcase the flexibility of our model to selectively maintain unimodal (e.g.,\ntext-to-text generation) generation performance by freezing the corresponding\nmodal tower (e.g. text). In cross-modal tasks such as automatic speech\nrecognition (ASR) where the output modality is text, we show that freezing the\ntext backbone results in negligible performance degradation. In cross-modal\ntasks such as text-to-speech generation (TTS) where the output modality is\nspeech, we show that using a pre-trained speech backbone results in superior\nperformance to the baseline.\n","authors":["Vicky Zayats","Peter Chen","Melissa Ferrari","Dirk Padfield"],"pdf_url":"https://arxiv.org/pdf/2405.18669v2.pdf","comment":"Under review at NeurIPS"},{"id":"http://arxiv.org/abs/2405.20935v1","updated":"2024-05-31T15:34:13Z","published":"2024-05-31T15:34:13Z","title":"Effective Interplay between Sparsity and Quantization: From Theory to\n Practice","summary":" The increasing size of deep neural networks necessitates effective model\ncompression to improve computational efficiency and reduce their memory\nfootprint. Sparsity and quantization are two prominent compression methods that\nhave individually demonstrated significant reduction in computational and\nmemory footprints while preserving model accuracy. While effective, the\ninterplay between these two methods remains an open question. In this paper, we\ninvestigate the interaction between these two methods and assess whether their\ncombination impacts final model accuracy. We mathematically prove that applying\nsparsity before quantization is the optimal sequence for these operations,\nminimizing error in computation. Our empirical studies across a wide range of\nmodels, including OPT and Llama model families (125M-8B) and ViT corroborate\nthese theoretical findings. In addition, through rigorous analysis, we\ndemonstrate that sparsity and quantization are not orthogonal; their\ninteraction can significantly harm model accuracy, with quantization error\nplaying a dominant role in this degradation. Our findings extend to the\nefficient deployment of large models in resource-limited compute platforms and\nreduce serving cost, offering insights into best practices for applying these\ncompression methods to maximize efficacy without compromising accuracy.\n","authors":["Simla Burcu Harma","Ayan Chakraborty","Elizaveta Kostenok","Danila Mishin","Dongho Ha","Babak Falsafi","Martin Jaggi","Ming Liu","Yunho Oh","Suvinay Subramanian","Amir Yazdanbakhsh"],"pdf_url":"https://arxiv.org/pdf/2405.20935v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20933v1","updated":"2024-05-31T15:32:43Z","published":"2024-05-31T15:32:43Z","title":"Concentration Bounds for Optimized Certainty Equivalent Risk Estimation","summary":" We consider the problem of estimating the Optimized Certainty Equivalent\n(OCE) risk from independent and identically distributed (i.i.d.) samples. For\nthe classic sample average approximation (SAA) of OCE, we derive mean-squared\nerror as well as concentration bounds (assuming sub-Gaussianity). Further, we\nanalyze an efficient stochastic approximation-based OCE estimator, and derive\nfinite sample bounds for the same. To show the applicability of our bounds, we\nconsider a risk-aware bandit problem, with OCE as the risk. For this problem,\nwe derive bound on the probability of mis-identification. Finally, we conduct\nnumerical experiments to validate the theoretical findings.\n","authors":["Ayon Ghosh","L. A. Prashanth","Krishna Jagannathan"],"pdf_url":"https://arxiv.org/pdf/2405.20933v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.01399v2","updated":"2024-05-31T15:32:02Z","published":"2023-07-31T17:57:49Z","title":"Learning to Model the World with Language","summary":" To interact with humans and act in the world, agents need to understand the\nrange of language that people use and relate it to the visual world. While\ncurrent agents can learn to execute simple language instructions, we aim to\nbuild agents that leverage diverse language -- language like \"this button turns\non the TV\" or \"I put the bowls away\" -- that conveys general knowledge,\ndescribes the state of the world, provides interactive feedback, and more. Our\nkey idea is that agents should interpret such diverse language as a signal that\nhelps them predict the future: what they will observe, how the world will\nbehave, and which situations will be rewarded. This perspective unifies\nlanguage understanding with future prediction as a powerful self-supervised\nlearning objective. We instantiate this in Dynalang, an agent that learns a\nmultimodal world model to predict future text and image representations, and\nlearns to act from imagined model rollouts. While current methods that learn\nlanguage-conditioned policies degrade in performance with more diverse types of\nlanguage, we show that Dynalang learns to leverage environment descriptions,\ngame rules, and instructions to excel on tasks ranging from game-playing to\nnavigating photorealistic home scans. Finally, we show that our method enables\nadditional capabilities due to learning a generative model: Dynalang can be\npretrained on text-only data, enabling learning from offline datasets, and\ngenerate language grounded in an environment.\n","authors":["Jessy Lin","Yuqing Du","Olivia Watkins","Danijar Hafner","Pieter Abbeel","Dan Klein","Anca Dragan"],"pdf_url":"https://arxiv.org/pdf/2308.01399v2.pdf","comment":"ICML 2024. Website: https://dynalang.github.io/"},{"id":"http://arxiv.org/abs/2403.03938v2","updated":"2024-05-31T15:31:16Z","published":"2024-03-06T18:47:32Z","title":"GUIDE: Guidance-based Incremental Learning with Diffusion Models","summary":" We introduce GUIDE, a novel continual learning approach that directs\ndiffusion models to rehearse samples at risk of being forgotten. Existing\ngenerative strategies combat catastrophic forgetting by randomly sampling\nrehearsal examples from a generative model. Such an approach contradicts\nbuffer-based approaches where sampling strategy plays an important role. We\npropose to bridge this gap by incorporating classifier guidance into the\ndiffusion process to produce rehearsal examples specifically targeting\ninformation forgotten by a continuously trained model. This approach enables\nthe generation of samples from preceding task distributions, which are more\nlikely to be misclassified in the context of recently encountered classes. Our\nexperimental results show that GUIDE significantly reduces catastrophic\nforgetting, outperforming conventional random sampling approaches and\nsurpassing recent state-of-the-art methods in continual learning with\ngenerative replay.\n","authors":["Bartosz Cywiński","Kamil Deja","Tomasz Trzciński","Bartłomiej Twardowski","Łukasz Kuciński"],"pdf_url":"https://arxiv.org/pdf/2403.03938v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20917v1","updated":"2024-05-31T15:21:53Z","published":"2024-05-31T15:21:53Z","title":"Learning to Estimate System Specifications in Linear Temporal Logic\n using Transformers and Mamba","summary":" Temporal logic is a framework for representing and reasoning about\npropositions that evolve over time. It is commonly used for specifying\nrequirements in various domains, including hardware and software systems, as\nwell as robotics. Specification mining or formula generation involves\nextracting temporal logic formulae from system traces and has numerous\napplications, such as detecting bugs and improving interpretability. Although\nthere has been a surge of deep learning-based methods for temporal logic\nsatisfiability checking in recent years, the specification mining literature\nhas been lagging behind in adopting deep learning methods despite their many\nadvantages, such as scalability. In this paper, we introduce autoregressive\nmodels that can generate linear temporal logic formulae from traces, towards\naddressing the specification mining problem. We propose multiple architectures\nfor this task: transformer encoder-decoder, decoder-only transformer, and\nMamba, which is an emerging alternative to transformer models. Additionally, we\ndevise a metric for quantifying the distinctiveness of the generated formulae\nand a straightforward algorithm to enforce the syntax constraints. Our\nexperiments show that the proposed architectures yield promising results,\ngenerating correct and distinct formulae at a fraction of the compute cost\nneeded for the combinatorial baseline.\n","authors":["İlker Işık","Ebru Aydin Gol","Ramazan Gokberk Cinbis"],"pdf_url":"https://arxiv.org/pdf/2405.20917v1.pdf","comment":"20 pages, 15 figures"},{"id":"http://arxiv.org/abs/2405.20915v1","updated":"2024-05-31T15:21:44Z","published":"2024-05-31T15:21:44Z","title":"Fast yet Safe: Early-Exiting with Risk Control","summary":" Scaling machine learning models significantly improves their performance.\nHowever, such gains come at the cost of inference being slow and\nresource-intensive. Early-exit neural networks (EENNs) offer a promising\nsolution: they accelerate inference by allowing intermediate layers to exit and\nproduce a prediction early. Yet a fundamental issue with EENNs is how to\ndetermine when to exit without severely degrading performance. In other words,\nwhen is it 'safe' for an EENN to go 'fast'? To address this issue, we\ninvestigate how to adapt frameworks of risk control to EENNs. Risk control\noffers a distribution-free, post-hoc solution that tunes the EENN's exiting\nmechanism so that exits only occur when the output is of sufficient quality. We\nempirically validate our insights on a range of vision and language tasks,\ndemonstrating that risk control can produce substantial computational savings,\nall the while preserving user-specified performance goals.\n","authors":["Metod Jazbec","Alexander Timans","Tin Hadži Veljković","Kaspar Sakmann","Dan Zhang","Christian A. Naesseth","Eric Nalisnick"],"pdf_url":"https://arxiv.org/pdf/2405.20915v1.pdf","comment":"25 pages, 11 figures, 4 tables (incl. appendix)"},{"id":"http://arxiv.org/abs/2405.20905v1","updated":"2024-05-31T15:16:48Z","published":"2024-05-31T15:16:48Z","title":"VENI, VINDy, VICI: a variational reduced-order modeling framework with\n uncertainty quantification","summary":" The simulation of many complex phenomena in engineering and science requires\nsolving expensive, high-dimensional systems of partial differential equations\n(PDEs). To circumvent this, reduced-order models (ROMs) have been developed to\nspeed up computations. However, when governing equations are unknown or\npartially known, typically ROMs lack interpretability and reliability of the\npredicted solutions.\n In this work we present a data-driven, non-intrusive framework for building\nROMs where the latent variables and dynamics are identified in an interpretable\nmanner and uncertainty is quantified. Starting from a limited amount of\nhigh-dimensional, noisy data the proposed framework constructs an efficient ROM\nby leveraging variational autoencoders for dimensionality reduction along with\na newly introduced, variational version of sparse identification of nonlinear\ndynamics (SINDy), which we refer to as Variational Identification of Nonlinear\nDynamics (VINDy).\n In detail, the method consists of Variational Encoding of Noisy Inputs (VENI)\nto identify the distribution of reduced coordinates. Simultaneously, we learn\nthe distribution of the coefficients of a pre-determined set of candidate\nfunctions by VINDy. Once trained offline, the identified model can be queried\nfor new parameter instances and new initial conditions to compute the\ncorresponding full-time solutions. The probabilistic setup enables uncertainty\nquantification as the online testing consists of Variational Inference\nnaturally providing Certainty Intervals (VICI). In this work we showcase the\neffectiveness of the newly proposed VINDy method in identifying interpretable\nand accurate dynamical system for the R\\\"ossler system with different noise\nintensities and sources. Then the performance of the overall method - named\nVENI, VINDy, VICI - is tested on PDE benchmarks including structural mechanics\nand fluid dynamics.\n","authors":["Paolo Conti","Jonas Kneifl","Andrea Manzoni","Attilio Frangi","Jörg Fehr","Steven L. Brunton","J. Nathan Kutz"],"pdf_url":"https://arxiv.org/pdf/2405.20905v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.07856v3","updated":"2024-05-31T15:14:58Z","published":"2023-06-13T15:35:01Z","title":"Bayesian Program Learning by Decompiling Amortized Knowledge","summary":" DreamCoder is an inductive program synthesis system that, whilst solving\nproblems, learns to simplify search in an iterative wake-sleep procedure. The\ncost of search is amortized by training a neural search policy, reducing search\nbreadth and effectively \"compiling\" useful information to compose program\nsolutions across tasks. Additionally, a library of program components is learnt\nto compress and express discovered solutions in fewer components, reducing\nsearch depth. We present a novel approach for library learning that directly\nleverages the neural search policy, effectively \"decompiling\" its amortized\nknowledge to extract relevant program components. This provides stronger\namortized inference: the amortized knowledge learnt to reduce search breadth is\nnow also used to reduce search depth. We integrate our approach with DreamCoder\nand demonstrate faster domain proficiency with improved generalization on a\nrange of domains, particularly when fewer example solutions are available.\n","authors":["Alessandro B. Palmarini","Christopher G. Lucas","N. Siddharth"],"pdf_url":"https://arxiv.org/pdf/2306.07856v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.05300v3","updated":"2024-05-31T15:14:43Z","published":"2024-03-08T13:29:46Z","title":"Unity by Diversity: Improved Representation Learning in Multimodal VAEs","summary":" Variational Autoencoders for multimodal data hold promise for many tasks in\ndata analysis, such as representation learning, conditional generation, and\nimputation. Current architectures either share the encoder output, decoder\ninput, or both across modalities to learn a shared representation. Such\narchitectures impose hard constraints on the model. In this work, we show that\na better latent representation can be obtained by replacing these hard\nconstraints with a soft constraint. We propose a new mixture-of-experts prior,\nsoftly guiding each modality's latent representation towards a shared aggregate\nposterior. This approach results in a superior latent representation and allows\neach encoding to preserve information better from its uncompressed original\nfeatures. In extensive experiments on multiple benchmark datasets and two\nchallenging real-world datasets, we show improved learned latent\nrepresentations and imputation of missing data modalities compared to existing\nmethods.\n","authors":["Thomas M. Sutter","Yang Meng","Andrea Agostini","Daphné Chopard","Norbert Fortin","Julia E. Vogt","Bahbak Shahbaba","Stephan Mandt"],"pdf_url":"https://arxiv.org/pdf/2403.05300v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.06650v2","updated":"2024-05-31T15:05:40Z","published":"2023-01-17T01:12:44Z","title":"Enhancing Deep Traffic Forecasting Models with Dynamic Regression","summary":" Deep learning models for traffic forecasting often assume the residual is\nindependent and isotropic across time and space. This assumption simplifies\nloss functions such as mean absolute error, but real-world residual processes\noften exhibit significant autocorrelation and structured spatiotemporal\ncorrelation. This paper introduces a dynamic regression (DR) framework to\nenhance existing spatiotemporal traffic forecasting models by incorporating\nstructured learning for the residual process. We assume the residual of the\nbase model (i.e., a well-developed traffic forecasting model) follows a\nmatrix-variate seasonal autoregressive (AR) model, which is seamlessly\nintegrated into the training process through the redesign of the loss function.\nImportantly, the parameters of the DR framework are jointly optimized alongside\nthe base model. We evaluate the effectiveness of the proposed framework on\nstate-of-the-art (SOTA) deep traffic forecasting models using both speed and\nflow datasets, demonstrating improved performance and providing interpretable\nAR coefficients and spatiotemporal covariance matrices.\n","authors":["Vincent Zhihao Zheng","Seongjin Choi","Lijun Sun"],"pdf_url":"https://arxiv.org/pdf/2301.06650v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.16785v2","updated":"2024-05-31T15:03:11Z","published":"2024-02-26T18:00:29Z","title":"CARTE: Pretraining and Transfer for Tabular Learning","summary":" Pretrained deep-learning models are the go-to solution for images or text.\nHowever, for tabular data the standard is still to train tree-based models.\nIndeed, transfer learning on tables hits the challenge of data integration:\nfinding correspondences, correspondences in the entries (entity matching) where\ndifferent words may denote the same entity, correspondences across columns\n(schema matching), which may come in different orders, names... We propose a\nneural architecture that does not need such correspondences. As a result, we\ncan pretrain it on background data that has not been matched. The architecture\n-- CARTE for Context Aware Representation of Table Entries -- uses a graph\nrepresentation of tabular (or relational) data to process tables with different\ncolumns, string embedding of entries and columns names to model an open\nvocabulary, and a graph-attentional network to contextualize entries with\ncolumn names and neighboring entries. An extensive benchmark shows that CARTE\nfacilitates learning, outperforming a solid set of baselines including the best\ntree-based models. CARTE also enables joint learning across tables with\nunmatched columns, enhancing a small table with bigger ones. CARTE opens the\ndoor to large pretrained models for tabular data.\n","authors":["Myung Jun Kim","Léo Grinsztajn","Gaël Varoquaux"],"pdf_url":"https://arxiv.org/pdf/2402.16785v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.10780v2","updated":"2024-05-31T15:00:36Z","published":"2024-05-13T21:37:50Z","title":"Intelligent and Miniaturized Neural Interfaces: An Emerging Era in\n Neurotechnology","summary":" Integrating smart algorithms on neural devices presents significant\nopportunities for various brain disorders. In this paper, we review the latest\nadvancements in the development of three categories of intelligent neural\nprostheses featuring embedded signal processing on the implantable or wearable\ndevice. These include: 1) Neural interfaces for closed-loop symptom tracking\nand responsive stimulation; 2) Neural interfaces for emerging network-related\nconditions, such as psychiatric disorders; and 3) Intelligent BMI SoCs for\nmovement recovery following paralysis.\n","authors":["Mahsa Shoaran","Uisub Shin","MohammadAli Shaeri"],"pdf_url":"https://arxiv.org/pdf/2405.10780v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.00664v5","updated":"2024-05-31T14:58:20Z","published":"2023-05-01T05:26:33Z","title":"EvoluNet: Advancing Dynamic Non-IID Transfer Learning on Graphs","summary":" Non-IID transfer learning on graphs is crucial in many high-stakes domains.\nThe majority of existing works assume stationary distribution for both source\nand target domains. However, real-world graphs are intrinsically dynamic,\npresenting challenges in terms of domain evolution and dynamic discrepancy\nbetween source and target domains. To bridge the gap, we shift the problem to\nthe dynamic setting and pose the question: given the label-rich source graphs\nand the label-scarce target graphs both observed in previous T timestamps, how\ncan we effectively characterize the evolving domain discrepancy and optimize\nthe generalization performance of the target domain at the incoming T+1\ntimestamp? To answer it, we propose a generalization bound for dynamic non-IID\ntransfer learning on graphs, which implies the generalization performance is\ndominated by domain evolution and domain discrepancy between source and target\ngraphs. Inspired by the theoretical results, we introduce a novel generic\nframework named EvoluNet. It leverages a transformer-based temporal encoding\nmodule to model temporal information of the evolving domains and then uses a\ndynamic domain unification module to efficiently learn domain-invariant\nrepresentations across the source and target domains. Finally, EvoluNet\noutperforms the state-of-the-art models by up to 12.1%, demonstrating its\neffectiveness in transferring knowledge from dynamic source graphs to dynamic\ntarget graphs.\n","authors":["Haohui Wang","Yuzhen Mao","Yujun Yan","Yaoqing Yang","Jianhui Sun","Kevin Choi","Balaji Veeramani","Alison Hu","Edward Bowen","Tyler Cody","Dawei Zhou"],"pdf_url":"https://arxiv.org/pdf/2305.00664v5.pdf","comment":"Accepted at ICML 2024"},{"id":"http://arxiv.org/abs/2405.20882v1","updated":"2024-05-31T14:55:38Z","published":"2024-05-31T14:55:38Z","title":"Sheaf HyperNetworks for Personalized Federated Learning","summary":" Graph hypernetworks (GHNs), constructed by combining graph neural networks\n(GNNs) with hypernetworks (HNs), leverage relational data across various\ndomains such as neural architecture search, molecular property prediction and\nfederated learning. Despite GNNs and HNs being individually successful, we show\nthat GHNs present problems compromising their performance, such as\nover-smoothing and heterophily. Moreover, we cannot apply GHNs directly to\npersonalized federated learning (PFL) scenarios, where a priori client relation\ngraph may be absent, private, or inaccessible. To mitigate these limitations in\nthe context of PFL, we propose a novel class of HNs, sheaf hypernetworks\n(SHNs), which combine cellular sheaf theory with HNs to improve parameter\nsharing for PFL. We thoroughly evaluate SHNs across diverse PFL tasks,\nincluding multi-class classification, traffic and weather forecasting.\nAdditionally, we provide a methodology for constructing client relation graphs\nin scenarios where such graphs are unavailable. We show that SHNs consistently\noutperform existing PFL solutions in complex non-IID scenarios. While the\nbaselines' performance fluctuates depending on the task, SHNs show improvements\nof up to 2.7% in accuracy and 5.3% in lower mean squared error over the\nbest-performing baseline.\n","authors":["Bao Nguyen","Lorenzo Sani","Xinchi Qiu","Pietro Liò","Nicholas D. Lane"],"pdf_url":"https://arxiv.org/pdf/2405.20882v1.pdf","comment":"25 pages, 12 figures, 7 tables, pre-print under review"},{"id":"http://arxiv.org/abs/2405.20879v1","updated":"2024-05-31T14:54:51Z","published":"2024-05-31T14:54:51Z","title":"Flow matching achieves minimax optimal convergence","summary":" Flow matching (FM) has gained significant attention as a simulation-free\ngenerative model. Unlike diffusion models, which are based on stochastic\ndifferential equations, FM employs a simpler approach by solving an ordinary\ndifferential equation with an initial condition from a normal distribution,\nthus streamlining the sample generation process. This paper discusses the\nconvergence properties of FM in terms of the $p$-Wasserstein distance, a\nmeasure of distributional discrepancy. We establish that FM can achieve the\nminmax optimal convergence rate for $1 \\leq p \\leq 2$, presenting the first\ntheoretical evidence that FM can reach convergence rates comparable to those of\ndiffusion models. Our analysis extends existing frameworks by examining a\nbroader class of mean and variance functions for the vector fields and\nidentifies specific conditions necessary to attain these optimal rates.\n","authors":["Kenji Fukumizu","Taiji Suzuki","Noboru Isobe","Kazusato Oko","Masanori Koyama"],"pdf_url":"https://arxiv.org/pdf/2405.20879v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20877v1","updated":"2024-05-31T14:52:58Z","published":"2024-05-31T14:52:58Z","title":"Waveform Design for Over-the-Air Computing","summary":" In response to the increasing number of devices anticipated in\nnext-generation networks, a shift toward over-the-air (OTA) computing has been\nproposed. Leveraging the superposition of multiple access channels, OTA\ncomputing enables efficient resource management by supporting simultaneous\nuncoded transmission in the time and the frequency domain. Thus, to advance the\nintegration of OTA computing, our study presents a theoretical analysis\naddressing practical issues encountered in current digital communication\ntransceivers, such as time sampling error and intersymbol interference (ISI).\nTo this end, we examine the theoretical mean squared error (MSE) for OTA\ntransmission under time sampling error and ISI, while also exploring methods\nfor minimizing the MSE in the OTA transmission. Utilizing alternating\noptimization, we also derive optimal power policies for both the devices and\nthe base station. Additionally, we propose a novel deep neural network\n(DNN)-based approach to design waveforms enhancing OTA transmission performance\nunder time sampling error and ISI. To ensure fair comparison with existing\nwaveforms like the raised cosine (RC) and the better-than-raised-cosine (BRTC),\nwe incorporate a custom loss function integrating energy and bandwidth\nconstraints, along with practical design considerations such as waveform\nsymmetry. Simulation results validate our theoretical analysis and demonstrate\nperformance gains of the designed pulse over RC and BTRC waveforms. To\nfacilitate testing of our results without necessitating the DNN structure\nrecreation, we provide curve fitting parameters for select DNN-based waveforms\nas well.\n","authors":["Nikos G. Evgenidis","Nikos A. Mitsiou","Sotiris A. Tegos","Panagiotis D. Diamantoulakis","Panagiotis Sarigiannidis","Ioannis T. Rekanos","George K. Karagiannidis"],"pdf_url":"https://arxiv.org/pdf/2405.20877v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2310.18953v2","updated":"2024-05-31T14:51:58Z","published":"2023-10-29T09:54:03Z","title":"TIC-TAC: A Framework for Improved Covariance Estimation in Deep\n Heteroscedastic Regression","summary":" Deep heteroscedastic regression involves jointly optimizing the mean and\ncovariance of the predicted distribution using the negative log-likelihood.\nHowever, recent works show that this may result in sub-optimal convergence due\nto the challenges associated with covariance estimation. While the literature\naddresses this by proposing alternate formulations to mitigate the impact of\nthe predicted covariance, we focus on improving the predicted covariance\nitself. We study two questions: (1) Does the predicted covariance truly capture\nthe randomness of the predicted mean? (2) In the absence of supervision, how\ncan we quantify the accuracy of covariance estimation? We address (1) with a\nTaylor Induced Covariance (TIC), which captures the randomness of the predicted\nmean by incorporating its gradient and curvature through the second order\nTaylor polynomial. Furthermore, we tackle (2) by introducing a Task Agnostic\nCorrelations (TAC) metric, which combines the notion of correlations and\nabsolute error to evaluate the covariance. We evaluate TIC-TAC across multiple\nexperiments spanning synthetic and real-world datasets. Our results show that\nnot only does TIC accurately learn the covariance, it additionally facilitates\nan improved convergence of the negative log-likelihood. Our code is available\nat https://github.com/vita-epfl/TIC-TAC\n","authors":["Megh Shukla","Mathieu Salzmann","Alexandre Alahi"],"pdf_url":"https://arxiv.org/pdf/2310.18953v2.pdf","comment":"ICML 2024. Please feel free to provide feedback!"},{"id":"http://arxiv.org/abs/2402.01000v3","updated":"2024-05-31T14:49:11Z","published":"2024-02-01T20:27:19Z","title":"Multivariate Probabilistic Time Series Forecasting with Correlated\n Errors","summary":" Accurately modeling the correlation structure of errors is essential for\nreliable uncertainty quantification in probabilistic time series forecasting.\nRecent deep learning models for multivariate time series have developed\nefficient parameterizations for time-varying contemporaneous covariance, but\nthey often assume temporal independence of errors for simplicity. However,\nreal-world data frequently exhibit significant error autocorrelation and\ncross-lag correlation due to factors such as missing covariates. In this paper,\nwe present a plug-and-play method that learns the covariance structure of\nerrors over multiple steps for autoregressive models with Gaussian-distributed\nerrors. To achieve scalable inference and computational efficiency, we model\nthe contemporaneous covariance using a low-rank-plus-diagonal parameterization\nand characterize cross-covariance through a group of independent latent\ntemporal processes. The learned covariance matrix can be used to calibrate\npredictions based on observed residuals. We evaluate our method on\nprobabilistic models built on RNN and Transformer architectures, and the\nresults confirm the effectiveness of our approach in enhancing predictive\naccuracy and uncertainty quantification without significantly increasing the\nparameter size.\n","authors":["Vincent Zhihao Zheng","Lijun Sun"],"pdf_url":"https://arxiv.org/pdf/2402.01000v3.pdf","comment":"This paper extends the work presented in arXiv:2305.17028 to a\n multivariate setting"},{"id":"http://arxiv.org/abs/2211.10737v4","updated":"2024-05-31T14:47:25Z","published":"2022-11-19T16:17:11Z","title":"Accuracy Booster: Enabling 4-bit Fixed-point Arithmetic for DNN Training","summary":" The unprecedented demand for computing resources to train DNN models has led\nto a search for minimal numerical encoding. Recent state-of-the-art (SOTA)\nproposals advocate for multi-level scaled narrow bitwidth numerical formats. In\nthis paper, we show that single-level scaling is sufficient to maintain\ntraining accuracy while maximizing arithmetic density. We identify a previously\nproposed single-level scaled format for 8-bit training, Hybrid Block Floating\nPoint (HBFP), as the optimal candidate to minimize. We perform a full-scale\nexploration of the HBFP design space using mathematical tools to study the\ninterplay among various parameters and identify opportunities for even smaller\nencodings across layers and epochs. Based on our findings, we propose Accuracy\nBooster, a mixed-mantissa HBFP technique that uses 4-bit mantissas for over 99%\nof all arithmetic operations in training and 6-bit mantissas only in the last\nepoch and first/last layers. We show Accuracy Booster enables increasing\narithmetic density over all other SOTA formats by at least 2.3x while achieving\nstate-of-the-art accuracies in 4-bit training.\n","authors":["Simla Burcu Harma","Ayan Chakraborty","Nicholas Sperry","Babak Falsafi","Martin Jaggi","Yunho Oh"],"pdf_url":"https://arxiv.org/pdf/2211.10737v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20860v1","updated":"2024-05-31T14:44:05Z","published":"2024-05-31T14:44:05Z","title":"Enhancing Efficiency of Safe Reinforcement Learning via Sample\n Manipulation","summary":" Safe reinforcement learning (RL) is crucial for deploying RL agents in\nreal-world applications, as it aims to maximize long-term rewards while\nsatisfying safety constraints. However, safe RL often suffers from sample\ninefficiency, requiring extensive interactions with the environment to learn a\nsafe policy. We propose Efficient Safe Policy Optimization (ESPO), a novel\napproach that enhances the efficiency of safe RL through sample manipulation.\nESPO employs an optimization framework with three modes: maximizing rewards,\nminimizing costs, and balancing the trade-off between the two. By dynamically\nadjusting the sampling process based on the observed conflict between reward\nand safety gradients, ESPO theoretically guarantees convergence, optimization\nstability, and improved sample complexity bounds. Experiments on the\nSafety-MuJoCo and Omnisafe benchmarks demonstrate that ESPO significantly\noutperforms existing primal-based and primal-dual-based baselines in terms of\nreward maximization and constraint satisfaction. Moreover, ESPO achieves\nsubstantial gains in sample efficiency, requiring 25--29% fewer samples than\nbaselines, and reduces training time by 21--38%.\n","authors":["Shangding Gu","Laixi Shi","Yuhao Ding","Alois Knoll","Costas Spanos","Adam Wierman","Ming Jin"],"pdf_url":"https://arxiv.org/pdf/2405.20860v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20042v2","updated":"2024-05-31T14:42:52Z","published":"2024-05-30T13:23:02Z","title":"CycleFormer : TSP Solver Based on Language Modeling","summary":" We propose a new transformer model for the Traveling Salesman Problem (TSP)\ncalled CycleFormer. We identified distinctive characteristics that need to be\nconsidered when applying a conventional transformer model to TSP and aimed to\nfully incorporate these elements into the TSP-specific transformer. Unlike the\ntoken sets in typical language models, which are limited and static, the token\n(node) set in TSP is unlimited and dynamic. To exploit this fact to the\nfullest, we equated the encoder output with the decoder linear layer and\ndirectly connected the context vector of the encoder to the decoder encoding.\nAdditionally, we added a positional encoding to the encoder tokens that\nreflects the two-dimensional nature of TSP, and devised a circular positional\nencoding for the decoder tokens that considers the cyclic properties of a tour.\nBy incorporating these ideas, CycleFormer outperforms state-of-the-art (SOTA)\ntransformer models for TSP from TSP-50 to TSP-500. Notably, on TSP-500, the\noptimality gap was reduced by approximately 2.8 times, from 3.09% to 1.10%,\ncompared to the existing SOTA. The code will be made available at\nhttps://github.com/Giventicket/CycleFormer.\n","authors":["Jieun Yook","Junpyo Seo","Joon Huh","Han Joon Byun","Byung-ro Mooon"],"pdf_url":"https://arxiv.org/pdf/2405.20042v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2103.05621v4","updated":"2024-05-31T14:35:18Z","published":"2021-03-09T18:46:01Z","title":"The Common Intuition to Transfer Learning Can Win or Lose: Case Studies\n for Linear Regression","summary":" We study a fundamental transfer learning process from source to target linear\nregression tasks, including overparameterized settings where there are more\nlearned parameters than data samples. The target task learning is addressed by\nusing its training data together with the parameters previously computed for\nthe source task. We define a transfer learning approach to the target task as a\nlinear regression optimization with a regularization on the distance between\nthe to-be-learned target parameters and the already-learned source parameters.\nWe analytically characterize the generalization performance of our transfer\nlearning approach and demonstrate its ability to resolve the peak in\ngeneralization errors in double descent phenomena of the minimum L2-norm\nsolution to linear regression. Moreover, we show that for sufficiently related\ntasks, the optimally tuned transfer learning approach can outperform the\noptimally tuned ridge regression method, even when the true parameter vector\nconforms to an isotropic Gaussian prior distribution. Namely, we demonstrate\nthat transfer learning can beat the minimum mean square error (MMSE) solution\nof the independent target task. Our results emphasize the ability of transfer\nlearning to extend the solution space to the target task and, by that, to have\nan improved MMSE solution. We formulate the linear MMSE solution to our\ntransfer learning setting and point out its key differences from the common\ndesign philosophy to transfer learning.\n","authors":["Yehuda Dar","Daniel LeJeune","Richard G. Baraniuk"],"pdf_url":"https://arxiv.org/pdf/2103.05621v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20848v1","updated":"2024-05-31T14:32:31Z","published":"2024-05-31T14:32:31Z","title":"SLIM: a Scalable Light-weight Root Cause Analysis for Imbalanced Data in\n Microservice","summary":" The newly deployed service -- one kind of change service, could lead to a new\ntype of minority fault. Existing state-of-the-art methods for fault\nlocalization rarely consider the imbalanced fault classification in change\nservice. This paper proposes a novel method that utilizes decision rule sets to\ndeal with highly imbalanced data by optimizing the F1 score subject to\ncardinality constraints. The proposed method greedily generates the rule with\nmaximal marginal gain and uses an efficient minorize-maximization (MM) approach\nto select rules iteratively, maximizing a non-monotone submodular lower bound.\nCompared with existing fault localization algorithms, our algorithm can adapt\nto the imbalanced fault scenario of change service, and provide interpretable\nfault causes which are easy to understand and verify. Our method can also be\ndeployed in the online training setting, with only about 15% training overhead\ncompared to the current SOTA methods. Empirical studies showcase that our\nalgorithm outperforms existing fault localization algorithms in both accuracy\nand model interpretability.\n","authors":["Rui Ren","Jingbang Yang","Linxiao Yang","Xinyue Gu","Liang Sun"],"pdf_url":"https://arxiv.org/pdf/2405.20848v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.00846v2","updated":"2024-05-31T14:26:47Z","published":"2024-05-01T20:21:44Z","title":"Gameplay Filters: Safe Robot Walking through Adversarial Imagination","summary":" Ensuring the safe operation of legged robots in uncertain, novel environments\nis crucial to their widespread adoption. Despite recent advances in safety\nfilters that can keep arbitrary task-driven policies from incurring safety\nfailures, existing solutions for legged robot locomotion still rely on\nsimplified dynamics and may fail when the robot is perturbed away from\npredefined stable gaits. This paper presents a general approach that leverages\noffline game-theoretic reinforcement learning to synthesize a highly robust\nsafety filter for high-order nonlinear dynamics. This gameplay filter then\nmaintains runtime safety by continually simulating adversarial futures and\nprecluding task-driven actions that would cause it to lose future games (and\nthereby violate safety). Validated on a 36-dimensional quadruped robot\nlocomotion task, the gameplay safety filter exhibits inherent robustness to the\nsim-to-real gap without manual tuning or heuristic designs. Physical\nexperiments demonstrate the effectiveness of the gameplay safety filter under\nperturbations, such as tugging and unmodeled irregular terrains, while\nsimulation studies shed light on how to trade off computation and\nconservativeness without compromising safety.\n","authors":["Duy P. Nguyen","Kai-Chieh Hsu","Wenhao Yu","Jie Tan","Jaime F. Fisac"],"pdf_url":"https://arxiv.org/pdf/2405.00846v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16069v2","updated":"2024-05-31T14:25:58Z","published":"2024-05-25T05:40:16Z","title":"IncomeSCM: From tabular data set to time-series simulator and causal\n estimation benchmark","summary":" Evaluating observational estimators of causal effects demands information\nthat is rarely available: unconfounded interventions and outcomes from the\npopulation of interest, created either by randomization or adjustment. As a\nresult, it is customary to fall back on simulators when creating benchmark\ntasks. Simulators offer great control but are often too simplistic to make\nchallenging tasks, either because they are hand-designed and lack the nuances\nof real-world data, or because they are fit to observational data without\nstructural constraints. In this work, we propose a general, repeatable strategy\nfor turning observational data into sequential structural causal models and\nchallenging estimation tasks by following two simple principles: 1) fitting\nreal-world data where possible, and 2) creating complexity by composing simple,\nhand-designed mechanisms. We implement these ideas in a highly configurable\nsoftware package and apply it to the well-known Adult income data set to\nconstruct the \\tt IncomeSCM simulator. From this, we devise multiple estimation\ntasks and sample data sets to compare established estimators of causal effects.\nThe tasks present a suitable challenge, with effect estimates varying greatly\nin quality between methods, despite similar performance in the modeling of\nfactual outcomes, highlighting the need for dedicated causal estimators and\nmodel selection criteria.\n","authors":["Fredrik D. Johansson"],"pdf_url":"https://arxiv.org/pdf/2405.16069v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20838v1","updated":"2024-05-31T14:25:45Z","published":"2024-05-31T14:25:45Z","title":"einspace: Searching for Neural Architectures from Fundamental Operations","summary":" Neural architecture search (NAS) finds high performing networks for a given\ntask. Yet the results of NAS are fairly prosaic; they did not e.g. create a\nshift from convolutional structures to transformers. This is not least because\nthe search spaces in NAS often aren't diverse enough to include such\ntransformations a priori. Instead, for NAS to provide greater potential for\nfundamental design shifts, we need a novel expressive search space design which\nis built from more fundamental operations. To this end, we introduce einspace,\na search space based on a parameterised probabilistic context-free grammar. Our\nspace is versatile, supporting architectures of various sizes and complexities,\nwhile also containing diverse network operations which allow it to model\nconvolutions, attention components and more. It contains many existing\ncompetitive architectures, and provides flexibility for discovering new ones.\nUsing this search space, we perform experiments to find novel architectures as\nwell as improvements on existing ones on the diverse Unseen NAS datasets. We\nshow that competitive architectures can be obtained by searching from scratch,\nand we consistently find large improvements when initialising the search with\nstrong baselines. We believe that this work is an important advancement towards\na transformative NAS paradigm where search space expressivity and strategic\nsearch initialisation play key roles.\n","authors":["Linus Ericsson","Miguel Espinosa","Chenhongyi Yang","Antreas Antoniou","Amos Storkey","Shay B. Cohen","Steven McDonagh","Elliot J. Crowley"],"pdf_url":"https://arxiv.org/pdf/2405.20838v1.pdf","comment":"Project page at https://linusericsson.github.io/einspace/"},{"id":"http://arxiv.org/abs/2405.20836v1","updated":"2024-05-31T14:24:39Z","published":"2024-05-31T14:24:39Z","title":"Solving partial differential equations with sampled neural networks","summary":" Approximation of solutions to partial differential equations (PDE) is an\nimportant problem in computational science and engineering. Using neural\nnetworks as an ansatz for the solution has proven a challenge in terms of\ntraining time and approximation accuracy. In this contribution, we discuss how\nsampling the hidden weights and biases of the ansatz network from data-agnostic\nand data-dependent probability distributions allows us to progress on both\nchallenges. In most examples, the random sampling schemes outperform iterative,\ngradient-based optimization of physics-informed neural networks regarding\ntraining time and accuracy by several orders of magnitude. For time-dependent\nPDE, we construct neural basis functions only in the spatial domain and then\nsolve the associated ordinary differential equation with classical methods from\nscientific computing over a long time horizon. This alleviates one of the\ngreatest challenges for neural PDE solvers because it does not require us to\nparameterize the solution in time. For second-order elliptic PDE in Barron\nspaces, we prove the existence of sampled networks with $L^2$ convergence to\nthe solution. We demonstrate our approach on several time-dependent and static\nPDEs. We also illustrate how sampled networks can effectively solve inverse\nproblems in this setting. Benefits compared to common numerical schemes include\nspectral convergence and mesh-free construction of basis functions.\n","authors":["Chinmay Datar","Taniya Kapoor","Abhishek Chandra","Qing Sun","Iryna Burak","Erik Lien Bolager","Anna Veselovska","Massimo Fornasier","Felix Dietrich"],"pdf_url":"https://arxiv.org/pdf/2405.20836v1.pdf","comment":"16 pages, 15 figures"},{"id":"http://arxiv.org/abs/2405.20835v1","updated":"2024-05-31T14:24:33Z","published":"2024-05-31T14:24:33Z","title":"Outliers and Calibration Sets have Diminishing Effect on Quantization of\n Modern LLMs","summary":" Post-Training Quantization (PTQ) enhances the efficiency of Large Language\nModels (LLMs) by enabling faster operation and compatibility with more\naccessible hardware through reduced memory usage, at the cost of small\nperformance drops. We explore the role of calibration sets in PTQ, specifically\ntheir effect on hidden activations in various notable open-source LLMs.\nCalibration sets are crucial for evaluating activation magnitudes and\nidentifying outliers, which can distort the quantization range and negatively\nimpact performance. Our analysis reveals a marked contrast in quantization\neffectiveness across models. The older OPT model, which much of the\nquantization literature is based on, shows significant performance\ndeterioration and high susceptibility to outliers with varying calibration\nsets. In contrast, newer models like Llama-2 7B, Llama-3 8B, Command-R 35B, and\nMistral 7B demonstrate strong robustness, with Mistral 7B showing near-immunity\nto outliers and stable activations. These findings suggest a shift in PTQ\nstrategies might be needed. As advancements in pre-training methods reduce the\nrelevance of outliers, there is an emerging need to reassess the fundamentals\nof current quantization literature. The emphasis should pivot towards\noptimizing inference speed, rather than primarily focusing on outlier\npreservation, to align with the evolving characteristics of state-of-the-art\nLLMs.\n","authors":["Davide Paglieri","Saurabh Dash","Tim Rocktäschel","Jack Parker-Holder"],"pdf_url":"https://arxiv.org/pdf/2405.20835v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.07217v2","updated":"2024-05-31T14:23:09Z","published":"2024-02-23T10:08:45Z","title":"Attention-aware Semantic Communications for Collaborative Inference","summary":" We propose a communication-efficient collaborative inference framework in the\ndomain of edge inference, focusing on the efficient use of vision transformer\n(ViT) models. The partitioning strategy of conventional collaborative inference\nfails to reduce communication cost because of the inherent architecture of ViTs\nmaintaining consistent layer dimensions across the entire transformer encoder.\nTherefore, instead of employing the partitioning strategy, our framework\nutilizes a lightweight ViT model on the edge device, with the server deploying\na complicated ViT model. To enhance communication efficiency and achieve the\nclassification accuracy of the server model, we propose two strategies: 1)\nattention-aware patch selection and 2) entropy-aware image transmission.\nAttention-aware patch selection leverages the attention scores generated by the\nedge device's transformer encoder to identify and select the image patches\ncritical for classification. This strategy enables the edge device to transmit\nonly the essential patches to the server, significantly improving communication\nefficiency. Entropy-aware image transmission uses min-entropy as a metric to\naccurately determine whether to depend on the lightweight model on the edge\ndevice or to request the inference from the server model. In our framework, the\nlightweight ViT model on the edge device acts as a semantic encoder,\nefficiently identifying and selecting the crucial image information required\nfor the classification task. Our experiments demonstrate that the proposed\ncollaborative inference framework can reduce communication overhead by 68% with\nonly a minimal loss in accuracy compared to the server model on the ImageNet\ndataset.\n","authors":["Jiwoong Im","Nayoung Kwon","Taewoo Park","Jiheon Woo","Jaeho Lee","Yongjune Kim"],"pdf_url":"https://arxiv.org/pdf/2404.07217v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20830v1","updated":"2024-05-31T14:21:04Z","published":"2024-05-31T14:21:04Z","title":"Self-Augmented Preference Optimization: Off-Policy Paradigms for\n Language Model Alignment","summary":" Traditional language model alignment methods, such as Direct Preference\nOptimization (DPO), are limited by their dependence on static, pre-collected\npaired preference data, which hampers their adaptability and practical\napplicability. To overcome this limitation, we introduce Self-Augmented\nPreference Optimization (SAPO), an effective and scalable training paradigm\nthat does not require existing paired data. Building on the self-play concept,\nwhich autonomously generates negative responses, we further incorporate an\noff-policy learning pipeline to enhance data exploration and exploitation.\nSpecifically, we employ an Exponential Moving Average (EMA) model in\nconjunction with a replay buffer to enable dynamic updates of response\nsegments, effectively integrating real-time feedback with insights from\nhistorical data. Our comprehensive evaluations of the LLaMA3-8B and Mistral-7B\nmodels across benchmarks, including the Open LLM Leaderboard, IFEval,\nAlpacaEval 2.0, and MT-Bench, demonstrate that SAPO matches or surpasses\nestablished offline contrastive baselines, such as DPO and Odds Ratio\nPreference Optimization, and outperforms offline self-play methods like SPIN.\nOur code is available at https://github.com/yinyueqin/SAPO\n","authors":["Yueqin Yin","Zhendong Wang","Yujia Xie","Weizhu Chen","Mingyuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2405.20830v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20829v1","updated":"2024-05-31T14:21:00Z","published":"2024-05-31T14:21:00Z","title":"Rethinking Open-World Semi-Supervised Learning: Distribution Mismatch\n and Inductive Inference","summary":" Open-world semi-supervised learning (OWSSL) extends conventional\nsemi-supervised learning to open-world scenarios by taking account of novel\ncategories in unlabeled datasets. Despite the recent advancements in OWSSL, the\nsuccess often relies on the assumptions that 1) labeled and unlabeled datasets\nshare the same balanced class prior distribution, which does not generally hold\nin real-world applications, and 2) unlabeled training datasets are utilized for\nevaluation, where such transductive inference might not adequately address\nchallenges in the wild. In this paper, we aim to generalize OWSSL by addressing\nthem. Our work suggests that practical OWSSL may require different training\nsettings, evaluation methods, and learning strategies compared to those\nprevalent in the existing literature.\n","authors":["Seongheon Park","Hyuk Kwon","Kwanghoon Sohn","Kibok Lee"],"pdf_url":"https://arxiv.org/pdf/2405.20829v1.pdf","comment":"CVPR Workshop on Computer Vision in the Wild (CVinW), 2024"},{"id":"http://arxiv.org/abs/2312.10045v2","updated":"2024-05-31T14:19:03Z","published":"2023-12-01T11:27:08Z","title":"Interpretable Knowledge Tracing via Response Influence-based\n Counterfactual Reasoning","summary":" Knowledge tracing (KT) plays a crucial role in computer-aided education and\nintelligent tutoring systems, aiming to assess students' knowledge proficiency\nby predicting their future performance on new questions based on their past\nresponse records. While existing deep learning knowledge tracing (DLKT) methods\nhave significantly improved prediction accuracy and achieved state-of-the-art\nresults, they often suffer from a lack of interpretability. To address this\nlimitation, current approaches have explored incorporating psychological\ninfluences to achieve more explainable predictions, but they tend to overlook\nthe potential influences of historical responses. In fact, understanding how\nmodels make predictions based on response influences can enhance the\ntransparency and trustworthiness of the knowledge tracing process, presenting\nan opportunity for a new paradigm of interpretable KT. However, measuring\nunobservable response influences is challenging. In this paper, we resort to\ncounterfactual reasoning that intervenes in each response to answer\n\\textit{what if a student had answered a question incorrectly that he/she\nactually answered correctly, and vice versa}. Based on this, we propose RCKT, a\nnovel response influence-based counterfactual knowledge tracing framework. RCKT\ngenerates response influences by comparing prediction outcomes from factual\nsequences and constructed counterfactual sequences after interventions.\nAdditionally, we introduce maximization and inference techniques to leverage\naccumulated influences from different past responses, further improving the\nmodel's performance and credibility. Extensive experimental results demonstrate\nthat our RCKT method outperforms state-of-the-art knowledge tracing methods on\nfour datasets against six baselines, and provides credible interpretations of\nresponse influences.\n","authors":["Jiajun Cui","Minghe Yu","Bo Jiang","Aimin Zhou","Jianyong Wang","Wei Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.10045v2.pdf","comment":"ICDE'24 (fixing a few typos). Source code at\n https://github.com/JJCui96/RCKT. Keywords: knowledge tracing, interpretable\n machine learning, counterfactual reasoning, artificial intelligence for\n education"},{"id":"http://arxiv.org/abs/2405.20825v1","updated":"2024-05-31T14:18:37Z","published":"2024-05-31T14:18:37Z","title":"Analysis of clinical, dosimetric and radiomic features for predicting\n local failure after stereotactic radiotherapy of brain metastases in\n malignant melanoma","summary":" Background: The aim of this study was to investigate the role of clinical,\ndosimetric and pretherapeutic magnetic resonance imaging (MRI) features for\nlesion-specific outcome prediction of stereotactic radiotherapy (SRT) in\npatients with brain metastases from malignant melanoma (MBM).\n Methods: In this multicenter, retrospective analysis, we reviewed 517 MBM\nfrom 130 patients treated with SRT (single fraction or hypofractionated). For\neach gross tumor volume (GTV) 1576 radiomic features (RF) were calculated (788\neach for the GTV and for a 3 mm margin around the GTV). Clinical parameters,\nradiation dose and RF from pretherapeutic contrast-enhanced T1-weighted MRI\nfrom different institutions were evaluated with a feature processing and\nelimination pipeline in a nested cross-validation scheme.\n Results: Seventy-two (72) of 517 lesions (13.9%) showed a local failure (LF)\nafter SRT. The processing pipeline showed clinical, dosimetric and radiomic\nfeatures providing information for LF prediction. The most prominent ones were\nthe correlation of the gray level co-occurrence matrix of the margin (hazard\nratio (HR): 0.37, confidence interval (CI): 0.23-0.58) and systemic therapy\nbefore SRT (HR: 0.55, CI: 0.42-0.70). The majority of RF associated with LF was\ncalculated in the margin around the GTV.\n Conclusions: Pretherapeutic MRI based RF connected with lesion-specific\noutcome after SRT could be identified, despite multicentric data and minor\ndifferences in imaging protocols. Image data analysis of the surrounding\nmetastatic environment may provide therapy-relevant information with the\npotential to further individualize radiotherapy strategies.\n","authors":["Nanna E. Hartong","Ilias Sachpazidis","Oliver Blanck","Lucas Etzel","Jan C. Peeken","Stephanie E. Combs","Horst Urbach","Maxim Zaitsev","Dimos Baltas","Ilinca Popp","Anca-Ligia Grosu","Tobias Fechter"],"pdf_url":"https://arxiv.org/pdf/2405.20825v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.14041v4","updated":"2024-05-31T14:18:31Z","published":"2022-12-02T16:34:56Z","title":"Deciphering RNA Secondary Structure Prediction: A Probabilistic K-Rook\n Matching Perspective","summary":" The secondary structure of ribonucleic acid (RNA) is more stable and\naccessible in the cell than its tertiary structure, making it essential for\nfunctional prediction. Although deep learning has shown promising results in\nthis field, current methods suffer from poor generalization and high\ncomplexity. In this work, we reformulate the RNA secondary structure prediction\nas a K-Rook problem, thereby simplifying the prediction process into\nprobabilistic matching within a finite solution space. Building on this\ninnovative perspective, we introduce RFold, a simple yet effective method that\nlearns to predict the most matching K-Rook solution from the given sequence.\nRFold employs a bi-dimensional optimization strategy that decomposes the\nprobabilistic matching problem into row-wise and column-wise components to\nreduce the matching complexity, simplifying the solving process while\nguaranteeing the validity of the output. Extensive experiments demonstrate that\nRFold achieves competitive performance and about eight times faster inference\nefficiency than the state-of-the-art approaches. The code and Colab demo are\navailable in\n\\href{http://github.com/A4Bio/RFold}{http://github.com/A4Bio/RFold}.\n","authors":["Cheng Tan","Zhangyang Gao","Hanqun Cao","Xingran Chen","Ge Wang","Lirong Wu","Jun Xia","Jiangbin Zheng","Stan Z. Li"],"pdf_url":"https://arxiv.org/pdf/2212.14041v4.pdf","comment":"Accepted by ICML 2024"},{"id":"http://arxiv.org/abs/2405.20824v1","updated":"2024-05-31T14:16:52Z","published":"2024-05-31T14:16:52Z","title":"Online Convex Optimisation: The Optimal Switching Regret for all\n Segmentations Simultaneously","summary":" We consider the classic problem of online convex optimisation. Whereas the\nnotion of static regret is relevant for stationary problems, the notion of\nswitching regret is more appropriate for non-stationary problems. A switching\nregret is defined relative to any segmentation of the trial sequence, and is\nequal to the sum of the static regrets of each segment. In this paper we show\nthat, perhaps surprisingly, we can achieve the asymptotically optimal switching\nregret on every possible segmentation simultaneously. Our algorithm for doing\nso is very efficient: having a space and per-trial time complexity that is\nlogarithmic in the time-horizon. Our algorithm also obtains novel bounds on its\ndynamic regret: being adaptive to variations in the rate of change of the\ncomparator sequence.\n","authors":["Stephen Pasteris","Chris Hicks","Vasilios Mavroudis","Mark Herbster"],"pdf_url":"https://arxiv.org/pdf/2405.20824v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20821v1","updated":"2024-05-31T14:15:44Z","published":"2024-05-31T14:15:44Z","title":"Pursuing Overall Welfare in Federated Learning through Sequential\n Decision Making","summary":" In traditional federated learning, a single global model cannot perform\nequally well for all clients. Therefore, the need to achieve the client-level\nfairness in federated system has been emphasized, which can be realized by\nmodifying the static aggregation scheme for updating the global model to an\nadaptive one, in response to the local signals of the participating clients.\nOur work reveals that existing fairness-aware aggregation strategies can be\nunified into an online convex optimization framework, in other words, a central\nserver's sequential decision making process. To enhance the decision making\ncapability, we propose simple and intuitive improvements for suboptimal designs\nwithin existing methods, presenting AAggFF. Considering practical requirements,\nwe further subdivide our method tailored for the cross-device and the\ncross-silo settings, respectively. Theoretical analyses guarantee sublinear\nregret upper bounds for both settings: $\\mathcal{O}(\\sqrt{T \\log{K}})$ for the\ncross-device setting, and $\\mathcal{O}(K \\log{T})$ for the cross-silo setting,\nwith $K$ clients and $T$ federation rounds. Extensive experiments demonstrate\nthat the federated system equipped with AAggFF achieves better degree of\nclient-level fairness than existing methods in both practical settings. Code is\navailable at https://github.com/vaseline555/AAggFF\n","authors":["Seok-Ju Hahn","Gi-Soo Kim","Junghye Lee"],"pdf_url":"https://arxiv.org/pdf/2405.20821v1.pdf","comment":"Accepted at ICML 2024"},{"id":"http://arxiv.org/abs/2405.20808v1","updated":"2024-05-31T14:07:33Z","published":"2024-05-31T14:07:33Z","title":"Optimally Improving Cooperative Learning in a Social Setting","summary":" We consider a cooperative learning scenario where a collection of networked\nagents with individually owned classifiers dynamically update their\npredictions, for the same classification task, through communication or\nobservations of each other's predictions. Clearly if highly influential\nvertices use erroneous classifiers, there will be a negative effect on the\naccuracy of all the agents in the network. We ask the following question: how\ncan we optimally fix the prediction of a few classifiers so as maximize the\noverall accuracy in the entire network. To this end we consider an aggregate\nand an egalitarian objective function. We show a polynomial time algorithm for\noptimizing the aggregate objective function, and show that optimizing the\negalitarian objective function is NP-hard. Furthermore, we develop\napproximation algorithms for the egalitarian improvement. The performance of\nall of our algorithms are guaranteed by mathematical analysis and backed by\nexperiments on synthetic and real data.\n","authors":["Shahrzad Haddadan","Cheng Xin","Jie Gao"],"pdf_url":"https://arxiv.org/pdf/2405.20808v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.17810v2","updated":"2024-05-31T14:07:00Z","published":"2024-02-27T12:43:09Z","title":"BioT5+: Towards Generalized Biological Understanding with IUPAC\n Integration and Multi-task Tuning","summary":" Recent research trends in computational biology have increasingly focused on\nintegrating text and bio-entity modeling, especially in the context of\nmolecules and proteins. However, previous efforts like BioT5 faced challenges\nin generalizing across diverse tasks and lacked a nuanced understanding of\nmolecular structures, particularly in their textual representations (e.g.,\nIUPAC). This paper introduces BioT5+, an extension of the BioT5 framework,\ntailored to enhance biological research and drug discovery. BioT5+ incorporates\nseveral novel features: integration of IUPAC names for molecular understanding,\ninclusion of extensive bio-text and molecule data from sources like bioRxiv and\nPubChem, the multi-task instruction tuning for generality across tasks, and a\nnumerical tokenization technique for improved processing of numerical data.\nThese enhancements allow BioT5+ to bridge the gap between molecular\nrepresentations and their textual descriptions, providing a more holistic\nunderstanding of biological entities, and largely improving the grounded\nreasoning of bio-text and bio-sequences. The model is pre-trained and\nfine-tuned with a large number of experiments, including \\emph{3 types of\nproblems (classification, regression, generation), 15 kinds of tasks, and 21\ntotal benchmark datasets}, demonstrating the remarkable performance and\nstate-of-the-art results in most cases. BioT5+ stands out for its ability to\ncapture intricate relationships in biological data, thereby contributing\nsignificantly to bioinformatics and computational biology. Our code is\navailable at \\url{https://github.com/QizhiPei/BioT5}.\n","authors":["Qizhi Pei","Lijun Wu","Kaiyuan Gao","Xiaozhuan Liang","Yin Fang","Jinhua Zhu","Shufang Xie","Tao Qin","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2402.17810v2.pdf","comment":"Accepted by ACL 2024 (Findings)"},{"id":"http://arxiv.org/abs/2402.12550v2","updated":"2024-05-31T14:04:05Z","published":"2024-02-19T21:20:22Z","title":"Multilinear Mixture of Experts: Scalable Expert Specialization through\n Factorization","summary":" The Mixture of Experts (MoE) paradigm provides a powerful way to decompose\ndense layers into smaller, modular computations often more amenable to human\ninterpretation, debugging, and editability. However, a major challenge lies in\nthe computational cost of scaling the number of experts high enough to achieve\nfine-grained specialization. In this paper, we propose the Multilinear Mixture\nof Experts ($\\mu$MoE) layer to address this, focusing on vision models.\n$\\mu$MoE layers enable scalable expert specialization by performing an implicit\ncomputation on prohibitively large weight tensors entirely in factorized form.\nConsequently, $\\mu$MoEs (1) avoid the restrictively high inference-time costs\nof 'soft' MoEs, yet (2) do not inherit the training issues of the popular\n'sparse' MoEs' discrete (non-differentiable) expert routing. We present both\nqualitative and quantitative evidence that scaling $\\mu$MoE layers when\nfine-tuning foundation models for vision tasks leads to more specialized\nexperts at the class-level, further enabling manual bias correction in CelebA\nattribute classification. Finally, we show qualitative results demonstrating\nthe expert specialism achieved when pre-training large GPT2 and MLP-Mixer\nmodels with parameter-matched $\\mu$MoE blocks at every layer, maintaining\ncomparable accuracy. Our code is available at:\nhttps://github.com/james-oldfield/muMoE.\n","authors":["James Oldfield","Markos Georgopoulos","Grigorios G. Chrysos","Christos Tzelepis","Yannis Panagakis","Mihalis A. Nicolaou","Jiankang Deng","Ioannis Patras"],"pdf_url":"https://arxiv.org/pdf/2402.12550v2.pdf","comment":"Github: https://github.com/james-oldfield/muMoE. Project page:\n https://james-oldfield.github.io/muMoE/"},{"id":"http://arxiv.org/abs/2305.15805v3","updated":"2024-05-31T14:02:24Z","published":"2023-05-25T07:39:41Z","title":"Dynamic Context Pruning for Efficient and Interpretable Autoregressive\n Transformers","summary":" Autoregressive Transformers adopted in Large Language Models (LLMs) are hard\nto scale to long sequences. Despite several works trying to reduce their\ncomputational cost, most of LLMs still adopt attention layers between all pairs\nof tokens in the sequence, thus incurring a quadratic cost. In this study, we\npresent a novel approach that dynamically prunes contextual information while\npreserving the model's expressiveness, resulting in reduced memory and\ncomputational requirements during inference. Our method employs a learnable\nmechanism that determines which uninformative tokens can be dropped from the\ncontext at any point across the generation process. By doing so, our approach\nnot only addresses performance concerns but also enhances interpretability,\nproviding valuable insight into the model's decision-making process. Our\ntechnique can be applied to existing pre-trained models through a\nstraightforward fine-tuning process, and the pruning strength can be specified\nby a sparsity parameter. Notably, our empirical findings demonstrate that we\ncan effectively prune up to 80\\% of the context without significant performance\ndegradation on downstream tasks, offering a valuable tool for mitigating\ninference costs. Our reference implementation achieves up to $2\\times$ increase\nin inference throughput and even greater memory savings.\n","authors":["Sotiris Anagnostidis","Dario Pavllo","Luca Biggio","Lorenzo Noci","Aurelien Lucchi","Thomas Hofmann"],"pdf_url":"https://arxiv.org/pdf/2305.15805v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.15154v2","updated":"2024-05-31T14:01:32Z","published":"2024-05-24T02:13:46Z","title":"Online Prompt Pricing based on Combinatorial Multi-Armed Bandit and\n Hierarchical Stackelberg Game","summary":" Generation models have shown promising performance in various tasks, making\ntrading around machine learning models possible. In this paper, we aim at a\nnovel prompt trading scenario, prompt bundle trading (PBT) system, and propose\nan online pricing mechanism. Based on the combinatorial multi-armed bandit\n(CMAB) and three-stage hierarchical Stackelburg (HS) game, our pricing\nmechanism considers the profits of the consumer, platform, and seller,\nsimultaneously achieving the profit satisfaction of these three participants.\nWe break down the pricing issue into two steps, namely unknown category\nselection and incentive strategy optimization. The former step is to select a\nset of categories with the highest qualities, and the latter is to derive the\noptimal strategy for each participant based on the chosen categories. Unlike\nthe existing fixed pricing mode, the PBT pricing mechanism we propose is more\nflexible and diverse, which is more in accord with the transaction needs of\nreal-world scenarios. We test our method on a simulated text-to-image dataset.\nThe experimental results demonstrate the effectiveness of our algorithm, which\nprovides a feasible price-setting standard for the prompt marketplaces.\n","authors":["Meiling Li","Hongrun Ren","Haixu Xiong","Zhenxing Qian","Xinpeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.15154v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20800v1","updated":"2024-05-31T14:01:12Z","published":"2024-05-31T14:01:12Z","title":"Shape Constraints in Symbolic Regression using Penalized Least Squares","summary":" We study the addition of shape constraints and their consideration during the\nparameter estimation step of symbolic regression (SR). Shape constraints serve\nas a means to introduce prior knowledge about the shape of the otherwise\nunknown model function into SR. Unlike previous works that have explored shape\nconstraints in SR, we propose minimizing shape constraint violations during\nparameter estimation using gradient-based numerical optimization.\n We test three algorithm variants to evaluate their performance in identifying\nthree symbolic expressions from a synthetically generated data set. This paper\nexamines two benchmark scenarios: one with varying noise levels and another\nwith reduced amounts of training data. The results indicate that incorporating\nshape constraints into the expression search is particularly beneficial when\ndata is scarce. Compared to using shape constraints only in the selection\nprocess, our approach of minimizing violations during parameter estimation\nshows a statistically significant benefit in some of our test cases, without\nbeing significantly worse in any instance.\n","authors":["Viktor Martinek","Julia Reuter","Ophelia Frotscher","Sanaz Mostaghim","Markus Richter","Roland Herzog"],"pdf_url":"https://arxiv.org/pdf/2405.20800v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20799v1","updated":"2024-05-31T14:00:44Z","published":"2024-05-31T14:00:44Z","title":"Rough Transformers: Lightweight Continuous-Time Sequence Modelling with\n Path Signatures","summary":" Time-series data in real-world settings typically exhibit long-range\ndependencies and are observed at non-uniform intervals. In these settings,\ntraditional sequence-based recurrent models struggle. To overcome this,\nresearchers often replace recurrent architectures with Neural ODE-based models\nto account for irregularly sampled data and use Transformer-based architectures\nto account for long-range dependencies. Despite the success of these two\napproaches, both incur very high computational costs for input sequences of\neven moderate length. To address this challenge, we introduce the Rough\nTransformer, a variation of the Transformer model that operates on\ncontinuous-time representations of input sequences and incurs significantly\nlower computational costs. In particular, we propose \\textit{multi-view\nsignature attention}, which uses path signatures to augment vanilla attention\nand to capture both local and global (multi-scale) dependencies in the input\ndata, while remaining robust to changes in the sequence length and sampling\nfrequency and yielding improved spatial processing. We find that, on a variety\nof time-series-related tasks, Rough Transformers consistently outperform their\nvanilla attention counterparts while obtaining the representational benefits of\nNeural ODE-based models, all at a fraction of the computational time and memory\nresources.\n","authors":["Fernando Moreno-Pino","Álvaro Arroyo","Harrison Waldon","Xiaowen Dong","Álvaro Cartea"],"pdf_url":"https://arxiv.org/pdf/2405.20799v1.pdf","comment":"Preprint. Under review. arXiv admin note: text overlap with\n arXiv:2403.10288"},{"id":"http://arxiv.org/abs/2402.09838v2","updated":"2024-05-31T13:59:44Z","published":"2024-02-15T10:00:13Z","title":"Performative Reinforcement Learning in Gradually Shifting Environments","summary":" When Reinforcement Learning (RL) agents are deployed in practice, they might\nimpact their environment and change its dynamics. We propose a new framework to\nmodel this phenomenon, where the current environment depends on the deployed\npolicy as well as its previous dynamics. This is a generalization of\nPerformative RL (PRL) [Mandal et al., 2023]. Unlike PRL, our framework allows\nto model scenarios where the environment gradually adjusts to a deployed\npolicy. We adapt two algorithms from the performative prediction literature to\nour setting and propose a novel algorithm called Mixed Delayed Repeated\nRetraining (MDRR). We provide conditions under which these algorithms converge\nand compare them using three metrics: number of retrainings, approximation\nguarantee, and number of samples per deployment. MDRR is the first algorithm in\nthis setting which combines samples from multiple deployments in its training.\nThis makes MDRR particularly suitable for scenarios where the environment's\nresponse strongly depends on its previous dynamics, which are common in\npractice. We experimentally compare the algorithms using a simulation-based\ntestbed and our results show that MDRR converges significantly faster than\nprevious approaches.\n","authors":["Ben Rank","Stelios Triantafyllou","Debmalya Mandal","Goran Radanovic"],"pdf_url":"https://arxiv.org/pdf/2402.09838v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20797v1","updated":"2024-05-31T13:59:18Z","published":"2024-05-31T13:59:18Z","title":"Ovis: Structural Embedding Alignment for Multimodal Large Language Model","summary":" Current Multimodal Large Language Models (MLLMs) typically integrate a\npre-trained LLM with another pre-trained vision transformer through a\nconnector, such as an MLP, endowing the LLM with visual capabilities. However,\nthe misalignment between two embedding strategies in MLLMs -- the structural\ntextual embeddings based on an embedding look-up table and the continuous\nembeddings generated directly by the vision encoder -- makes challenges for a\nmore seamless fusion of visual and textual information. We propose Ovis, a\nnovel MLLM architecture designed to structurally align visual and textual\nembeddings. Ovis integrates an additional learnable visual embedding table into\nthe visual encoder's process. To capture rich visual semantics, each image\npatch indexes the visual embedding table multiple times, resulting in a final\nvisual embedding that is a probabilistic combination of the indexed embeddings.\nThis structural approach mirrors the method used for generating textual\nembeddings. Empirical evaluations on various multimodal benchmarks demonstrate\nthat Ovis outperforms open-source MLLMs of similar parameter scales and even\nsurpasses the proprietary model Qwen-VL-Plus overall. These results highlight\nthe potential of Ovis' structured visual representation for advancing MLLM\narchitectural design and promoting more effective multimodal learning. Both the\nsource code and the training dataset of Ovis will be made publicly available.\n","authors":["Shiyin Lu","Yang Li","Qing-Guo Chen","Zhao Xu","Weihua Luo","Kaifu Zhang","Han-Jia Ye"],"pdf_url":"https://arxiv.org/pdf/2405.20797v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20794v1","updated":"2024-05-31T13:54:25Z","published":"2024-05-31T13:54:25Z","title":"Model Interpretation and Explainability: Towards Creating Transparency\n in Prediction Models","summary":" Explainable AI (XAI) has a counterpart in analytical modeling which we refer\nto as model explainability. We tackle the issue of model explainability in the\ncontext of prediction models. We analyze a dataset of loans from a credit card\ncompany and apply three stages: execute and compare four different prediction\nmethods, apply the best known explainability techniques in the current\nliterature to the model training sets to identify feature importance (FI)\n(static case), and finally to cross-check whether the FI set holds up under\nwhat if prediction scenarios for continuous and categorical variables (dynamic\ncase). We found inconsistency in FI identification between the static and\ndynamic cases. We summarize the state of the art in model explainability and\nsuggest further research to advance the field.\n","authors":["Donald Kridel","Jacob Dineen","Daniel Dolk","David Castillo"],"pdf_url":"https://arxiv.org/pdf/2405.20794v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13397v3","updated":"2024-05-31T13:53:26Z","published":"2023-10-20T10:12:06Z","title":"Equivariant Deep Weight Space Alignment","summary":" Permutation symmetries of deep networks make basic operations like model\nmerging and similarity estimation challenging. In many cases, aligning the\nweights of the networks, i.e., finding optimal permutations between their\nweights, is necessary. Unfortunately, weight alignment is an NP-hard problem.\nPrior research has mainly focused on solving relaxed versions of the alignment\nproblem, leading to either time-consuming methods or sub-optimal solutions. To\naccelerate the alignment process and improve its quality, we propose a novel\nframework aimed at learning to solve the weight alignment problem, which we\nname Deep-Align. To that end, we first prove that weight alignment adheres to\ntwo fundamental symmetries and then, propose a deep architecture that respects\nthese symmetries. Notably, our framework does not require any labeled data. We\nprovide a theoretical analysis of our approach and evaluate Deep-Align on\nseveral types of network architectures and learning setups. Our experimental\nresults indicate that a feed-forward pass with Deep-Align produces better or\nequivalent alignments compared to those produced by current optimization\nalgorithms. Additionally, our alignments can be used as an effective\ninitialization for other methods, leading to improved solutions with a\nsignificant speedup in convergence.\n","authors":["Aviv Navon","Aviv Shamsian","Ethan Fetaya","Gal Chechik","Nadav Dym","Haggai Maron"],"pdf_url":"https://arxiv.org/pdf/2310.13397v3.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2405.20791v1","updated":"2024-05-31T13:48:54Z","published":"2024-05-31T13:48:54Z","title":"GS-Phong: Meta-Learned 3D Gaussians for Relightable Novel View Synthesis","summary":" Decoupling the illumination in 3D scenes is crucial for novel view synthesis\nand relighting. In this paper, we propose a novel method for representing a\nscene illuminated by a point light using a set of relightable 3D Gaussian\npoints. Inspired by the Blinn-Phong model, our approach decomposes the scene\ninto ambient, diffuse, and specular components, enabling the synthesis of\nrealistic lighting effects. To facilitate the decomposition of geometric\ninformation independent of lighting conditions, we introduce a novel bilevel\noptimization-based meta-learning framework. The fundamental idea is to view the\nrendering tasks under various lighting positions as a multi-task learning\nproblem, which our meta-learning approach effectively addresses by generalizing\nthe learned Gaussian geometries not only across different viewpoints but also\nacross diverse light positions. Experimental results demonstrate the\neffectiveness of our approach in terms of training efficiency and rendering\nquality compared to existing methods for free-viewpoint relighting.\n","authors":["Yumeng He","Yunbo Wang","Xiaokang Yang"],"pdf_url":"https://arxiv.org/pdf/2405.20791v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20790v1","updated":"2024-05-31T13:45:52Z","published":"2024-05-31T13:45:52Z","title":"Intersectional Unfairness Discovery","summary":" AI systems have been shown to produce unfair results for certain subgroups of\npopulation, highlighting the need to understand bias on certain sensitive\nattributes. Current research often falls short, primarily focusing on the\nsubgroups characterized by a single sensitive attribute, while neglecting the\nnature of intersectional fairness of multiple sensitive attributes. This paper\nfocuses on its one fundamental aspect by discovering diverse high-bias\nsubgroups under intersectional sensitive attributes. Specifically, we propose a\nBias-Guided Generative Network (BGGN). By treating each bias value as a reward,\nBGGN efficiently generates high-bias intersectional sensitive attributes.\nExperiments on real-world text and image datasets demonstrate a diverse and\nefficient discovery of BGGN. To further evaluate the generated unseen but\npossible unfair intersectional sensitive attributes, we formulate them as\nprompts and use modern generative AI to produce new texts and images. The\nresults of frequently generating biased data provides new insights of\ndiscovering potential unfairness in popular modern generative AI systems.\nWarning: This paper contains generative examples that are offensive in nature.\n","authors":["Gezheng Xu","Qi Chen","Charles Ling","Boyu Wang","Changjian Shui"],"pdf_url":"https://arxiv.org/pdf/2405.20790v1.pdf","comment":"ICML-2024 Camera-ready"},{"id":"http://arxiv.org/abs/2405.20772v1","updated":"2024-05-31T13:28:37Z","published":"2024-05-31T13:28:37Z","title":"Reinforcement Learning for Sociohydrology","summary":" In this study, we discuss how reinforcement learning (RL) provides an\neffective and efficient framework for solving sociohydrology problems. The\nefficacy of RL for these types of problems is evident because of its ability to\nupdate policies in an iterative manner - something that is also foundational to\nsociohydrology, where we are interested in representing the co-evolution of\nhuman-water interactions. We present a simple case study to demonstrate the\nimplementation of RL in a problem of runoff reduction through management\ndecisions related to changes in land-use land-cover (LULC). We then discuss the\nbenefits of RL for these types of problems and share our perspectives on the\nfuture research directions in this area.\n","authors":["Tirthankar Roy","Shivendra Srivastava","Beichen Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20772v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.10144v2","updated":"2024-05-31T13:11:15Z","published":"2024-03-15T09:43:52Z","title":"NLP Verification: Towards a General Methodology for Certifying\n Robustness","summary":" Deep neural networks have exhibited substantial success in the field of\nNatural Language Processing and ensuring their safety and reliability is\ncrucial: there are safety critical contexts where such models must be robust to\nvariability or attack, and give guarantees over their output. Unlike Computer\nVision, NLP lacks a unified verification methodology and, despite recent\nadvancements in literature, they are often light on the pragmatical issues of\nNLP verification. In this paper, we attempt to distil and evaluate general\ncomponents of an NLP verification pipeline, that emerges from the progress in\nthe field to date. Our contributions are two-fold. Firstly, we give a general\n(i.e. algorithm-independent) characterisation of verifiable subspaces that\nresult from embedding sentences into continuous spaces. We identify, and give\nan effective method to deal with, the technical challenge of semantic\ngeneralisability of verified subspaces; and propose it as a standard metric in\nthe NLP verification pipelines (alongside with the standard metrics of model\naccuracy and model verifiability). Secondly, we propose a general methodology\nto analyse the effect of the embedding gap -- a problem that refers to the\ndiscrepancy between verification of geometric subspaces, and the semantic\nmeaning of sentences which the geometric subspaces are supposed to represent.\nIn extreme cases, poor choices in embedding of sentences may invalidate\nverification results. We propose a number of practical NLP methods that can\nhelp to quantify the effects of the embedding gap; and in particular we propose\nthe metric of falsifiability of semantic subspaces as another fundamental\nmetric to be reported as part of the NLP verification pipeline. We believe that\ntogether these general principles pave the way towards a more consolidated and\neffective development of this new domain.\n","authors":["Marco Casadio","Tanvi Dinkar","Ekaterina Komendantskaya","Luca Arnaboldi","Matthew L. Daggitt","Omri Isac","Guy Katz","Verena Rieser","Oliver Lemon"],"pdf_url":"https://arxiv.org/pdf/2403.10144v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20763v1","updated":"2024-05-31T12:32:34Z","published":"2024-05-31T12:32:34Z","title":"Improving Generalization and Convergence by Enhancing Implicit\n Regularization","summary":" In this work, we propose an Implicit Regularization Enhancement (IRE)\nframework to accelerate the discovery of flat solutions in deep learning,\nthereby improving generalization and convergence. Specifically, IRE decouples\nthe dynamics of flat and sharp directions, which boosts the sharpness reduction\nalong flat directions while maintaining the training stability in sharp\ndirections. We show that IRE can be practically incorporated with {\\em generic\nbase optimizers} without introducing significant computational overload.\nExperiments show that IRE consistently improves the generalization performance\nfor image classification tasks across a variety of benchmark datasets\n(CIFAR-10/100, ImageNet) and models (ResNets and ViTs). Surprisingly, IRE also\nachieves a $2\\times$ {\\em speed-up} compared to AdamW in the pre-training of\nLlama models (of sizes ranging from 60M to 229M) on datasets including\nWikitext-103, Minipile, and Openwebtext. Moreover, we provide theoretical\nguarantees, showing that IRE can substantially accelerate the convergence\ntowards flat minima in Sharpness-aware Minimization (SAM).\n","authors":["Mingze Wang","Haotian He","Jinbo Wang","Zilin Wang","Guanhua Huang","Feiyu Xiong","Zhiyu Li","Weinan E","Lei Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20763v1.pdf","comment":"35 pages"},{"id":"http://arxiv.org/abs/2402.07043v2","updated":"2024-05-31T12:27:52Z","published":"2024-02-10T21:06:34Z","title":"A Tale of Tails: Model Collapse as a Change of Scaling Laws","summary":" As AI model size grows, neural scaling laws have become a crucial tool to\npredict the improvements of large models when increasing capacity and the size\nof original (human or natural) training data. Yet, the widespread use of\npopular models means that the ecosystem of online data and text will co-evolve\nto progressively contain increased amounts of synthesized data. In this paper\nwe ask: How will the scaling laws change in the inevitable regime where\nsynthetic data makes its way into the training corpus? Will future models,\nstill improve, or be doomed to degenerate up to total (model) collapse? We\ndevelop a theoretical framework of model collapse through the lens of scaling\nlaws. We discover a wide range of decay phenomena, analyzing loss of scaling,\nshifted scaling with number of generations, the ''un-learning\" of skills, and\ngrokking when mixing human and synthesized data. Our theory is validated by\nlarge-scale experiments with a transformer on an arithmetic task and text\ngeneration using the large language model Llama2.\n","authors":["Elvis Dohmatob","Yunzhen Feng","Pu Yang","Francois Charton","Julia Kempe"],"pdf_url":"https://arxiv.org/pdf/2402.07043v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20761v1","updated":"2024-05-31T12:27:38Z","published":"2024-05-31T12:27:38Z","title":"Share Your Secrets for Privacy! Confidential Forecasting with Vertical\n Federated Learning","summary":" Vertical federated learning (VFL) is a promising area for time series\nforecasting in industrial applications, such as predictive maintenance and\nmachine control. Critical challenges to address in manufacturing include data\nprivacy and over-fitting on small and noisy datasets during both training and\ninference. Additionally, to increase industry adaptability, such forecasting\nmodels must scale well with the number of parties while ensuring strong\nconvergence and low-tuning complexity. We address those challenges and propose\n'Secret-shared Time Series Forecasting with VFL' (STV), a novel framework that\nexhibits the following key features: i) a privacy-preserving algorithm for\nforecasting with SARIMAX and autoregressive trees on vertically partitioned\ndata; ii) serverless forecasting using secret sharing and multi-party\ncomputation; iii) novel N-party algorithms for matrix multiplication and\ninverse operations for direct parameter optimization, giving strong convergence\nwith minimal hyperparameter tuning complexity. We conduct evaluations on six\nrepresentative datasets from public and industry-specific contexts. Our results\ndemonstrate that STV's forecasting accuracy is comparable to those of\ncentralized approaches. They also show that our direct optimization can\noutperform centralized methods, which include state-of-the-art diffusion models\nand long-short-term memory, by 23.81% on forecasting accuracy. We also conduct\na scalability analysis by examining the communication costs of direct and\niterative optimization to navigate the choice between the two. Code and\nappendix are available: https://github.com/adis98/STV\n","authors":["Aditya Shankar","Lydia Y. Chen","Jérémie Decouchant","Dimitra Gkorou","Rihan Hai"],"pdf_url":"https://arxiv.org/pdf/2405.20761v1.pdf","comment":"Submitted to the 27TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE\n (ECAI 2024)"},{"id":"http://arxiv.org/abs/2309.16476v2","updated":"2024-05-31T12:25:31Z","published":"2023-09-28T14:39:50Z","title":"High-dimensional robust regression under heavy-tailed data: Asymptotics\n and Universality","summary":" We investigate the high-dimensional properties of robust regression\nestimators in the presence of heavy-tailed contamination of both the covariates\nand response functions. In particular, we provide a sharp asymptotic\ncharacterisation of M-estimators trained on a family of elliptical covariate\nand noise data distributions including cases where second and higher moments do\nnot exist. We show that, despite being consistent, the Huber loss with\noptimally tuned location parameter $\\delta$ is suboptimal in the\nhigh-dimensional regime in the presence of heavy-tailed noise, highlighting the\nnecessity of further regularisation to achieve optimal performance. This result\nalso uncovers the existence of a transition in $\\delta$ as a function of the\nsample complexity and contamination. Moreover, we derive the decay rates for\nthe excess risk of ridge regression. We show that, while it is both optimal and\nuniversal for covariate distributions with finite second moment, its decay rate\ncan be considerably faster when the covariates' second moment does not exist.\nFinally, we show that our formulas readily generalise to a richer family of\nmodels and data distributions, such as generalised linear estimation with\narbitrary convex regularisation trained on mixture models.\n","authors":["Urte Adomaityte","Leonardo Defilippis","Bruno Loureiro","Gabriele Sicuro"],"pdf_url":"https://arxiv.org/pdf/2309.16476v2.pdf","comment":"13 pages + Supplementary information"},{"id":"http://arxiv.org/abs/2405.20759v1","updated":"2024-05-31T12:20:02Z","published":"2024-05-31T12:20:02Z","title":"Information Theoretic Text-to-Image Alignment","summary":" Diffusion models for Text-to-Image (T2I) conditional generation have seen\ntremendous success recently. Despite their success, accurately capturing user\nintentions with these models still requires a laborious trial and error\nprocess. This challenge is commonly identified as a model alignment problem, an\nissue that has attracted considerable attention by the research community.\nInstead of relying on fine-grained linguistic analyses of prompts, human\nannotation, or auxiliary vision-language models to steer image generation, in\nthis work we present a novel method that relies on an information-theoretic\nalignment measure. In a nutshell, our method uses self-supervised fine-tuning\nand relies on point-wise mutual information between prompts and images to\ndefine a synthetic training set to induce model alignment. Our comparative\nanalysis shows that our method is on-par or superior to the state-of-the-art,\nyet requires nothing but a pre-trained denoising network to estimate MI and a\nlightweight fine-tuning strategy.\n","authors":["Chao Wang","Giulio Franzese","Alessandro Finamore","Massimo Gallo","Pietro Michiardi"],"pdf_url":"https://arxiv.org/pdf/2405.20759v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16056v3","updated":"2024-05-31T11:44:39Z","published":"2024-05-25T04:51:41Z","title":"FedSheafHN: Personalized Federated Learning on Graph-structured Data","summary":" Personalized subgraph Federated Learning (FL) is a task that customizes Graph\nNeural Networks (GNNs) to individual client needs, accommodating diverse data\ndistributions. However, applying hypernetworks in FL, while aiming to\nfacilitate model personalization, often encounters challenges due to inadequate\nrepresentation of client-specific characteristics. To overcome these\nlimitations, we propose a model called FedSheafHN, using enhanced collaboration\ngraph embedding and efficient personalized model parameter generation.\nSpecifically, our model embeds each client's local subgraph into a\nserver-constructed collaboration graph. We utilize sheaf diffusion in the\ncollaboration graph to learn client representations. Our model improves the\nintegration and interpretation of complex client characteristics. Furthermore,\nour model ensures the generation of personalized models through advanced\nhypernetworks optimized for parallel operations across clients. Empirical\nevaluations demonstrate that FedSheafHN outperforms existing methods in most\nscenarios, in terms of client model performance on various graph-structured\ndatasets. It also has fast model convergence and effective new clients\ngeneralization.\n","authors":["Wenfei Liang","Yanan Zhao","Rui She","Yiming Li","Wee Peng Tay"],"pdf_url":"https://arxiv.org/pdf/2405.16056v3.pdf","comment":"This paper was submitted to ICML 2024 in Feb 2024. You can find a\n record\n here:https://github.com/CarrieWFF/ICML-2024-submission-recording/blob/main/Screenshot%20of%20FedSheafHN%20submission%20to%20ICML%202024.png"},{"id":"http://arxiv.org/abs/2405.19542v2","updated":"2024-05-31T11:31:12Z","published":"2024-05-29T22:04:40Z","title":"Anatomical Region Recognition and Real-time Bone Tracking Methods by\n Dynamically Decoding A-Mode Ultrasound Signals","summary":" Accurate bone tracking is crucial for kinematic analysis in orthopedic\nsurgery and prosthetic robotics. Traditional methods (e.g., skin markers) are\nsubject to soft tissue artifacts, and the bone pins used in surgery introduce\nthe risk of additional trauma and infection. For electromyography (EMG), its\ninability to directly measure joint angles requires complex algorithms for\nkinematic estimation. To address these issues, A-mode ultrasound-based tracking\nhas been proposed as a non-invasive and safe alternative. However, this\napproach suffers from limited accuracy in peak detection when processing\nreceived ultrasound signals. To build a precise and real-time bone tracking\napproach, this paper introduces a deep learning-based method for anatomical\nregion recognition and bone tracking using A-mode ultrasound signals,\nspecifically focused on the knee joint. The algorithm is capable of\nsimultaneously performing bone tracking and identifying the anatomical region\nwhere the A-mode ultrasound transducer is placed. It contains the fully\nconnection between all encoding and decoding layers of the cascaded U-Nets to\nfocus only on the signal region that is most likely to have the bone peak, thus\npinpointing the exact location of the peak and classifying the anatomical\nregion of the signal. The experiment showed a 97% accuracy in the\nclassification of the anatomical regions and a precision of around 0.5$\\pm$1mm\nunder dynamic tracking conditions for various anatomical areas surrounding the\nknee joint. In general, this approach shows great potential beyond the\ntraditional method, in terms of the accuracy achieved and the recognition of\nthe anatomical region where the ultrasound has been attached as an additional\nfunctionality.\n","authors":["Bangyu Lan","Stefano Stramigioli","Kenan Niu"],"pdf_url":"https://arxiv.org/pdf/2405.19542v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.02617v3","updated":"2024-05-31T11:30:31Z","published":"2023-06-05T06:31:14Z","title":"Permutation Decision Trees","summary":" Decision Tree is a well understood Machine Learning model that is based on\nminimizing impurities in the internal nodes. The most common impurity measures\nare Shannon entropy and Gini impurity. These impurity measures are insensitive\nto the order of training data and hence the final tree obtained is invariant to\nany permutation of the data. This is a limitation in terms of modeling when\nthere are temporal order dependencies between data instances. In this research,\nwe propose the adoption of Effort-To-Compress (ETC) - a complexity measure, for\nthe first time, as an alternative impurity measure. Unlike Shannon entropy and\nGini impurity, structural impurity based on ETC is able to capture order\ndependencies in the data, thus obtaining potentially different decision trees\nfor different permutations of the same data instances, a concept we term as\nPermutation Decision Trees (PDT). We then introduce the notion of Permutation\nBagging achieved using permutation decision trees without the need for random\nfeature selection and sub-sampling. We conduct a performance comparison between\nPermutation Decision Trees and classical decision trees across various\nreal-world datasets, including Appendicitis, Breast Cancer Wisconsin, Diabetes\nPima Indian, Ionosphere, Iris, Sonar, and Wine. Our findings reveal that PDT\ndemonstrates comparable performance to classical decision trees across most\ndatasets. Remarkably, in certain instances, PDT even slightly surpasses the\nperformance of classical decision trees. In comparing Permutation Bagging with\nRandom Forest, we attain comparable performance to Random Forest models\nconsisting of 50 to 1000 trees, using merely 21 trees. This highlights the\nefficiency and effectiveness of Permutation Bagging in achieving comparable\nperformance outcomes with significantly fewer trees.\n","authors":["Harikrishnan N B","Arham Jain","Nithin Nagaraj"],"pdf_url":"https://arxiv.org/pdf/2306.02617v3.pdf","comment":"15 pages, 8 figures"},{"id":"http://arxiv.org/abs/2311.01906v2","updated":"2024-05-31T11:14:16Z","published":"2023-11-03T13:30:52Z","title":"Simplifying Transformer Blocks","summary":" A simple design recipe for deep Transformers is to compose identical building\nblocks. But standard transformer blocks are far from simple, interweaving\nattention and MLP sub-blocks with skip connections & normalisation layers in\nprecise arrangements. This complexity leads to brittle architectures, where\nseemingly minor changes can significantly reduce training speed, or render\nmodels untrainable.\n In this work, we ask to what extent the standard transformer block can be\nsimplified? Combining signal propagation theory and empirical observations, we\nmotivate modifications that allow many block components to be removed with no\nloss of training speed, including skip connections, projection or value\nparameters, sequential sub-blocks and normalisation layers. In experiments on\nboth autoregressive decoder-only and BERT encoder-only models, our simplified\ntransformers emulate the per-update training speed and performance of standard\ntransformers, while enjoying 15% faster training throughput, and using 15%\nfewer parameters.\n","authors":["Bobby He","Thomas Hofmann"],"pdf_url":"https://arxiv.org/pdf/2311.01906v2.pdf","comment":"ICLR 2024"},{"id":"http://arxiv.org/abs/2405.20748v1","updated":"2024-05-31T10:30:14Z","published":"2024-05-31T10:30:14Z","title":"OpenTensor: Reproducing Faster Matrix Multiplication Discovering\n Algorithms","summary":" OpenTensor is a reproduction of AlphaTensor, which discovered a new algorithm\nthat outperforms the state-of-the-art methods for matrix multiplication by Deep\nReinforcement Learning (DRL). While AlphaTensor provides a promising framework\nfor solving scientific problems, it is really hard to reproduce due to the\nmassive tricks and lack of source codes. In this paper, we clean up the\nalgorithm pipeline, clarify the technical details, and make some improvements\nto the training process. Computational results show that OpenTensor can\nsuccessfully find efficient matrix multiplication algorithms.\n","authors":["Yiwen Sun","Wenye Li"],"pdf_url":"https://arxiv.org/pdf/2405.20748v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20743v1","updated":"2024-05-31T10:13:17Z","published":"2024-05-31T10:13:17Z","title":"Trajectory Forecasting through Low-Rank Adaptation of Discrete Latent\n Codes","summary":" Trajectory forecasting is crucial for video surveillance analytics, as it\nenables the anticipation of future movements for a set of agents, e.g.\nbasketball players engaged in intricate interactions with long-term intentions.\nDeep generative models offer a natural learning approach for trajectory\nforecasting, yet they encounter difficulties in achieving an optimal balance\nbetween sampling fidelity and diversity. We address this challenge by\nleveraging Vector Quantized Variational Autoencoders (VQ-VAEs), which utilize a\ndiscrete latent space to tackle the issue of posterior collapse. Specifically,\nwe introduce an instance-based codebook that allows tailored latent\nrepresentations for each example. In a nutshell, the rows of the codebook are\ndynamically adjusted to reflect contextual information (i.e., past motion\npatterns extracted from the observed trajectories). In this way, the\ndiscretization process gains flexibility, leading to improved reconstructions.\nNotably, instance-level dynamics are injected into the codebook through\nlow-rank updates, which restrict the customization of the codebook to a lower\ndimension space. The resulting discrete space serves as the basis of the\nsubsequent step, which regards the training of a diffusion-based predictive\nmodel. We show that such a two-fold framework, augmented with instance-level\ndiscretization, leads to accurate and diverse forecasts, yielding\nstate-of-the-art performance on three established benchmarks.\n","authors":["Riccardo Benaglia","Angelo Porrello","Pietro Buzzega","Simone Calderara","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2405.20743v1.pdf","comment":"15 pages, 3 figures, 5 tables"},{"id":"http://arxiv.org/abs/2405.20738v1","updated":"2024-05-31T10:07:24Z","published":"2024-05-31T10:07:24Z","title":"Federated Random Forest for Partially Overlapping Clinical Data","summary":" In the healthcare sector, a consciousness surrounding data privacy and\ncorresponding data protection regulations, as well as heterogeneous and\nnon-harmonized data, pose huge challenges to large-scale data analysis.\nMoreover, clinical data often involves partially overlapping features, as some\nobservations may be missing due to various reasons, such as differences in\nprocedures, diagnostic tests, or other recorded patient history information\nacross hospitals or institutes. To address the challenges posed by partially\noverlapping features and incomplete data in clinical datasets, a comprehensive\napproach is required. Particularly in the domain of medical data, promising\noutcomes are achieved by federated random forests whenever features align.\nHowever, for most standard algorithms, like random forest, it is essential that\nall data sets have identical parameters. Therefore, in this work the concept of\nfederated random forest is adapted to a setting with partially overlapping\nfeatures. Moreover, our research assesses the effectiveness of the newly\ndeveloped federated random forest models for partially overlapping clinical\ndata. For aggregating the federated, globally optimized model, only features\navailable locally at each site can be used. We tackled two issues in\nfederation: (i) the quantity of involved parties, (ii) the varying overlap of\nfeatures. This evaluation was conducted across three clinical datasets. The\nfederated random forest model even in cases where only a subset of features\noverlaps consistently demonstrates superior performance compared to its local\ncounterpart. This holds true across various scenarios, including datasets with\nimbalanced classes. Consequently, federated random forests for partially\noverlapped data offer a promising solution to transcend barriers in\ncollaborative research and corporate cooperation.\n","authors":["Youngjun Park","Cord Eric Schmidt","Benedikt Marcel Batton","Anne-Christin Hauschild"],"pdf_url":"https://arxiv.org/pdf/2405.20738v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.06958v3","updated":"2024-05-31T09:58:08Z","published":"2023-11-12T20:52:14Z","title":"Towards Climate Variable Prediction with Conditioned Spatio-Temporal\n Normalizing Flows","summary":" This study investigates how conditional normalizing flows can be applied to\nremote sensing data products in climate science for spatio-temporal prediction.\nThe method is chosen due to its desired properties such as exact likelihood\ncomputation, predictive uncertainty estimation and efficient inference and\nsampling which facilitates faster exploration of climate scenarios.\nExperimental findings reveal that the conditioned spatio-temporal flow\nsurpasses both deterministic and stochastic baselines in prolonged rollout\nscenarios. It exhibits stable extrapolation beyond the training time horizon\nfor extended rollout durations. These findings contribute valuable insights to\nthe field of spatio-temporal modeling, with potential applications spanning\ndiverse scientific disciplines.\n","authors":["Christina Winkler","David Rolnick"],"pdf_url":"https://arxiv.org/pdf/2311.06958v3.pdf","comment":"5 pages"},{"id":"http://arxiv.org/abs/2405.20731v1","updated":"2024-05-31T09:39:41Z","published":"2024-05-31T09:39:41Z","title":"Maximum Temperature Prediction Using Remote Sensing Data Via\n Convolutional Neural Network","summary":" Urban heat islands, defined as specific zones exhibiting substantially higher\ntemperatures than their immediate environs, pose significant threats to\nenvironmental sustainability and public health. This study introduces a novel\nmachine-learning model that amalgamates data from the Sentinel-3 satellite,\nmeteorological predictions, and additional remote sensing inputs. The primary\naim is to generate detailed spatiotemporal maps that forecast the peak\ntemperatures within a 24-hour period in Turin. Experimental results validate\nthe model's proficiency in predicting temperature patterns, achieving a Mean\nAbsolute Error (MAE) of 2.09 degrees Celsius for the year 2023 at a resolution\nof 20 meters per pixel, thereby enriching our knowledge of urban climatic\nbehavior. This investigation enhances the understanding of urban microclimates,\nemphasizing the importance of cross-disciplinary data integration, and laying\nthe groundwork for informed policy-making aimed at alleviating the negative\nimpacts of extreme urban temperatures.\n","authors":["Lorenzo Innocenti","Giacomo Blanco","Luca Barco","Claudio Rossi"],"pdf_url":"https://arxiv.org/pdf/2405.20731v1.pdf","comment":"4 pages, submitted to IEEE MetroLivEnv 2024 conference"},{"id":"http://arxiv.org/abs/2405.20091v2","updated":"2024-05-31T09:35:36Z","published":"2024-05-30T14:27:40Z","title":"Visual Attention Analysis in Online Learning","summary":" In this paper, we present an approach in the Multimodal Learning Analytics\nfield. Within this approach, we have developed a tool to visualize and analyze\neye movement data collected during learning sessions in online courses. The\ntool is named VAAD (an acronym for Visual Attention Analysis Dashboard). These\neye movement data have been gathered using an eye-tracker and subsequently\nprocessed and visualized for interpretation. The purpose of the tool is to\nconduct a descriptive analysis of the data by facilitating its visualization,\nenabling the identification of differences and learning patterns among various\nlearner populations. Additionally, it integrates a predictive module capable of\nanticipating learner activities during a learning session. Consequently, VAAD\nholds the potential to offer valuable insights into online learning behaviors\nfrom both descriptive and predictive perspectives.\n","authors":["Miriam Navarro","Álvaro Becerra","Roberto Daza","Ruth Cobos","Aythami Morales","Julian Fierrez"],"pdf_url":"https://arxiv.org/pdf/2405.20091v2.pdf","comment":"Accepted in CEDI 2024 (VII Congreso Espa\\~nol de Inform\\'atica), A\n Coru\\~na, Spain"},{"id":"http://arxiv.org/abs/2405.20724v1","updated":"2024-05-31T09:26:26Z","published":"2024-05-31T09:26:26Z","title":"Learning on Large Graphs using Intersecting Communities","summary":" Message Passing Neural Networks (MPNNs) are a staple of graph machine\nlearning. MPNNs iteratively update each node's representation in an input graph\nby aggregating messages from the node's neighbors, which necessitates a memory\ncomplexity of the order of the number of graph edges. This complexity might\nquickly become prohibitive for large graphs provided they are not very sparse.\nIn this paper, we propose a novel approach to alleviate this problem by\napproximating the input graph as an intersecting community graph (ICG) -- a\ncombination of intersecting cliques. The key insight is that the number of\ncommunities required to approximate a graph does not depend on the graph size.\nWe develop a new constructive version of the Weak Graph Regularity Lemma to\nefficiently construct an approximating ICG for any input graph. We then devise\nan efficient graph learning algorithm operating directly on ICG in linear\nmemory and time with respect to the number of nodes (rather than edges). This\noffers a new and fundamentally different pipeline for learning on very large\nnon-sparse graphs, whose applicability is demonstrated empirically on node\nclassification tasks and spatio-temporal data processing.\n","authors":["Ben Finkelshtein","İsmail İlkan Ceylan","Michael Bronstein","Ron Levie"],"pdf_url":"https://arxiv.org/pdf/2405.20724v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.10843v2","updated":"2024-05-31T09:20:24Z","published":"2023-06-19T10:45:10Z","title":"Female mosquito detection by means of AI techniques inside release\n containers in the context of a Sterile Insect Technique program","summary":" The Sterile Insect Technique (SIT) is a biological pest control technique\nbased on the release into the environment of sterile males of the insect\nspecies whose population is to be controlled. The entire SIT process involves\nmass-rearing within a biofactory, sorting of the specimens by sex,\nsterilization, and subsequent release of the sterile males into the\nenvironment. The reason for avoiding the release of female specimens is\nbecause, unlike males, females bite, with the subsequent risk of disease\ntransmission. In the case of Aedes mosquito biofactories for SIT, the key point\nof the whole process is sex separation. This process is nowadays performed by a\ncombination of mechanical devices and AI-based vision systems. However, there\nis still a possibility of false negatives, so a last stage of verification is\nnecessary before releasing them into the environment. It is known that the\nsound produced by the flapping of adult male mosquitoes is different from that\nproduced by females, so this feature can be used to detect the presence of\nfemales in containers prior to environmental release. This paper presents a\nstudy for the detection of females in Aedes mosquito release vessels for SIT\nprograms. The containers used consist of PVC a tubular design of 8.8cm diameter\nand 12.5cm height. The containers were placed in an experimental setup that\nallowed the recording of the sound of mosquito flight inside of them. Each\ncontainer was filled with 250 specimens considering the cases of (i) only male\nmosquitoes, (ii) only female mosquitoes, and (iii) 75% males and 25% females.\nCase (i) was used for training and testing, whereas cases (ii) and (iii) were\nused only for testing. Two algorithms were implemented for the detection of\nfemale mosquitoes: an unsupervised outlier detection algorithm (iForest) and a\none-class SVM trained with male-only recordings.\n","authors":["Javier Naranjo-Alcazar","Jordi Grau-Haro","David Almenar","Pedro Zuccarello"],"pdf_url":"https://arxiv.org/pdf/2306.10843v2.pdf","comment":"Accepted EUSIPCO 2024"},{"id":"http://arxiv.org/abs/2405.20717v1","updated":"2024-05-31T09:14:36Z","published":"2024-05-31T09:14:36Z","title":"Cyclic image generation using chaotic dynamics","summary":" Successive image generation using cyclic transformations is demonstrated by\nextending the CycleGAN model to transform images among three different\ncategories. Repeated application of the trained generators produces sequences\nof images that transition among the different categories. The generated image\nsequences occupy a more limited region of the image space compared with the\noriginal training dataset. Quantitative evaluation using precision and recall\nmetrics indicates that the generated images have high quality but reduced\ndiversity relative to the training dataset. Such successive generation\nprocesses are characterized as chaotic dynamics in terms of dynamical system\ntheory. Positive Lyapunov exponents estimated from the generated trajectories\nconfirm the presence of chaotic dynamics, with the Lyapunov dimension of the\nattractor found to be comparable to the intrinsic dimension of the training\ndata manifold. The results suggest that chaotic dynamics in the image space\ndefined by the deep generative model contribute to the diversity of the\ngenerated images, constituting a novel approach for multi-class image\ngeneration. This model can be interpreted as an extension of classical\nassociative memory to perform hetero-association among image categories.\n","authors":["Takaya Tanaka","Yutaka Yamaguti"],"pdf_url":"https://arxiv.org/pdf/2405.20717v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19967v2","updated":"2024-05-31T08:54:24Z","published":"2024-05-30T11:46:42Z","title":"Improved Out-of-Scope Intent Classification with Dual Encoding and\n Threshold-based Re-Classification","summary":" Detecting out-of-scope user utterances is essential for task-oriented\ndialogues and intent classification. Current methodologies face difficulties\nwith the unpredictable distribution of outliers and often rely on assumptions\nabout data distributions. We present the Dual Encoder for Threshold-Based\nRe-Classification (DETER) to address these challenges. This end-to-end\nframework efficiently detects out-of-scope intents without requiring\nassumptions on data distributions or additional post-processing steps. The core\nof DETER utilizes dual text encoders, the Universal Sentence Encoder (USE) and\nthe Transformer-based Denoising AutoEncoder (TSDAE), to generate user utterance\nembeddings, which are classified through a branched neural architecture.\nFurther, DETER generates synthetic outliers using self-supervision and\nincorporates out-of-scope phrases from open-domain datasets. This approach\nensures a comprehensive training set for out-of-scope detection. Additionally,\na threshold-based re-classification mechanism refines the model's initial\npredictions. Evaluations on the CLINC-150, Stackoverflow, and Banking77\ndatasets demonstrate DETER's efficacy. Our model outperforms previous\nbenchmarks, increasing up to 13% and 5% in F1 score for known and unknown\nintents on CLINC-150 and Stackoverflow, and 16% for known and 24% % for unknown\nintents on Banking77. The source code has been released at\nhttps://github.com/Hossam-Mohammed-tech/Intent_Classification_OOS.\n","authors":["Hossam M. Zawbaa","Wael Rashwan","Sourav Dutta","Haytham Assem"],"pdf_url":"https://arxiv.org/pdf/2405.19967v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20052v2","updated":"2024-05-31T08:50:25Z","published":"2024-05-30T13:38:28Z","title":"Hardware-Efficient EMG Decoding for Next-Generation Hand Prostheses","summary":" Advancements in neural engineering have enabled the development of Robotic\nProsthetic Hands (RPHs) aimed at restoring hand functionality. Current\ncommercial RPHs offer limited control through basic on/off commands. Recent\nprogresses in machine learning enable finger movement decoding with higher\ndegrees of freedom, yet the high computational complexity of such models limits\ntheir application in portable devices. Future RPH designs must balance\nportability, low power consumption, and high decoding accuracy to be practical\nfor individuals with disabilities. To this end, we introduce a novel\nattractor-based neural network to realize on-chip movement decoding for\nnext-generation portable RPHs. The proposed architecture comprises an encoder,\nan attention layer, an attractor network, and a refinement regressor. We tested\nour model on four healthy subjects and achieved a decoding accuracy of 80.3%.\nOur proposed model is over 120 and 50 times more compact compared to\nstate-of-the-art LSTM and CNN models, respectively, with comparable (or\nsuperior) decoding accuracy. Therefore, it exhibits minimal hardware complexity\nand can be effectively integrated as a System-on-Chip.\n","authors":["Mohammad Kalbasi","MohammadAli Shaeri","Vincent Alexandre Mendez","Solaiman Shokur","Silvestro Micera","Mahsa Shoaran"],"pdf_url":"https://arxiv.org/pdf/2405.20052v2.pdf","comment":"\\{copyright} 2024 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2103.03636v2","updated":"2024-05-31T08:50:18Z","published":"2021-03-05T12:44:22Z","title":"CoDeGAN: Contrastive Disentanglement for Generative Adversarial Network","summary":" Disentanglement, a critical concern in interpretable machine learning, has\nalso garnered significant attention from the computer vision community. Many\nexisting GAN-based class disentanglement (unsupervised) approaches, such as\nInfoGAN and its variants, primarily aim to maximize the mutual information (MI)\nbetween the generated image and its latent codes. However, this focus may lead\nto a tendency for the network to generate highly similar images when presented\nwith the same latent class factor, potentially resulting in mode collapse or\nmode dropping. To alleviate this problem, we propose \\texttt{CoDeGAN}\n(Contrastive Disentanglement for Generative Adversarial Networks), where we\nrelax similarity constraints for disentanglement from the image domain to the\nfeature domain. This modification not only enhances the stability of GAN\ntraining but also improves their disentangling capabilities. Moreover, we\nintegrate self-supervised pre-training into CoDeGAN to learn semantic\nrepresentations, significantly facilitating unsupervised disentanglement.\nExtensive experimental results demonstrate the superiority of our method over\nstate-of-the-art approaches across multiple benchmarks. The code is available\nat https://github.com/learninginvision/CoDeGAN.\n","authors":["Jiangwei Zhao","Zejia Liu","Xiaohan Guo","Lili Pan"],"pdf_url":"https://arxiv.org/pdf/2103.03636v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.09636v2","updated":"2024-05-31T08:50:05Z","published":"2024-04-15T10:12:33Z","title":"All-in-one simulation-based inference","summary":" Amortized Bayesian inference trains neural networks to solve stochastic\ninference problems using model simulations, thereby making it possible to\nrapidly perform Bayesian inference for any newly observed data. However,\ncurrent simulation-based amortized inference methods are simulation-hungry and\ninflexible: They require the specification of a fixed parametric prior,\nsimulator, and inference tasks ahead of time. Here, we present a new amortized\ninference method -- the Simformer -- which overcomes these limitations. By\ntraining a probabilistic diffusion model with transformer architectures, the\nSimformer outperforms current state-of-the-art amortized inference approaches\non benchmark tasks and is substantially more flexible: It can be applied to\nmodels with function-valued parameters, it can handle inference scenarios with\nmissing or unstructured data, and it can sample arbitrary conditionals of the\njoint distribution of parameters and data, including both posterior and\nlikelihood. We showcase the performance and flexibility of the Simformer on\nsimulators from ecology, epidemiology, and neuroscience, and demonstrate that\nit opens up new possibilities and application domains for amortized Bayesian\ninference on simulation-based models.\n","authors":["Manuel Gloeckler","Michael Deistler","Christian Weilbach","Frank Wood","Jakob H. Macke"],"pdf_url":"https://arxiv.org/pdf/2404.09636v2.pdf","comment":"To be published in the proceedings of the 41st International\n Conference on Machine Learning (ICML 2024), Vienna, Austria. PMLR 235, 2024"},{"id":"http://arxiv.org/abs/2306.08595v3","updated":"2024-05-31T08:39:03Z","published":"2023-06-14T15:55:19Z","title":"TensorKrowch: Smooth integration of tensor networks in machine learning","summary":" Tensor networks are factorizations of high-dimensional tensors into networks\nof smaller tensors. They have applications in physics and mathematics, and\nrecently have been proposed as promising machine learning architectures. To\nease the integration of tensor networks in machine learning pipelines, we\nintroduce TensorKrowch, an open source Python library built on top of PyTorch.\nProviding a user-friendly interface, TensorKrowch allows users to construct any\ntensor network, train it, and integrate it as a layer in more intricate deep\nlearning models. In this paper, we describe the main functionality and basic\nusage of TensorKrowch, and provide technical details on its building blocks and\nthe optimizations performed to achieve efficient operation.\n","authors":["José Ramón Pareja Monturiol","David Pérez-García","Alejandro Pozas-Kerstjens"],"pdf_url":"https://arxiv.org/pdf/2306.08595v3.pdf","comment":"20 pages, 2 figures. The TensorKrowch GitHub repository is in\n https://github.com/joserapa98/tensorkrowch and the TensorKrowch documentation\n is in https://joserapa98.github.io/tensorkrowch. V3: Accepted version,\n corrected acknowledgments"},{"id":"http://arxiv.org/abs/2405.20692v1","updated":"2024-05-31T08:38:25Z","published":"2024-05-31T08:38:25Z","title":"In-Context Decision Transformer: Reinforcement Learning via Hierarchical\n Chain-of-Thought","summary":" In-context learning is a promising approach for offline reinforcement\nlearning (RL) to handle online tasks, which can be achieved by providing task\nprompts. Recent works demonstrated that in-context RL could emerge with\nself-improvement in a trial-and-error manner when treating RL tasks as an\nacross-episodic sequential prediction problem. Despite the self-improvement not\nrequiring gradient updates, current works still suffer from high computational\ncosts when the across-episodic sequence increases with task horizons. To this\nend, we propose an In-context Decision Transformer (IDT) to achieve\nself-improvement in a high-level trial-and-error manner. Specifically, IDT is\ninspired by the efficient hierarchical structure of human decision-making and\nthus reconstructs the sequence to consist of high-level decisions instead of\nlow-level actions that interact with environments. As one high-level decision\ncan guide multi-step low-level actions, IDT naturally avoids excessively long\nsequences and solves online tasks more efficiently. Experimental results show\nthat IDT achieves state-of-the-art in long-horizon tasks over current\nin-context RL methods. In particular, the online evaluation time of our IDT is\n\\textbf{36$\\times$} times faster than baselines in the D4RL benchmark and\n\\textbf{27$\\times$} times faster in the Grid World benchmark.\n","authors":["Sili Huang","Jifeng Hu","Hechang Chen","Lichao Sun","Bo Yang"],"pdf_url":"https://arxiv.org/pdf/2405.20692v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20690v1","updated":"2024-05-31T08:35:56Z","published":"2024-05-31T08:35:56Z","title":"Unleashing the Potential of Diffusion Models for Incomplete Data\n Imputation","summary":" This paper introduces DiffPuter, an iterative method for missing data\nimputation that leverages the Expectation-Maximization (EM) algorithm and\nDiffusion Models. By treating missing data as hidden variables that can be\nupdated during model training, we frame the missing data imputation task as an\nEM problem. During the M-step, DiffPuter employs a diffusion model to learn the\njoint distribution of both the observed and currently estimated missing data.\nIn the E-step, DiffPuter re-estimates the missing data based on the conditional\nprobability given the observed data, utilizing the diffusion model learned in\nthe M-step. Starting with an initial imputation, DiffPuter alternates between\nthe M-step and E-step until convergence. Through this iterative process,\nDiffPuter progressively refines the complete data distribution, yielding\nincreasingly accurate estimations of the missing data. Our theoretical analysis\ndemonstrates that the unconditional training and conditional sampling processes\nof the diffusion model align precisely with the objectives of the M-step and\nE-step, respectively. Empirical evaluations across 10 diverse datasets and\ncomparisons with 16 different imputation methods highlight DiffPuter's superior\nperformance. Notably, DiffPuter achieves an average improvement of 8.10% in MAE\nand 5.64% in RMSE compared to the most competitive existing method.\n","authors":["Hengrui Zhang","Liancheng Fang","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2405.20690v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20687v1","updated":"2024-05-31T08:31:26Z","published":"2024-05-31T08:31:26Z","title":"Conditioning GAN Without Training Dataset","summary":" Deep learning algorithms have a large number of trainable parameters often\nwith sizes of hundreds of thousands or more. Training this algorithm requires a\nlarge amount of training data and generating a sufficiently large dataset for\nthese algorithms is costly\\cite{noguchi2019image}.\n GANs are generative neural networks that use two deep learning networks that\nare competing with each other. The networks are generator and discriminator\nnetworks. The generator tries to generate realistic images which resemble the\nactual training dataset by approximating the training data distribution and the\ndiscriminator is trained to classify images as real or\nfake(generated)\\cite{goodfellow2016nips}. Training these GAN algorithms also\nrequires a large amount of training dataset\\cite{noguchi2019image}.\n In this study, the aim is to address the question, \"Given an unconditioned\npretrained generator network and a pretrained classifier, is it feasible to\ndevelop a conditioned generator without relying on any training dataset?\"\n The paper begins with a general introduction to the problem. The subsequent\nsections are structured as follows: Section 2 provides background information\non the problem. Section 3 reviews relevant literature on the topic. Section 4\noutlines the methodology employed in this study. Section 5 presents the\nexperimental results. Section 6 discusses the findings and proposes potential\nfuture research directions. Finally, Section 7 offers concluding remarks.\n The implementation can be accessed\n\\href{https://github.com/kidist-amde/BigGAN-PyTorch}{here}.\n","authors":["Kidist Amde Mekonnen"],"pdf_url":"https://arxiv.org/pdf/2405.20687v1.pdf","comment":"5 pages, 2 figures, Part of my MSc project course, School Project\n Course 2022"},{"id":"http://arxiv.org/abs/2405.19383v2","updated":"2024-05-31T08:29:26Z","published":"2024-05-29T08:48:52Z","title":"Network Analytics for Anti-Money Laundering -- A Systematic Literature\n Review and Experimental Evaluation","summary":" Money laundering presents a pervasive challenge, burdening society by\nfinancing illegal activities. To more effectively combat and detect money\nlaundering, the use of network information is increasingly being explored,\nexploiting that money laundering necessarily involves interconnected parties.\nThis has lead to a surge in literature on network analytics (NA) for anti-money\nlaundering (AML). The literature, however, is fragmented and a comprehensive\noverview of existing work is missing. This results in limited understanding of\nthe methods that may be applied and their comparative detection power.\nTherefore, this paper presents an extensive and systematic review of the\nliterature. We identify and analyse 97 papers in the Web of Science and Scopus\ndatabases, resulting in a taxonomy of approaches following the fraud analytics\nframework of Bockel-Rickermann et al.. Moreover, this paper presents a\ncomprehensive experimental framework to evaluate and compare the performance of\nprominent NA methods in a uniform setup. The framework is applied on the\npublicly available Elliptic data set and implements manual feature engineering,\nrandom walk-based methods, and deep learning GNNs. We conclude from the results\nthat network analytics increases the predictive power of the AML model with\ngraph neural networks giving the best results. An open source implementation of\nthe experimental framework is provided to facilitate researchers and\npractitioners to extend upon these results and experiment on proprietary data.\nAs such, we aim to promote a standardised approach towards the analysis and\nevaluation of network analytics for AML.\n","authors":["Bruno Deprez","Toon Vanderschueren","Bart Baesens","Tim Verdonck","Wouter Verbeke"],"pdf_url":"https://arxiv.org/pdf/2405.19383v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20685v1","updated":"2024-05-31T08:26:53Z","published":"2024-05-31T08:26:53Z","title":"Enhancing Counterfactual Image Generation Using Mahalanobis Distance\n with Distribution Preferences in Feature Space","summary":" In the realm of Artificial Intelligence (AI), the importance of Explainable\nArtificial Intelligence (XAI) is increasingly recognized, particularly as AI\nmodels become more integral to our lives. One notable single-instance XAI\napproach is counterfactual explanation, which aids users in comprehending a\nmodel's decisions and offers guidance on altering these decisions. Specifically\nin the context of image classification models, effective image counterfactual\nexplanations can significantly enhance user understanding. This paper\nintroduces a novel method for computing feature importance within the feature\nspace of a black-box model. By employing information fusion techniques, our\nmethod maximizes the use of data to address feature counterfactual explanations\nin the feature space. Subsequently, we utilize an image generation model to\ntransform these feature counterfactual explanations into image counterfactual\nexplanations. Our experiments demonstrate that the counterfactual explanations\ngenerated by our method closely resemble the original images in both pixel and\nfeature spaces. Additionally, our method outperforms established baselines,\nachieving impressive experimental results.\n","authors":["Yukai Zhang","Ao Xu","Zihao Li","Tieru Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20685v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.04928v2","updated":"2024-05-31T08:23:42Z","published":"2024-01-10T04:55:24Z","title":"Relaxed Contrastive Learning for Federated Learning","summary":" We propose a novel contrastive learning framework to effectively address the\nchallenges of data heterogeneity in federated learning. We first analyze the\ninconsistency of gradient updates across clients during local training and\nestablish its dependence on the distribution of feature representations,\nleading to the derivation of the supervised contrastive learning (SCL)\nobjective to mitigate local deviations. In addition, we show that a na\\\"ive\nadoption of SCL in federated learning leads to representation collapse,\nresulting in slow convergence and limited performance gains. To address this\nissue, we introduce a relaxed contrastive learning loss that imposes a\ndivergence penalty on excessively similar sample pairs within each class. This\nstrategy prevents collapsed representations and enhances feature\ntransferability, facilitating collaborative training and leading to significant\nperformance improvements. Our framework outperforms all existing federated\nlearning approaches by huge margins on the standard benchmarks through\nextensive experimental results.\n","authors":["Seonguk Seo","Jinkyu Kim","Geeho Kim","Bohyung Han"],"pdf_url":"https://arxiv.org/pdf/2401.04928v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20678v1","updated":"2024-05-31T08:21:11Z","published":"2024-05-31T08:21:11Z","title":"No-Regret Learning for Fair Multi-Agent Social Welfare Optimization","summary":" We consider the problem of online multi-agent Nash social welfare (NSW)\nmaximization. While previous works of Hossain et al. [2021], Jones et al.\n[2023] study similar problems in stochastic multi-agent multi-armed bandits and\nshow that $\\sqrt{T}$-regret is possible after $T$ rounds, their fairness\nmeasure is the product of all agents' rewards, instead of their NSW (that is,\ntheir geometric mean). Given the fundamental role of NSW in the fairness\nliterature, it is more than natural to ask whether no-regret fair learning with\nNSW as the objective is possible. In this work, we provide a complete answer to\nthis question in various settings. Specifically, in stochastic $N$-agent\n$K$-armed bandits, we develop an algorithm with\n$\\widetilde{\\mathcal{O}}\\left(K^{\\frac{2}{N}}T^{\\frac{N-1}{N}}\\right)$ regret\nand prove that the dependence on $T$ is tight, making it a sharp contrast to\nthe $\\sqrt{T}$-regret bounds of Hossain et al. [2021], Jones et al. [2023]. We\nthen consider a more challenging version of the problem with adversarial\nrewards. Somewhat surprisingly, despite NSW being a concave function, we prove\nthat no algorithm can achieve sublinear regret. To circumvent such negative\nresults, we further consider a setting with full-information feedback and\ndesign two algorithms with $\\sqrt{T}$-regret: the first one has no dependence\non $N$ at all and is applicable to not just NSW but a broad class of welfare\nfunctions, while the second one has better dependence on $K$ and is preferable\nwhen $N$ is small. Finally, we also show that logarithmic regret is possible\nwhenever there exists one agent who is indifferent about different arms.\n","authors":["Mengxiao Zhang","Ramiro Deo-Campo Vuong","Haipeng Luo"],"pdf_url":"https://arxiv.org/pdf/2405.20678v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20677v1","updated":"2024-05-31T08:21:09Z","published":"2024-05-31T08:21:09Z","title":"Provably Efficient Interactive-Grounded Learning with Personalized\n Reward","summary":" Interactive-Grounded Learning (IGL) [Xie et al., 2021] is a powerful\nframework in which a learner aims at maximizing unobservable rewards through\ninteracting with an environment and observing reward-dependent feedback on the\ntaken actions. To deal with personalized rewards that are ubiquitous in\napplications such as recommendation systems, Maghakian et al. [2022] study a\nversion of IGL with context-dependent feedback, but their algorithm does not\ncome with theoretical guarantees. In this work, we consider the same problem\nand provide the first provably efficient algorithms with sublinear regret under\nrealizability. Our analysis reveals that the step-function estimator of prior\nwork can deviate uncontrollably due to finite-sample effects. Our solution is a\nnovel Lipschitz reward estimator which underestimates the true reward and\nenjoys favorable generalization performances. Building on this estimator, we\npropose two algorithms, one based on explore-then-exploit and the other based\non inverse-gap weighting. We apply IGL to learning from image feedback and\nlearning from text feedback, which are reward-free settings that arise in\npractice. Experimental results showcase the importance of using our Lipschitz\nreward estimator and the overall effectiveness of our algorithms.\n","authors":["Mengxiao Zhang","Yuheng Zhang","Haipeng Luo","Paul Mineiro"],"pdf_url":"https://arxiv.org/pdf/2405.20677v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20675v1","updated":"2024-05-31T08:19:44Z","published":"2024-05-31T08:19:44Z","title":"Adv-KD: Adversarial Knowledge Distillation for Faster Diffusion Sampling","summary":" Diffusion Probabilistic Models (DPMs) have emerged as a powerful class of\ndeep generative models, achieving remarkable performance in image synthesis\ntasks. However, these models face challenges in terms of widespread adoption\ndue to their reliance on sequential denoising steps during sample generation.\nThis dependence leads to substantial computational requirements, making them\nunsuitable for resource-constrained or real-time processing systems. To address\nthese challenges, we propose a novel method that integrates denoising phases\ndirectly into the model's architecture, thereby reducing the need for\nresource-intensive computations. Our approach combines diffusion models with\ngenerative adversarial networks (GANs) through knowledge distillation, enabling\nmore efficient training and evaluation. By utilizing a pre-trained diffusion\nmodel as a teacher model, we train a student model through adversarial\nlearning, employing layerwise transformations for denoising and submodules for\npredicting the teacher model's output at various points in time. This\nintegration significantly reduces the number of parameters and denoising steps\nrequired, leading to improved sampling speed at test time. We validate our\nmethod with extensive experiments, demonstrating comparable performance with\nreduced computational requirements compared to existing approaches. By enabling\nthe deployment of diffusion models on resource-constrained devices, our\nresearch mitigates their computational burden and paves the way for wider\naccessibility and practical use across the research community and end-users.\n Our code is publicly available at https://github.com/kidist-amde/Adv-KD\n","authors":["Kidist Amde Mekonnen","Nicola Dall'Asen","Paolo Rota"],"pdf_url":"https://arxiv.org/pdf/2405.20675v1.pdf","comment":"7 pages, 11 figures, ELLIS Doctoral Symposium 2023 in Helsinki,\n Finland"},{"id":"http://arxiv.org/abs/2402.09050v2","updated":"2024-05-31T08:15:06Z","published":"2024-02-14T09:46:53Z","title":"End-to-End Training Induces Information Bottleneck through Layer-Role\n Differentiation: A Comparative Analysis with Layer-wise Training","summary":" End-to-end (E2E) training, optimizing the entire model through error\nbackpropagation, fundamentally supports the advancements of deep learning.\nDespite its high performance, E2E training faces the problems of memory\nconsumption, parallel computing, and discrepancy with the functionalities of\nthe actual brain. Various alternative methods have been proposed to overcome\nthese difficulties; however, no one can yet match the performance of E2E\ntraining, thereby falling short in practicality. Furthermore, there is no deep\nunderstanding regarding differences in the trained model properties beyond the\nperformance gap. In this paper, we reconsider why E2E training demonstrates a\nsuperior performance through a comparison with layer-wise training, a non-E2E\nmethod that locally sets errors. On the basis of the observation that E2E\ntraining has an advantage in propagating input information, we analyze the\ninformation plane dynamics of intermediate representations based on the\nHilbert-Schmidt independence criterion (HSIC). The results of our normalized\nHSIC value analysis reveal the E2E training ability to exhibit different\ninformation dynamics across layers, in addition to efficient information\npropagation. Furthermore, we show that this layer-role differentiation leads to\nthe final representation following the information bottleneck principle. It\nsuggests the need to consider the cooperative interactions between layers, not\njust the final layer when analyzing the information bottleneck of deep\nlearning.\n","authors":["Keitaro Sakamoto","Issei Sato"],"pdf_url":"https://arxiv.org/pdf/2402.09050v2.pdf","comment":"TMLR2024"},{"id":"http://arxiv.org/abs/2402.14991v2","updated":"2024-05-31T08:14:11Z","published":"2024-02-22T22:03:16Z","title":"Quantum Theory and Application of Contextual Optimal Transport","summary":" Optimal Transport (OT) has fueled machine learning (ML) across many domains.\nWhen paired data measurements $(\\boldsymbol{\\mu}, \\boldsymbol{\\nu})$ are\ncoupled to covariates, a challenging conditional distribution learning setting\narises. Existing approaches for learning a $\\textit{global}$ transport map\nparameterized through a potentially unseen context utilize Neural OT and\nlargely rely on Brenier's theorem. Here, we propose a first-of-its-kind quantum\ncomputing formulation for amortized optimization of contextualized\ntransportation plans. We exploit a direct link between doubly stochastic\nmatrices and unitary operators thus unravelling a natural connection between OT\nand quantum computation. We verify our method (QontOT) on synthetic and real\ndata by predicting variations in cell type distributions conditioned on drug\ndosage. Importantly we conduct a 24-qubit hardware experiment on a task\nchallenging for classical computers and report a performance that cannot be\nmatched with our classical neural OT approach. In sum, this is a first step\ntoward learning to predict contextualized transportation plans through quantum\ncomputing.\n","authors":["Nicola Mariella","Albert Akhriev","Francesco Tacchino","Christa Zoufal","Juan Carlos Gonzalez-Espitia","Benedek Harsanyi","Eugene Koskin","Ivano Tavernelli","Stefan Woerner","Marianna Rapsomaniki","Sergiy Zhuk","Jannis Born"],"pdf_url":"https://arxiv.org/pdf/2402.14991v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2405.20671v1","updated":"2024-05-31T08:13:35Z","published":"2024-05-31T08:13:35Z","title":"Position Coupling: Leveraging Task Structure for Improved Length\n Generalization of Transformers","summary":" Even for simple arithmetic tasks like integer addition, it is challenging for\nTransformers to generalize to longer sequences than those encountered during\ntraining. To tackle this problem, we propose position coupling, a simple yet\neffective method that directly embeds the structure of the tasks into the\npositional encoding of a (decoder-only) Transformer. Taking a departure from\nthe vanilla absolute position mechanism assigning unique position IDs to each\nof the tokens, we assign the same position IDs to two or more \"relevant\"\ntokens; for integer addition tasks, we regard digits of the same significance\nas in the same position. On the empirical side, we show that with the proposed\nposition coupling, a small (1-layer) Transformer trained on 1 to 30-digit\nadditions can generalize up to 200-digit additions (6.67x of the trained\nlength). On the theoretical side, we prove that a 1-layer Transformer with\ncoupled positions can solve the addition task involving exponentially many\ndigits, whereas any 1-layer Transformer without positional information cannot\nentirely solve it. We also demonstrate that position coupling can be applied to\nother algorithmic tasks such as addition with multiple summands, Nx2\nmultiplication, copy/reverse, and a two-dimensional task.\n","authors":["Hanseul Cho","Jaeyoung Cha","Pranjal Awasthi","Srinadh Bhojanapalli","Anupam Gupta","Chulhee Yun"],"pdf_url":"https://arxiv.org/pdf/2405.20671v1.pdf","comment":"73 pages, 20 figures, 90 tables"},{"id":"http://arxiv.org/abs/2405.19732v2","updated":"2024-05-31T08:13:34Z","published":"2024-05-30T06:24:14Z","title":"Two Optimizers Are Better Than One: LLM Catalyst for Enhancing\n Gradient-Based Optimization","summary":" Learning a skill generally relies on both practical experience by doer and\ninsightful high-level guidance by instructor. Will this strategy also work well\nfor solving complex non-convex optimization problems? Here, a common\ngradient-based optimizer acts like a disciplined doer, making locally optimal\nupdate at each step. Recent methods utilize large language models (LLMs) to\noptimize solutions for concrete problems by inferring from natural language\ninstructions, akin to a high-level instructor. In this paper, we show that\nthese two optimizers are complementary to each other, suggesting a\ncollaborative optimization approach. The gradient-based optimizer and LLM-based\noptimizer are combined in an interleaved manner. We instruct LLMs using task\ndescriptions and timely optimization trajectories recorded during\ngradient-based optimization. Inferred results from LLMs are used as restarting\npoints for the next stage of gradient optimization. By leveraging both the\nlocally rigorous gradient-based optimizer and the high-level deductive\nLLM-based optimizer, our combined optimization method consistently yields\nimprovements over competitive baseline prompt tuning methods. Our results\ndemonstrate the synergistic effect of conventional gradient-based optimization\nand the inference ability of LLMs. The code is released at\nhttps://github.com/guozix/LLM-catalyst.\n","authors":["Zixian Guo","Ming Liu","Zhilong Ji","Jinfeng Bai","Yiwen Guo","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2405.19732v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2002.01605v2","updated":"2024-05-31T08:11:57Z","published":"2020-02-05T02:06:56Z","title":"Exploratory Machine Learning with Unknown Unknowns","summary":" In conventional supervised learning, a training dataset is given with\nground-truth labels from a known label set, and the learned model will classify\nunseen instances to known labels. This paper studies a new problem setting in\nwhich there are unknown classes in the training data misperceived as other\nlabels, and thus their existence appears unknown from the given supervision. We\nattribute the unknown unknowns to the fact that the training dataset is badly\nadvised by the incompletely perceived label space due to the insufficient\nfeature information. To this end, we propose the exploratory machine learning,\nwhich examines and investigates training data by actively augmenting the\nfeature space to discover potentially hidden classes. Our method consists of\nthree ingredients including rejection model, feature exploration, and model\ncascade. We provide theoretical analysis to justify its superiority, and\nvalidate the effectiveness on both synthetic and real datasets.\n","authors":["Peng Zhao","Jia-Wei Shan","Yu-Jie Zhang","Zhi-Hua Zhou"],"pdf_url":"https://arxiv.org/pdf/2002.01605v2.pdf","comment":"published at Artificial Intelligence, preliminary conference version\n published at AAAI'21"},{"id":"http://arxiv.org/abs/2405.20668v1","updated":"2024-05-31T08:09:36Z","published":"2024-05-31T08:09:36Z","title":"Improving Paratope and Epitope Prediction by Multi-Modal Contrastive\n Learning and Interaction Informativeness Estimation","summary":" Accurately predicting antibody-antigen binding residues, i.e., paratopes and\nepitopes, is crucial in antibody design. However, existing methods solely focus\non uni-modal data (either sequence or structure), disregarding the\ncomplementary information present in multi-modal data, and most methods predict\nparatopes and epitopes separately, overlooking their specific spatial\ninteractions. In this paper, we propose a novel Multi-modal contrastive\nlearning and Interaction informativeness estimation-based method for Paratope\nand Epitope prediction, named MIPE, by using both sequence and structure data\nof antibodies and antigens. MIPE implements a multi-modal contrastive learning\nstrategy, which maximizes representations of binding and non-binding residues\nwithin each modality and meanwhile aligns uni-modal representations towards\neffective modal representations. To exploit the spatial interaction\ninformation, MIPE also incorporates an interaction informativeness estimation\nthat computes the estimated interaction matrices between antibodies and\nantigens, thereby approximating them to the actual ones. Extensive experiments\ndemonstrate the superiority of our method compared to baselines. Additionally,\nthe ablation studies and visualizations demonstrate the superiority of MIPE\nowing to the better representations acquired through multi-modal contrastive\nlearning and the interaction patterns comprehended by the interaction\ninformativeness estimation.\n","authors":["Zhiwei Wang","Yongkang Wang","Wen Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20668v1.pdf","comment":"This paper is accepted by IJCAI 2024"},{"id":"http://arxiv.org/abs/2405.20664v1","updated":"2024-05-31T08:03:52Z","published":"2024-05-31T08:03:52Z","title":"Weak Robust Compatibility Between Learning Algorithms and Counterfactual\n Explanation Generation Algorithms","summary":" Counterfactual explanation generation is a powerful method for Explainable\nArtificial Intelligence. It can help users understand why machine learning\nmodels make specific decisions, and how to change those decisions. Evaluating\nthe robustness of counterfactual explanation algorithms is therefore crucial.\nPrevious literature has widely studied the robustness based on the perturbation\nof input instances. However, the robustness defined from the perspective of\nperturbed instances is sometimes biased, because this definition ignores the\nimpact of learning algorithms on robustness. In this paper, we propose a more\nreasonable definition, Weak Robust Compatibility, based on the perspective of\nexplanation strength. In practice, we propose WRC-Test to help us generate more\nrobust counterfactuals. Meanwhile, we designed experiments to verify the\neffectiveness of WRC-Test. Theoretically, we introduce the concepts of PAC\nlearning theory and define the concept of PAC WRC-Approximability. Based on\nreasonable assumptions, we establish oracle inequalities about weak robustness,\nwhich gives a sufficient condition for PAC WRC-Approximability.\n","authors":["Ao Xu","Tieru Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20664v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.14578v2","updated":"2024-05-31T08:01:56Z","published":"2024-05-23T13:52:36Z","title":"Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling","summary":" In current deep learning tasks, Adam style optimizers such as Adam, Adagrad,\nRMSProp, Adafactor, and Lion have been widely used as alternatives to SGD style\noptimizers. These optimizers typically update model parameters using the sign\nof gradients, resulting in more stable convergence curves. The learning rate\nand the batch size are the most critical hyperparameters for optimizers, which\nrequire careful tuning to enable effective convergence. Previous research has\nshown that the optimal learning rate increases linearly or follows similar\nrules with batch size for SGD style optimizers. However, this conclusion is not\napplicable to Adam style optimizers. In this paper, we elucidate the connection\nbetween optimal learning rates and batch sizes for Adam style optimizers\nthrough both theoretical analysis and extensive experiments. First, we raise\nthe scaling law between batch sizes and optimal learning rates in the sign of\ngradient case, in which we prove that the optimal learning rate first rises and\nthen falls as the batch size increases. Moreover, the peak value of the surge\nwill gradually move toward the larger batch size as training progresses.\nSecond, we conducted experiments on various CV and NLP tasks and verified the\ncorrectness of the scaling law.\n","authors":["Shuaipeng Li","Penghao Zhao","Hailin Zhang","Xingwu Sun","Hao Wu","Dian Jiao","Weiyan Wang","Chengjun Liu","Zheng Fang","Jinbao Xue","Yangyu Tao","Bin Cui","Di Wang"],"pdf_url":"https://arxiv.org/pdf/2405.14578v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19059v2","updated":"2024-05-31T07:45:53Z","published":"2024-05-29T13:00:10Z","title":"Robust Entropy Search for Safe Efficient Bayesian Optimization","summary":" The practical use of Bayesian Optimization (BO) in engineering applications\nimposes special requirements: high sampling efficiency on the one hand and\nfinding a robust solution on the other hand. We address the case of adversarial\nrobustness, where all parameters are controllable during the optimization\nprocess, but a subset of them is uncontrollable or even adversely perturbed at\nthe time of application. To this end, we develop an efficient information-based\nacquisition function that we call Robust Entropy Search (RES). We empirically\ndemonstrate its benefits in experiments on synthetic and real-life data. The\nresults showthat RES reliably finds robust optima, outperforming\nstate-of-the-art algorithms.\n","authors":["Dorina Weichert","Alexander Kister","Sebastian Houben","Patrick Link","Gunar Ernis"],"pdf_url":"https://arxiv.org/pdf/2405.19059v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2110.10927v4","updated":"2024-05-31T07:39:50Z","published":"2021-10-21T06:49:10Z","title":"SecureBoost+ : A High Performance Gradient Boosting Tree Framework for\n Large Scale Vertical Federated Learning","summary":" Gradient boosting decision tree (GBDT) is a widely used ensemble algorithm in\nthe industry. Its vertical federated learning version, SecureBoost, is one of\nthe most popular algorithms used in cross-silo privacy-preserving modeling. As\nthe area of privacy computation thrives in recent years, demands for\nlarge-scale and high-performance federated learning have grown dramatically in\nreal-world applications. In this paper, to fulfill these requirements, we\npropose SecureBoost+ that is both novel and improved from the prior work\nSecureBoost. SecureBoost+ integrates several ciphertext calculation\noptimizations and engineering optimizations. The experimental results\ndemonstrate that Secureboost+ has significant performance improvements on large\nand high dimensional data sets compared to SecureBoost. It makes effective and\nefficient large-scale vertical federated learning possible.\n","authors":["Weijing Chen","Guoqiang Ma","Tao Fan","Yan Kang","Qian Xu","Qiang Yang"],"pdf_url":"https://arxiv.org/pdf/2110.10927v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20652v1","updated":"2024-05-31T07:39:22Z","published":"2024-05-31T07:39:22Z","title":"Sign is Not a Remedy: Multiset-to-Multiset Message Passing for Learning\n on Heterophilic Graphs","summary":" Graph Neural Networks (GNNs) have gained significant attention as a powerful\nmodeling and inference method, especially for homophilic graph-structured data.\nTo empower GNNs in heterophilic graphs, where adjacent nodes exhibit dissimilar\nlabels or features, Signed Message Passing (SMP) has been widely adopted.\nHowever, there is a lack of theoretical and empirical analysis regarding the\nlimitations of SMP. In this work, we unveil some potential pitfalls of SMP and\ntheir remedies. We first identify two limitations of SMP: undesirable\nrepresentation update for multi-hop neighbors and vulnerability against\noversmoothing issues. To overcome these challenges, we propose a novel message\npassing function called Multiset to Multiset GNN(M2M-GNN). Our theoretical\nanalyses and extensive experiments demonstrate that M2M-GNN effectively\nalleviates the aforementioned limitations of SMP, yielding superior performance\nin comparison\n","authors":["Langzhang Liang","Sunwoo Kim","Kijung Shin","Zenglin Xu","Shirui Pan","Yuan Qi"],"pdf_url":"https://arxiv.org/pdf/2405.20652v1.pdf","comment":"Published as a conference paper at ICML 2024"},{"id":"http://arxiv.org/abs/2405.20649v1","updated":"2024-05-31T07:30:34Z","published":"2024-05-31T07:30:34Z","title":"Reward-based Input Construction for Cross-document Relation Extraction","summary":" Relation extraction (RE) is a fundamental task in natural language\nprocessing, aiming to identify relations between target entities in text. While\nmany RE methods are designed for a single sentence or document, cross-document\nRE has emerged to address relations across multiple long documents. Given the\nnature of long documents in cross-document RE, extracting document embeddings\nis challenging due to the length constraints of pre-trained language models.\nTherefore, we propose REward-based Input Construction (REIC), the first\nlearning-based sentence selector for cross-document RE. REIC extracts sentences\nbased on relational evidence, enabling the RE module to effectively infer\nrelations. Since supervision of evidence sentences is generally unavailable, we\ntrain REIC using reinforcement learning with RE prediction scores as rewards.\nExperimental results demonstrate the superiority of our method over heuristic\nmethods for different RE structures and backbones in cross-document RE. Our\ncode is publicly available at https://github.com/aailabkaist/REIC.\n","authors":["Byeonghu Na","Suhyeon Jo","Yeongmin Kim","Il-Chul Moon"],"pdf_url":"https://arxiv.org/pdf/2405.20649v1.pdf","comment":"Accepted at ACL 2024 main conference"},{"id":"http://arxiv.org/abs/2405.20648v1","updated":"2024-05-31T07:30:24Z","published":"2024-05-31T07:30:24Z","title":"Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision\n Models For Video Captioning and Summarization","summary":" Video is an increasingly prominent and information-dense medium, yet it poses\nsubstantial challenges for language models. A typical video consists of a\nsequence of shorter segments, or shots, that collectively form a coherent\nnarrative. Each shot is analogous to a word in a sentence where multiple data\nstreams of information (such as visual and auditory data) must be processed\nsimultaneously. Comprehension of the entire video requires not only\nunderstanding the visual-audio information of each shot but also requires that\nthe model links the ideas between each shot to generate a larger,\nall-encompassing story. Despite significant progress in the field, current\nworks often overlook videos' more granular shot-by-shot semantic information.\nIn this project, we propose a family of efficient large language vision models\n(LLVMs) to boost video summarization and captioning called Shotluck Holmes. By\nleveraging better pretraining and data collection strategies, we extend the\nabilities of existing small LLVMs from being able to understand a picture to\nbeing able to understand a sequence of frames. Specifically, we show that\nShotluck Holmes achieves better performance than state-of-the-art results on\nthe Shot2Story video captioning and summary task with significantly smaller and\nmore computationally efficient models.\n","authors":["Richard Luo","Austin Peng","Adithya Vasudev","Rishabh Jain"],"pdf_url":"https://arxiv.org/pdf/2405.20648v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.08970v3","updated":"2024-05-31T07:29:20Z","published":"2023-06-15T09:05:36Z","title":"An Efficient and Multi-private Key Secure Aggregation for Federated\n Learning","summary":" With the emergence of privacy leaks in federated learning, secure aggregation\nprotocols that mainly adopt either homomorphic encryption or threshold secret\nsharing have been widely developed for federated learning to protect the\nprivacy of the local training data of each client. However, these existing\nprotocols suffer from many shortcomings, such as the dependence on a trusted\nthird party, the vulnerability to clients being corrupted, low efficiency, the\ntrade-off between security and fault tolerance, etc. To solve these\ndisadvantages, we propose an efficient and multi-private key secure aggregation\nscheme for federated learning. Specifically, we skillfully modify the variant\nElGamal encryption technique to achieve homomorphic addition operation, which\nhas two important advantages: 1) The server and each client can freely select\npublic and private keys without introducing a trust third party and 2) Compared\nto the variant ElGamal encryption, the plaintext space is relatively large,\nwhich is more suitable for the deep model. Besides, for the high dimensional\ndeep model parameter, we introduce a super-increasing sequence to compress\nmulti-dimensional data into 1-D, which can greatly reduce encryption and\ndecryption times as well as communication for ciphertext transmission. Detailed\nsecurity analyses show that our proposed scheme achieves the semantic security\nof both individual local gradients and the aggregated result while achieving\noptimal robustness in tolerating both client collusion and dropped clients.\nExtensive simulations demonstrate that the accuracy of our scheme is almost the\nsame as the non-private approach, while the efficiency of our scheme is much\nbetter than the state-of-the-art homomorphic encryption-based secure\naggregation schemes. More importantly, the efficiency advantages of our scheme\nwill become increasingly prominent as the number of model parameters increases.\n","authors":["Xue Yang","Zifeng Liu","Xiaohu Tang","Rongxing Lu","Bo Liu"],"pdf_url":"https://arxiv.org/pdf/2306.08970v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.12995v3","updated":"2024-05-31T07:28:40Z","published":"2024-03-05T13:35:41Z","title":"ESM All-Atom: Multi-scale Protein Language Model for Unified Molecular\n Modeling","summary":" Protein language models have demonstrated significant potential in the field\nof protein engineering. However, current protein language models primarily\noperate at the residue scale, which limits their ability to provide information\nat the atom level. This limitation prevents us from fully exploiting the\ncapabilities of protein language models for applications involving both\nproteins and small molecules. In this paper, we propose ESM-AA (ESM All-Atom),\na novel approach that enables atom-scale and residue-scale unified molecular\nmodeling. ESM-AA achieves this by pre-training on multi-scale code-switch\nprotein sequences and utilizing a multi-scale position encoding to capture\nrelationships among residues and atoms. Experimental results indicate that\nESM-AA surpasses previous methods in protein-molecule tasks, demonstrating the\nfull utilization of protein language models. Further investigations reveal that\nthrough unified molecular modeling, ESM-AA not only gains molecular knowledge\nbut also retains its understanding of proteins. The source codes of ESM-AA are\npublicly released at https://github.com/zhengkangjie/ESM-AA.\n","authors":["Kangjie Zheng","Siyu Long","Tianyu Lu","Junwei Yang","Xinyu Dai","Ming Zhang","Zaiqing Nie","Wei-Ying Ma","Hao Zhou"],"pdf_url":"https://arxiv.org/pdf/2403.12995v3.pdf","comment":"ICML2024 camera-ready, update some experimental results, add github\n url"}],"Multimedia":[{"id":"http://arxiv.org/abs/2405.20078v2","updated":"2024-05-31T16:49:19Z","published":"2024-05-30T14:08:09Z","title":"NeRF View Synthesis: Subjective Quality Assessment and Objective Metrics\n Evaluation","summary":" Neural radiance fields (NeRF) are a groundbreaking computer vision technology\nthat enables the generation of high-quality, immersive visual content from\nmultiple viewpoints. This capability holds significant advantages for\napplications such as virtual/augmented reality, 3D modelling and content\ncreation for the film and entertainment industry. However, the evaluation of\nNeRF methods poses several challenges, including a lack of comprehensive\ndatasets, reliable assessment methodologies, and objective quality metrics.\nThis paper addresses the problem of NeRF quality assessment thoroughly, by\nconducting a rigorous subjective quality assessment test that considers several\nscene classes and recently proposed NeRF view synthesis methods. Additionally,\nthe performance of a wide range of state-of-the-art conventional and\nlearning-based full-reference 2D image and video quality assessment metrics is\nevaluated against the subjective scores of the subjective study. The\nexperimental results are analyzed in depth, providing a comparative evaluation\nof several NeRF methods and objective quality metrics, across different classes\nof visual scenes, including real and synthetic content for front-face and\n360-degree camera trajectories.\n","authors":["Pedro Martin","Antonio Rodrigues","Joao Ascenso","Maria Paula Queluz"],"pdf_url":"https://arxiv.org/pdf/2405.20078v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20687v1","updated":"2024-05-31T08:31:26Z","published":"2024-05-31T08:31:26Z","title":"Conditioning GAN Without Training Dataset","summary":" Deep learning algorithms have a large number of trainable parameters often\nwith sizes of hundreds of thousands or more. Training this algorithm requires a\nlarge amount of training data and generating a sufficiently large dataset for\nthese algorithms is costly\\cite{noguchi2019image}.\n GANs are generative neural networks that use two deep learning networks that\nare competing with each other. The networks are generator and discriminator\nnetworks. The generator tries to generate realistic images which resemble the\nactual training dataset by approximating the training data distribution and the\ndiscriminator is trained to classify images as real or\nfake(generated)\\cite{goodfellow2016nips}. Training these GAN algorithms also\nrequires a large amount of training dataset\\cite{noguchi2019image}.\n In this study, the aim is to address the question, \"Given an unconditioned\npretrained generator network and a pretrained classifier, is it feasible to\ndevelop a conditioned generator without relying on any training dataset?\"\n The paper begins with a general introduction to the problem. The subsequent\nsections are structured as follows: Section 2 provides background information\non the problem. Section 3 reviews relevant literature on the topic. Section 4\noutlines the methodology employed in this study. Section 5 presents the\nexperimental results. Section 6 discusses the findings and proposes potential\nfuture research directions. Finally, Section 7 offers concluding remarks.\n The implementation can be accessed\n\\href{https://github.com/kidist-amde/BigGAN-PyTorch}{here}.\n","authors":["Kidist Amde Mekonnen"],"pdf_url":"https://arxiv.org/pdf/2405.20687v1.pdf","comment":"5 pages, 2 figures, Part of my MSc project course, School Project\n Course 2022"},{"id":"http://arxiv.org/abs/2405.20675v1","updated":"2024-05-31T08:19:44Z","published":"2024-05-31T08:19:44Z","title":"Adv-KD: Adversarial Knowledge Distillation for Faster Diffusion Sampling","summary":" Diffusion Probabilistic Models (DPMs) have emerged as a powerful class of\ndeep generative models, achieving remarkable performance in image synthesis\ntasks. However, these models face challenges in terms of widespread adoption\ndue to their reliance on sequential denoising steps during sample generation.\nThis dependence leads to substantial computational requirements, making them\nunsuitable for resource-constrained or real-time processing systems. To address\nthese challenges, we propose a novel method that integrates denoising phases\ndirectly into the model's architecture, thereby reducing the need for\nresource-intensive computations. Our approach combines diffusion models with\ngenerative adversarial networks (GANs) through knowledge distillation, enabling\nmore efficient training and evaluation. By utilizing a pre-trained diffusion\nmodel as a teacher model, we train a student model through adversarial\nlearning, employing layerwise transformations for denoising and submodules for\npredicting the teacher model's output at various points in time. This\nintegration significantly reduces the number of parameters and denoising steps\nrequired, leading to improved sampling speed at test time. We validate our\nmethod with extensive experiments, demonstrating comparable performance with\nreduced computational requirements compared to existing approaches. By enabling\nthe deployment of diffusion models on resource-constrained devices, our\nresearch mitigates their computational burden and paves the way for wider\naccessibility and practical use across the research community and end-users.\n Our code is publicly available at https://github.com/kidist-amde/Adv-KD\n","authors":["Kidist Amde Mekonnen","Nicola Dall'Asen","Paolo Rota"],"pdf_url":"https://arxiv.org/pdf/2405.20675v1.pdf","comment":"7 pages, 11 figures, ELLIS Doctoral Symposium 2023 in Helsinki,\n Finland"},{"id":"http://arxiv.org/abs/2405.20606v1","updated":"2024-05-31T03:40:15Z","published":"2024-05-31T03:40:15Z","title":"Vision-Language Meets the Skeleton: Progressively Distillation with\n Cross-Modal Knowledge for 3D Action Representation Learning","summary":" Supervised and self-supervised learning are two main training paradigms for\nskeleton-based human action recognition. However, the former one-hot\nclassification requires labor-intensive predefined action categories\nannotations, while the latter involves skeleton transformations (e.g.,\ncropping) in the pretext tasks that may impair the skeleton structure. To\naddress these challenges, we introduce a novel skeleton-based training\nframework (C$^2$VL) based on Cross-modal Contrastive learning that uses the\nprogressive distillation to learn task-agnostic human skeleton action\nrepresentation from the Vision-Language knowledge prompts. Specifically, we\nestablish the vision-language action concept space through vision-language\nknowledge prompts generated by pre-trained large multimodal models (LMMs),\nwhich enrich the fine-grained details that the skeleton action space lacks.\nMoreover, we propose the intra-modal self-similarity and inter-modal\ncross-consistency softened targets in the cross-modal contrastive process to\nprogressively control and guide the degree of pulling vision-language knowledge\nprompts and corresponding skeletons closer. These soft instance discrimination\nand self-knowledge distillation strategies contribute to the learning of better\nskeleton-based action representations from the noisy skeleton-vision-language\npairs. During the inference phase, our method requires only the skeleton data\nas the input for action recognition and no longer for vision-language prompts.\nExtensive experiments show that our method achieves state-of-the-art results on\nNTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets. The code will be available\nin the future.\n","authors":["Yang Chen","Tian He","Junfeng Fu","Ling Wang","Jingcai Guo","Hong Cheng"],"pdf_url":"https://arxiv.org/pdf/2405.20606v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00135v1","updated":"2024-05-31T18:55:10Z","published":"2024-05-31T18:55:10Z","title":"Advancing Ear Biometrics: Enhancing Accuracy and Robustness through Deep\n Learning","summary":" Biometric identification is a reliable method to verify individuals based on\ntheir unique physical or behavioral traits, offering a secure alternative to\ntraditional methods like passwords or PINs. This study focuses on ear biometric\nidentification, exploiting its distinctive features for enhanced accuracy,\nreliability, and usability. While past studies typically investigate face\nrecognition and fingerprint analysis, our research demonstrates the\neffectiveness of ear biometrics in overcoming limitations such as variations in\nfacial expressions and lighting conditions. We utilized two datasets: AMI (700\nimages from 100 individuals) and EarNV1.0 (28,412 images from 164 individuals).\nTo improve the accuracy and robustness of our ear biometric identification\nsystem, we applied various techniques including data preprocessing and\naugmentation. Our models achieved a testing accuracy of 99.35% on the AMI\nDataset and 98.1% on the EarNV1.0 dataset, showcasing the effectiveness of our\napproach in precisely identifying individuals based on ear biometric\ncharacteristics.\n","authors":["Youssef Mohamed","Zeyad Youssef","Ahmed Heakl","Ahmed Zaky"],"pdf_url":"https://arxiv.org/pdf/2406.00135v1.pdf","comment":"6 pages, 8 figures, 3 tables, International IEEE Conference on the\n Intelligent Methods, Systems, and Applications"},{"id":"http://arxiv.org/abs/2406.00093v1","updated":"2024-05-31T17:59:56Z","published":"2024-05-31T17:59:56Z","title":"Bootstrap3D: Improving 3D Content Creation with Synthetic Data","summary":" Recent years have witnessed remarkable progress in multi-view diffusion\nmodels for 3D content creation. However, there remains a significant gap in\nimage quality and prompt-following ability compared to 2D diffusion models. A\ncritical bottleneck is the scarcity of high-quality 3D assets with detailed\ncaptions. To address this challenge, we propose Bootstrap3D, a novel framework\nthat automatically generates an arbitrary quantity of multi-view images to\nassist in training multi-view diffusion models. Specifically, we introduce a\ndata generation pipeline that employs (1) 2D and video diffusion models to\ngenerate multi-view images based on constructed text prompts, and (2) our\nfine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting\ninaccurate captions. Leveraging this pipeline, we have generated 1 million\nhigh-quality synthetic multi-view images with dense descriptive captions to\naddress the shortage of high-quality 3D data. Furthermore, we present a\nTraining Timestep Reschedule (TTR) strategy that leverages the denoising\nprocess to learn multi-view consistency while maintaining the original 2D\ndiffusion prior. Extensive experiments demonstrate that Bootstrap3D can\ngenerate high-quality multi-view images with superior aesthetic quality,\nimage-text alignment, and maintained view consistency.\n","authors":["Zeyi Sun","Tong Wu","Pan Zhang","Yuhang Zang","Xiaoyi Dong","Yuanjun Xiong","Dahua Lin","Jiaqi Wang"],"pdf_url":"https://arxiv.org/pdf/2406.00093v1.pdf","comment":"Project Page: https://sunzey.github.io/Bootstrap3D/"}]},"2024-06-03T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2404.14387v2","updated":"2024-06-03T17:47:30Z","published":"2024-04-22T17:43:23Z","title":"A Survey on Self-Evolution of Large Language Models","summary":" Large language models (LLMs) have significantly advanced in various fields\nand intelligent agent applications. However, current LLMs that learn from human\nor external model supervision are costly and may face performance ceilings as\ntask complexity and diversity increase. To address this issue, self-evolution\napproaches that enable LLM to autonomously acquire, refine, and learn from\nexperiences generated by the model itself are rapidly growing. This new\ntraining paradigm inspired by the human experiential learning process offers\nthe potential to scale LLMs towards superintelligence. In this work, we present\na comprehensive survey of self-evolution approaches in LLMs. We first propose a\nconceptual framework for self-evolution and outline the evolving process as\niterative cycles composed of four phases: experience acquisition, experience\nrefinement, updating, and evaluation. Second, we categorize the evolution\nobjectives of LLMs and LLM-based agents; then, we summarize the literature and\nprovide taxonomy and insights for each module. Lastly, we pinpoint existing\nchallenges and propose future directions to improve self-evolution frameworks,\nequipping researchers with critical insights to fast-track the development of\nself-evolving LLMs. Our corresponding GitHub repository is available at\nhttps://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/Awesome-Self-Evolution-of-LLM\n","authors":["Zhengwei Tao","Ting-En Lin","Xiancai Chen","Hangyu Li","Yuchuan Wu","Yongbin Li","Zhi Jin","Fei Huang","Dacheng Tao","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2404.14387v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.09808v3","updated":"2024-06-03T17:43:23Z","published":"2023-11-16T11:32:47Z","title":"PixT3: Pixel-based Table-To-Text Generation","summary":" Table-to-text generation involves generating appropriate textual descriptions\ngiven structured tabular data. It has attracted increasing attention in recent\nyears thanks to the popularity of neural network models and the availability of\nlarge-scale datasets. A common feature across existing methods is their\ntreatment of the input as a string, i.e., by employing linearization techniques\nthat do not always preserve information in the table, are verbose, and lack\nspace efficiency. We propose to rethink data-to-text generation as a visual\nrecognition task, removing the need for rendering the input in a string format.\nWe present PixT3, a multimodal table-to-text model that overcomes the\nchallenges of linearization and input size limitations encountered by existing\nmodels. PixT3 is trained with a new self-supervised learning objective to\nreinforce table structure awareness and is applicable to open-ended and\ncontrolled generation settings. Experiments on the ToTTo and Logic2Text\nbenchmarks show that PixT3 is competitive and, in some settings, superior to\ngenerators that operate solely on text.\n","authors":["Iñigo Alonso","Eneko Agirre","Mirella Lapata"],"pdf_url":"https://arxiv.org/pdf/2311.09808v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.17505v3","updated":"2024-06-03T17:35:04Z","published":"2024-01-30T23:46:35Z","title":"Arrows of Time for Large Language Models","summary":" We study the probabilistic modeling performed by Autoregressive Large\nLanguage Models (LLMs) through the angle of time directionality, addressing a\nquestion first raised in (Shannon, 1951). For large enough models, we\nempirically find a time asymmetry in their ability to learn natural language: a\ndifference in the average log-perplexity when trying to predict the next token\nversus when trying to predict the previous one. This difference is at the same\ntime subtle and very consistent across various modalities (language, model\nsize, training time, ...). Theoretically, this is surprising: from an\ninformation-theoretic point of view, there should be no such difference. We\nprovide a theoretical framework to explain how such an asymmetry can appear\nfrom sparsity and computational complexity considerations, and outline a number\nof perspectives opened by our results.\n","authors":["Vassilis Papadopoulos","Jérémie Wenger","Clément Hongler"],"pdf_url":"https://arxiv.org/pdf/2401.17505v3.pdf","comment":"Re-arranged and updated figures. Added experiments. 12 figures, 20\n pages"},{"id":"http://arxiv.org/abs/2403.17846v2","updated":"2024-06-03T17:12:25Z","published":"2024-03-26T16:36:43Z","title":"Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot\n Navigation","summary":" Recent open-vocabulary robot mapping methods enrich dense geometric maps with\npre-trained visual-language features. While these maps allow for the prediction\nof point-wise saliency maps when queried for a certain language concept,\nlarge-scale environments and abstract queries beyond the object level still\npose a considerable hurdle, ultimately limiting language-grounded robotic\nnavigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D\nscene graph mapping approach for language-grounded robot navigation. Leveraging\nopen-vocabulary vision foundation models, we first obtain state-of-the-art\nopen-vocabulary segment-level maps in 3D and subsequently construct a 3D scene\ngraph hierarchy consisting of floor, room, and object concepts, each enriched\nwith open-vocabulary features. Our approach is able to represent multi-story\nbuildings and allows robotic traversal of those using a cross-floor Voronoi\ngraph. HOV-SG is evaluated on three distinct datasets and surpasses previous\nbaselines in open-vocabulary semantic accuracy on the object, room, and floor\nlevel while producing a 75% reduction in representation size compared to dense\nopen-vocabulary maps. In order to prove the efficacy and generalization\ncapabilities of HOV-SG, we showcase successful long-horizon\nlanguage-conditioned robot navigation within real-world multi-storage\nenvironments. We provide code and trial video data at http://hovsg.github.io/.\n","authors":["Abdelrhman Werby","Chenguang Huang","Martin Büchner","Abhinav Valada","Wolfram Burgard"],"pdf_url":"https://arxiv.org/pdf/2403.17846v2.pdf","comment":"Code and video are available at http://hovsg.github.io/"},{"id":"http://arxiv.org/abs/2401.04854v3","updated":"2024-06-03T17:01:06Z","published":"2024-01-10T00:05:45Z","title":"Are Language Models More Like Libraries or Like Librarians?\n Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs","summary":" Are LLMs cultural technologies like photocopiers or printing presses, which\ntransmit information but cannot create new content? A challenge for this idea,\nwhich we call bibliotechnism, is that LLMs generate novel text. We begin with a\ndefense of bibliotechnism, showing how even novel text may inherit its meaning\nfrom original human-generated text. We then argue that bibliotechnism faces an\nindependent challenge from examples in which LLMs generate novel reference,\nusing new names to refer to new entities. Such examples could be explained if\nLLMs were not cultural technologies but had beliefs, desires, and intentions.\nAccording to interpretationism in the philosophy of mind, a system has such\nattitudes if and only if its behavior is well explained by the hypothesis that\nit does. Interpretationists may hold that LLMs have attitudes, and thus have a\nsimple solution to the novel reference problem. We emphasize, however, that\ninterpretationism is compatible with very simple creatures having attitudes and\ndiffers sharply from views that presuppose these attitudes require\nconsciousness, sentience, or intelligence (topics about which we make no\nclaims).\n","authors":["Harvey Lederman","Kyle Mahowald"],"pdf_url":"https://arxiv.org/pdf/2401.04854v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14867v2","updated":"2024-06-03T16:59:20Z","published":"2023-12-22T17:45:19Z","title":"VIEScore: Towards Explainable Metrics for Conditional Image Synthesis\n Evaluation","summary":" In the rapidly advancing field of conditional image generation research,\nchallenges such as limited explainability lie in effectively evaluating the\nperformance and capabilities of various models. This paper introduces VIEScore,\na Visual Instruction-guided Explainable metric for evaluating any conditional\nimage generation tasks. VIEScore leverages general knowledge from Multimodal\nLarge Language Models (MLLMs) as the backbone and does not require training or\nfine-tuning. We evaluate VIEScore on seven prominent tasks in conditional image\ntasks and found: (1) VIEScore (GPT4-o) achieves a high Spearman correlation of\n0.4 with human evaluations, while the human-to-human correlation is 0.45. (2)\nVIEScore (with open-source MLLM) is significantly weaker than GPT-4o and GPT-4v\nin evaluating synthetic images. (3) VIEScore achieves a correlation on par with\nhuman ratings in the generation tasks but struggles in editing tasks. With\nthese results, we believe VIEScore shows its great potential to replace human\njudges in evaluating image synthesis tasks.\n","authors":["Max Ku","Dongfu Jiang","Cong Wei","Xiang Yue","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2312.14867v2.pdf","comment":"Accepted to ACL2024 main"},{"id":"http://arxiv.org/abs/2403.01748v3","updated":"2024-06-03T16:58:04Z","published":"2024-03-04T05:55:01Z","title":"NeuSpeech: Decode Neural signal as Speech","summary":" Decoding language from brain dynamics is an important open direction in the\nrealm of brain-computer interface (BCI), especially considering the rapid\ngrowth of large language models. Compared to invasive-based signals which\nrequire electrode implantation surgery, non-invasive neural signals (e.g. EEG,\nMEG) have attracted increasing attention considering their safety and\ngenerality. However, the exploration is not adequate in three aspects: 1)\nprevious methods mainly focus on EEG but none of the previous works address\nthis problem on MEG with better signal quality; 2) prior works have\npredominantly used $``teacher-forcing\"$ during generative decoding, which is\nimpractical; 3) prior works are mostly $``BART-based\"$ not fully\nauto-regressive, which performs better in other sequence tasks. In this paper,\nwe explore the brain-to-text translation of MEG signals in a speech-decoding\nformation. Here we are the first to investigate a cross-attention-based\n``whisper\" model for generating text directly from MEG signals without teacher\nforcing. Our model achieves impressive BLEU-1 scores of 60.30 and 52.89 without\npretraining $\\&$ teacher-forcing on two major datasets ($\\textit{GWilliams}$\nand $\\textit{Schoffelen}$). This paper conducts a comprehensive review to\nunderstand how speech decoding formation performs on the neural decoding tasks,\nincluding pretraining initialization, training $\\&$ evaluation set splitting,\naugmentation, and scaling law. Code is available at\nhttps://github.com/NeuSpeech/NeuSpeech1$.\n","authors":["Yiqian Yang","Yiqun Duan","Qiang Zhang","Hyejeong Jo","Jinni Zhou","Won Hee Lee","Renjing Xu","Hui Xiong"],"pdf_url":"https://arxiv.org/pdf/2403.01748v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.08277v5","updated":"2024-06-03T16:48:59Z","published":"2024-02-13T08:12:48Z","title":"Towards Faithful and Robust LLM Specialists for Evidence-Based\n Question-Answering","summary":" Advances towards more faithful and traceable answers of Large Language Models\n(LLMs) are crucial for various research and practical endeavors. One avenue in\nreaching this goal is basing the answers on reliable sources. However, this\nEvidence-Based QA has proven to work insufficiently with LLMs in terms of\nciting the correct sources (source quality) and truthfully representing the\ninformation within sources (answer attributability). In this work, we\nsystematically investigate how to robustly fine-tune LLMs for better source\nquality and answer attributability. Specifically, we introduce a data\ngeneration pipeline with automated data quality filters, which can synthesize\ndiversified high-quality training and testing data at scale. We further\nintroduce four test sets to benchmark the robustness of fine-tuned specialist\nmodels. Extensive evaluation shows that fine-tuning on synthetic data improves\nperformance on both in- and out-of-distribution. Furthermore, we show that data\nquality, which can be drastically improved by proposed quality filters, matters\nmore than quantity in improving Evidence-Based QA.\n","authors":["Tobias Schimanski","Jingwei Ni","Mathias Kraus","Elliott Ash","Markus Leippold"],"pdf_url":"https://arxiv.org/pdf/2402.08277v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.14555v4","updated":"2024-06-03T16:43:16Z","published":"2024-05-23T13:35:34Z","title":"Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating\n Representative and Affinity Bias in Large Language Models","summary":" Research on Large Language Models (LLMs) has often neglected subtle biases\nthat, although less apparent, can significantly influence the models' outputs\ntoward particular social narratives. This study addresses two such biases\nwithin LLMs: representative bias, which denotes a tendency of LLMs to generate\noutputs that mirror the experiences of certain identity groups, and affinity\nbias, reflecting the models' evaluative preferences for specific narratives or\nviewpoints. We introduce two novel metrics to measure these biases: the\nRepresentative Bias Score (RBS) and the Affinity Bias Score (ABS), and present\nthe Creativity-Oriented Generation Suite (CoGS), a collection of open-ended\ntasks such as short story writing and poetry composition, designed with\ncustomized rubrics to detect these subtle biases. Our analysis uncovers marked\nrepresentative biases in prominent LLMs, with a preference for identities\nassociated with being white, straight, and men. Furthermore, our investigation\nof affinity bias reveals distinctive evaluative patterns within each model,\nakin to `bias fingerprints'. This trend is also seen in human evaluators,\nhighlighting a complex interplay between human and machine bias perceptions.\n","authors":["Abhishek Kumar","Sarfaroz Yunusov","Ali Emami"],"pdf_url":"https://arxiv.org/pdf/2405.14555v4.pdf","comment":"9 pages (excluding references), accepted to ACL 2024 Main Conference"},{"id":"http://arxiv.org/abs/2405.16277v3","updated":"2024-06-03T16:42:55Z","published":"2024-05-25T15:28:22Z","title":"Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge","summary":" Large Language Models (LLMs) have demonstrated remarkable success in tasks\nlike the Winograd Schema Challenge (WSC), showcasing advanced textual\ncommon-sense reasoning. However, applying this reasoning to multimodal domains,\nwhere understanding text and images together is essential, remains a\nsubstantial challenge. To address this, we introduce WinoVis, a novel dataset\nspecifically designed to probe text-to-image models on pronoun disambiguation\nwithin multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion\nAttentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel\nevaluation framework that isolates the models' ability in pronoun\ndisambiguation from other visual processing challenges. Evaluation of\nsuccessive model versions reveals that, despite incremental advancements,\nStable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, only marginally\nsurpassing random guessing. Further error analysis identifies important areas\nfor future research aimed at advancing text-to-image models in their ability to\ninterpret and interact with the complex visual world.\n","authors":["Brendan Park","Madeline Janecek","Naser Ezzati-Jivan","Yifeng Li","Ali Emami"],"pdf_url":"https://arxiv.org/pdf/2405.16277v3.pdf","comment":"9 pages (excluding references), accepted to ACL 2024 Main Conference"},{"id":"http://arxiv.org/abs/2405.16282v3","updated":"2024-06-03T16:41:53Z","published":"2024-05-25T15:42:04Z","title":"Confidence Under the Hood: An Investigation into the\n Confidence-Probability Alignment in Large Language Models","summary":" As the use of Large Language Models (LLMs) becomes more widespread,\nunderstanding their self-evaluation of confidence in generated responses\nbecomes increasingly important as it is integral to the reliability of the\noutput of these models. We introduce the concept of Confidence-Probability\nAlignment, that connects an LLM's internal confidence, quantified by token\nprobabilities, to the confidence conveyed in the model's response when\nexplicitly asked about its certainty. Using various datasets and prompting\ntechniques that encourage model introspection, we probe the alignment between\nmodels' internal and expressed confidence. These techniques encompass using\nstructured evaluation scales to rate confidence, including answer options when\nprompting, and eliciting the model's confidence level for outputs it does not\nrecognize as its own. Notably, among the models analyzed, OpenAI's GPT-4 showed\nthe strongest confidence-probability alignment, with an average Spearman's\n$\\hat{\\rho}$ of 0.42, across a wide range of tasks. Our work contributes to the\nongoing efforts to facilitate risk assessment in the application of LLMs and to\nfurther our understanding of model trustworthiness.\n","authors":["Abhishek Kumar","Robert Morabito","Sanzhar Umbet","Jad Kabbara","Ali Emami"],"pdf_url":"https://arxiv.org/pdf/2405.16282v3.pdf","comment":"9 pages (excluding references), accepted to ACL 2024 Main Conference"},{"id":"http://arxiv.org/abs/2307.02863v5","updated":"2024-06-03T16:32:02Z","published":"2023-07-06T09:03:10Z","title":"ValiTex -- a unified validation framework for computational text-based\n measures of social constructs","summary":" Guidance on how to validate computational text-based measures of social\nconstructs is fragmented. While researchers generally acknowledge the\nimportance of validating text-based measures, they often lack a shared\nvocabulary and a unified framework to do so. This paper introduces ValiText, a\nnew validation framework designed to assist scholars in validly measuring\nsocial constructs in textual data. The framework is built on a conceptual\nfoundation of validity in the social sciences, strengthened by an empirical\nreview of validation practices in the social sciences and consultations with\nexperts. Ultimately, ValiText prescribes researchers to demonstrate three types\nof validation evidence: substantive evidence (outlining the theoretical\nunderpinning of the measure), structural evidence (examining the properties of\nthe text model and its output) and external evidence (testing for how the\nmeasure relates to independent information). The framework is further\nsupplemented by a checklist of validation steps, offering practical guidance in\nthe form of documentation sheets that guide researchers in the validation\nprocess.\n","authors":["Lukas Birkenmaier","Claudia Wagner","Clemens Lechner"],"pdf_url":"https://arxiv.org/pdf/2307.02863v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.06270v3","updated":"2024-06-03T16:23:28Z","published":"2024-05-10T06:52:44Z","title":"XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced\n In-Context Learning in Healthcare","summary":" The integration of Large Language Models (LLMs) into healthcare diagnostics\noffers a promising avenue for clinical decision-making. This study outlines the\ndevelopment of a novel method for zero-shot/few-shot in-context learning (ICL)\nby integrating medical domain knowledge using a multi-layered structured\nprompt. We also explore the efficacy of two communication styles between the\nuser and LLMs: the Numerical Conversational (NC) style, which processes data\nincrementally, and the Natural Language Single-Turn (NL-ST) style, which\nemploys long narrative prompts.\n Our study systematically evaluates the diagnostic accuracy and risk factors,\nincluding gender bias and false negative rates, using a dataset of 920 patient\nrecords in various few-shot scenarios. Results indicate that traditional\nclinical machine learning (ML) models generally outperform LLMs in zero-shot\nand few-shot settings. However, the performance gap narrows significantly when\nemploying few-shot examples alongside effective explainable AI (XAI) methods as\nsources of domain knowledge. Moreover, with sufficient time and an increased\nnumber of examples, the conversational style (NC) nearly matches the\nperformance of ML models. Most notably, LLMs demonstrate comparable or superior\ncost-sensitive accuracy relative to ML models.\n This research confirms that, with appropriate domain knowledge and tailored\ncommunication strategies, LLMs can significantly enhance diagnostic processes.\nThe findings highlight the importance of optimizing the number of training\nexamples and communication styles to improve accuracy and reduce biases in LLM\napplications.\n","authors":["Fatemeh Nazary","Yashar Deldjoo","Tommaso Di Noia","Eugenio di Sciascio"],"pdf_url":"https://arxiv.org/pdf/2405.06270v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.19334v2","updated":"2024-06-03T16:19:03Z","published":"2024-02-29T16:37:08Z","title":"Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge","summary":" The democratization of pre-trained language models through open-source\ninitiatives has rapidly advanced innovation and expanded access to cutting-edge\ntechnologies. However, this openness also brings significant security risks,\nincluding backdoor attacks, where hidden malicious behaviors are triggered by\nspecific inputs, compromising natural language processing (NLP) system\nintegrity and reliability. This paper suggests that merging a backdoored model\nwith other homogeneous models can significantly remediate backdoor\nvulnerabilities even if such models are not entirely secure. In our\nexperiments, we verify our hypothesis on various models (BERT-Base,\nRoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News,\nand QNLI). Compared to multiple advanced defensive approaches, our method\noffers an effective and efficient inference-stage defense against backdoor\nattacks on classification and instruction-tuned tasks without additional\nresources or specific knowledge. Our approach consistently outperforms recent\nadvanced baselines, leading to an average of about 75% reduction in the attack\nsuccess rate. Since model merging has been an established approach for\nimproving model performance, the extra advantage it provides regarding defense\ncan be seen as a cost-free bonus.\n","authors":["Ansh Arora","Xuanli He","Maximilian Mozes","Srinibas Swain","Mark Dras","Qiongkai Xu"],"pdf_url":"https://arxiv.org/pdf/2402.19334v2.pdf","comment":"accepted to ACL2024 (Findings)"},{"id":"http://arxiv.org/abs/2405.19701v2","updated":"2024-06-03T15:59:34Z","published":"2024-05-30T05:26:57Z","title":"Significance of Chain of Thought in Gender Bias Mitigation for\n English-Dravidian Machine Translation","summary":" Gender bias in machine translation (MT) sys- tems poses a significant\nchallenge to achieving accurate and inclusive translations. This paper examines\ngender bias in machine translation systems for languages such as Telugu and\nKan- nada from the Dravidian family, analyzing how gender inflections affect\ntranslation accuracy and neutrality using Google Translate and Chat- GPT. It\nfinds that while plural forms can reduce bias, individual-centric sentences\noften main- tain the bias due to historical stereotypes. The study evaluates\nthe Chain of Thought process- ing, noting significant bias mitigation from 80%\nto 4% in Telugu and from 40% to 0% in Kan- nada. It also compares Telugu and\nKannada translations, emphasizing the need for language specific strategies to\naddress these challenges and suggesting directions for future research to\nenhance fairness in both data preparation and prompts during inference.\n","authors":["Lavanya Prahallad","Radhika Mamidi"],"pdf_url":"https://arxiv.org/pdf/2405.19701v2.pdf","comment":"6 pages"},{"id":"http://arxiv.org/abs/2405.19266v2","updated":"2024-06-03T15:27:10Z","published":"2024-05-29T16:59:38Z","title":"PediatricsGPT: Large Language Models as Chinese Medical Assistants for\n Pediatric Applications","summary":" Developing intelligent pediatric consultation systems offers promising\nprospects for improving diagnostic efficiency, especially in China, where\nhealthcare resources are scarce. Despite recent advances in Large Language\nModels (LLMs) for Chinese medicine, their performance is sub-optimal in\npediatric applications due to inadequate instruction data and vulnerable\ntraining procedures. To address the above issues, this paper builds PedCorpus,\na high-quality dataset of over 300,000 multi-task instructions from pediatric\ntextbooks, guidelines, and knowledge graph resources to fulfil diverse\ndiagnostic demands. Upon well-designed PedCorpus, we propose PediatricsGPT, the\nfirst Chinese pediatric LLM assistant built on a systematic and robust training\npipeline. In the continuous pre-training phase, we introduce a hybrid\ninstruction pre-training mechanism to mitigate the internal-injected knowledge\ninconsistency of LLMs for medical domain adaptation. Immediately, the\nfull-parameter Supervised Fine-Tuning (SFT) is utilized to incorporate the\ngeneral medical knowledge schema into the models. After that, we devise a\ndirect following preference optimization to enhance the generation of\npediatrician-like humanistic responses. In the parameter-efficient secondary\nSFT phase, a mixture of universal-specific experts strategy is presented to\nresolve the competency conflict between medical generalist and pediatric\nexpertise mastery. Extensive results based on the metrics, GPT-4, and doctor\nevaluations on distinct doctor downstream tasks show that PediatricsGPT\nconsistently outperforms previous Chinese medical LLMs. Our model and dataset\nwill be open-source for community development.\n","authors":["Dingkang Yang","Jinjie Wei","Dongling Xiao","Shunli Wang","Tong Wu","Gang Li","Mingcheng Li","Shuaibing Wang","Jiawei Chen","Yue Jiang","Qingyao Xu","Ke Li","Peng Zhai","Lihua Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.19266v2.pdf","comment":"A Technical Report on a Chinese Medical Large Language Model"},{"id":"http://arxiv.org/abs/2405.20962v2","updated":"2024-06-03T15:10:53Z","published":"2024-05-31T16:07:33Z","title":"Large Language Models are Zero-Shot Next Location Predictors","summary":" Predicting the locations an individual will visit in the future is crucial\nfor solving many societal issues like disease diffusion and reduction of\npollution among many others. The models designed to tackle next-location\nprediction, however, require a significant amount of individual-level\ninformation to be trained effectively. Such data may be scarce or even\nunavailable in some geographic regions or peculiar scenarios (e.g., cold-start\nin recommendation systems). Moreover, the design of a next-location predictor\nable to generalize or geographically transfer knowledge is still an open\nresearch challenge. Recent advances in natural language processing have led to\na rapid diffusion of Large Language Models (LLMs) which have shown good\ngeneralization and reasoning capabilities. These insights, coupled with the\nrecent findings that LLMs are rich in geographical knowledge, allowed us to\nbelieve that these models can act as zero-shot next-location predictors. This\npaper evaluates the capabilities of many popular LLMs in this role,\nspecifically Llama, GPT-3.5 and Mistral 7B. After designing a proper prompt, we\ntested the models on three real-world mobility datasets. The results show that\nLLMs can obtain accuracies up to 32.4%, a significant relative improvement of\nover 600% when compared to sophisticated DL models specifically designed for\nhuman mobility. Moreover, we show that other LLMs are unable to perform the\ntask properly. To prevent positively biased results, we also propose a\nframework inspired by other studies to test data contamination. Finally, we\nexplored the possibility of using LLMs as text-based explainers for\nnext-location prediction showing that can effectively provide an explanation\nfor their decision. Notably, 7B models provide more generic, but still\nreliable, explanations compared to larger counterparts. Code:\ngithub.com/ssai-trento/LLM-zero-shot-NL\n","authors":["Ciro Beneduce","Bruno Lepri","Massimiliano Luca"],"pdf_url":"https://arxiv.org/pdf/2405.20962v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.03407v2","updated":"2024-06-03T15:00:47Z","published":"2024-03-06T02:23:32Z","title":"Human vs. Machine: Behavioral Differences Between Expert Humans and\n Language Models in Wargame Simulations","summary":" To some, the advent of artificial intelligence (AI) promises better\ndecision-making and increased military effectiveness while reducing the\ninfluence of human error and emotions. However, there is still debate about how\nAI systems, especially large language models (LLMs), behave compared to humans\nin high-stakes military decision-making scenarios with the potential for\nincreased risks towards escalation and unnecessary conflicts. To test this\npotential and scrutinize the use of LLMs for such purposes, we use a new\nwargame experiment with 107 national security experts designed to look at\ncrisis escalation in a fictional US-China scenario and compare human players to\nLLM-simulated responses in separate simulations. Wargames have a long history\nin the development of military strategy and the response of nations to threats\nor attacks. Here, we show a considerable high-level agreement in the LLM and\nhuman responses and significant quantitative and qualitative differences in\nindividual actions and strategic tendencies. These differences depend on\nintrinsic biases in LLMs regarding the appropriate level of violence following\nstrategic instructions, the choice of LLM, and whether the LLMs are tasked to\ndecide for a team of players directly or first to simulate dialog between\nplayers. When simulating the dialog, the discussions lack quality and maintain\na farcical harmony. The LLM simulations cannot account for human player\ncharacteristics, showing no significant difference even for extreme traits,\nsuch as \"pacifist\" or \"aggressive sociopath\". Our results motivate policymakers\nto be cautious before granting autonomy or following AI-based strategy\nrecommendations.\n","authors":["Max Lamparth","Anthony Corso","Jacob Ganz","Oriana Skylar Mastro","Jacquelyn Schneider","Harold Trinkunas"],"pdf_url":"https://arxiv.org/pdf/2403.03407v2.pdf","comment":"Updated with new plot and more details"},{"id":"http://arxiv.org/abs/2306.06427v3","updated":"2024-06-03T14:59:11Z","published":"2023-06-10T12:42:36Z","title":"Boosting Language Models Reasoning with Chain-of-Knowledge Prompting","summary":" Recently, Chain-of-Thought (CoT) prompting has delivered success on complex\nreasoning tasks, which aims at designing a simple prompt like ``Let's think\nstep by step'' or multiple in-context exemplars with well-designed rationales\nto elicit Large Language Models (LLMs) to generate intermediate reasoning\nsteps. However, the generated rationales often come with mistakes, making\nunfactual and unfaithful reasoning chains. To mitigate this brittleness, we\npropose a novel Chain-of-Knowledge (CoK) prompting, where we aim at eliciting\nLLMs to generate explicit pieces of knowledge evidence in the form of structure\ntriple. This is inspired by our human behaviors, i.e., we can draw a mind map\nor knowledge map as the reasoning evidence in the brain before answering a\ncomplex question. Benefiting from CoK, we additionally introduce a\nF^2-Verification method to estimate the reliability of the reasoning chains in\nterms of factuality and faithfulness. For the unreliable response, the wrong\nevidence can be indicated to prompt the LLM to rethink. Extensive experiments\ndemonstrate that our method can further improve the performance of commonsense,\nfactual, symbolic, and arithmetic reasoning tasks.\n","authors":["Jianing Wang","Qiushi Sun","Xiang Li","Ming Gao"],"pdf_url":"https://arxiv.org/pdf/2306.06427v3.pdf","comment":"ACL 2024"},{"id":"http://arxiv.org/abs/2403.15447v2","updated":"2024-06-03T14:49:00Z","published":"2024-03-18T01:38:19Z","title":"Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient\n LLMs Under Compression","summary":" Compressing high-capability Large Language Models (LLMs) has emerged as a\nfavored strategy for resource-efficient inferences. While state-of-the-art\n(SoTA) compression methods boast impressive advancements in preserving benign\ntask performance, the potential risks of compression in terms of safety and\ntrustworthiness have been largely neglected. This study conducts the first,\nthorough evaluation of three (3) leading LLMs using five (5) SoTA compression\ntechniques across eight (8) trustworthiness dimensions. Our experiments\nhighlight the intricate interplay between compression and trustworthiness,\nrevealing some interesting patterns. We find that quantization is currently a\nmore effective approach than pruning in achieving efficiency and\ntrustworthiness simultaneously. For instance, a 4-bit quantized model retains\nthe trustworthiness of its original counterpart, but model pruning\nsignificantly degrades trustworthiness, even at 50% sparsity. Moreover,\nemploying quantization within a moderate bit range could unexpectedly improve\ncertain trustworthiness dimensions such as ethics and fairness. Conversely,\nextreme quantization to very low bit levels (3 bits) tends to reduce\ntrustworthiness significantly. This increased risk cannot be uncovered by\nlooking at benign performance alone, in turn, mandating comprehensive\ntrustworthiness evaluation in practice. These findings culminate in practical\nrecommendations for simultaneously achieving high utility, efficiency, and\ntrustworthiness in LLMs. Code and models are available at\nhttps://decoding-comp-trust.github.io.\n","authors":["Junyuan Hong","Jinhao Duan","Chenhui Zhang","Zhangheng Li","Chulin Xie","Kelsey Lieberman","James Diffenderfer","Brian Bartoldson","Ajay Jaiswal","Kaidi Xu","Bhavya Kailkhura","Dan Hendrycks","Dawn Song","Zhangyang Wang","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2403.15447v2.pdf","comment":"Accepted to ICML'24"},{"id":"http://arxiv.org/abs/2402.02416v3","updated":"2024-06-03T14:33:45Z","published":"2024-02-04T09:24:51Z","title":"Aligner: Efficient Alignment by Learning to Correct","summary":" With the rapid development of large language models (LLMs) and ever-evolving\npractical requirements, finding an efficient and effective alignment method has\nnever been more critical. However, the tension between the complexity of\ncurrent alignment methods and the need for rapid iteration in deployment\nscenarios necessitates the development of a model-agnostic alignment approach\nthat can operate under these constraints. In this paper, we introduce Aligner,\na novel and simple alignment paradigm that learns the correctional residuals\nbetween preferred and dispreferred answers using a small model. Designed as a\nmodel-agnostic, plug-and-play module, Aligner can be directly applied to\nvarious open-source and API-based models with only one-off training, making it\nsuitable for rapid iteration. Notably, Aligner can be applied to any powerful,\nlarge-scale upstream models. Moreover, it can even iteratively bootstrap the\nupstream models using corrected responses as synthetic human preference data,\nbreaking through the model's performance ceiling. Our experiments demonstrate\nperformance improvements by deploying the same Aligner model across 11\ndifferent LLMs, evaluated on the 3H dimensions (helpfulness, harmlessness, and\nhonesty). Specifically, Aligner-7B has achieved an average improvement of\n68.9\\% in helpfulness and 23.8\\% in harmlessness across the tested LLMs while\nalso effectively reducing hallucination. In the Alpaca-Eval leaderboard,\nstacking Aligner-2B on GPT-4 Turbo improved its LC Win Rate from 55.0\\% to\n58.3\\%, surpassing GPT-4 Omni's 57.5\\% Win Rate (community report).\n","authors":["Jiaming Ji","Boyuan Chen","Hantao Lou","Donghai Hong","Borong Zhang","Xuehai Pan","Juntao Dai","Tianyi Qiu","Yaodong Yang"],"pdf_url":"https://arxiv.org/pdf/2402.02416v3.pdf","comment":"29 pages"},{"id":"http://arxiv.org/abs/2402.08567v2","updated":"2024-06-03T14:15:03Z","published":"2024-02-13T16:06:17Z","title":"Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM\n Agents Exponentially Fast","summary":" A multimodal large language model (MLLM) agent can receive instructions,\ncapture images, retrieve histories from memory, and decide which tools to use.\nNonetheless, red-teaming efforts have revealed that adversarial images/prompts\ncan jailbreak an MLLM and cause unaligned behaviors. In this work, we report an\neven more severe safety issue in multi-agent environments, referred to as\ninfectious jailbreak. It entails the adversary simply jailbreaking a single\nagent, and without any further intervention from the adversary, (almost) all\nagents will become infected exponentially fast and exhibit harmful behaviors.\nTo validate the feasibility of infectious jailbreak, we simulate multi-agent\nenvironments containing up to one million LLaVA-1.5 agents, and employ\nrandomized pair-wise chat as a proof-of-concept instantiation for multi-agent\ninteraction. Our results show that feeding an (infectious) adversarial image\ninto the memory of any randomly chosen agent is sufficient to achieve\ninfectious jailbreak. Finally, we derive a simple principle for determining\nwhether a defense mechanism can provably restrain the spread of infectious\njailbreak, but how to design a practical defense that meets this principle\nremains an open question to investigate. Our project page is available at\nhttps://sail-sg.github.io/Agent-Smith/.\n","authors":["Xiangming Gu","Xiaosen Zheng","Tianyu Pang","Chao Du","Qian Liu","Ye Wang","Jing Jiang","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2402.08567v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2402.14856v2","updated":"2024-06-03T13:53:01Z","published":"2024-02-20T12:58:14Z","title":"Comparing Inferential Strategies of Humans and Large Language Models in\n Deductive Reasoning","summary":" Deductive reasoning plays a pivotal role in the formulation of sound and\ncohesive arguments. It allows individuals to draw conclusions that logically\nfollow, given the truth value of the information provided. Recent progress in\nthe domain of large language models (LLMs) has showcased their capability in\nexecuting deductive reasoning tasks. Nonetheless, a significant portion of\nresearch primarily assesses the accuracy of LLMs in solving such tasks, often\noverlooking a deeper analysis of their reasoning behavior. In this study, we\ndraw upon principles from cognitive psychology to examine inferential\nstrategies employed by LLMs, through a detailed evaluation of their responses\nto propositional logic problems. Our findings indicate that LLMs display\nreasoning patterns akin to those observed in humans, including strategies like\n$\\textit{supposition following}$ or $\\textit{chain construction}$. Moreover,\nour research demonstrates that the architecture and scale of the model\nsignificantly affect its preferred method of reasoning, with more advanced\nmodels tending to adopt strategies more frequently than less sophisticated\nones. Importantly, we assert that a model's accuracy, that is the correctness\nof its final conclusion, does not necessarily reflect the validity of its\nreasoning process. This distinction underscores the necessity for more nuanced\nevaluation procedures in the field.\n","authors":["Philipp Mondorf","Barbara Plank"],"pdf_url":"https://arxiv.org/pdf/2402.14856v2.pdf","comment":"ACL 2024 main, 31 pages, 19 figures"},{"id":"http://arxiv.org/abs/2312.10302v4","updated":"2024-06-03T13:46:16Z","published":"2023-12-16T03:33:12Z","title":"One-Shot Learning as Instruction Data Prospector for Large Language\n Models","summary":" Contemporary practices in instruction tuning often hinge on enlarging data\nscaling without a clear strategy for ensuring data quality, inadvertently\nintroducing noise that may compromise model performance. To address this\nchallenge, we introduce \\textsc{Nuggets}, a novel and efficient methodology\nthat leverages one-shot learning to discern and select high-quality instruction\ndata from extensive datasets. \\textsc{Nuggets} assesses the potential of\nindividual instruction examples to act as effective one-shot learning\ninstances, thereby identifying those that can significantly improve performance\nacross diverse tasks. \\textsc{Nuggets} utilizes a scoring system based on the\nimpact of candidate examples on the perplexity of a diverse anchor set,\nfacilitating the selection of the most advantageous data for instruction\ntuning. Through comprehensive evaluations on two benchmarks, including MT-Bench\nand Alpaca-Eval, we show that instruction tuning with the top 1\\% of examples\ncurated by \\textsc{Nuggets} substantially outperforms conventional methods\nemploying the entire dataset.\n","authors":["Yunshui Li","Binyuan Hui","Xiaobo Xia","Jiaxi Yang","Min Yang","Lei Zhang","Shuzheng Si","Ling-Hao Chen","Junhao Liu","Tongliang Liu","Fei Huang","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2312.10302v4.pdf","comment":"ACL 2024"},{"id":"http://arxiv.org/abs/2309.08637v5","updated":"2024-06-03T13:39:40Z","published":"2023-09-14T15:34:01Z","title":"TextBind: Multi-turn Interleaved Multimodal Instruction-following in the\n Wild","summary":" Large language models with instruction-following abilities have\nrevolutionized the field of artificial intelligence. These models show\nexceptional generalizability to tackle various real-world tasks through their\nnatural language interfaces. However, their performance heavily relies on\nhigh-quality exemplar data, which is often difficult to obtain. This challenge\nis further exacerbated when it comes to multimodal instruction following. We\nintroduce TextBind, an almost annotation-free framework for empowering larger\nlanguage models with the multi-turn interleaved multimodal\ninstruction-following capabilities. Our approach requires only image-caption\npairs and generates multi-turn multimodal instruction-response conversations\nfrom a language model. To accommodate interleaved image-text inputs and\noutputs, we devise MIM, a language model-centric architecture that seamlessly\nintegrates image encoder and decoder models. We release our dataset, model, and\ndemo to foster future research in the area of multimodal instruction following.\n","authors":["Huayang Li","Siheng Li","Deng Cai","Longyue Wang","Lemao Liu","Taro Watanabe","Yujiu Yang","Shuming Shi"],"pdf_url":"https://arxiv.org/pdf/2309.08637v5.pdf","comment":"Findings of ACL 2024"},{"id":"http://arxiv.org/abs/2401.10695v2","updated":"2024-06-03T13:32:45Z","published":"2024-01-19T14:00:19Z","title":"LangBridge: Multilingual Reasoning Without Multilingual Supervision","summary":" We introduce LangBridge, a zero-shot approach to adapt language models for\nmultilingual reasoning tasks without multilingual supervision. LangBridge\noperates by bridging two models, each specialized in different aspects: (1) one\nspecialized in understanding multiple languages (e.g., mT5 encoder) and (2) one\nspecialized in reasoning (e.g., MetaMath). LangBridge connects the two models\nby introducing minimal trainable parameters between them. Despite utilizing\nonly English data for training, LangBridge considerably enhances the\nperformance of language models on low-resource languages across mathematical\nreasoning, code completion, logical reasoning, and commonsense reasoning. Our\nanalysis suggests that the efficacy of LangBridge stems from the\nlanguage-agnostic characteristics of multilingual representations. We publicly\nrelease our code and models.\n","authors":["Dongkeun Yoon","Joel Jang","Sungdong Kim","Seungone Kim","Sheikh Shafayat","Minjoon Seo"],"pdf_url":"https://arxiv.org/pdf/2401.10695v2.pdf","comment":"ACL 2024 Main"},{"id":"http://arxiv.org/abs/2402.02805v2","updated":"2024-06-03T13:07:06Z","published":"2024-02-05T08:26:33Z","title":"Graph-enhanced Large Language Models in Asynchronous Plan Reasoning","summary":" Planning is a fundamental property of human intelligence. Reasoning about\nasynchronous plans is challenging since it requires sequential and parallel\nplanning to optimize time costs. Can large language models (LLMs) succeed at\nthis task? Here, we present the first large-scale study investigating this\nquestion. We find that a representative set of closed and open-source LLMs,\nincluding GPT-4 and LLaMA-2, behave poorly when not supplied with illustrations\nabout the task-solving process in our benchmark AsyncHow. We propose a novel\ntechnique called Plan Like a Graph (PLaG) that combines graphs with natural\nlanguage prompts and achieves state-of-the-art results. We show that although\nPLaG can boost model performance, LLMs still suffer from drastic degradation\nwhen task complexity increases, highlighting the limits of utilizing LLMs for\nsimulating digital devices. We see our study as an exciting step towards using\nLLMs as efficient autonomous agents. Our code and data are available at\nhttps://github.com/fangru-lin/graph-llm-asynchow-plan.\n","authors":["Fangru Lin","Emanuele La Malfa","Valentin Hofmann","Elle Michelle Yang","Anthony Cohn","Janet B. Pierrehumbert"],"pdf_url":"https://arxiv.org/pdf/2402.02805v2.pdf","comment":"Accepted at ICML-2024"},{"id":"http://arxiv.org/abs/2303.06458v3","updated":"2024-06-03T12:47:12Z","published":"2023-03-11T17:14:33Z","title":"ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and\n Multilingual Natural Language Generation","summary":" Natural Language Generation (NLG) accepts input data in the form of images,\nvideos, or text and generates corresponding natural language text as output.\nExisting NLG methods mainly adopt a supervised approach and rely heavily on\ncoupled data-to-text pairs. However, for many targeted scenarios and for\nnon-English languages, sufficient quantities of labeled data are often not\navailable. To relax the dependency on labeled data of downstream tasks, we\npropose an intuitive and effective zero-shot learning framework, ZeroNLG, which\ncan deal with multiple NLG tasks, including image-to-text (image captioning),\nvideo-to-text (video captioning), and text-to-text (neural machine\ntranslation), across English, Chinese, German, and French within a unified\nframework. ZeroNLG does not require any labeled downstream pairs for training.\nDuring training, ZeroNLG (i) projects different domains (across modalities and\nlanguages) to corresponding coordinates in a shared common latent space; (ii)\nbridges different domains by aligning their corresponding coordinates in this\nspace; and (iii) builds an unsupervised multilingual auto-encoder to learn to\ngenerate text by reconstructing the input text given its coordinate in shared\nlatent space. Consequently, during inference, based on the data-to-text\npipeline, ZeroNLG can generate target sentences across different languages\ngiven the coordinate of input data in the common space. Within this unified\nframework, given visual (imaging or video) data as input, ZeroNLG can perform\nzero-shot visual captioning; given textual sentences as input, ZeroNLG can\nperform zero-shot machine translation. We present the results of extensive\nexperiments on twelve NLG tasks, showing that, without using any labeled\ndownstream pairs for training, ZeroNLG generates high-quality and believable\noutputs and significantly outperforms existing zero-shot methods.\n","authors":["Bang Yang","Fenglin Liu","Yuexian Zou","Xian Wu","Yaowei Wang","David A. Clifton"],"pdf_url":"https://arxiv.org/pdf/2303.06458v3.pdf","comment":"Accepted by TPAMI (Our code and data are available at\n https://github.com/yangbang18/ZeroNLG)"},{"id":"http://arxiv.org/abs/2405.11143v2","updated":"2024-06-03T12:19:18Z","published":"2024-05-20T01:04:40Z","title":"OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework","summary":" As large language models (LLMs) continue to grow by scaling laws,\nreinforcement learning from human feedback (RLHF) has gained significant\nattention due to its outstanding performance. However, unlike pretraining or\nfine-tuning a single model, scaling reinforcement learning from human feedback\n(RLHF) for training large language models poses coordination challenges across\nfour models. We present OpenRLHF, an open-source framework enabling efficient\nRLHF scaling. Unlike existing RLHF frameworks that co-locate four models on the\nsame GPUs, OpenRLHF re-designs scheduling for the models beyond 70B parameters\nusing Ray, vLLM, and DeepSpeed, leveraging improved resource utilization and\ndiverse training approaches. Integrating seamlessly with Hugging Face, OpenRLHF\nprovides an out-of-the-box solution with optimized algorithms and launch\nscripts, which ensures user-friendliness. OpenRLHF implements RLHF, DPO,\nrejection sampling, and other alignment techniques. Empowering state-of-the-art\nLLM development, OpenRLHF's code is available at\nhttps://github.com/OpenLLMAI/OpenRLHF.\n","authors":["Jian Hu","Xibin Wu","Weixun Wang"," Xianyu","Dehao Zhang","Yu Cao"],"pdf_url":"https://arxiv.org/pdf/2405.11143v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.11082v6","updated":"2024-06-03T12:19:16Z","published":"2023-04-19T17:50:09Z","title":"Fundamental Limitations of Alignment in Large Language Models","summary":" An important aspect in developing language models that interact with humans\nis aligning their behavior to be useful and unharmful for their human users.\nThis is usually achieved by tuning the model in a way that enhances desired\nbehaviors and inhibits undesired ones, a process referred to as alignment. In\nthis paper, we propose a theoretical approach called Behavior Expectation\nBounds (BEB) which allows us to formally investigate several inherent\ncharacteristics and limitations of alignment in large language models.\nImportantly, we prove that within the limits of this framework, for any\nbehavior that has a finite probability of being exhibited by the model, there\nexist prompts that can trigger the model into outputting this behavior, with\nprobability that increases with the length of the prompt. This implies that any\nalignment process that attenuates an undesired behavior but does not remove it\naltogether, is not safe against adversarial prompting attacks. Furthermore, our\nframework hints at the mechanism by which leading alignment approaches such as\nreinforcement learning from human feedback make the LLM prone to being prompted\ninto the undesired behaviors. This theoretical result is being experimentally\ndemonstrated in large scale by the so called contemporary \"chatGPT jailbreaks\",\nwhere adversarial users trick the LLM into breaking its alignment guardrails by\ntriggering it into acting as a malicious persona. Our results expose\nfundamental limitations in alignment of LLMs and bring to the forefront the\nneed to devise reliable mechanisms for ensuring AI safety.\n","authors":["Yotam Wolf","Noam Wies","Oshri Avnery","Yoav Levine","Amnon Shashua"],"pdf_url":"https://arxiv.org/pdf/2304.11082v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.07105v3","updated":"2024-06-03T12:14:34Z","published":"2024-01-13T16:09:49Z","title":"Graph Language Models","summary":" While Language Models (LMs) are the workhorses of NLP, their interplay with\nstructured knowledge graphs (KGs) is still actively researched. Current methods\nfor encoding such graphs typically either (i) linearize them for embedding with\nLMs -- which underutilize structural information, or (ii) use Graph Neural\nNetworks (GNNs) to preserve the graph structure -- but GNNs cannot represent\ntext features as well as pretrained LMs. In our work we introduce a novel LM\ntype, the Graph Language Model (GLM), that integrates the strengths of both\napproaches and mitigates their weaknesses. The GLM parameters are initialized\nfrom a pretrained LM to enhance understanding of individual graph concepts and\ntriplets. Simultaneously, we design the GLM's architecture to incorporate graph\nbiases, thereby promoting effective knowledge distribution within the graph.\nThis enables GLMs to process graphs, texts, and interleaved inputs of both.\nEmpirical evaluations on relation classification tasks show that GLM embeddings\nsurpass both LM- and GNN-based baselines in supervised and zero-shot setting,\ndemonstrating their versatility.\n","authors":["Moritz Plenz","Anette Frank"],"pdf_url":"https://arxiv.org/pdf/2401.07105v3.pdf","comment":"Accepted at ACL 2024. 9 pages, 10 figures, 9 tables"},{"id":"http://arxiv.org/abs/2404.04232v2","updated":"2024-06-03T12:08:20Z","published":"2024-04-05T17:26:22Z","title":"Benchmarking and Improving Compositional Generalization of Multi-aspect\n Controllable Text Generation","summary":" Compositional generalization, representing the model's ability to generate\ntext with new attribute combinations obtained by recombining single attributes\nfrom the training data, is a crucial property for multi-aspect controllable\ntext generation (MCTG) methods. Nonetheless, a comprehensive compositional\ngeneralization evaluation benchmark of MCTG is still lacking. We propose\nCompMCTG, a benchmark encompassing diverse multi-aspect labeled datasets and a\ncrafted three-dimensional evaluation protocol, to holistically evaluate the\ncompositional generalization of MCTG approaches. We observe that existing MCTG\nworks generally confront a noticeable performance drop in compositional\ntesting. To mitigate this issue, we introduce Meta-MCTG, a training framework\nincorporating meta-learning, where we enable models to learn how to generalize\nby simulating compositional generalization scenarios in the training phase. We\ndemonstrate the effectiveness of Meta-MCTG through achieving obvious\nimprovement (by at most 3.64%) for compositional testing performance in 94.4%\ncases.\n","authors":["Tianqi Zhong","Zhaoyi Li","Quan Wang","Linqi Song","Ying Wei","Defu Lian","Zhendong Mao"],"pdf_url":"https://arxiv.org/pdf/2404.04232v2.pdf","comment":"Accepted to ACL 2024 (Main); 32 pages"},{"id":"http://arxiv.org/abs/2403.06833v2","updated":"2024-06-03T12:04:50Z","published":"2024-03-11T15:48:56Z","title":"Can LLMs Separate Instructions From Data? And What Do We Even Mean By\n That?","summary":" Instruction-tuned Large Language Models (LLMs) show impressive results in\nnumerous practical applications, but they lack essential safety features that\nare common in other areas of computer science, particularly an explicit\nseparation of instructions and data. This makes them vulnerable to\nmanipulations such as indirect prompt injections and generally unsuitable for\nsafety-critical tasks. Surprisingly, there is currently no established\ndefinition or benchmark to quantify this phenomenon. In this work, we close\nthis gap by introducing a formal measure for instruction-data separation and an\nempirical variant that is calculable from a model's outputs. We also present a\nnew dataset, SEP, that allows estimating the measure for real-world models. Our\nresults on various LLMs show that the problem of instruction-data separation is\nreal: all models fail to achieve high separation, and canonical mitigation\ntechniques, such as prompt engineering and fine-tuning, either fail to\nsubstantially improve separation or reduce model utility. The source code and\nSEP dataset are openly accessible at\nhttps://github.com/egozverev/Shold-It-Be-Executed-Or-Processed.\n","authors":["Egor Zverev","Sahar Abdelnabi","Soroush Tabesh","Mario Fritz","Christoph H. Lampert"],"pdf_url":"https://arxiv.org/pdf/2403.06833v2.pdf","comment":"GitHub:\n https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed. 10 pages main\n text, 30 pages in total"},{"id":"http://arxiv.org/abs/2404.08817v2","updated":"2024-06-03T11:56:38Z","published":"2024-04-12T21:28:18Z","title":"Revisiting Code Similarity Evaluation with Abstract Syntax Tree Edit\n Distance","summary":" This paper revisits recent code similarity evaluation metrics, particularly\nfocusing on the application of Abstract Syntax Tree (AST) editing distance in\ndiverse programming languages. In particular, we explore the usefulness of\nthese metrics and compare them to traditional sequence similarity metrics. Our\nexperiments showcase the effectiveness of AST editing distance in capturing\nintricate code structures, revealing a high correlation with established\nmetrics. Furthermore, we explore the strengths and weaknesses of AST editing\ndistance and prompt-based GPT similarity scores in comparison to BLEU score,\nexecution match, and Jaccard Similarity. We propose, optimize, and publish an\nadaptable metric that demonstrates effectiveness across all tested languages,\nrepresenting an enhanced version of Tree Similarity of Edit Distance (TSED).\n","authors":["Yewei Song","Cedric Lothritz","Daniel Tang","Tegawendé F. Bissyandé","Jacques Klein"],"pdf_url":"https://arxiv.org/pdf/2404.08817v2.pdf","comment":"ACL 2024 Main"},{"id":"http://arxiv.org/abs/2311.08045v4","updated":"2024-06-03T11:34:05Z","published":"2023-11-14T10:10:31Z","title":"Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM\n Game","summary":" Human preference alignment is essential to improve the interaction quality of\nlarge language models (LLMs). Existing alignment methods depend on manually\nannotated preference data to guide the LLM optimization directions. However,\ncontinuously updating LLMs for alignment raises a distribution gap between\nmodel-generated samples and human-annotated responses, hindering training\neffectiveness. To mitigate this issue, previous methods require additional\npreference annotation on newly generated samples to adapt to the shifted\ndistribution, which consumes a large amount of annotation resources. Targeting\nmore efficient human preference optimization, we propose an Adversarial\nPreference Optimization (APO) framework, in which the LLM and the reward model\nupdate alternatively via a min-max game. Through adversarial training, the\nreward model can adapt to the shifted generation distribution of the LLM\nwithout any additional annotation. With comprehensive experiments, we find the\nproposed adversarial training framework further enhances existing alignment\nbaselines in terms of LLM helpfulness and harmlessness. The code is at\nhttps://github.com/Linear95/APO.\n","authors":["Pengyu Cheng","Yifan Yang","Jian Li","Yong Dai","Tianhao Hu","Peixin Cao","Nan Du","Xiaolong Li"],"pdf_url":"https://arxiv.org/pdf/2311.08045v4.pdf","comment":"Accepted by ACL2024 findings"},{"id":"http://arxiv.org/abs/2401.17167v3","updated":"2024-06-03T11:28:29Z","published":"2024-01-30T16:52:56Z","title":"Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool\n Utilization in Real-World Complex Scenarios","summary":" The recent trend of using Large Language Models (LLMs) as tool agents in\nreal-world applications underscores the necessity for comprehensive evaluations\nof their capabilities, particularly in complex scenarios involving planning,\ncreating, and using tools. However, existing benchmarks typically focus on\nsimple synthesized queries that do not reflect real-world complexity, thereby\noffering limited perspectives in evaluating tool utilization. To address this\nissue, we present UltraTool, a novel benchmark designed to improve and evaluate\nLLMs' ability in tool utilization within real-world scenarios. UltraTool\nfocuses on the entire process of using tools - from planning and creating to\napplying them in complex tasks. It emphasizes real-world complexities,\ndemanding accurate, multi-step planning for effective problem-solving. A key\nfeature of UltraTool is its independent evaluation of planning with natural\nlanguage, which happens before tool usage and simplifies the task solving by\nmapping out the intermediate steps. Thus, unlike previous work, it eliminates\nthe restriction of pre-defined toolset. Through extensive experiments on\nvarious LLMs, we offer novel insights into the evaluation of capabilities of\nLLMs in tool utilization, thereby contributing a fresh perspective to this\nrapidly evolving field. The benchmark is publicly available at\nhttps://github.com/JoeYing1019/UltraTool.\n","authors":["Shijue Huang","Wanjun Zhong","Jianqiao Lu","Qi Zhu","Jiahui Gao","Weiwen Liu","Yutai Hou","Xingshan Zeng","Yasheng Wang","Lifeng Shang","Xin Jiang","Ruifeng Xu","Qun Liu"],"pdf_url":"https://arxiv.org/pdf/2401.17167v3.pdf","comment":"Accepted by ACL2024 Findings"},{"id":"http://arxiv.org/abs/2401.15641v2","updated":"2024-06-03T11:11:13Z","published":"2024-01-28T12:33:14Z","title":"PRE: A Peer Review Based Large Language Model Evaluator","summary":" The impressive performance of large language models (LLMs) has attracted\nconsiderable attention from the academic and industrial communities. Besides\nhow to construct and train LLMs, how to effectively evaluate and compare the\ncapacity of LLMs has also been well recognized as an important yet difficult\nproblem. Existing paradigms rely on either human annotators or model-based\nevaluators to evaluate the performance of LLMs on different tasks. However,\nthese paradigms often suffer from high cost, low generalizability, and\ninherited biases in practice, which make them incapable of supporting the\nsustainable development of LLMs in long term. In order to address these issues,\ninspired by the peer review systems widely used in academic publication\nprocess, we propose a novel framework that can automatically evaluate LLMs\nthrough a peer-review process. Specifically, for the evaluation of a specific\ntask, we first construct a small qualification exam to select \"reviewers\" from\na couple of powerful LLMs. Then, to actually evaluate the \"submissions\" written\nby different candidate LLMs, i.e., the evaluatees, we use the reviewer LLMs to\nrate or compare the submissions. The final ranking of evaluatee LLMs is\ngenerated based on the results provided by all reviewers. We conducted\nextensive experiments on text summarization tasks with eleven LLMs including\nGPT-4. The results demonstrate the existence of biasness when evaluating using\na single LLM. Also, our PRE model outperforms all the baselines, illustrating\nthe effectiveness of the peer review mechanism.\n","authors":["Zhumin Chu","Qingyao Ai","Yiteng Tu","Haitao Li","Yiqun Liu"],"pdf_url":"https://arxiv.org/pdf/2401.15641v2.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2402.06044v3","updated":"2024-06-03T10:48:16Z","published":"2024-02-08T20:35:06Z","title":"OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind\n Reasoning Capabilities of Large Language Models","summary":" Neural Theory-of-Mind (N-ToM), machine's ability to understand and keep track\nof the mental states of others, is pivotal in developing socially intelligent\nagents. However, prevalent N-ToM benchmarks have several shortcomings,\nincluding the presence of ambiguous and artificial narratives, absence of\npersonality traits and preferences, a lack of questions addressing characters'\npsychological mental states, and limited diversity in the questions posed. In\nresponse to these issues, we construct OpenToM, a new benchmark for assessing\nN-ToM with (1) longer and clearer narrative stories, (2) characters with\nexplicit personality traits, (3) actions that are triggered by character\nintentions, and (4) questions designed to challenge LLMs' capabilities of\nmodeling characters' mental states of both the physical and psychological\nworld. Using OpenToM, we reveal that state-of-the-art LLMs thrive at modeling\ncertain aspects of mental states in the physical world but fall short when\ntracking characters' mental states in the psychological world.\n","authors":["Hainiu Xu","Runcong Zhao","Lixing Zhu","Jinhua Du","Yulan He"],"pdf_url":"https://arxiv.org/pdf/2402.06044v3.pdf","comment":"ACL 2024"},{"id":"http://arxiv.org/abs/2404.10306v4","updated":"2024-06-03T10:42:36Z","published":"2024-04-16T06:27:39Z","title":"Balancing Speciality and Versatility: a Coarse to Fine Framework for\n Supervised Fine-tuning Large Language Model","summary":" Aligned Large Language Models (LLMs) showcase remarkable versatility, capable\nof handling diverse real-world tasks. Meanwhile, aligned LLMs are also expected\nto exhibit speciality, excelling in specific applications. However, fine-tuning\nwith extra data, a common practice to gain speciality, often leads to\ncatastrophic forgetting (CF) of previously acquired versatility, hindering the\nmodel's performance across diverse tasks. In response to this challenge, we\npropose CoFiTune, a coarse to fine framework in an attempt to strike the\nbalance between speciality and versatility. At the coarse-grained level, an\nempirical tree-search algorithm is utilized to pinpoint and update specific\nmodules that are crucial for speciality, while keeping other parameters frozen;\nat the fine-grained level, a soft-masking mechanism regulates the update to the\nLLMs, mitigating the CF issue without harming speciality. In an overall\nevaluation of both speciality and versatility, CoFiTune consistently\noutperforms baseline methods across diverse tasks and model scales. Compared to\nthe full-parameter SFT, CoFiTune leads to about 14% versatility improvement and\nmarginal speciality loss on a 13B model. Lastly, based on further analysis, we\nprovide a speculative insight into the information forwarding process in LLMs,\nwhich helps explain the effectiveness of the proposed method. The code is\navailable at https://github.com/rattlesnakey/CoFiTune.\n","authors":["Hengyuan Zhang","Yanru Wu","Dawei Li","Sak Yang","Rui Zhao","Yong Jiang","Fei Tan"],"pdf_url":"https://arxiv.org/pdf/2404.10306v4.pdf","comment":"43 pages, 10 figures, accepted by ACL 2024 Findings"},{"id":"http://arxiv.org/abs/2402.09259v2","updated":"2024-06-03T10:30:00Z","published":"2024-02-14T15:45:56Z","title":"SyntaxShap: Syntax-aware Explainability Method for Text Generation","summary":" To harness the power of large language models in safety-critical domains, we\nneed to ensure the explainability of their predictions. However, despite the\nsignificant attention to model interpretability, there remains an unexplored\ndomain in explaining sequence-to-sequence tasks using methods tailored for\ntextual data. This paper introduces SyntaxShap, a local, model-agnostic\nexplainability method for text generation that takes into consideration the\nsyntax in the text data. The presented work extends Shapley values to account\nfor parsing-based syntactic dependencies. Taking a game theoric approach,\nSyntaxShap only considers coalitions constraint by the dependency tree. We\nadopt a model-based evaluation to compare SyntaxShap and its weighted form to\nstate-of-the-art explainability methods adapted to text generation tasks, using\ndiverse metrics including faithfulness, coherency, and semantic alignment of\nthe explanations to the model. We show that our syntax-aware method produces\nexplanations that help build more faithful and coherent explanations for\npredictions by autoregressive models. Confronted with the misalignment of human\nand AI model reasoning, this paper also highlights the need for cautious\nevaluation strategies in explainable AI.\n","authors":["Kenza Amara","Rita Sevastjanova","Mennatallah El-Assady"],"pdf_url":"https://arxiv.org/pdf/2402.09259v2.pdf","comment":"Accepted to ACL 2024"},{"id":"http://arxiv.org/abs/2402.09631v3","updated":"2024-06-03T10:24:22Z","published":"2024-02-15T00:20:30Z","title":"Representation Surgery: Theory and Practice of Affine Steering","summary":" Language models often exhibit undesirable behavior, e.g., generating toxic or\ngender-biased text. In the case of neural language models, an encoding of the\nundesirable behavior is often present in the model's representations. Thus, one\nnatural (and common) approach to prevent the model from exhibiting undesirable\nbehavior is to steer the model's representations in a manner that reduces the\nprobability of it generating undesirable text. This paper investigates the\nformal and empirical properties of steering functions, i.e., transformation of\nthe neural language model's representations that alter its behavior. First, we\nderive two optimal, in the least-squares sense, affine steering functions under\ndifferent constraints. Our theory provides justification for existing\napproaches and offers a novel, improved steering approach. Second, we offer a\nseries of experiments that demonstrate the empirical effectiveness of the\nmethods in mitigating bias and reducing toxic generation.\n","authors":["Shashwat Singh","Shauli Ravfogel","Jonathan Herzig","Roee Aharoni","Ryan Cotterell","Ponnurangam Kumaraguru"],"pdf_url":"https://arxiv.org/pdf/2402.09631v3.pdf","comment":"Accepted in ICML 2024"},{"id":"http://arxiv.org/abs/2312.11075v4","updated":"2024-06-03T10:00:13Z","published":"2023-12-18T10:16:37Z","title":"Split and Rephrase with Large Language Models","summary":" The Split and Rephrase (SPRP) task, which consists in splitting complex\nsentences into a sequence of shorter grammatical sentences, while preserving\nthe original meaning, can facilitate the processing of complex texts for humans\nand machines alike. It is also a valuable testbed to evaluate natural language\nprocessing models, as it requires modelling complex grammatical aspects. In\nthis work, we evaluate large language models on the task, showing that they can\nprovide large improvements over the state of the art on the main metrics,\nalthough still lagging in terms of splitting compliance. Results from two human\nevaluations further support the conclusions drawn from automated metric\nresults. We provide a comprehensive study that includes prompting variants,\ndomain shift, fine-tuned pretrained language models of varying parameter size\nand training data volumes, contrasted with both zero-shot and few-shot\napproaches on instruction-tuned language models. Although the latter were\nmarkedly outperformed by fine-tuned models, they may constitute a reasonable\noff-the-shelf alternative. Our results provide a fine-grained analysis of the\npotential and limitations of large language models for SPRP, with significant\nimprovements achievable using relatively small amounts of training data and\nmodel parameters overall, and remaining limitations for all models on the task.\n","authors":["David Ponce","Thierry Etchegoyhen","Jesús Calleja Pérez","Harritxu Gete"],"pdf_url":"https://arxiv.org/pdf/2312.11075v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.04251v2","updated":"2024-06-03T09:33:14Z","published":"2024-02-06T18:59:30Z","title":"Linear-time Minimum Bayes Risk Decoding with Reference Aggregation","summary":" Minimum Bayes Risk (MBR) decoding is a text generation technique that has\nbeen shown to improve the quality of machine translations, but is expensive,\neven if a sampling-based approximation is used. Besides requiring a large\nnumber of sampled sequences, it requires the pairwise calculation of a utility\nmetric, which has quadratic complexity. In this paper, we propose to\napproximate pairwise metric scores with scores calculated against aggregated\nreference representations. This changes the complexity of utility estimation\nfrom $O(n^2)$ to $O(n)$, while empirically preserving most of the quality gains\nof MBR decoding. We release our source code at https://github.com/ZurichNLP/mbr\n","authors":["Jannis Vamvas","Rico Sennrich"],"pdf_url":"https://arxiv.org/pdf/2402.04251v2.pdf","comment":"ACL 2024"},{"id":"http://arxiv.org/abs/2404.06395v3","updated":"2024-06-03T08:54:38Z","published":"2024-04-09T15:36:50Z","title":"MiniCPM: Unveiling the Potential of Small Language Models with Scalable\n Training Strategies","summary":" The burgeoning interest in developing Large Language Models (LLMs) with up to\ntrillion parameters has been met with concerns regarding resource efficiency\nand practical expense, particularly given the immense cost of experimentation.\nThis scenario underscores the importance of exploring the potential of Small\nLanguage Models (SLMs) as a resource-efficient alternative. In this context, we\nintroduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter\nvariants, not only excel in their respective categories but also demonstrate\ncapabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach\nexhibits scalability in both model and data dimensions for future LLM research.\nRegarding model scaling, we employ extensive model wind tunnel experiments for\nstable and optimal scaling. For data scaling, we introduce a\nWarmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to\ncontinuous training and domain adaptation. We present an in-depth analysis of\nthe intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we\nare now able to efficiently study data-model scaling law without extensive\nretraining experiments on both axes of model and data, from which we derive the\nmuch higher compute optimal data-model ratio than Chinchilla Optimal.\nAdditionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE\nand MiniCPM-128K, whose excellent performance further cementing MiniCPM's\nfoundation in diverse SLM applications. MiniCPM models are available publicly\nat https://github.com/OpenBMB/MiniCPM .\n","authors":["Shengding Hu","Yuge Tu","Xu Han","Chaoqun He","Ganqu Cui","Xiang Long","Zhi Zheng","Yewei Fang","Yuxiang Huang","Weilin Zhao","Xinrong Zhang","Zheng Leng Thai","Kaihuo Zhang","Chongyi Wang","Yuan Yao","Chenyang Zhao","Jie Zhou","Jie Cai","Zhongwu Zhai","Ning Ding","Chao Jia","Guoyang Zeng","Dahai Li","Zhiyuan Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2404.06395v3.pdf","comment":"revise according to peer review"},{"id":"http://arxiv.org/abs/2311.09189v2","updated":"2024-06-03T08:37:10Z","published":"2023-11-15T18:32:27Z","title":"PsyEval: A Suite of Mental Health Related Tasks for Evaluating Large\n Language Models","summary":" Evaluating Large Language Models (LLMs) in the mental health domain poses\ndistinct challenged from other domains, given the subtle and highly subjective\nnature of symptoms that exhibit significant variability among individuals. This\npaper presents PsyEval, the first comprehensive suite of mental health-related\ntasks for evaluating LLMs. PsyEval encompasses five sub-tasks that evaluate\nthree critical dimensions of mental health. This comprehensive framework is\ndesigned to thoroughly assess the unique challenges and intricacies of mental\nhealth-related tasks, making PsyEval a highly specialized and valuable tool for\nevaluating LLM performance in this domain. We evaluate twelve advanced LLMs\nusing PsyEval. Experiment results not only demonstrate significant room for\nimprovement in current LLMs concerning mental health but also unveil potential\ndirections for future model optimization.\n","authors":["Haoan Jin","Siyuan Chen","Dilawaier Dilixiati","Yewei Jiang","Mengyue Wu","Kenny Q. Zhu"],"pdf_url":"https://arxiv.org/pdf/2311.09189v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18741v2","updated":"2024-06-03T08:35:07Z","published":"2024-05-29T04:04:05Z","title":"Genshin: General Shield for Natural Language Processing with Large\n Language Models","summary":" Large language models (LLMs) like ChatGPT, Gemini, or LLaMA have been\ntrending recently, demonstrating considerable advancement and generalizability\npower in countless domains. However, LLMs create an even bigger black box\nexacerbating opacity, with interpretability limited to few approaches. The\nuncertainty and opacity embedded in LLMs' nature restrict their application in\nhigh-stakes domains like financial fraud, phishing, etc. Current approaches\nmainly rely on traditional textual classification with posterior interpretable\nalgorithms, suffering from attackers who may create versatile adversarial\nsamples to break the system's defense, forcing users to make trade-offs between\nefficiency and robustness. To address this issue, we propose a novel cascading\nframework called Genshin (General Shield for Natural Language Processing with\nLarge Language Models), utilizing LLMs as defensive one-time plug-ins. Unlike\nmost applications of LLMs that try to transform text into something new or\nstructural, Genshin uses LLMs to recover text to its original state. Genshin\naims to combine the generalizability of the LLM, the discrimination of the\nmedian model, and the interpretability of the simple model. Our experiments on\nthe task of sentimental analysis and spam detection have shown fatal flaws of\nthe current median models and exhilarating results on LLMs' recovery ability,\ndemonstrating that Genshin is both effective and efficient. In our ablation\nstudy, we unearth several intriguing observations. Utilizing the LLM defender,\na tool derived from the 4th paradigm, we have reproduced BERT's 15% optimal\nmask rate results in the 3rd paradigm of NLP. Additionally, when employing the\nLLM as a potential adversarial tool, attackers are capable of executing\neffective attacks that are nearly semantically lossless.\n","authors":["Xiao Peng","Tao Liu","Ying Wang"],"pdf_url":"https://arxiv.org/pdf/2405.18741v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19799v2","updated":"2024-06-03T08:13:10Z","published":"2024-05-30T08:10:50Z","title":"Unsupervised Mutual Learning of Dialogue Discourse Parsing and Topic\n Segmentation","summary":" The advancement of large language models (LLMs) has propelled the development\nof dialogue systems. Unlike the popular ChatGPT-like assistant model, which\nonly satisfies the user's preferences, task-oriented dialogue systems have also\nfaced new requirements and challenges in the broader business field. They are\nexpected to provide correct responses at each dialogue turn, at the same time,\nachieve the overall goal defined by the task. By understanding rhetorical\nstructures and topic structures via topic segmentation and discourse parsing, a\ndialogue system may do a better planning to achieve both objectives. However,\nwhile both structures belong to discourse structure in linguistics, rhetorical\nstructure and topic structure are mostly modeled separately or with one\nassisting the other in the prior work. The interaction between these two\nstructures has not been considered for joint modeling and mutual learning.\nFurthermore, unsupervised learning techniques to achieve the above are not well\nexplored. To fill this gap, we propose an unsupervised mutual learning\nframework of two structures leveraging the global and local connections between\nthem. We extend the topic modeling between non-adjacent discourse units to\nensure global structural relevance with rhetorical structures. We also\nincorporate rhetorical structures into the topic structure through a graph\nneural network model to ensure local coherence consistency. Finally, we utilize\nthe similarity between the two fused structures for mutual learning. The\nexperimental results demonstrate that our methods outperform all strong\nbaselines on two dialogue rhetorical datasets (STAC and Molweni), as well as\ndialogue topic datasets (Doc2Dial and TIAGE). We provide our code at\nhttps://github.com/Jeff-Sue/URT.\n","authors":["Jiahui Xu","Feng Jiang","Anningzhe Gao","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2405.19799v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.03161v3","updated":"2024-06-03T08:09:09Z","published":"2024-02-05T16:30:49Z","title":"Video-LaVIT: Unified Video-Language Pre-training with Decoupled\n Visual-Motional Tokenization","summary":" In light of recent advances in multimodal Large Language Models (LLMs), there\nis increasing attention to scaling them from image-text data to more\ninformative real-world videos. Compared to static images, video poses unique\nchallenges for effective large-scale pre-training due to the modeling of its\nspatiotemporal dynamics. In this paper, we address such limitations in\nvideo-language pre-training with an efficient video decomposition that\nrepresents each video as keyframes and temporal motions. These are then adapted\nto an LLM using well-designed tokenizers that discretize visual and temporal\ninformation as a few tokens, thus enabling unified generative pre-training of\nvideos, images, and text. At inference, the generated tokens from the LLM are\ncarefully recovered to the original continuous pixel space to create various\nvideo content. Our proposed framework is both capable of comprehending and\ngenerating image and video content, as demonstrated by its competitive\nperformance across 13 multimodal benchmarks in image and video understanding\nand generation. Our code and models are available at\nhttps://video-lavit.github.io.\n","authors":["Yang Jin","Zhicheng Sun","Kun Xu","Kun Xu","Liwei Chen","Hao Jiang","Quzhe Huang","Chengru Song","Yuliang Liu","Di Zhang","Yang Song","Kun Gai","Yadong Mu"],"pdf_url":"https://arxiv.org/pdf/2402.03161v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.04854v3","updated":"2024-06-03T07:48:19Z","published":"2024-02-07T13:54:06Z","title":"Hierarchical Tree-structured Knowledge Graph For Academic Insight Survey","summary":" Research surveys have always posed a challenge for beginner researchers who\nlack of research training. These researchers struggle to understand the\ndirections within their research topic, and the discovery of new research\nfindings within a short time. One way to provide intuitive assistance to\nbeginner researchers is by offering relevant knowledge graphs(KG) and\nrecommending related academic papers. However, existing navigation knowledge\ngraphs primarily rely on keywords in the research field and often fail to\npresent the logical hierarchy among multiple related papers clearly. Moreover,\nmost recommendation systems for academic papers simply rely on high text\nsimilarity, which can leave researchers confused as to why a particular article\nis being recommended. They may lack of grasp important information about the\ninsight connection between \"Issue resolved\" and \"Issue finding\" that they hope\nto obtain. To address these issues, this study aims to support research insight\nsurveys for beginner researchers by establishing a hierarchical tree-structured\nknowledge graph that reflects the inheritance insight of research topics and\nthe relevance insight among the academic papers.\n","authors":["Jinghong Li","Huy Phan","Wen Gu","Koichi Ota","Shinobu Hasegawa"],"pdf_url":"https://arxiv.org/pdf/2402.04854v3.pdf","comment":"This paper will be submitted to 'The 18TH International Conference on\n INnovations in Intelligent SysTems and Applications (INISTA 2024)'"},{"id":"http://arxiv.org/abs/2311.01775v2","updated":"2024-06-03T07:42:15Z","published":"2023-11-03T08:20:48Z","title":"UP4LS: User Profile Constructed by Multiple Attributes for Enhancing\n Linguistic Steganalysis","summary":" Linguistic steganalysis (LS) tasks aim to detect whether a text contains\nsecret information. Existing LS methods focus on the deep-learning model design\nand they achieve excellent results in ideal data. However, they overlook the\nunique user characteristics, leading to weak performance in social networks.\nAnd a few stegos here that further complicate detection. We propose the UP4LS,\na framework with the User Profile for enhancing LS in realistic scenarios.\nThree kinds of user attributes like writing habits are explored to build the\nprofile. For each attribute, the specific feature extraction module is\ndesigned. The extracted features are mapped to high-dimensional user features\nvia the deep-learning model of the method to be improved. The content feature\nis extracted by the language model. Then user and content features are\nintegrated. Existing methods can improve LS results by adding the UP4LS\nframework without changing their deep-learning models. Experiments show that\nUP4LS can significantly enhance the performance of LS-task baselines in\nrealistic scenarios, with the overall Acc increased by 25%, F1 increased by\n51%, and SOTA results. The improvement is especially pronounced in fewer\nstegos. Additionally, UP4LS also sets the stage for the related-task SOTA\nmethods to efficient LS.\n","authors":["Yihao Wang","Ruiqi Song","Lingxiao Li","Yifan Tang","Ru Zhang","Jianyi Liu"],"pdf_url":"https://arxiv.org/pdf/2311.01775v2.pdf","comment":"15 pages, 7 figures, 14 tables"},{"id":"http://arxiv.org/abs/2402.02801v2","updated":"2024-06-03T07:35:25Z","published":"2024-02-05T08:19:56Z","title":"KS-Lottery: Finding Certified Lottery Tickets for Multilingual Language\n Models","summary":" The lottery ticket hypothesis posits the existence of ``winning tickets''\nwithin a randomly initialized neural network. Do winning tickets exist for LLMs\nin fine-tuning scenarios? How can we find such winning tickets? In this paper,\nwe propose KS-Lottery, a method to identify a small subset of LLM parameters\nhighly effective in multilingual fine-tuning. Our key idea is to use\nKolmogorov-Smirnov Test to analyze the distribution shift of parameters before\nand after fine-tuning. We further theoretically prove that KS-Lottery can find\nthe certified winning tickets in the embedding layer, fine-tuning on the found\nparameters is guaranteed to perform as well as full fine-tuning. Comparing\nKS-Lottery with other parameter-efficient tuning algorithms on translation\ntasks, the experimental results show that KS-Lottery finds a much smaller set\nof parameters for fine-tuning while achieving the comparable performance as\nfull fine-tuning LLM. Surprisingly, we find that fine-tuning 18 tokens'\nembedding of LLaMA suffices to reach the fine-tuning translation\nperformance~\\footnote{https://github.com/CONE-MT/KS-Lottery.}.\n","authors":["Fei Yuan","Chang Ma","Shuai Yuan","Qiushi Sun","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2402.02801v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.14652v2","updated":"2024-06-03T07:34:05Z","published":"2024-02-22T16:04:03Z","title":"Cleaner Pretraining Corpus Curation with Neural Web Scraping","summary":" The web contains large-scale, diverse, and abundant information to satisfy\nthe information-seeking needs of humans. Through meticulous data collection,\npreprocessing, and curation, webpages can be used as a fundamental data\nresource for language model pretraining. However, when confronted with the\nprogressively revolutionized and intricate nature of webpages,\nrule-based/feature-based web scrapers are becoming increasingly inadequate.\nThis paper presents a simple, fast, and effective Neural web Scraper\n(NeuScraper) to help extract primary and clean text contents from webpages.\nExperimental results show that NeuScraper surpasses the baseline scrapers by\nachieving more than a 20% improvement, demonstrating its potential in\nextracting higher-quality data to facilitate the language model pretraining.\nAll of the code is available at https://github.com/OpenMatch/NeuScraper.\n","authors":["Zhipeng Xu","Zhenghao Liu","Yukun Yan","Zhiyuan Liu","Ge Yu","Chenyan Xiong"],"pdf_url":"https://arxiv.org/pdf/2402.14652v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.09353v5","updated":"2024-06-03T07:27:15Z","published":"2024-02-14T17:59:34Z","title":"DoRA: Weight-Decomposed Low-Rank Adaptation","summary":" Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA\nand its variants have gained considerable popularity because of avoiding\nadditional inference costs. However, there still often exists an accuracy gap\nbetween these methods and full fine-tuning (FT). In this work, we first\nintroduce a novel weight decomposition analysis to investigate the inherent\ndifferences between FT and LoRA. Aiming to resemble the learning capacity of FT\nfrom the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA).\nDoRA decomposes the pre-trained weight into two components, magnitude and\ndirection, for fine-tuning, specifically employing LoRA for directional updates\nto efficiently minimize the number of trainable parameters. By employing \\ours,\nwe enhance both the learning capacity and training stability of LoRA while\navoiding any additional inference overhead. \\ours~consistently outperforms LoRA\non fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as\ncommonsense reasoning, visual instruction tuning, and image/video-text\nunderstanding. Code is available at https://github.com/NVlabs/DoRA.\n","authors":["Shih-Yang Liu","Chien-Yi Wang","Hongxu Yin","Pavlo Molchanov","Yu-Chiang Frank Wang","Kwang-Ting Cheng","Min-Hung Chen"],"pdf_url":"https://arxiv.org/pdf/2402.09353v5.pdf","comment":"Code available at https://github.com/NVlabs/DoRA"},{"id":"http://arxiv.org/abs/2401.04679v7","updated":"2024-06-03T06:59:31Z","published":"2024-01-09T17:09:01Z","title":"RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation","summary":" We investigate parameter-efficient fine-tuning (PEFT) methods that can\nprovide good accuracy under limited computational and memory budgets in the\ncontext of large language models (LLMs). We present a new PEFT method called\nRobust Adaptation (RoSA) inspired by robust principal component analysis that\njointly trains $\\textit{low-rank}$ and $\\textit{highly-sparse}$ components on\ntop of a set of fixed pretrained weights to efficiently approximate the\nperformance of a full-fine-tuning (FFT) solution. Across a series of\nchallenging generative tasks such as grade-school math and SQL query\ngeneration, which require fine-tuning for good performance, we show that RoSA\noutperforms LoRA, pure sparse fine-tuning, and alternative hybrid methods at\nthe same parameter budget, and can even recover the performance of FFT on some\ntasks. We provide system support for RoSA to complement the training algorithm,\nspecifically in the form of sparse GPU kernels which enable memory- and\ncomputationally-efficient training, and show that it is also compatible with\nlow-precision base weights, resulting in the first joint representation\ncombining quantization, low-rank and sparse approximations. Our code is\navailable at https://github.com/IST-DASLab/RoSA.\n","authors":["Mahdi Nikdan","Soroush Tabesh","Elvir Crnčević","Dan Alistarh"],"pdf_url":"https://arxiv.org/pdf/2401.04679v7.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.18018v4","updated":"2024-06-03T06:52:58Z","published":"2024-01-31T17:28:24Z","title":"On Prompt-Driven Safeguarding for Large Language Models","summary":" Prepending model inputs with safety prompts is a common practice for\nsafeguarding large language models (LLMs) against queries with harmful intents.\nHowever, the underlying working mechanisms of safety prompts have not been\nunraveled yet, restricting the possibility of automatically optimizing them to\nimprove LLM safety. In this work, we investigate how LLMs' behavior (i.e.,\ncomplying with or refusing user queries) is affected by safety prompts from the\nperspective of model representation. We find that in the representation space,\nthe input queries are typically moved by safety prompts in a \"higher-refusal\"\ndirection, in which models become more prone to refusing to provide assistance,\neven when the queries are harmless. On the other hand, LLMs are naturally\ncapable of distinguishing harmful and harmless queries without safety prompts.\nInspired by these findings, we propose a method for safety prompt optimization,\nnamely DRO (Directed Representation Optimization). Treating a safety prompt as\ncontinuous, trainable embeddings, DRO learns to move the queries'\nrepresentations along or opposite the refusal direction, depending on their\nharmfulness. Experiments with eight LLMs on out-of-domain and jailbreak\nbenchmarks demonstrate that DRO remarkably improves the safeguarding\nperformance of human-crafted safety prompts, without compromising the models'\ngeneral performance.\n","authors":["Chujie Zheng","Fan Yin","Hao Zhou","Fandong Meng","Jie Zhou","Kai-Wei Chang","Minlie Huang","Nanyun Peng"],"pdf_url":"https://arxiv.org/pdf/2401.18018v4.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2401.04514v2","updated":"2024-06-03T06:50:26Z","published":"2024-01-09T12:12:50Z","title":"Rewriting the Code: A Simple Method for Large Language Model Augmented\n Code Search","summary":" In code search, the Generation-Augmented Retrieval (GAR) framework, which\ngenerates exemplar code snippets to augment queries, has emerged as a promising\nstrategy to address the principal challenge of modality misalignment between\ncode snippets and natural language queries, particularly with the demonstrated\ncode generation capabilities of Large Language Models (LLMs). Nevertheless, our\npreliminary investigations indicate that the improvements conferred by such an\nLLM-augmented framework are somewhat constrained. This limitation could\npotentially be ascribed to the fact that the generated codes, albeit\nfunctionally accurate, frequently display a pronounced stylistic deviation from\nthe ground truth code in the codebase. In this paper, we extend the\nfoundational GAR framework and propose a simple yet effective method that\nadditionally Rewrites the Code (ReCo) within the codebase for style\nnormalization. Experimental results demonstrate that ReCo significantly boosts\nretrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%),\nand fine-tuned dense (up to 23.6%) retrieval settings in diverse search\nscenarios. To further elucidate the advantages of ReCo and stimulate research\nin code style normalization, we introduce Code Style Similarity, the first\nmetric tailored to quantify stylistic similarities in code. Notably, our\nempirical findings reveal the inadequacy of existing metrics in capturing\nstylistic nuances. The source code and data are available at\n\\url{https://github.com/Alex-HaochenLi/ReCo}.\n","authors":["Haochen Li","Xin Zhou","Zhiqi Shen"],"pdf_url":"https://arxiv.org/pdf/2401.04514v2.pdf","comment":"Accepted to ACL 2024"},{"id":"http://arxiv.org/abs/2402.14526v2","updated":"2024-06-03T06:48:34Z","published":"2024-02-22T13:20:53Z","title":"Balanced Data Sampling for Language Model Training with Clustering","summary":" Data plays a fundamental role in the training of Large Language Models\n(LLMs). While attention has been paid to the collection and composition of\ndatasets, determining the data sampling strategy in training remains an open\nquestion. Most LLMs are trained with a simple strategy, random sampling.\nHowever, this sampling strategy ignores the unbalanced nature of training data\ndistribution, which can be sub-optimal. In this paper, we propose ClusterClip\nSampling to balance the text distribution of training data for better model\ntraining. Specifically, ClusterClip Sampling utilizes data clustering to\nreflect the data distribution of the training set and balances the common\nsamples and rare samples during training based on the cluster results. A\nrepetition clip operation is introduced to mitigate the overfitting issue led\nby samples from certain clusters. Extensive experiments validate the\neffectiveness of ClusterClip Sampling, which outperforms random sampling and\nother cluster-based sampling variants under various training datasets and large\nlanguage models.\n","authors":["Yunfan Shao","Linyang Li","Zhaoye Fei","Hang Yan","Dahua Lin","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2402.14526v2.pdf","comment":"ACL 2024 (findings), Code is released at\n https://github.com/choosewhatulike/cluster-clip"},{"id":"http://arxiv.org/abs/2405.10738v2","updated":"2024-06-03T06:42:32Z","published":"2024-05-17T12:32:53Z","title":"Feature-Adaptive and Data-Scalable In-Context Learning","summary":" In-context learning (ICL), which promotes inference with several\ndemonstrations, has become a widespread paradigm to stimulate LLM capabilities\nfor downstream tasks. Due to context length constraints, it cannot be further\nimproved in spite of more training data, and general features directly from\nLLMs in ICL are not adaptive to the specific downstream task. In this paper, we\npropose a feature-adaptive and data-scalable in-context learning framework\n(FADS-ICL), which can leverage task-adaptive features to promote inference on\nthe downstream task, with the supervision of beyond-context samples.\nSpecifically, it first extracts general features of beyond-context samples via\nthe LLM with ICL input form one by one, and introduces a task-specific\nmodulator to perform feature refinement and prediction after fitting a specific\ndownstream task. We conduct extensive experiments on FADS-ICL under varying\ndata settings (4$\\sim$128 shots) and LLM scale (0.8$\\sim$70B) settings.\nExperimental results show that FADS-ICL consistently outperforms previous\nstate-of-the-art methods by a significant margin under all settings, verifying\nthe effectiveness and superiority of FADS-ICL. For example, under the 1.5B and\n32 shots setting, FADS-ICL can achieve \\textbf{+14.3} average accuracy from\nfeature adaptation over vanilla ICL on 10 datasets, with \\textbf{+6.2} average\naccuracy over the previous state-of-the-art method, and the performance can\nfurther improve with increasing training data. Code and data are publicly\navailable at \\url{https://github.com/jiahaozhenbang/FADS-ICL}.\n","authors":["Jiahao Li","Quan Wang","Licheng Zhang","Guoqing Jin","Zhendong Mao"],"pdf_url":"https://arxiv.org/pdf/2405.10738v2.pdf","comment":"Accepted at ACL 2024 main conference"},{"id":"http://arxiv.org/abs/2405.20680v2","updated":"2024-06-03T06:20:18Z","published":"2024-05-31T08:22:49Z","title":"Unraveling and Mitigating Retriever Inconsistencies in\n Retrieval-Augmented Large Language Models","summary":" Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their\nsuperiority in terms of factuality, they do not consistently outperform the\noriginal retrieval-free Language Models (LMs). Our experiments reveal that this\nexample-level performance inconsistency exists not only between\nretrieval-augmented and retrieval-free LM but also among different retrievers.\nTo understand this phenomenon, we investigate the degeneration behavior of\nRALMs and theoretically decompose it into four categories. Further analysis\nbased on our decomposition reveals that the innate difference in knowledge\nsources and the unpredictable degeneration of the reader model contribute most\nto the inconsistency. Drawing from our analysis, we introduce Ensemble of\nRetrievers (EoR), a trainable framework that can adaptively retrieve from\ndifferent knowledge sources and effectively decrease unpredictable reader\nerrors. Our experiments on Open Domain Question Answering show that EoR\nsubstantially improves performance over the RALM with a single retriever by\nconsiderably reducing inconsistent behaviors.\n","authors":["Mingda Li","Xinyu Li","Yifan Chen","Wenfeng Xuan","Weinan Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.20680v2.pdf","comment":"ACL 2024 (findings)"},{"id":"http://arxiv.org/abs/2311.09071v2","updated":"2024-06-03T06:11:06Z","published":"2023-11-15T16:13:14Z","title":"How Vocabulary Sharing Facilitates Multilingualism in LLaMA?","summary":" Large Language Models (LLMs), often show strong performance on English tasks,\nwhile exhibiting limitations on other languages. What is an LLM's multilingual\ncapability when it is trained only on certain languages? The underlying\nmechanism remains unclear. This study endeavors to examine the multilingual\ncapability of LLMs from the vocabulary sharing perspective by conducting an\nexhaustive analysis across 101 languages. Through the investigation of the\nperformance gap before and after embedding fine-tuning, we discovered four\ndistinct quadrants. By delving into each quadrant we provide actionable and\nefficient guidelines for tuning these languages. Extensive experiments reveal\nthat existing LLMs possess multilingual capabilities that surpass our\nexpectations, and we can significantly improve the multilingual performance of\nLLMs based on these attributes of each\nquadrant~\\footnote{\\url{https://github.com/CONE-MT/Vocabulary-Sharing-Facilitates-Multilingualism}.}.\n","authors":["Fei Yuan","Shuai Yuan","Zhiyong Wu","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2311.09071v2.pdf","comment":"ACL-2024 Findings"},{"id":"http://arxiv.org/abs/2402.15043v2","updated":"2024-06-03T06:02:39Z","published":"2024-02-23T01:30:39Z","title":"KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large\n Language Models","summary":" Automatic evaluation methods for large language models (LLMs) are hindered by\ndata contamination, leading to inflated assessments of their effectiveness.\nExisting strategies, which aim to detect contaminated texts, focus on\nquantifying contamination status instead of accurately gauging model\nperformance. In this paper, we introduce KIEval, a Knowledge-grounded\nInteractive Evaluation framework, which incorporates an LLM-powered\n\"interactor\" role for the first time to accomplish a dynamic\ncontamination-resilient evaluation. Starting with a question in a conventional\nLLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically\ngenerated, multi-round, and knowledge-focused dialogues to determine whether a\nmodel's response is merely a recall of benchmark answers or demonstrates a deep\ncomprehension to apply knowledge in more complex conversations. Extensive\nexperiments on seven leading LLMs across five datasets validate KIEval's\neffectiveness and generalization. We also reveal that data contamination brings\nno contribution or even negative effect to models' real-world applicability and\nunderstanding, and existing contamination detection methods for LLMs can only\nidentify contamination in pre-training but not during supervised fine-tuning.\n","authors":["Zhuohao Yu","Chang Gao","Wenjin Yao","Yidong Wang","Wei Ye","Jindong Wang","Xing Xie","Yue Zhang","Shikun Zhang"],"pdf_url":"https://arxiv.org/pdf/2402.15043v2.pdf","comment":"Accepted to ACL 2024 (main conference); 19 pages, 5 figures, 19\n tables, code is available at: https://github.com/zhuohaoyu/KIEval"},{"id":"http://arxiv.org/abs/2404.03653v2","updated":"2024-06-03T06:02:34Z","published":"2024-04-04T17:59:46Z","title":"CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept\n Matching","summary":" Diffusion models have demonstrated great success in the field of\ntext-to-image generation. However, alleviating the misalignment between the\ntext prompts and images is still challenging. The root reason behind the\nmisalignment has not been extensively investigated. We observe that the\nmisalignment is caused by inadequate token attention activation. We further\nattribute this phenomenon to the diffusion model's insufficient condition\nutilization, which is caused by its training paradigm. To address the issue, we\npropose CoMat, an end-to-end diffusion model fine-tuning strategy with an\nimage-to-text concept matching mechanism. We leverage an image captioning model\nto measure image-to-text alignment and guide the diffusion model to revisit\nignored tokens. A novel attribute concentration module is also proposed to\naddress the attribute binding problem. Without any image or human preference\ndata, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL.\nExtensive experiments show that CoMat-SDXL significantly outperforms the\nbaseline model SDXL in two text-to-image alignment benchmarks and achieves\nstart-of-the-art performance.\n","authors":["Dongzhi Jiang","Guanglu Song","Xiaoshi Wu","Renrui Zhang","Dazhong Shen","Zhuofan Zong","Yu Liu","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2404.03653v2.pdf","comment":"Project Page: https://caraj7.github.io/comat"},{"id":"http://arxiv.org/abs/2402.11078v3","updated":"2024-06-03T05:39:10Z","published":"2024-02-16T21:10:33Z","title":"Model Editing by Standard Fine-Tuning","summary":" Standard fine-tuning is considered not as effective as specialized methods\nfor model editing due to its comparatively poor performance. However, it is\nsimple, agnostic to the architectural details of the model being edited, and\nable to leverage advances in standard training techniques with no additional\nwork (e.g., black-box PEFT for computational efficiency), making it an\nappealing choice for a model editor. In this work, we show that standard\nfine-tuning alone can yield competitive model editing performance with two\nminor modifications. First, we optimize the conditional likelihood rather than\nthe full likelihood. Second, in addition to the typical practice of training on\nrandomly paraphrased edit prompts to encourage generalization, we also train on\nrandom or similar unedited facts to encourage locality. Our experiments on the\nZsRE and CounterFact datasets demonstrate that these simple modifications allow\nstandard fine-tuning to match or outperform highly specialized editors in terms\nof edit score.\n","authors":["Govind Gangadhar","Karl Stratos"],"pdf_url":"https://arxiv.org/pdf/2402.11078v3.pdf","comment":"Findings of ACL 2024"},{"id":"http://arxiv.org/abs/2402.11281v2","updated":"2024-06-03T04:53:00Z","published":"2024-02-17T13:41:44Z","title":"Can Large Multimodal Models Uncover Deep Semantics Behind Images?","summary":" Understanding the deep semantics of images is essential in the era dominated\nby social media. However, current research works primarily on the superficial\ndescription of images, revealing a notable deficiency in the systematic\ninvestigation of the inherent deep semantics. In this work, we introduce\nDEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs)\ncapacities of visual deep semantics. DEEPEVAL includes human-annotated dataset\nand three progressive subtasks: fine-grained description selection, in-depth\ntitle matching, and deep semantics understanding. Utilizing DEEPEVAL, we\nevaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a\nsubstantial gap between the deep semantic comprehension capabilities of\nexisting LMMs and humans. For example, GPT-4V is 30% behind humans in\nunderstanding deep semantics, even though it achieves human-comparable\nperformance in image description. Further analysis reveals that LMM performance\non DEEPEVAL varies according to the specific facets of deep semantics explored,\nindicating the fundamental challenges remaining in developing LMMs.\n","authors":["Yixin Yang","Zheng Li","Qingxiu Dong","Heming Xia","Zhifang Sui"],"pdf_url":"https://arxiv.org/pdf/2402.11281v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08540v5","updated":"2024-06-03T04:18:11Z","published":"2023-10-12T17:32:09Z","title":"Do pretrained Transformers Learn In-Context by Gradient Descent?","summary":" The emergence of In-Context Learning (ICL) in LLMs remains a remarkable\nphenomenon that is partially understood. To explain ICL, recent studies have\ncreated theoretical connections to Gradient Descent (GD). We ask, do such\nconnections hold up in actual pre-trained language models? We highlight the\nlimiting assumptions in prior works that make their setup considerably\ndifferent from the practical setup in which language models are trained. For\nexample, their experimental verification uses \\emph{ICL objective} (training\nmodels explicitly for ICL), which differs from the emergent ICL in the wild.\nFurthermore, the theoretical hand-constructed weights used in these studies\nhave properties that don't match those of real LLMs. We also look for evidence\nin real models. We observe that ICL and GD have different sensitivity to the\norder in which they observe demonstrations. Finally, we probe and compare the\nICL vs. GD hypothesis in a natural setting. We conduct comprehensive empirical\nanalyses on language models pre-trained on natural data (LLaMa-7B). Our\ncomparisons of three performance metrics highlight the inconsistent behavior of\nICL and GD as a function of various factors such as datasets, models, and the\nnumber of demonstrations. We observe that ICL and GD modify the output\ndistribution of language models differently. These results indicate that\n\\emph{the equivalence between ICL and GD remains an open hypothesis} and calls\nfor further studies.\n","authors":["Lingfeng Shen","Aayush Mishra","Daniel Khashabi"],"pdf_url":"https://arxiv.org/pdf/2310.08540v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12754v2","updated":"2024-06-03T03:17:01Z","published":"2023-12-20T04:27:13Z","title":"Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic\n Segmentation","summary":" Recently, CLIP has found practical utility in the domain of pixel-level\nzero-shot segmentation tasks. The present landscape features two-stage\nmethodologies beset by issues such as intricate pipelines and elevated\ncomputational costs. While current one-stage approaches alleviate these\nconcerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's\ngeneralization capacity, they still fall short in fully harnessing CLIP's\npotential for pixel-level unseen class demarcation and precise pixel\npredictions. To further stimulate CLIP's zero-shot dense prediction capability,\nwe propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from\nimage to pixel. Specifically, we initially introduce Spectral Prompt Tuning\n(SPT), incorporating spectral prompts into the CLIP visual encoder's shallow\nlayers to capture structural intricacies of images, thereby enhancing\ncomprehension of unseen classes. Subsequently, we introduce the Spectral Guided\nDecoder (SGD), utilizing both high and low-frequency information to steer the\nnetwork's spatial focus towards more prominent classification features,\nenabling precise pixel-level prediction outcomes. Through extensive experiments\non two public datasets, we demonstrate the superiority of our method over\nstate-of-the-art approaches, performing well across all classes and\nparticularly excelling in handling unseen classes. Code is available\nat:https://github.com/clearxu/SPT.\n","authors":["Wenhao Xu","Rongtao Xu","Changwei Wang","Shibiao Xu","Li Guo","Man Zhang","Xiaopeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.12754v2.pdf","comment":"AAAI2024 Accepted"},{"id":"http://arxiv.org/abs/2311.09154v3","updated":"2024-06-03T03:06:55Z","published":"2023-11-15T17:50:30Z","title":"CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models","summary":" We are currently in an era of fierce competition among various large language\nmodels (LLMs) continuously pushing the boundaries of benchmark performance.\nHowever, genuinely assessing the capabilities of these LLMs has become a\nchallenging and critical issue due to potential data contamination, and it\nwastes dozens of time and effort for researchers and engineers to download and\ntry those contaminated models. To save our precious time, we propose a novel\nand useful method, Clean-Eval, which mitigates the issue of data contamination\nand evaluates the LLMs in a cleaner manner. Clean-Eval employs an LLM to\nparaphrase and back-translate the contaminated data into a candidate set,\ngenerating expressions with the same meaning but in different surface forms. A\nsemantic detector is then used to filter the generated low-quality samples to\nnarrow down this candidate set. The best candidate is finally selected from\nthis set based on the BLEURT score. According to human assessment, this best\ncandidate is semantically similar to the original contamination data but\nexpressed differently. All candidates can form a new benchmark to evaluate the\nmodel. Our experiments illustrate that Clean-Eval substantially restores the\nactual evaluation results on contaminated LLMs under both few-shot learning and\nfine-tuning scenarios.\n","authors":["Wenhong Zhu","Hongkun Hao","Zhiwei He","Yunze Song","Yumeng Zhang","Hanxu Hu","Yiran Wei","Rui Wang","Hongyuan Lu"],"pdf_url":"https://arxiv.org/pdf/2311.09154v3.pdf","comment":"NAACL2024(findings)"},{"id":"http://arxiv.org/abs/2402.18223v2","updated":"2024-06-03T03:02:44Z","published":"2024-02-28T10:38:21Z","title":"Improving Open-Ended Text Generation via Adaptive Decoding","summary":" Current language models decode text token by token according to probabilistic\ndistribution, and determining the appropriate candidates for the next token is\ncrucial to ensure generation quality. This study introduces adaptive decoding,\na mechanism that dynamically empowers language models to ascertain a sensible\ncandidate set during generation. Specifically, we introduce an entropy-based\nmetric called confidence and conceptualize determining the optimal candidate\nset as a confidence-increasing process. The rationality of including a token in\nthe candidate set is assessed by leveraging the increment of confidence.\nExperimental results reveal that our method balances diversity and coherence\nwell. The human evaluation shows that our method can generate human-preferred\ntext. Additionally, our method can potentially improve the reasoning ability of\nlanguage models.\n","authors":["Wenhong Zhu","Hongkun Hao","Zhiwei He","Yiming Ai","Rui Wang"],"pdf_url":"https://arxiv.org/pdf/2402.18223v2.pdf","comment":"ICML2024"},{"id":"http://arxiv.org/abs/2312.14591v3","updated":"2024-06-03T02:24:38Z","published":"2023-12-22T10:29:43Z","title":"Reasons to Reject? Aligning Language Models with Judgments","summary":" As humans, we consistently interact with our peers and receive feedback in\nthe form of natural language. This language feedback allows us to maintain\nappropriate behavior, and rectify potential errors. The question arises\nnaturally: can we use language feedback to align large language models (LLMs)?\nIn contrast to previous research that aligns LLMs with scalar rewards, we\npresent the first systematic exploration of alignment through the lens of\nlanguage feedback (i.e., judgment). We start with an in-depth investigation of\npotential methods that can be adapted for aligning LLMs with judgments,\nrevealing that these methods cannot fully capitalize on judgments. To\nfacilitate more effective utilization of judgments, we propose a novel\nframework, Contrastive Unlikelihood Training (CUT), that allows for\nfine-grained inappropriate content detection and correction based on judgments.\nOur results show that, with merely 1317 off-the-shelf judgment data, CUT\n(LLaMA2-13b) can beat the 175B DaVinci003 and surpass the best baseline by\n50.84 points on AlpacaEval. CUT (LLaMA2-chat-13b) can also align LLMs in an\niterative fashion using up-to-date model-specific judgments, improving\nperformance from 81.09 to 91.68 points on AlpacaEval. Further analysis suggests\nthat judgments hold greater potential than rewards in LLM alignment.\n","authors":["Weiwen Xu","Deng Cai","Zhisong Zhang","Wai Lam","Shuming Shi"],"pdf_url":"https://arxiv.org/pdf/2312.14591v3.pdf","comment":"Accepted at ACL 2024 Findings. Our source codes and models are\n publicly available at https://github.com/wwxu21/CUT"},{"id":"http://arxiv.org/abs/2402.04411v2","updated":"2024-06-03T01:40:46Z","published":"2024-02-06T21:14:45Z","title":"DFA-RAG: Conversational Semantic Router for Large Language Model with\n Definite Finite Automaton","summary":" This paper introduces the retrieval-augmented large language model with\nDefinite Finite Automaton (DFA-RAG), a novel framework designed to enhance the\ncapabilities of conversational agents using large language models (LLMs).\nTraditional LLMs face challenges in generating regulated and compliant\nresponses in special scenarios with predetermined response guidelines, like\nemotional support and customer service. Our framework addresses these\nchallenges by embedding a Definite Finite Automaton (DFA), learned from\ntraining dialogues, within the LLM. This structured approach acts as a semantic\nrouter which enables the LLM to adhere to a deterministic response pathway. The\nrouting is achieved by the retrieval-augmentation generation (RAG) strategy,\nwhich carefully selects dialogue examples aligned with the current\nconversational context. The advantages of DFA-RAG include an interpretable\nstructure through human-readable DFA, context-aware retrieval for responses in\nconversations, and plug-and-play compatibility with existing LLMs. Extensive\nbenchmarks validate DFA-RAG's effectiveness, indicating its potential as a\nvaluable contribution to the conversational agent.\n","authors":["Yiyou Sun","Junjie Hu","Wei Cheng","Haifeng Chen"],"pdf_url":"https://arxiv.org/pdf/2402.04411v2.pdf","comment":"Accepted to ICML 2024"},{"id":"http://arxiv.org/abs/2401.08417v4","updated":"2024-06-03T01:28:06Z","published":"2024-01-16T15:04:51Z","title":"Contrastive Preference Optimization: Pushing the Boundaries of LLM\n Performance in Machine Translation","summary":" Moderate-sized large language models (LLMs) -- those with 7B or 13B\nparameters -- exhibit promising machine translation (MT) performance. However,\neven the top-performing 13B LLM-based translation models, like ALMA, does not\nmatch the performance of state-of-the-art conventional encoder-decoder\ntranslation models or larger-scale LLMs such as GPT-4. In this study, we bridge\nthis performance gap. We first assess the shortcomings of supervised\nfine-tuning for LLMs in the MT task, emphasizing the quality issues present in\nthe reference data, despite being human-generated. Then, in contrast to SFT\nwhich mimics reference translations, we introduce Contrastive Preference\nOptimization (CPO), a novel approach that trains models to avoid generating\nadequate but not perfect translations. Applying CPO to ALMA models with only\n22K parallel sentences and 12M parameters yields significant improvements. The\nresulting model, called ALMA-R, can match or exceed the performance of the WMT\ncompetition winners and GPT-4 on WMT'21, WMT'22 and WMT'23 test datasets.\n","authors":["Haoran Xu","Amr Sharaf","Yunmo Chen","Weiting Tan","Lingfeng Shen","Benjamin Van Durme","Kenton Murray","Young Jin Kim"],"pdf_url":"https://arxiv.org/pdf/2401.08417v4.pdf","comment":"Accepted at ICML 2024"},{"id":"http://arxiv.org/abs/2404.18239v3","updated":"2024-06-03T01:10:53Z","published":"2024-04-28T16:31:32Z","title":"SOUL: Unlocking the Power of Second-Order Optimization for LLM\n Unlearning","summary":" Large Language Models (LLMs) have highlighted the necessity of effective\nunlearning mechanisms to comply with data regulations and ethical AI practices.\nLLM unlearning aims at removing undesired data influences and associated model\ncapabilities without compromising utility out of the scope of unlearning. While\ninterest in studying LLM unlearning is growing,the impact of the optimizer\nchoice for LLM unlearning remains under-explored. In this work, we shed light\non the significance of optimizer selection in LLM unlearning for the first\ntime, establishing a clear connection between {second-order optimization} and\ninfluence unlearning (a classical approach using influence functions to update\nthe model for data influence removal). This insight propels us to develop a\nsecond-order unlearning framework, termed SOUL, built upon the second-order\nclipped stochastic optimization (Sophia)-based LLM training method. SOUL\nextends the static, one-shot model update using influence unlearning to a\ndynamic, iterative unlearning process. Our extensive experiments show that SOUL\nconsistently outperforms conventional first-order methods across various\nunlearning tasks, models, and metrics, suggesting the promise of second-order\noptimization in providing a scalable and easily implementable solution for LLM\nunlearning.\n","authors":["Jinghan Jia","Yihua Zhang","Yimeng Zhang","Jiancheng Liu","Bharat Runwal","James Diffenderfer","Bhavya Kailkhura","Sijia Liu"],"pdf_url":"https://arxiv.org/pdf/2404.18239v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.11058v3","updated":"2024-06-03T01:09:38Z","published":"2024-02-16T20:14:47Z","title":"II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in\n Visual Question Answering","summary":" Visual Question Answering (VQA) often involves diverse reasoning scenarios\nacross Vision and Language (V&L). Most prior VQA studies, however, have merely\nfocused on assessing the model's overall accuracy without evaluating it on\ndifferent reasoning cases. Furthermore, some recent works observe that\nconventional Chain-of-Thought (CoT) prompting fails to generate effective\nreasoning for VQA, especially for complex scenarios requiring multi-hop\nreasoning. In this paper, we propose II-MMR, a novel idea to identify and\nimprove multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA\nquestion with an image and finds a reasoning path to reach its answer using two\nnovel language promptings: (i) answer prediction-guided CoT prompt, or (ii)\nknowledge triplet-guided prompt. II-MMR then analyzes this path to identify\ndifferent reasoning cases in current VQA benchmarks by estimating how many hops\nand what types (i.e., visual or beyond-visual) of reasoning are required to\nanswer the question. On popular benchmarks including GQA and A-OKVQA, II-MMR\nobserves that most of their VQA questions are easy to answer, simply demanding\n\"single-hop\" reasoning, whereas only a few questions require \"multi-hop\"\nreasoning. Moreover, while the recent V&L model struggles with such complex\nmulti-hop reasoning questions even using the traditional CoT method, II-MMR\nshows its effectiveness across all reasoning cases in both zero-shot and\nfine-tuning settings.\n","authors":["Jihyung Kil","Farideh Tavazoee","Dongyeop Kang","Joo-Kyung Kim"],"pdf_url":"https://arxiv.org/pdf/2402.11058v3.pdf","comment":"Accepted to ACL 2024 Findings"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2312.00700v2","updated":"2024-06-03T17:57:39Z","published":"2023-12-01T16:33:57Z","title":"GIFT: Generative Interpretable Fine-Tuning","summary":" We present Generative Interpretable Fine-Tuning (GIFT) for\nparameter-efficient fine-tuning of pretrained Transformer backbones, which can\nbe formulated as a simple factorized matrix multiplication in the parameter\nspace or equivalently in the activation space, and thus embraces built-in\ninterpretability. For a pretrained layer with weights $\\omega\\in\n\\mathbb{R}^{d_{out}\\times d_{in}}$, our proposed GIFT learns the fine-tuned\nweights $\\hat{\\omega}$ directly from $\\omega$ as $\\hat{\\omega}=\\omega \\cdot\n(\\mathbb{I}+\\phi_{d_{in}\\times r}\\cdot \\psi_{r\\times d_{in}})$ where\n$\\mathbb{I}$ is an identity matrix. $\\Theta=(\\phi, \\psi)$ are the learnable\nparameters of the two linear layers of GIFT with $r$ being a hyper-parameter.\n$\\Theta$ is shared by all the layers selected for fine-tuning, resulting in\nsignificantly fewer trainable parameters compared to Low-Rank Adaptation\n(LoRA). We perform comprehensive evaluations on natural language tasks\n(commonsense reasoning and sequence classification) and computer vision tasks\n(visual fine-grained classification). We obtain the best accuracy and parameter\nefficiency among baselines both on the Commonsense170k reasoning benchmark\nusing LLaMA-1 (7B) and Llama-2 (7B)/-3 (8B) and on the FGVC and VTAB visual\nrecognition benchmarks using ImageNet-21k pretrained Vision Transformer\n(ViT-B/16). Notably, we obtain 5.9% absolute increase in average accuracy with\n53.8 times reduction of parameters on Commonsense170k using Llama-3 (8B)\ncompared to LoRA. We obtain performance comparable to LoRA on the GLUE\nbenchmark but with significantly fewer parameters using RoBERTa-Base/Large. We\nshow the output of the first linear layer (i.e., $\\omega\\cdot \\phi$) is\nsurprisingly interpretable, which can play the role of a token-clustering head\nas a by-product to localize meaningful objects/parts in images for computer\nvision tasks. Our code is publicly available.\n","authors":["Chinmay Savadikar","Xi Song","Tianfu Wu"],"pdf_url":"https://arxiv.org/pdf/2312.00700v2.pdf","comment":"Project page and code: https://savadikarc.github.io/gift"},{"id":"http://arxiv.org/abs/2403.19780v2","updated":"2024-06-03T17:56:14Z","published":"2024-03-28T19:06:37Z","title":"Mitigating Motion Blur in Neural Radiance Fields with Events and Frames","summary":" Neural Radiance Fields (NeRFs) have shown great potential in novel view\nsynthesis. However, they struggle to render sharp images when the data used for\ntraining is affected by motion blur. On the other hand, event cameras excel in\ndynamic scenes as they measure brightness changes with microsecond resolution\nand are thus only marginally affected by blur. Recent methods attempt to\nenhance NeRF reconstructions under camera motion by fusing frames and events.\nHowever, they face challenges in recovering accurate color content or constrain\nthe NeRF to a set of predefined camera poses, harming reconstruction quality in\nchallenging conditions. This paper proposes a novel formulation addressing\nthese issues by leveraging both model- and learning-based modules. We\nexplicitly model the blur formation process, exploiting the event double\nintegral as an additional model-based prior. Additionally, we model the\nevent-pixel response using an end-to-end learnable response function, allowing\nour method to adapt to non-idealities in the real event-camera sensor. We show,\non synthetic and real data, that the proposed approach outperforms existing\ndeblur NeRFs that use only frames as well as those that combine frames and\nevents by +6.13dB and +2.48dB, respectively.\n","authors":["Marco Cannici","Davide Scaramuzza"],"pdf_url":"https://arxiv.org/pdf/2403.19780v2.pdf","comment":"IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n 2024"},{"id":"http://arxiv.org/abs/2402.10093v2","updated":"2024-06-03T17:51:58Z","published":"2024-02-15T16:46:16Z","title":"MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained\n Representations","summary":" We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning\nboost for pre-trained MIM models. MIM-Refiner is motivated by the insight that\nstrong representations within MIM models generally reside in intermediate\nlayers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are\nconnected to different intermediate layers. In each head, a modified nearest\nneighbor objective constructs semantic clusters that capture semantic\ninformation which improves performance on downstream tasks, including\noff-the-shelf and fine-tuning settings.\n The refinement process is short and simple - yet highly effective. Within a\nfew epochs, we refine the features of MIM models from subpar to\nstate-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with\ndata2vec 2.0 on ImageNet-1K, sets a new state-of-the-art in linear probing\n(84.7%) and low-shot classification among models that are pre-trained on\nImageNet-1K. At ImageNet-1K 1-shot classification, MIM-Refiner advances the\nstate-of-the-art to 64.2%, outperforming larger models that were trained on up\nto 2000 times more data such as DINOv2-g, OpenCLIP-G and MAWS-6.5B.\n","authors":["Benedikt Alkin","Lukas Miklautz","Sepp Hochreiter","Johannes Brandstetter"],"pdf_url":"https://arxiv.org/pdf/2402.10093v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.10689v3","updated":"2024-06-03T17:50:58Z","published":"2023-09-19T15:23:52Z","title":"ReShader: View-Dependent Highlights for Single Image View-Synthesis","summary":" In recent years, novel view synthesis from a single image has seen\nsignificant progress thanks to the rapid advancements in 3D scene\nrepresentation and image inpainting techniques. While the current approaches\nare able to synthesize geometrically consistent novel views, they often do not\nhandle the view-dependent effects properly. Specifically, the highlights in\ntheir synthesized images usually appear to be glued to the surfaces, making the\nnovel views unrealistic. To address this major problem, we make a key\nobservation that the process of synthesizing novel views requires changing the\nshading of the pixels based on the novel camera, and moving them to appropriate\nlocations. Therefore, we propose to split the view synthesis process into two\nindependent tasks of pixel reshading and relocation. During the reshading\nprocess, we take the single image as the input and adjust its shading based on\nthe novel camera. This reshaded image is then used as the input to an existing\nview synthesis method to relocate the pixels and produce the final novel view\nimage. We propose to use a neural network to perform reshading and generate a\nlarge set of synthetic input-reshaded pairs to train our network. We\ndemonstrate that our approach produces plausible novel view images with\nrealistic moving highlights on a variety of real world scenes.\n","authors":["Avinash Paliwal","Brandon Nguyen","Andrii Tsarov","Nima Khademi Kalantari"],"pdf_url":"https://arxiv.org/pdf/2309.10689v3.pdf","comment":"SIGGRAPH Asia 2023. Project page at\n https://people.engr.tamu.edu/nimak/Papers/SIGAsia2023_Reshader/index.html and\n video at https://www.youtube.com/watch?v=XW-tl48D3Ok"},{"id":"http://arxiv.org/abs/2311.18107v5","updated":"2024-06-03T17:46:49Z","published":"2023-11-29T21:45:33Z","title":"A Stochastic-Geometrical Framework for Object Pose Estimation based on\n Mixture Models Avoiding the Correspondence Problem","summary":" Background: Pose estimation of rigid objects is a practical challenge in\noptical metrology and computer vision. This paper presents a novel\nstochastic-geometrical modeling framework for object pose estimation based on\nobserving multiple feature points.\n Methods: This framework utilizes mixture models for feature point densities\nin object space and for interpreting real measurements. Advantages are the\navoidance to resolve individual feature correspondences and to incorporate\ncorrect stochastic dependencies in multi-view applications. First, the general\nmodeling framework is presented, second, a general algorithm for pose\nestimation is derived, and third, two example models (camera and lateration\nsetup) are presented.\n Results: Numerical experiments show the effectiveness of this modeling and\ngeneral algorithm by presenting four simulation scenarios for three observation\nsystems, including the dependence on measurement resolution, object\ndeformations and measurement noise. Probabilistic modeling utilizing mixture\nmodels shows the potential for accurate and robust pose estimations while\navoiding the correspondence problem.\n","authors":["Wolfgang Hoegele"],"pdf_url":"https://arxiv.org/pdf/2311.18107v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.01016v2","updated":"2024-06-03T17:36:47Z","published":"2024-05-02T05:35:10Z","title":"Addressing Diverging Training Costs using Local Restoration for Precise\n Bird's Eye View Map Construction","summary":" Recent advancements in Bird's Eye View (BEV) fusion for map construction have\ndemonstrated remarkable mapping of urban environments. However, their deep and\nbulky architecture incurs substantial amounts of backpropagation memory and\ncomputing latency. Consequently, the problem poses an unavoidable bottleneck in\nconstructing high-resolution (HR) BEV maps, as their large-sized features cause\nsignificant increases in costs including GPU memory consumption and computing\nlatency, named diverging training costs issue. Affected by the problem, most\nexisting methods adopt low-resolution (LR) BEV and struggle to estimate the\nprecise locations of urban scene components like road lanes, and sidewalks. As\nthe imprecision leads to risky self-driving, the diverging training costs issue\nhas to be resolved. In this paper, we address the issue with our novel Trumpet\nNeural Network (TNN) mechanism. The framework utilizes LR BEV space and outputs\nan up-sampled semantic BEV map to create a memory-efficient pipeline. To this\nend, we introduce Local Restoration of BEV representation. Specifically, the\nup-sampled BEV representation has severely aliased, blocky signals, and thick\nsemantic labels. Our proposed Local Restoration restores the signals and thins\n(or narrows down) the width of the labels. Our extensive experiments show that\nthe TNN mechanism provides a plug-and-play memory-efficient pipeline, thereby\nenabling the effective estimation of real-sized (or precise) semantic labels\nfor BEV map construction.\n","authors":["Minsu Kim","Giseop Kim","Sunwook Choi"],"pdf_url":"https://arxiv.org/pdf/2405.01016v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.02730v2","updated":"2024-06-03T17:14:56Z","published":"2024-05-04T18:27:29Z","title":"U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers","summary":" Diffusion Transformers (DiTs) introduce the transformer architecture to\ndiffusion tasks for latent-space image generation. With an isotropic\narchitecture that chains a series of transformer blocks, DiTs demonstrate\ncompetitive performance and good scalability; but meanwhile, the abandonment of\nU-Net by DiTs and their following improvements is worth rethinking. To this\nend, we conduct a simple toy experiment by comparing a U-Net architectured DiT\nwith an isotropic one. It turns out that the U-Net architecture only gain a\nslight advantage amid the U-Net inductive bias, indicating potential\nredundancies within the U-Net-style DiT. Inspired by the discovery that U-Net\nbackbone features are low-frequency-dominated, we perform token downsampling on\nthe query-key-value tuple for self-attention that bring further improvements\ndespite a considerable amount of reduction in computation. Based on\nself-attention with downsampled tokens, we propose a series of U-shaped DiTs\n(U-DiTs) in the paper and conduct extensive experiments to demonstrate the\nextraordinary performance of U-DiT models. The proposed U-DiT could outperform\nDiT-XL/2 with only 1/6 of its computation cost. Codes are available at\nhttps://github.com/YuchuanTian/U-DiT.\n","authors":["Yuchuan Tian","Zhijun Tu","Hanting Chen","Jie Hu","Chao Xu","Yunhe Wang"],"pdf_url":"https://arxiv.org/pdf/2405.02730v2.pdf","comment":"12 pages, 5 figures"},{"id":"http://arxiv.org/abs/2403.17846v2","updated":"2024-06-03T17:12:25Z","published":"2024-03-26T16:36:43Z","title":"Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot\n Navigation","summary":" Recent open-vocabulary robot mapping methods enrich dense geometric maps with\npre-trained visual-language features. While these maps allow for the prediction\nof point-wise saliency maps when queried for a certain language concept,\nlarge-scale environments and abstract queries beyond the object level still\npose a considerable hurdle, ultimately limiting language-grounded robotic\nnavigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D\nscene graph mapping approach for language-grounded robot navigation. Leveraging\nopen-vocabulary vision foundation models, we first obtain state-of-the-art\nopen-vocabulary segment-level maps in 3D and subsequently construct a 3D scene\ngraph hierarchy consisting of floor, room, and object concepts, each enriched\nwith open-vocabulary features. Our approach is able to represent multi-story\nbuildings and allows robotic traversal of those using a cross-floor Voronoi\ngraph. HOV-SG is evaluated on three distinct datasets and surpasses previous\nbaselines in open-vocabulary semantic accuracy on the object, room, and floor\nlevel while producing a 75% reduction in representation size compared to dense\nopen-vocabulary maps. In order to prove the efficacy and generalization\ncapabilities of HOV-SG, we showcase successful long-horizon\nlanguage-conditioned robot navigation within real-world multi-storage\nenvironments. We provide code and trial video data at http://hovsg.github.io/.\n","authors":["Abdelrhman Werby","Chenguang Huang","Martin Büchner","Abhinav Valada","Wolfram Burgard"],"pdf_url":"https://arxiv.org/pdf/2403.17846v2.pdf","comment":"Code and video are available at http://hovsg.github.io/"},{"id":"http://arxiv.org/abs/2312.14867v2","updated":"2024-06-03T16:59:20Z","published":"2023-12-22T17:45:19Z","title":"VIEScore: Towards Explainable Metrics for Conditional Image Synthesis\n Evaluation","summary":" In the rapidly advancing field of conditional image generation research,\nchallenges such as limited explainability lie in effectively evaluating the\nperformance and capabilities of various models. This paper introduces VIEScore,\na Visual Instruction-guided Explainable metric for evaluating any conditional\nimage generation tasks. VIEScore leverages general knowledge from Multimodal\nLarge Language Models (MLLMs) as the backbone and does not require training or\nfine-tuning. We evaluate VIEScore on seven prominent tasks in conditional image\ntasks and found: (1) VIEScore (GPT4-o) achieves a high Spearman correlation of\n0.4 with human evaluations, while the human-to-human correlation is 0.45. (2)\nVIEScore (with open-source MLLM) is significantly weaker than GPT-4o and GPT-4v\nin evaluating synthetic images. (3) VIEScore achieves a correlation on par with\nhuman ratings in the generation tasks but struggles in editing tasks. With\nthese results, we believe VIEScore shows its great potential to replace human\njudges in evaluating image synthesis tasks.\n","authors":["Max Ku","Dongfu Jiang","Cong Wei","Xiang Yue","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2312.14867v2.pdf","comment":"Accepted to ACL2024 main"},{"id":"http://arxiv.org/abs/2405.16277v3","updated":"2024-06-03T16:42:55Z","published":"2024-05-25T15:28:22Z","title":"Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge","summary":" Large Language Models (LLMs) have demonstrated remarkable success in tasks\nlike the Winograd Schema Challenge (WSC), showcasing advanced textual\ncommon-sense reasoning. However, applying this reasoning to multimodal domains,\nwhere understanding text and images together is essential, remains a\nsubstantial challenge. To address this, we introduce WinoVis, a novel dataset\nspecifically designed to probe text-to-image models on pronoun disambiguation\nwithin multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion\nAttentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel\nevaluation framework that isolates the models' ability in pronoun\ndisambiguation from other visual processing challenges. Evaluation of\nsuccessive model versions reveals that, despite incremental advancements,\nStable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, only marginally\nsurpassing random guessing. Further error analysis identifies important areas\nfor future research aimed at advancing text-to-image models in their ability to\ninterpret and interact with the complex visual world.\n","authors":["Brendan Park","Madeline Janecek","Naser Ezzati-Jivan","Yifeng Li","Ali Emami"],"pdf_url":"https://arxiv.org/pdf/2405.16277v3.pdf","comment":"9 pages (excluding references), accepted to ACL 2024 Main Conference"},{"id":"http://arxiv.org/abs/2401.01163v3","updated":"2024-06-03T16:09:55Z","published":"2024-01-02T11:46:42Z","title":"NU-Class Net: A Novel Approach for Video Quality Enhancement","summary":" Video content has experienced a surge in popularity, asserting its dominance\nover internet traffic and Internet of Things (IoT) networks. Video compression\nhas long been regarded as the primary means of efficiently managing the\nsubstantial multimedia traffic generated by video-capturing devices.\nNevertheless, video compression algorithms entail significant computational\ndemands in order to achieve substantial compression ratios. This complexity\npresents a formidable challenge when implementing efficient video coding\nstandards in resource-constrained embedded systems, such as IoT edge node\ncameras. To tackle this challenge, this paper introduces NU-Class Net, an\ninnovative deep-learning model designed to mitigate compression artifacts\nstemming from lossy compression codecs. This enhancement significantly elevates\nthe perceptible quality of low-bit-rate videos. By employing the NU-Class Net,\nthe video encoder within the video-capturing node can reduce output quality,\nthereby generating low-bit-rate videos and effectively curtailing both\ncomputation and bandwidth requirements at the edge. On the decoder side, which\nis typically less encumbered by resource limitations, NU-Class Net is applied\nafter the video decoder to compensate for artifacts and approximate the quality\nof the original video. Experimental results affirm the efficacy of the proposed\nmodel in enhancing the perceptible quality of videos, especially those streamed\nat low bit rates.\n","authors":["Parham Zilouchian Moghaddam","Mehdi Modarressi","Mohammad Amin Sadeghi"],"pdf_url":"https://arxiv.org/pdf/2401.01163v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.00479v2","updated":"2024-06-03T15:55:06Z","published":"2023-07-02T05:26:54Z","title":"Domain Transfer Through Image-to-Image Translation for Uncertainty-Aware\n Prostate Cancer Classification","summary":" Prostate Cancer (PCa) is a prevalent disease among men, and multi-parametric\nMRIs offer a non-invasive method for its detection. While MRI-based deep\nlearning solutions have shown promise in supporting PCa diagnosis, acquiring\nsufficient training data, particularly in local clinics remains challenging.\nOne potential solution is to take advantage of publicly available datasets to\npre-train deep models and fine-tune them on the local data, but multi-source\nMRIs can pose challenges due to cross-domain distribution differences. These\nlimitations hinder the adoption of explainable and reliable deep-learning\nsolutions in local clinics for PCa diagnosis. In this work, we present a novel\napproach for unpaired image-to-image translation of prostate multi-parametric\nMRIs and an uncertainty-aware training approach for classifying clinically\nsignificant PCa, to be applied in data-constrained settings such as local and\nsmall clinics. Our approach involves a novel pipeline for translating unpaired\n3.0T multi-parametric prostate MRIs to 1.5T, thereby augmenting the available\ntraining data. Additionally, we introduce an evidential deep learning approach\nto estimate model uncertainty and employ dataset filtering techniques during\ntraining. Furthermore, we propose a simple, yet efficient Evidential Focal\nLoss, combining focal loss with evidential uncertainty, to train our model\neffectively. Our experiments demonstrate that the proposed method significantly\nimproves the Area Under ROC Curve (AUC) by over 20% compared to the previous\nwork. Our code is available at https://github.com/med-i-lab/DT_UE_PCa\n","authors":["Meng Zhou","Amoon Jamzad","Jason Izard","Alexandre Menard","Robert Siemens","Parvin Mousavi"],"pdf_url":"https://arxiv.org/pdf/2307.00479v2.pdf","comment":"Preprint. In Submission"},{"id":"http://arxiv.org/abs/2405.14959v2","updated":"2024-06-03T15:51:49Z","published":"2024-05-23T18:10:26Z","title":"EvGGS: A Collaborative Learning Framework for Event-based Generalizable\n Gaussian Splatting","summary":" Event cameras offer promising advantages such as high dynamic range and low\nlatency, making them well-suited for challenging lighting conditions and\nfast-moving scenarios. However, reconstructing 3D scenes from raw event streams\nis difficult because event data is sparse and does not carry absolute color\ninformation. To release its potential in 3D reconstruction, we propose the\nfirst event-based generalizable 3D reconstruction framework, called EvGGS,\nwhich reconstructs scenes as 3D Gaussians from only event input in a\nfeedforward manner and can generalize to unseen cases without any retraining.\nThis framework includes a depth estimation module, an intensity reconstruction\nmodule, and a Gaussian regression module. These submodules connect in a\ncascading manner, and we collaboratively train them with a designed joint loss\nto make them mutually promote. To facilitate related studies, we build a novel\nevent-based 3D dataset with various material objects and calibrated labels of\ngrayscale images, depth maps, camera poses, and silhouettes. Experiments show\nmodels that have jointly trained significantly outperform those trained\nindividually. Our approach performs better than all baselines in reconstruction\nquality, and depth/intensity predictions with satisfactory rendering speed.\n","authors":["Jiaxu Wang","Junhao He","Ziyi Zhang","Mingyuan Sun","Jingkai Sun","Renjing Xu"],"pdf_url":"https://arxiv.org/pdf/2405.14959v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20310v3","updated":"2024-06-03T15:13:55Z","published":"2024-05-30T17:52:52Z","title":"A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D\n Reconstruction","summary":" Learning 3D scene representation from a single-view image is a long-standing\nfundamental problem in computer vision, with the inherent ambiguity in\npredicting contents unseen from the input view. Built on the recently proposed\n3D Gaussian Splatting (3DGS), the Splatter Image method has made promising\nprogress on fast single-image novel view synthesis via learning a single 3D\nGaussian for each pixel based on the U-Net feature map of an input image.\nHowever, it has limited expressive power to represent occluded components that\nare not observable in the input view. To address this problem, this paper\npresents a Hierarchical Splatter Image method in which a pixel is worth more\nthan one 3D Gaussians. Specifically, each pixel is represented by a parent 3D\nGaussian and a small number of child 3D Gaussians. Parent 3D Gaussians are\nlearned as done in the vanilla Splatter Image. Child 3D Gaussians are learned\nvia a lightweight Multi-Layer Perceptron (MLP) which takes as input the\nprojected image features of a parent 3D Gaussian and the embedding of a target\ncamera view. Both parent and child 3D Gaussians are learned end-to-end in a\nstage-wise way. The joint condition of input image features from eyes of the\nparent Gaussians and the target camera position facilitates learning to\nallocate child Gaussians to ``see the unseen'', recovering the occluded details\nthat are often missed by parent Gaussians.\n In experiments, the proposed method is tested on the ShapeNet-SRN and CO3D\ndatasets with state-of-the-art performance obtained, especially showing\npromising capabilities of reconstructing occluded contents in the input view.\n","authors":["Jianghao Shen","Nan Xue","Tianfu Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20310v3.pdf","comment":"preprint, under review"},{"id":"http://arxiv.org/abs/2402.01516v2","updated":"2024-06-03T14:53:20Z","published":"2024-02-02T15:57:13Z","title":"Cross-view Masked Diffusion Transformers for Person Image Synthesis","summary":" We present X-MDPT ($\\underline{Cross}$-view $\\underline{M}$asked\n$\\underline{D}$iffusion $\\underline{P}$rediction $\\underline{T}$ransformers), a\nnovel diffusion model designed for pose-guided human image generation. X-MDPT\ndistinguishes itself by employing masked diffusion transformers that operate on\nlatent patches, a departure from the commonly-used Unet structures in existing\nworks. The model comprises three key modules: 1) a denoising diffusion\nTransformer, 2) an aggregation network that consolidates conditions into a\nsingle vector for the diffusion process, and 3) a mask cross-prediction module\nthat enhances representation learning with semantic information from the\nreference image. X-MDPT demonstrates scalability, improving FID, SSIM, and\nLPIPS with larger models. Despite its simple design, our model outperforms\nstate-of-the-art approaches on the DeepFashion dataset while exhibiting\nefficiency in terms of training parameters, training time, and inference speed.\nOur compact 33MB model achieves an FID of 7.42, surpassing a prior Unet latent\ndiffusion approach (FID 8.07) using only $11\\times$ fewer parameters. Our best\nmodel surpasses the pixel-based diffusion with $\\frac{2}{3}$ of the parameters\nand achieves $5.43 \\times$ faster inference. The code is available at\nhttps://github.com/trungpx/xmdpt.\n","authors":["Trung X. Pham","Zhang Kang","Chang D. Yoo"],"pdf_url":"https://arxiv.org/pdf/2402.01516v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2312.11538v2","updated":"2024-06-03T14:42:35Z","published":"2023-12-15T22:38:24Z","title":"Iterative Motion Editing with Natural Language","summary":" Text-to-motion diffusion models can generate realistic animations from text\nprompts, but do not support fine-grained motion editing controls. In this\npaper, we present a method for using natural language to iteratively specify\nlocal edits to existing character animations, a task that is common in most\ncomputer animation workflows. Our key idea is to represent a space of motion\nedits using a set of kinematic motion editing operators (MEOs) whose effects on\nthe source motion is well-aligned with user expectations. We provide an\nalgorithm that leverages pre-existing language models to translate textual\ndescriptions of motion edits into source code for programs that define and\nexecute sequences of MEOs on a source animation. We execute MEOs by first\ntranslating them into keyframe constraints, and then use diffusion-based motion\nmodels to generate output motions that respect these constraints. Through a\nuser study and quantitative evaluation, we demonstrate that our system can\nperform motion edits that respect the animator's editing intent, remain\nfaithful to the original animation (it edits the original animation, but does\nnot dramatically change it), and yield realistic character animation results.\n","authors":["Purvi Goel","Kuan-Chieh Wang","C. Karen Liu","Kayvon Fatahalian"],"pdf_url":"https://arxiv.org/pdf/2312.11538v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.04848v4","updated":"2024-06-03T14:18:29Z","published":"2023-06-08T00:56:33Z","title":"Interpreting and Improving Diffusion Models from an Optimization\n Perspective","summary":" Denoising is intuitively related to projection. Indeed, under the manifold\nhypothesis, adding random noise is approximately equivalent to orthogonal\nperturbation. Hence, learning to denoise is approximately learning to project.\nIn this paper, we use this observation to interpret denoising diffusion models\nas approximate gradient descent applied to the Euclidean distance function. We\nthen provide straight-forward convergence analysis of the DDIM sampler under\nsimple assumptions on the projection error of the denoiser. Finally, we propose\na new gradient-estimation sampler, generalizing DDIM using insights from our\ntheoretical results. In as few as 5-10 function evaluations, our sampler\nachieves state-of-the-art FID scores on pretrained CIFAR-10 and CelebA models\nand can generate high quality samples on latent diffusion models.\n","authors":["Frank Permenter","Chenyang Yuan"],"pdf_url":"https://arxiv.org/pdf/2306.04848v4.pdf","comment":"24 pages, 9 figures, 4 tables. To appear in ICML 2024"},{"id":"http://arxiv.org/abs/2402.08567v2","updated":"2024-06-03T14:15:03Z","published":"2024-02-13T16:06:17Z","title":"Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM\n Agents Exponentially Fast","summary":" A multimodal large language model (MLLM) agent can receive instructions,\ncapture images, retrieve histories from memory, and decide which tools to use.\nNonetheless, red-teaming efforts have revealed that adversarial images/prompts\ncan jailbreak an MLLM and cause unaligned behaviors. In this work, we report an\neven more severe safety issue in multi-agent environments, referred to as\ninfectious jailbreak. It entails the adversary simply jailbreaking a single\nagent, and without any further intervention from the adversary, (almost) all\nagents will become infected exponentially fast and exhibit harmful behaviors.\nTo validate the feasibility of infectious jailbreak, we simulate multi-agent\nenvironments containing up to one million LLaVA-1.5 agents, and employ\nrandomized pair-wise chat as a proof-of-concept instantiation for multi-agent\ninteraction. Our results show that feeding an (infectious) adversarial image\ninto the memory of any randomly chosen agent is sufficient to achieve\ninfectious jailbreak. Finally, we derive a simple principle for determining\nwhether a defense mechanism can provably restrain the spread of infectious\njailbreak, but how to design a practical defense that meets this principle\nremains an open question to investigate. Our project page is available at\nhttps://sail-sg.github.io/Agent-Smith/.\n","authors":["Xiangming Gu","Xiaosen Zheng","Tianyu Pang","Chao Du","Qian Liu","Ye Wang","Jing Jiang","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2402.08567v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2402.04050v2","updated":"2024-06-03T13:22:12Z","published":"2024-02-06T14:53:19Z","title":"Connecting the Dots: Collaborative Fine-tuning for Black-Box\n Vision-Language Models","summary":" With the emergence of pretrained vision-language models (VLMs), considerable\nefforts have been devoted to fine-tuning them for downstream tasks. Despite the\nprogress made in designing efficient fine-tuning methods, such methods require\naccess to the model's parameters, which can be challenging as model owners\noften opt to provide their models as a black box to safeguard model ownership.\nThis paper proposes a \\textbf{C}ollabo\\textbf{ra}tive\n\\textbf{F}ine-\\textbf{T}uning (\\textbf{CraFT}) approach for fine-tuning\nblack-box VLMs to downstream tasks, where one only has access to the input\nprompts and the output predictions of the model. CraFT comprises two modules, a\nprompt generation module for learning text prompts and a prediction refinement\nmodule for enhancing output predictions in residual style. Additionally, we\nintroduce an auxiliary prediction-consistent loss to promote consistent\noptimization across these modules. These modules are optimized by a novel\ncollaborative training algorithm. Extensive experiments on few-shot\nclassification over 15 datasets demonstrate the superiority of CraFT. The\nresults show that CraFT achieves a decent gain of about 12\\% with 16-shot\ndatasets and only 8,000 queries. Moreover, CraFT trains faster and uses only\nabout 1/80 of the memory footprint for deployment, while sacrificing only\n1.62\\% compared to the white-box method. Our code is publicly available at\nhttps://github.com/mrflogs/CraFT .\n","authors":["Zhengbo Wang","Jian Liang","Ran He","Zilei Wang","Tieniu Tan"],"pdf_url":"https://arxiv.org/pdf/2402.04050v2.pdf","comment":"Accepted by ICML 2024"},{"id":"http://arxiv.org/abs/2310.18651v5","updated":"2024-06-03T13:02:54Z","published":"2023-10-28T09:35:30Z","title":"Patch-Wise Self-Supervised Visual Representation Learning: A\n Fine-Grained Approach","summary":" Self-supervised visual representation learning traditionally focuses on\nimage-level instance discrimination. Our study introduces an innovative,\nfine-grained dimension by integrating patch-level discrimination into these\nmethodologies. This integration allows for the simultaneous analysis of local\nand global visual features, thereby enriching the quality of the learned\nrepresentations. Initially, the original images undergo spatial augmentation.\nSubsequently, we employ a distinctive photometric patch-level augmentation,\nwhere each patch is individually augmented, independent from other patches\nwithin the same view. This approach generates a diverse training dataset with\ndistinct color variations in each segment. The augmented images are then\nprocessed through a self-distillation learning framework, utilizing the Vision\nTransformer (ViT) as its backbone. The proposed method minimizes the\nrepresentation distances across both image and patch levels to capture details\nfrom macro to micro perspectives. To this end, we present a simple yet\neffective patch-matching algorithm to find the corresponding patches across the\naugmented views. Thanks to the efficient structure of the patch-matching\nalgorithm, our method reduces computational complexity compared to similar\napproaches. Consequently, we achieve an advanced understanding of the model\nwithout adding significant computational requirements. We have extensively\npretrained our method on datasets of varied scales, such as Cifar10,\nImageNet-100, and ImageNet-1K. It demonstrates superior performance over\nstate-of-the-art self-supervised representation learning methods in image\nclassification and downstream tasks, such as copy detection and image\nretrieval. The implementation of our method is accessible on GitHub.\n","authors":["Ali Javidani","Mohammad Amin Sadeghi","Babak Nadjar Araabi"],"pdf_url":"https://arxiv.org/pdf/2310.18651v5.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2303.06458v3","updated":"2024-06-03T12:47:12Z","published":"2023-03-11T17:14:33Z","title":"ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and\n Multilingual Natural Language Generation","summary":" Natural Language Generation (NLG) accepts input data in the form of images,\nvideos, or text and generates corresponding natural language text as output.\nExisting NLG methods mainly adopt a supervised approach and rely heavily on\ncoupled data-to-text pairs. However, for many targeted scenarios and for\nnon-English languages, sufficient quantities of labeled data are often not\navailable. To relax the dependency on labeled data of downstream tasks, we\npropose an intuitive and effective zero-shot learning framework, ZeroNLG, which\ncan deal with multiple NLG tasks, including image-to-text (image captioning),\nvideo-to-text (video captioning), and text-to-text (neural machine\ntranslation), across English, Chinese, German, and French within a unified\nframework. ZeroNLG does not require any labeled downstream pairs for training.\nDuring training, ZeroNLG (i) projects different domains (across modalities and\nlanguages) to corresponding coordinates in a shared common latent space; (ii)\nbridges different domains by aligning their corresponding coordinates in this\nspace; and (iii) builds an unsupervised multilingual auto-encoder to learn to\ngenerate text by reconstructing the input text given its coordinate in shared\nlatent space. Consequently, during inference, based on the data-to-text\npipeline, ZeroNLG can generate target sentences across different languages\ngiven the coordinate of input data in the common space. Within this unified\nframework, given visual (imaging or video) data as input, ZeroNLG can perform\nzero-shot visual captioning; given textual sentences as input, ZeroNLG can\nperform zero-shot machine translation. We present the results of extensive\nexperiments on twelve NLG tasks, showing that, without using any labeled\ndownstream pairs for training, ZeroNLG generates high-quality and believable\noutputs and significantly outperforms existing zero-shot methods.\n","authors":["Bang Yang","Fenglin Liu","Yuexian Zou","Xian Wu","Yaowei Wang","David A. Clifton"],"pdf_url":"https://arxiv.org/pdf/2303.06458v3.pdf","comment":"Accepted by TPAMI (Our code and data are available at\n https://github.com/yangbang18/ZeroNLG)"},{"id":"http://arxiv.org/abs/2403.13341v2","updated":"2024-06-03T12:11:52Z","published":"2024-03-20T06:48:48Z","title":"FissionFusion: Fast Geometric Generation and Hierarchical Souping for\n Medical Image Analysis","summary":" The scarcity of well-annotated medical datasets requires leveraging transfer\nlearning from broader datasets like ImageNet or pre-trained models like CLIP.\nModel soups averages multiple fine-tuned models aiming to improve performance\non In-Domain (ID) tasks and enhance robustness against Out-of-Distribution\n(OOD) datasets. However, applying these methods to the medical imaging domain\nfaces challenges and results in suboptimal performance. This is primarily due\nto differences in error surface characteristics that stem from data\ncomplexities such as heterogeneity, domain shift, class imbalance, and\ndistributional shifts between training and testing phases. To address this\nissue, we propose a hierarchical merging approach that involves local and\nglobal aggregation of models at various levels based on models' hyperparameter\nconfigurations. Furthermore, to alleviate the need for training a large number\nof models in the hyperparameter search, we introduce a computationally\nefficient method using a cyclical learning rate scheduler to produce multiple\nmodels for aggregation in the weight space. Our method demonstrates significant\nimprovements over the model souping approach across multiple datasets (around\n6% gain in HAM10000 and CheXpert datasets) while maintaining low computational\ncosts for model generation and selection. Moreover, we achieve better results\non OOD datasets than model soups. The code is available at\nhttps://github.com/BioMedIA-MBZUAI/FissionFusion.\n","authors":["Santosh Sanjeev","Nuren Zhaksylyk","Ibrahim Almakky","Anees Ur Rehman Hashmi","Mohammad Areeb Qazi","Mohammad Yaqub"],"pdf_url":"https://arxiv.org/pdf/2403.13341v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.08717v3","updated":"2024-06-03T12:06:06Z","published":"2022-03-16T16:14:19Z","title":"Weak Augmentation Guided Relational Self-Supervised Learning","summary":" Self-supervised Learning (SSL) including the mainstream contrastive learning\nhas achieved great success in learning visual representations without data\nannotations. However, most methods mainly focus on the instance level\ninformation (\\ie, the different augmented images of the same instance should\nhave the same feature or cluster into the same class), but there is a lack of\nattention on the relationships between different instances. In this paper, we\nintroduce a novel SSL paradigm, which we term as relational self-supervised\nlearning (ReSSL) framework that learns representations by modeling the\nrelationship between different instances. Specifically, our proposed method\nemploys sharpened distribution of pairwise similarities among different\ninstances as \\textit{relation} metric, which is thus utilized to match the\nfeature embeddings of different augmentations. To boost the performance, we\nargue that weak augmentations matter to represent a more reliable relation, and\nleverage momentum strategy for practical efficiency. The designed asymmetric\npredictor head and an InfoNCE warm-up strategy enhance the robustness to\nhyper-parameters and benefit the resulting performance. Experimental results\nshow that our proposed ReSSL substantially outperforms the state-of-the-art\nmethods across different network architectures, including various lightweight\nnetworks (\\eg, EfficientNet and MobileNet).\n","authors":["Mingkai Zheng","Shan You","Fei Wang","Chen Qian","Changshui Zhang","Xiaogang Wang","Chang Xu"],"pdf_url":"https://arxiv.org/pdf/2203.08717v3.pdf","comment":"Extended version of NeurIPS 2021 paper. arXiv admin note: substantial\n text overlap with arXiv:2107.09282"},{"id":"http://arxiv.org/abs/2311.17425v3","updated":"2024-06-03T11:58:25Z","published":"2023-11-29T07:57:30Z","title":"SpeechAct: Towards Generating Whole-body Motion from Speech","summary":" This paper addresses the problem of generating whole-body motion from speech.\nDespite great successes, prior methods still struggle to produce reasonable and\ndiverse whole-body motions from speech. This is due to their reliance on\nsuboptimal representations and a lack of strategies for generating diverse\nresults. To address these challenges, we present a novel hybrid point\nrepresentation to achieve accurate and continuous motion generation, e.g.,\navoiding foot skating, and this representation can be transformed into an\neasy-to-use representation, i.e., SMPL-X body mesh, for many applications. To\ngenerate whole-body motion from speech, for facial motion, closely tied to the\naudio signal, we introduce an encoder-decoder architecture to achieve\ndeterministic outcomes. However, for the body and hands, which have weaker\nconnections to the audio signal, we aim to generate diverse yet reasonable\nmotions. To boost diversity in motion generation, we propose a contrastive\nmotion learning method to encourage the model to produce more distinctive\nrepresentations. Specifically, we design a robust VQ-VAE to learn a quantized\nmotion codebook using our hybrid representation. Then, we regress the motion\nrepresentation from the audio signal by a translation model employing our\ncontrastive motion learning method. Experimental results validate the superior\nperformance and the correctness of our model. The project page is available for\nresearch purposes at http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct.\n","authors":["Jinsong Zhang","Minjie Zhu","Yuxiang Zhang","Yebin Liu","Kun Li"],"pdf_url":"https://arxiv.org/pdf/2311.17425v3.pdf","comment":"the manuscript should be revised"},{"id":"http://arxiv.org/abs/2403.06807v2","updated":"2024-06-03T11:33:51Z","published":"2024-03-11T15:26:34Z","title":"Multistep Consistency Models","summary":" Diffusion models are relatively easy to train but require many steps to\ngenerate samples. Consistency models are far more difficult to train, but\ngenerate samples in a single step.\n In this paper we propose Multistep Consistency Models: A unification between\nConsistency Models (Song et al., 2023) and TRACT (Berthelot et al., 2023) that\ncan interpolate between a consistency model and a diffusion model: a trade-off\nbetween sampling speed and sampling quality. Specifically, a 1-step consistency\nmodel is a conventional consistency model whereas a $\\infty$-step consistency\nmodel is a diffusion model.\n Multistep Consistency Models work really well in practice. By increasing the\nsample budget from a single step to 2-8 steps, we can train models more easily\nthat generate higher quality samples, while retaining much of the sampling\nspeed benefits. Notable results are 1.4 FID on Imagenet 64 in 8 step and 2.1\nFID on Imagenet128 in 8 steps with consistency distillation, using simple\nlosses without adversarial training. We also show that our method scales to a\ntext-to-image diffusion model, generating samples that are close to the quality\nof the original model.\n","authors":["Jonathan Heek","Emiel Hoogeboom","Tim Salimans"],"pdf_url":"https://arxiv.org/pdf/2403.06807v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19996v3","updated":"2024-06-03T11:32:40Z","published":"2024-05-30T12:32:35Z","title":"DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in\n the Wild","summary":" Image quality assessment (IQA) plays a critical role in selecting\nhigh-quality images and guiding compression and enhancement methods in a series\nof applications. The blind IQA, which assesses the quality of in-the-wild\nimages containing complex authentic distortions without reference images, poses\ngreater challenges. Existing methods are limited to modeling a uniform\ndistribution with local patches and are bothered by the gap between low and\nhigh-level visions (caused by widely adopted pre-trained classification\nnetworks). In this paper, we propose a novel IQA method called diffusion\npriors-based IQA (DP-IQA), which leverages the prior knowledge from the\npre-trained diffusion model with its excellent powers to bridge semantic gaps\nin the perception of the visual quality of images. Specifically, we use\npre-trained stable diffusion as the backbone, extract multi-level features from\nthe denoising U-Net during the upsampling process at a specified timestep, and\ndecode them to estimate the image quality score. The text and image adapters\nare adopted to mitigate the domain gap for downstream tasks and correct the\ninformation loss caused by the variational autoencoder bottleneck. Finally, we\ndistill the knowledge in the above model into a CNN-based student model,\nsignificantly reducing the parameter to enhance applicability, with the student\nmodel performing similarly or even better than the teacher model surprisingly.\nExperimental results demonstrate that our DP-IQA achieves state-of-the-art\nresults on various in-the-wild datasets with better generalization capability,\nwhich shows the superiority of our method in global modeling and utilizing the\nhierarchical feature clues of diffusion for evaluating image quality.\n","authors":["Honghao Fu","Yufei Wang","Wenhan Yang","Bihan Wen"],"pdf_url":"https://arxiv.org/pdf/2405.19996v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.14431v2","updated":"2024-06-03T11:06:53Z","published":"2023-02-28T09:21:12Z","title":"Efficient Masked Autoencoders with Self-Consistency","summary":" Inspired by the masked language modeling (MLM) in natural language processing\ntasks, the masked image modeling (MIM) has been recognized as a strong\nself-supervised pre-training method in computer vision. However, the high\nrandom mask ratio of MIM results in two serious problems: 1) the inadequate\ndata utilization of images within each iteration brings prolonged pre-training,\nand 2) the high inconsistency of predictions results in unreliable generations,\n$i.e.$, the prediction of the identical patch may be inconsistent in different\nmask rounds, leading to divergent semantics in the ultimately generated\noutcomes. To tackle these problems, we propose the efficient masked\nautoencoders with self-consistency (EMAE) to improve the pre-training\nefficiency and increase the consistency of MIM. In particular, we present a\nparallel mask strategy that divides the image into K non-overlapping parts,\neach of which is generated by a random mask with the same mask ratio. Then the\nMIM task is conducted parallelly on all parts in an iteration and the model\nminimizes the loss between the predictions and the masked patches. Besides, we\ndesign the self-consistency learning to further maintain the consistency of\npredictions of overlapping masked patches among parts. Overall, our method is\nable to exploit the data more efficiently and obtains reliable representations.\nExperiments on ImageNet show that EMAE achieves the best performance on\nViT-Large with only 13% of MAE pre-training time using NVIDIA A100 GPUs. After\npre-training on diverse datasets, EMAE consistently obtains state-of-the-art\ntransfer ability on a variety of downstream tasks, such as image\nclassification, object detection, and semantic segmentation.\n","authors":["Zhaowen Li","Yousong Zhu","Zhiyang Chen","Wei Li","Chaoyang Zhao","Rui Zhao","Ming Tang","Jinqiao Wang"],"pdf_url":"https://arxiv.org/pdf/2302.14431v2.pdf","comment":"Accept by IEEE Transactions on Pattern Analysis and Machine\n Intelligence (TPAMI)"},{"id":"http://arxiv.org/abs/2402.02085v3","updated":"2024-06-03T11:00:25Z","published":"2024-02-03T08:52:06Z","title":"DeCoF: Generated Video Detection via Frame Consistency: The First\n Benchmark Dataset","summary":" The escalating quality of video generated by advanced video generation\nmethods results in new security challenges, while there have been few relevant\nresearch efforts: 1) There is no open-source dataset for generated video\ndetection, 2) No generated video detection method has been proposed so far. To\nthis end, we propose an open-source dataset and a detection method for\ngenerated video for the first time. First, we propose a scalable dataset\nconsisting of 964 prompts, covering various forgery targets, scenes, behaviors,\nand actions, as well as various generation models with different architectures\nand generation methods, including the most popular commercial models like\nOpenAI's Sora and Google's Veo. Second, we found via probing experiments that\nspatial artifact-based detectors lack generalizability. Hence, we propose a\nsimple yet effective \\textbf{de}tection model based on \\textbf{f}rame\n\\textbf{co}nsistency (\\textbf{DeCoF}), which focuses on temporal artifacts by\neliminating the impact of spatial artifacts during feature learning. Extensive\nexperiments demonstrate the efficacy of DeCoF in detecting videos generated by\nunseen video generation models and confirm its powerful generalizability across\nseveral commercially proprietary models. Our code and dataset will be released\nat \\url{https://anonymous.4open.science/r/DeCoF-8394}.\n","authors":["Long Ma","Jiajia Zhang","Hongping Deng","Ningyu Zhang","Qinglang Guo","Haiyang Yu","Yong Liao","Pengyuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2402.02085v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.13106v2","updated":"2024-06-03T10:44:26Z","published":"2024-04-19T14:43:43Z","title":"Automatic Cranial Defect Reconstruction with Self-Supervised Deep\n Deformable Masked Autoencoders","summary":" Thousands of people suffer from cranial injuries every year. They require\npersonalized implants that need to be designed and manufactured before the\nreconstruction surgery. The manual design is expensive and time-consuming\nleading to searching for algorithms whose goal is to automatize the process.\nThe problem can be formulated as volumetric shape completion and solved by deep\nneural networks dedicated to supervised image segmentation. However, such an\napproach requires annotating the ground-truth defects which is costly and\ntime-consuming. Usually, the process is replaced with synthetic defect\ngeneration. However, even the synthetic ground-truth generation is\ntime-consuming and limits the data heterogeneity, thus the deep models'\ngeneralizability. In our work, we propose an alternative and simple approach to\nuse a self-supervised masked autoencoder to solve the problem. This approach by\ndesign increases the heterogeneity of the training set and can be seen as a\nform of data augmentation. We compare the proposed method with several\nstate-of-the-art deep neural networks and show both the quantitative and\nqualitative improvement on the SkullBreak and SkullFix datasets. The proposed\nmethod can be used to efficiently reconstruct the cranial defects in real time.\n","authors":["Marek Wodzinski","Daria Hemmerling","Mateusz Daniol"],"pdf_url":"https://arxiv.org/pdf/2404.13106v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.06116v2","updated":"2024-06-03T08:59:51Z","published":"2024-05-09T21:47:46Z","title":"Rethinking Efficient and Effective Point-based Networks for Event Camera\n Classification and Regression: EventMamba","summary":" Event cameras, drawing inspiration from biological systems, efficiently\ndetect changes in ambient light with low latency and high dynamic range while\nconsuming minimal power. The most current approach to processing event data\noften involves converting it into frame-based representations, which is\nwell-established in traditional vision. However, this approach neglects the\nsparsity of event data, loses fine-grained temporal information during the\ntransformation process, and increases the computational burden, making it\nineffective for characterizing event camera properties. In contrast, Point\nCloud is a popular representation for 3D processing and is better suited to\nmatch the sparse and asynchronous nature of the event camera. Nevertheless,\ndespite the theoretical compatibility of point-based methods with event\ncameras, the results show a performance gap that is not yet satisfactory\ncompared to frame-based methods. In order to bridge the performance gap, we\npropose EventMamba, an efficient and effective Point Cloud framework that\nachieves competitive results even compared to the state-of-the-art (SOTA)\nframe-based method in both classification and regression tasks. This notable\naccomplishment is facilitated by our rethinking of the distinction between\nEvent Cloud and Point Cloud, emphasizing effective temporal information\nextraction through optimized network structures. Specifically, EventMamba\nleverages temporal aggregation and State Space Model (SSM) based Mamba boasting\nenhanced temporal information extraction capabilities. Through a hierarchical\nstructure, EventMamba is adept at abstracting local and global spatial features\nand implicit and explicit temporal features. By adhering to the lightweight\ndesign principle, EventMamba delivers impressive results with minimal\ncomputational resource utilization, demonstrating its efficiency and\neffectiveness.\n","authors":["Hongwei Ren","Yue Zhou","Jiadong Zhu","Haotian Fu","Yulong Huang","Xiaopeng Lin","Yuetong Fang","Fei Ma","Hao Yu","Bojun Cheng"],"pdf_url":"https://arxiv.org/pdf/2405.06116v2.pdf","comment":"Extension Journal of TTPOINT and PEPNet"},{"id":"http://arxiv.org/abs/2405.16094v2","updated":"2024-06-03T08:27:09Z","published":"2024-05-25T06:58:20Z","title":"PLUG: Revisiting Amodal Segmentation with Foundation Model and\n Hierarchical Focus","summary":" Aiming to predict the complete shapes of partially occluded objects, amodal\nsegmentation is an important step towards visual intelligence. With crucial\nsignificance, practical prior knowledge derives from sufficient training, while\nlimited amodal annotations pose challenges to achieve better performance. To\ntackle this problem, utilizing the mighty priors accumulated in the foundation\nmodel, we propose the first SAM-based amodal segmentation approach, PLUG.\nMethodologically, a novel framework with hierarchical focus is presented to\nbetter adapt the task characteristics and unleash the potential capabilities of\nSAM. In the region level, due to the association and division in visible and\noccluded areas, inmodal and amodal regions are assigned as the focuses of\ndistinct branches to avoid mutual disturbance. In the point level, we introduce\nthe concept of uncertainty to explicitly assist the model in identifying and\nfocusing on ambiguous points. Guided by the uncertainty map, a\ncomputation-economic point loss is applied to improve the accuracy of predicted\nboundaries. Experiments are conducted on several prominent datasets, and the\nresults show that our proposed method outperforms existing methods with large\nmargins. Even with fewer total parameters, our method still exhibits remarkable\nadvantages.\n","authors":["Zhaochen Liu","Limeng Qiao","Xiangxiang Chu","Tingting Jiang"],"pdf_url":"https://arxiv.org/pdf/2405.16094v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.18729v2","updated":"2024-06-03T08:16:22Z","published":"2023-11-30T17:26:33Z","title":"Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic\n Data","summary":" Existing one-shot 4D head synthesis methods usually learn from monocular\nvideos with the aid of 3DMM reconstruction, yet the latter is evenly\nchallenging which restricts them from reasonable 4D head synthesis. We present\na method to learn one-shot 4D head synthesis via large-scale synthetic data.\nThe key is to first learn a part-wise 4D generative model from monocular images\nvia adversarial learning, to synthesize multi-view images of diverse identities\nand full motions as training data; then leverage a transformer-based animatable\ntriplane reconstructor to learn 4D head reconstruction using the synthetic\ndata. A novel learning strategy is enforced to enhance the generalizability to\nreal images by disentangling the learning process of 3D reconstruction and\nreenactment. Experiments demonstrate our superiority over the prior art.\n","authors":["Yu Deng","Duomin Wang","Xiaohang Ren","Xingyu Chen","Baoyuan Wang"],"pdf_url":"https://arxiv.org/pdf/2311.18729v2.pdf","comment":"CVPR24 camera ready version. Project page:\n https://yudeng.github.io/Portrait4D/"},{"id":"http://arxiv.org/abs/2402.05443v3","updated":"2024-06-03T08:12:13Z","published":"2024-02-08T06:45:03Z","title":"Scalable Wasserstein Gradient Flow for Generative Modeling through\n Unbalanced Optimal Transport","summary":" Wasserstein Gradient Flow (WGF) describes the gradient dynamics of\nprobability density within the Wasserstein space. WGF provides a promising\napproach for conducting optimization over the probability distributions.\nNumerically approximating the continuous WGF requires the time discretization\nmethod. The most well-known method for this is the JKO scheme. In this regard,\nprevious WGF models employ the JKO scheme and parametrize transport map for\neach JKO step. However, this approach results in quadratic training complexity\n$O(K^2)$ with the number of JKO step $K$. This severely limits the scalability\nof WGF models. In this paper, we introduce a scalable WGF-based generative\nmodel, called Semi-dual JKO (S-JKO). Our model is based on the semi-dual form\nof the JKO step, derived from the equivalence between the JKO step and the\nUnbalanced Optimal Transport. Our approach reduces the training complexity to\n$O(K)$. We demonstrate that our model significantly outperforms existing\nWGF-based generative models, achieving FID scores of 2.62 on CIFAR-10 and 5.46\non CelebA-HQ-256, which are comparable to state-of-the-art image generative\nmodels.\n","authors":["Jaemoo Choi","Jaewoong Choi","Myungjoo Kang"],"pdf_url":"https://arxiv.org/pdf/2402.05443v3.pdf","comment":"22 pages, 11 figures"},{"id":"http://arxiv.org/abs/2402.03161v3","updated":"2024-06-03T08:09:09Z","published":"2024-02-05T16:30:49Z","title":"Video-LaVIT: Unified Video-Language Pre-training with Decoupled\n Visual-Motional Tokenization","summary":" In light of recent advances in multimodal Large Language Models (LLMs), there\nis increasing attention to scaling them from image-text data to more\ninformative real-world videos. Compared to static images, video poses unique\nchallenges for effective large-scale pre-training due to the modeling of its\nspatiotemporal dynamics. In this paper, we address such limitations in\nvideo-language pre-training with an efficient video decomposition that\nrepresents each video as keyframes and temporal motions. These are then adapted\nto an LLM using well-designed tokenizers that discretize visual and temporal\ninformation as a few tokens, thus enabling unified generative pre-training of\nvideos, images, and text. At inference, the generated tokens from the LLM are\ncarefully recovered to the original continuous pixel space to create various\nvideo content. Our proposed framework is both capable of comprehending and\ngenerating image and video content, as demonstrated by its competitive\nperformance across 13 multimodal benchmarks in image and video understanding\nand generation. Our code and models are available at\nhttps://video-lavit.github.io.\n","authors":["Yang Jin","Zhicheng Sun","Kun Xu","Kun Xu","Liwei Chen","Hao Jiang","Quzhe Huang","Chengru Song","Yuliang Liu","Di Zhang","Yang Song","Kun Gai","Yadong Mu"],"pdf_url":"https://arxiv.org/pdf/2402.03161v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.16506v3","updated":"2024-06-03T08:01:25Z","published":"2024-02-26T11:41:28Z","title":"Stochastic Conditional Diffusion Models for Robust Semantic Image\n Synthesis","summary":" Semantic image synthesis (SIS) is a task to generate realistic images\ncorresponding to semantic maps (labels). However, in real-world applications,\nSIS often encounters noisy user inputs. To address this, we propose Stochastic\nConditional Diffusion Model (SCDM), which is a robust conditional diffusion\nmodel that features novel forward and generation processes tailored for SIS\nwith noisy labels. It enhances robustness by stochastically perturbing the\nsemantic label maps through Label Diffusion, which diffuses the labels with\ndiscrete diffusion. Through the diffusion of labels, the noisy and clean\nsemantic maps become similar as the timestep increases, eventually becoming\nidentical at $t=T$. This facilitates the generation of an image close to a\nclean image, enabling robust generation. Furthermore, we propose a class-wise\nnoise schedule to differentially diffuse the labels depending on the class. We\ndemonstrate that the proposed method generates high-quality samples through\nextensive experiments and analyses on benchmark datasets, including a novel\nexperimental setup simulating human errors during real-world applications. Code\nis available at https://github.com/mlvlab/SCDM.\n","authors":["Juyeon Ko","Inho Kong","Dogyun Park","Hyunwoo J. Kim"],"pdf_url":"https://arxiv.org/pdf/2402.16506v3.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2305.08389v2","updated":"2024-06-03T07:47:36Z","published":"2023-05-15T07:12:19Z","title":"Edit As You Wish: Video Caption Editing with Multi-grained User Control","summary":" Automatically narrating videos in natural language complying with user\nrequests, i.e. Controllable Video Captioning task, can help people manage\nmassive videos with desired intentions. However, existing works suffer from two\nshortcomings: 1) the control signal is single-grained which can not satisfy\ndiverse user intentions; 2) the video description is generated in a single\nround which can not be further edited to meet dynamic needs. In this paper, we\npropose a novel \\textbf{V}ideo \\textbf{C}aption \\textbf{E}diting \\textbf{(VCE)}\ntask to automatically revise an existing video description guided by\nmulti-grained user requests. Inspired by human writing-revision habits, we\ndesign the user command as a pivotal triplet \\{\\textit{operation, position,\nattribute}\\} to cover diverse user needs from coarse-grained to fine-grained.\nTo facilitate the VCE task, we \\textit{automatically} construct an open-domain\nbenchmark dataset named VATEX-EDIT and \\textit{manually} collect an e-commerce\ndataset called EMMAD-EDIT. We further propose a specialized small-scale model\n(i.e., OPA) compared with two generalist Large Multi-modal Models to perform an\nexhaustive analysis of the novel task. For evaluation, we adopt comprehensive\nmetrics considering caption fluency, command-caption consistency, and\nvideo-caption alignment. Experiments reveal the task challenges of fine-grained\nmulti-modal semantics understanding and processing. Our datasets, codes, and\nevaluation tools are ready to be open-sourced.\n","authors":["Linli Yao","Yuanmeng Zhang","Ziheng Wang","Xinglin Hou","Tiezheng Ge","Yuning Jiang","Xu Sun","Qin Jin"],"pdf_url":"https://arxiv.org/pdf/2305.08389v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.11549v2","updated":"2024-06-03T07:45:33Z","published":"2024-03-18T08:00:23Z","title":"Boosting Continual Learning of Vision-Language Models via\n Mixture-of-Experts Adapters","summary":" Continual learning can empower vision-language models to continuously acquire\nnew knowledge, without the need for access to the entire historical dataset.\nHowever, mitigating the performance degradation in large-scale models is\nnon-trivial due to (i) parameter shifts throughout lifelong learning and (ii)\nsignificant computational burdens associated with full-model tuning. In this\nwork, we present a parameter-efficient continual learning framework to\nalleviate long-term forgetting in incremental learning with vision-language\nmodels. Our approach involves the dynamic expansion of a pre-trained CLIP\nmodel, through the integration of Mixture-of-Experts (MoE) adapters in response\nto new tasks. To preserve the zero-shot recognition capability of\nvision-language models, we further introduce a Distribution Discriminative\nAuto-Selector (DDAS) that automatically routes in-distribution and\nout-of-distribution inputs to the MoE Adapter and the original CLIP,\nrespectively. Through extensive experiments across various settings, our\nproposed method consistently outperforms previous state-of-the-art approaches\nwhile concurrently reducing parameter training burdens by 60%. Our code locates\nat https://github.com/JiazuoYu/MoE-Adapters4CL\n","authors":["Jiazuo Yu","Yunzhi Zhuge","Lu Zhang","Ping Hu","Dong Wang","Huchuan Lu","You He"],"pdf_url":"https://arxiv.org/pdf/2403.11549v2.pdf","comment":"This work is accepted by CVPR2024. More modifications may be\n performed"},{"id":"http://arxiv.org/abs/2403.05874v2","updated":"2024-06-03T07:37:23Z","published":"2024-03-09T10:53:11Z","title":"SPAFormer: Sequential 3D Part Assembly with Transformers","summary":" We introduce SPAFormer, an innovative model designed to overcome the\ncombinatorial explosion challenge in the 3D Part Assembly (3D-PA) task. This\ntask requires accurate prediction of each part's pose and shape in sequential\nsteps, and as the number of parts increases, the possible assembly combinations\nincrease exponentially, leading to a combinatorial explosion that severely\nhinders the efficacy of 3D-PA. SPAFormer addresses this problem by leveraging\nweak constraints from assembly sequences, effectively reducing the solution\nspace's complexity. Since assembly part sequences convey construction rules\nsimilar to sentences being structured through words, our model explores both\nparallel and autoregressive generation. It further enhances assembly through\nknowledge enhancement strategies that utilize the attributes of parts and their\nsequence information, enabling it to capture the inherent assembly pattern and\nrelationships among sequentially ordered parts. We also construct a more\nchallenging benchmark named PartNet-Assembly covering 21 varied categories to\nmore comprehensively validate the effectiveness of SPAFormer. Extensive\nexperiments demonstrate the superior generalization capabilities of SPAFormer,\nparticularly with multi-tasking and in scenarios requiring long-horizon\nassembly. Codes and model weights will be released at\nhttps://github.com/xuboshen/SPAFormer.\n","authors":["Boshen Xu","Sipeng Zheng","Qin Jin"],"pdf_url":"https://arxiv.org/pdf/2403.05874v2.pdf","comment":"Code: https://github.com/xuboshen/SPAFormer"},{"id":"http://arxiv.org/abs/2405.17719v2","updated":"2024-06-03T07:29:18Z","published":"2024-05-28T00:27:29Z","title":"EgoNCE++: Do Egocentric Video-Language Models Really Understand\n Hand-Object Interactions?","summary":" Egocentric video-language pretraining is a crucial paradigm to advance the\nlearning of egocentric hand-object interactions (EgoHOI). Despite the great\nsuccess on existing testbeds, these benchmarks focus more on closed-set visual\nconcepts or limited scenarios. Due to the occurrence of diverse EgoHOIs in the\nreal world, we propose an open-vocabulary benchmark named EgoHOIBench to reveal\nthe diminished performance of current egocentric video-language models (EgoVLM)\non fined-grained concepts, indicating that these models still lack a full\nspectrum of egocentric understanding. We attribute this performance gap to\ninsufficient fine-grained supervision and strong bias towards understanding\nobjects rather than temporal dynamics in current methods. To tackle these\nissues, we introduce a novel asymmetric contrastive objective for EgoHOI named\nEgoNCE++. For video-to-text loss, we enhance text supervision through the\ngeneration of negative captions by leveraging the in-context learning of large\nlanguage models to perform HOI-related word substitution. For text-to-video\nloss, we propose an object-centric positive video sampling strategy that\naggregates video representations by the same nouns. Our extensive experiments\ndemonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition,\nmulti-instance retrieval, and action recognition tasks across various\negocentric models, with improvements of up to +26.55%. Our code is available at\nhttps://github.com/xuboshen/EgoNCEpp.\n","authors":["Boshen Xu","Ziheng Wang","Yang Du","Zhinan Song","Sipeng Zheng","Qin Jin"],"pdf_url":"https://arxiv.org/pdf/2405.17719v2.pdf","comment":"Code: https://github.com/xuboshen/EgoNCEpp"},{"id":"http://arxiv.org/abs/2402.09353v5","updated":"2024-06-03T07:27:15Z","published":"2024-02-14T17:59:34Z","title":"DoRA: Weight-Decomposed Low-Rank Adaptation","summary":" Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA\nand its variants have gained considerable popularity because of avoiding\nadditional inference costs. However, there still often exists an accuracy gap\nbetween these methods and full fine-tuning (FT). In this work, we first\nintroduce a novel weight decomposition analysis to investigate the inherent\ndifferences between FT and LoRA. Aiming to resemble the learning capacity of FT\nfrom the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA).\nDoRA decomposes the pre-trained weight into two components, magnitude and\ndirection, for fine-tuning, specifically employing LoRA for directional updates\nto efficiently minimize the number of trainable parameters. By employing \\ours,\nwe enhance both the learning capacity and training stability of LoRA while\navoiding any additional inference overhead. \\ours~consistently outperforms LoRA\non fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as\ncommonsense reasoning, visual instruction tuning, and image/video-text\nunderstanding. Code is available at https://github.com/NVlabs/DoRA.\n","authors":["Shih-Yang Liu","Chien-Yi Wang","Hongxu Yin","Pavlo Molchanov","Yu-Chiang Frank Wang","Kwang-Ting Cheng","Min-Hung Chen"],"pdf_url":"https://arxiv.org/pdf/2402.09353v5.pdf","comment":"Code available at https://github.com/NVlabs/DoRA"},{"id":"http://arxiv.org/abs/2404.18706v2","updated":"2024-06-03T07:19:35Z","published":"2024-04-29T13:57:02Z","title":"The Socface Project: Large-Scale Collection, Processing, and Analysis of\n a Century of French Censuses","summary":" This paper presents a complete processing workflow for extracting information\nfrom French census lists from 1836 to 1936. These lists contain information\nabout individuals living in France and their households. We aim at extracting\nall the information contained in these tables using automatic handwritten table\nrecognition. At the end of the Socface project, in which our work is taking\nplace, the extracted information will be redistributed to the departmental\narchives, and the nominative lists will be freely available to the public,\nallowing anyone to browse hundreds of millions of records. The extracted data\nwill be used by demographers to analyze social change over time, significantly\nimproving our understanding of French economic and social structures. For this\nproject, we developed a complete processing workflow: large-scale data\ncollection from French departmental archives, collaborative annotation of\ndocuments, training of handwritten table text and structure recognition models,\nand mass processing of millions of images. We present the tools we have\ndeveloped to easily collect and process millions of pages. We also show that it\nis possible to process such a wide variety of tables with a single table\nrecognition model that uses the image of the entire page to recognize\ninformation about individuals, categorize them and automatically group them\ninto households. The entire process has been successfully used to process the\ndocuments of a departmental archive, representing more than 450,000 images.\n","authors":["Mélodie Boillet","Solène Tarride","Manon Blanco","Valentin Rigal","Yoann Schneider","Bastien Abadie","Lionel Kesztenbaum","Christopher Kermorvant"],"pdf_url":"https://arxiv.org/pdf/2404.18706v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.00376v2","updated":"2024-06-03T07:09:39Z","published":"2024-03-01T09:01:53Z","title":"Spurious Feature Eraser: Stabilizing Test-Time Adaptation for\n Vision-Language Foundation Model","summary":" Vision-language foundation models have exhibited remarkable success across a\nmultitude of downstream tasks due to their scalability on extensive image-text\npaired data. However, these models also display significant limitations when\napplied to downstream tasks, such as fine-grained image classification, as a\nresult of ``decision shortcuts'' that hinder their generalization capabilities.\nIn this work, we find that the CLIP model possesses a rich set of features,\nencompassing both \\textit{desired invariant causal features} and\n\\textit{undesired decision shortcuts}. Moreover, the underperformance of CLIP\non downstream tasks originates from its inability to effectively utilize\npre-trained features in accordance with specific task requirements. To address\nthis challenge, we propose a simple yet effective method, Spurious Feature\nEraser (SEraser), to alleviate the decision shortcuts by erasing the spurious\nfeatures. Specifically, we introduce a test-time prompt tuning paradigm that\noptimizes a learnable prompt, thereby compelling the model to exploit invariant\nfeatures while disregarding decision shortcuts during the inference phase. The\nproposed method effectively alleviates excessive dependence on potentially\nmisleading spurious information. We conduct comparative analysis of the\nproposed method against various approaches which validates the significant\nsuperiority.\n","authors":["Huan Ma","Yan Zhu","Changqing Zhang","Peilin Zhao","Baoyuan Wu","Long-Kai Huang","Qinghua Hu","Bingzhe Wu"],"pdf_url":"https://arxiv.org/pdf/2403.00376v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20299v3","updated":"2024-06-03T06:36:22Z","published":"2024-05-30T17:46:23Z","title":"Scaling White-Box Transformers for Vision","summary":" CRATE, a white-box transformer architecture designed to learn compressed and\nsparse representations, offers an intriguing alternative to standard vision\ntransformers (ViTs) due to its inherent mathematical interpretability. Despite\nextensive investigations into the scaling behaviors of language and vision\ntransformers, the scalability of CRATE remains an open question which this\npaper aims to address. Specifically, we propose CRATE-$\\alpha$, featuring\nstrategic yet minimal modifications to the sparse coding block in the CRATE\narchitecture design, and a light training recipe designed to improve the\nscalability of CRATE. Through extensive experiments, we demonstrate that\nCRATE-$\\alpha$ can effectively scale with larger model sizes and datasets. For\nexample, our CRATE-$\\alpha$-B substantially outperforms the prior best CRATE-B\nmodel accuracy on ImageNet classification by 3.7%, achieving an accuracy of\n83.2%. Meanwhile, when scaling further, our CRATE-$\\alpha$-L obtains an\nImageNet classification accuracy of 85.1%. More notably, these model\nperformance improvements are achieved while preserving, and potentially even\nenhancing the interpretability of learned CRATE models, as we demonstrate\nthrough showing that the learned token representations of increasingly larger\ntrained CRATE-$\\alpha$ models yield increasingly higher-quality unsupervised\nobject segmentation of images. The project page is\nhttps://rayjryang.github.io/CRATE-alpha/.\n","authors":["Jinrui Yang","Xianhang Li","Druv Pai","Yuyin Zhou","Yi Ma","Yaodong Yu","Cihang Xie"],"pdf_url":"https://arxiv.org/pdf/2405.20299v3.pdf","comment":"project page: https://rayjryang.github.io/CRATE-alpha/"},{"id":"http://arxiv.org/abs/2404.03653v2","updated":"2024-06-03T06:02:34Z","published":"2024-04-04T17:59:46Z","title":"CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept\n Matching","summary":" Diffusion models have demonstrated great success in the field of\ntext-to-image generation. However, alleviating the misalignment between the\ntext prompts and images is still challenging. The root reason behind the\nmisalignment has not been extensively investigated. We observe that the\nmisalignment is caused by inadequate token attention activation. We further\nattribute this phenomenon to the diffusion model's insufficient condition\nutilization, which is caused by its training paradigm. To address the issue, we\npropose CoMat, an end-to-end diffusion model fine-tuning strategy with an\nimage-to-text concept matching mechanism. We leverage an image captioning model\nto measure image-to-text alignment and guide the diffusion model to revisit\nignored tokens. A novel attribute concentration module is also proposed to\naddress the attribute binding problem. Without any image or human preference\ndata, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL.\nExtensive experiments show that CoMat-SDXL significantly outperforms the\nbaseline model SDXL in two text-to-image alignment benchmarks and achieves\nstart-of-the-art performance.\n","authors":["Dongzhi Jiang","Guanglu Song","Xiaoshi Wu","Renrui Zhang","Dazhong Shen","Zhuofan Zong","Yu Liu","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2404.03653v2.pdf","comment":"Project Page: https://caraj7.github.io/comat"},{"id":"http://arxiv.org/abs/2405.20786v2","updated":"2024-06-03T05:36:47Z","published":"2024-05-30T06:25:42Z","title":"Stratified Avatar Generation from Sparse Observations","summary":" Estimating 3D full-body avatars from AR/VR devices is essential for creating\nimmersive experiences in AR/VR applications. This task is challenging due to\nthe limited input from Head Mounted Devices, which capture only sparse\nobservations from the head and hands. Predicting the full-body avatars,\nparticularly the lower body, from these sparse observations presents\nsignificant difficulties. In this paper, we are inspired by the inherent\nproperty of the kinematic tree defined in the Skinned Multi-Person Linear\n(SMPL) model, where the upper body and lower body share only one common\nancestor node, bringing the potential of decoupled reconstruction. We propose a\nstratified approach to decouple the conventional full-body avatar\nreconstruction pipeline into two stages, with the reconstruction of the upper\nbody first and a subsequent reconstruction of the lower body conditioned on the\nprevious stage. To implement this straightforward idea, we leverage the latent\ndiffusion model as a powerful probabilistic generator, and train it to follow\nthe latent distribution of decoupled motions explored by a VQ-VAE\nencoder-decoder model. Extensive experiments on AMASS mocap dataset demonstrate\nour state-of-the-art performance in the reconstruction of full-body motions.\n","authors":["Han Feng","Wenchao Ma","Quankai Gao","Xianwei Zheng","Nan Xue","Huijuan Xu"],"pdf_url":"https://arxiv.org/pdf/2405.20786v2.pdf","comment":"Accepted by CVPR 2024 (Oral)"},{"id":"http://arxiv.org/abs/2306.08964v2","updated":"2024-06-03T04:50:18Z","published":"2023-06-15T08:56:58Z","title":"Exploring Multi-Timestep Multi-Stage Diffusion Features for\n Hyperspectral Image Classification","summary":" The effectiveness of spectral-spatial feature learning is crucial for the\nhyperspectral image (HSI) classification task. Diffusion models, as a new class\nof groundbreaking generative models, have the ability to learn both contextual\nsemantics and textual details from the distinct timestep dimension, enabling\nthe modeling of complex spectral-spatial relations in HSIs. However, existing\ndiffusion-based HSI classification methods only utilize manually selected\nsingle-timestep single-stage features, limiting the full exploration and\nexploitation of rich contextual semantics and textual information hidden in the\ndiffusion model. To address this issue, we propose a novel diffusion-based\nfeature learning framework that explores Multi-Timestep Multi-Stage Diffusion\nfeatures for HSI classification for the first time, called MTMSD. Specifically,\nthe diffusion model is first pretrained with unlabeled HSI patches to mine the\nconnotation of unlabeled data, and then is used to extract the multi-timestep\nmulti-stage diffusion features. To effectively and efficiently leverage\nmulti-timestep multi-stage features,two strategies are further developed. One\nstrategy is class & timestep-oriented multi-stage feature purification module\nwith the inter-class and inter-timestep prior for reducing the redundancy of\nmulti-stage features and alleviating memory constraints. The other one is\nselective timestep feature fusion module with the guidance of global features\nto adaptively select different timestep features for integrating texture and\nsemantics. Both strategies facilitate the generality and adaptability of the\nMTMSD framework for diverse patterns of different HSI data. Extensive\nexperiments are conducted on four public HSI datasets, and the results\ndemonstrate that our method outperforms state-of-the-art methods for HSI\nclassification, especially on the challenging Houston 2018 dataset.\n","authors":["Jingyi Zhou","Jiamu Sheng","Jiayuan Fan","Peng Ye","Tong He","Bin Wang","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2306.08964v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.05234v2","updated":"2024-06-03T04:39:51Z","published":"2024-03-08T11:48:44Z","title":"Benchmarking Micro-action Recognition: Dataset, Methods, and\n Applications","summary":" Micro-action is an imperceptible non-verbal behaviour characterised by\nlow-intensity movement. It offers insights into the feelings and intentions of\nindividuals and is important for human-oriented applications such as emotion\nrecognition and psychological assessment. However, the identification,\ndifferentiation, and understanding of micro-actions pose challenges due to the\nimperceptible and inaccessible nature of these subtle human behaviors in\neveryday life. In this study, we innovatively collect a new micro-action\ndataset designated as Micro-action-52 (MA-52), and propose a benchmark named\nmicro-action network (MANet) for micro-action recognition (MAR) task. Uniquely,\nMA-52 provides the whole-body perspective including gestures, upper- and\nlower-limb movements, attempting to reveal comprehensive micro-action cues. In\ndetail, MA-52 contains 52 micro-action categories along with seven body part\nlabels, and encompasses a full array of realistic and natural micro-actions,\naccounting for 205 participants and 22,422 video instances collated from the\npsychological interviews. Based on the proposed dataset, we assess MANet and\nother nine prevalent action recognition methods. MANet incorporates squeeze-and\nexcitation (SE) and temporal shift module (TSM) into the ResNet architecture\nfor modeling the spatiotemporal characteristics of micro-actions. Then a\njoint-embedding loss is designed for semantic matching between video and action\nlabels; the loss is used to better distinguish between visually similar yet\ndistinct micro-action categories. The extended application in emotion\nrecognition has demonstrated one of the important values of our proposed\ndataset and method. In the future, further exploration of human behaviour,\nemotion, and psychological assessment will be conducted in depth. The dataset\nand source code are released at https://github.com/VUT-HFUT/Micro-Action.\n","authors":["Dan Guo","Kun Li","Bin Hu","Yan Zhang","Meng Wang"],"pdf_url":"https://arxiv.org/pdf/2403.05234v2.pdf","comment":"Accepted by IEEE Transactions on Circuits and Systems for Video\n Technology"},{"id":"http://arxiv.org/abs/2405.20881v2","updated":"2024-06-03T04:38:42Z","published":"2024-05-31T14:55:31Z","title":"S4Fusion: Saliency-aware Selective State Space Model for Infrared\n Visible Image Fusion","summary":" As one of the tasks in Image Fusion, Infrared and Visible Image Fusion aims\nto integrate complementary information captured by sensors of different\nmodalities into a single image. The Selective State Space Model (SSSM), known\nfor its ability to capture long-range dependencies, has demonstrated its\npotential in the field of computer vision. However, in image fusion, current\nmethods underestimate the potential of SSSM in capturing the global spatial\ninformation of both modalities. This limitation prevents the simultaneous\nconsideration of the global spatial information from both modalities during\ninteraction, leading to a lack of comprehensive perception of salient targets.\nConsequently, the fusion results tend to bias towards one modality instead of\nadaptively preserving salient targets. To address this issue, we propose the\nSaliency-aware Selective State Space Fusion Model (S4Fusion). In our S4Fusion,\nthe designed Cross-Modal Spatial Awareness Module (CMSA) can simultaneously\nfocus on global spatial information from both modalities while facilitating\ntheir interaction, thereby comprehensively capturing complementary information.\nAdditionally, S4Fusion leverages a pre-trained network to perceive uncertainty\nin the fused images. By minimizing this uncertainty, S4Fusion adaptively\nhighlights salient targets from both images. Extensive experiments demonstrate\nthat our approach produces high-quality images and enhances performance in\ndownstream tasks.\n","authors":["Haolong Ma","Hui Li","Chunyang Cheng","Gaoang Wang","Xiaoning Song","Xiaojun Wu"],"pdf_url":"https://arxiv.org/pdf/2405.20881v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07586v5","updated":"2024-06-03T04:17:49Z","published":"2023-12-11T02:40:40Z","title":"Characteristic Guidance: Non-linear Correction for Diffusion Model at\n Large Guidance Scale","summary":" Popular guidance for denoising diffusion probabilistic model (DDPM) linearly\ncombines distinct conditional models together to provide enhanced control over\nsamples. However, this approach overlooks nonlinear effects that become\nsignificant when guidance scale is large. To address this issue, we propose\ncharacteristic guidance, a guidance method that provides first-principle\nnon-linear correction for classifier-free guidance. Such correction forces the\nguided DDPMs to respect the Fokker-Planck (FP) equation of diffusion process,\nin a way that is training-free and compatible with existing sampling methods.\nExperiments show that characteristic guidance enhances semantic characteristics\nof prompts and mitigate irregularities in image generation, proving effective\nin diverse applications ranging from simulating magnet phase transitions to\nlatent space sampling.\n","authors":["Candi Zheng","Yuan Lan"],"pdf_url":"https://arxiv.org/pdf/2312.07586v5.pdf","comment":"8 pages, 7 figures"},{"id":"http://arxiv.org/abs/2403.00476v3","updated":"2024-06-03T04:13:39Z","published":"2024-03-01T12:02:19Z","title":"TempCompass: Do Video LLMs Really Understand Videos?","summary":" Recently, there is a surge in interest surrounding video large language\nmodels (Video LLMs). However, existing benchmarks fail to provide a\ncomprehensive feedback on the temporal perception ability of Video LLMs. On the\none hand, most of them are unable to distinguish between different temporal\naspects (e.g., speed, direction) and thus cannot reflect the nuanced\nperformance on these specific aspects. On the other hand, they are limited in\nthe diversity of task formats (e.g., only multi-choice QA), which hinders the\nunderstanding of how temporal perception performance may vary across different\ntypes of tasks. Motivated by these two problems, we propose the\n\\textbf{TempCompass} benchmark, which introduces a diversity of temporal\naspects and task formats. To collect high-quality test data, we devise two\nnovel strategies: (1) In video collection, we construct conflicting videos that\nshare the same static content but differ in a specific temporal aspect, which\nprevents Video LLMs from leveraging single-frame bias or language priors. (2)\nTo collect the task instructions, we propose a paradigm where humans first\nannotate meta-information for a video and then an LLM generates the\ninstruction. We also design an LLM-based approach to automatically and\naccurately evaluate the responses from Video LLMs. Based on TempCompass, we\ncomprehensively evaluate 8 state-of-the-art (SOTA) Video LLMs and 3 Image LLMs,\nand reveal the discerning fact that these models exhibit notably poor temporal\nperception ability. Our data will be available at\nhttps://github.com/llyx97/TempCompass.\n","authors":["Yuanxin Liu","Shicheng Li","Yi Liu","Yuxiang Wang","Shuhuai Ren","Lei Li","Sishuo Chen","Xu Sun","Lu Hou"],"pdf_url":"https://arxiv.org/pdf/2403.00476v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.07847v2","updated":"2024-06-03T04:02:52Z","published":"2024-04-11T15:42:53Z","title":"The Effectiveness of a Simplified Model Structure for Crowd Counting","summary":" In the field of crowd counting research, many recent deep learning based\nmethods have demonstrated robust capabilities for accurately estimating crowd\nsizes. However, the enhancement in their performance often arises from an\nincrease in the complexity of the model structure. This paper discusses how to\nconstruct high-performance crowd counting models using only simple structures.\nWe proposes the Fuss-Free Network (FFNet) that is characterized by its simple\nand efficieny structure, consisting of only a backbone network and a\nmulti-scale feature fusion structure. The multi-scale feature fusion structure\nis a simple structure consisting of three branches, each only equipped with a\nfocus transition module, and combines the features from these branches through\nthe concatenation operation. Our proposed crowd counting model is trained and\nevaluated on four widely used public datasets, and it achieves accuracy that is\ncomparable to that of existing complex models. Furthermore, we conduct a\ncomprehensive evaluation by replacing the existing backbones of various models\nsuch as FFNet and CCTrans with different networks, including MobileNet-v3,\nConvNeXt-Tiny, and Swin-Transformer-Small. The experimental results further\nindicate that excellent crowd counting performance can be achieved with the\nsimplied structure proposed by us.\n","authors":["Lei Chen","Xinghang Gao","Fei Chao","Chih Min Lin","Xingen Gao","Hongyi Zhang","Juqiang Lin"],"pdf_url":"https://arxiv.org/pdf/2404.07847v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11973v4","updated":"2024-06-03T03:51:38Z","published":"2023-12-19T09:11:49Z","title":"Continual Learning: Forget-free Winning Subnetworks for Video\n Representations","summary":" Inspired by the Lottery Ticket Hypothesis (LTH), which highlights the\nexistence of efficient subnetworks within larger, dense networks, a\nhigh-performing Winning Subnetwork (WSN) in terms of task performance under\nappropriate sparsity conditions is considered for various continual learning\ntasks. It leverages pre-existing weights from dense networks to achieve\nefficient learning in Task Incremental Learning (TIL) and Task-agnostic\nIncremental Learning (TaIL) scenarios. In Few-Shot Class Incremental Learning\n(FSCIL), a variation of WSN referred to as the Soft subnetwork (SoftNet) is\ndesigned to prevent overfitting when the data samples are scarce. Furthermore,\nthe sparse reuse of WSN weights is considered for Video Incremental Learning\n(VIL). The use of Fourier Subneural Operator (FSO) within WSN is considered. It\nenables compact encoding of videos and identifies reusable subnetworks across\nvarying bandwidths. We have integrated FSO into different architectural\nframeworks for continual learning, including VIL, TIL, and FSCIL. Our\ncomprehensive experiments demonstrate FSO's effectiveness, significantly\nimproving task performance at various convolutional representational levels.\nSpecifically, FSO enhances higher-layer performance in TIL and FSCIL and\nlower-layer performance in VIL.\n","authors":["Haeyong Kang","Jaehong Yoon","Sung Ju Hwang","Chang D. Yoo"],"pdf_url":"https://arxiv.org/pdf/2312.11973v4.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2303.14962,\n arXiv:2306.11305"},{"id":"http://arxiv.org/abs/2312.12754v2","updated":"2024-06-03T03:17:01Z","published":"2023-12-20T04:27:13Z","title":"Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic\n Segmentation","summary":" Recently, CLIP has found practical utility in the domain of pixel-level\nzero-shot segmentation tasks. The present landscape features two-stage\nmethodologies beset by issues such as intricate pipelines and elevated\ncomputational costs. While current one-stage approaches alleviate these\nconcerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's\ngeneralization capacity, they still fall short in fully harnessing CLIP's\npotential for pixel-level unseen class demarcation and precise pixel\npredictions. To further stimulate CLIP's zero-shot dense prediction capability,\nwe propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from\nimage to pixel. Specifically, we initially introduce Spectral Prompt Tuning\n(SPT), incorporating spectral prompts into the CLIP visual encoder's shallow\nlayers to capture structural intricacies of images, thereby enhancing\ncomprehension of unseen classes. Subsequently, we introduce the Spectral Guided\nDecoder (SGD), utilizing both high and low-frequency information to steer the\nnetwork's spatial focus towards more prominent classification features,\nenabling precise pixel-level prediction outcomes. Through extensive experiments\non two public datasets, we demonstrate the superiority of our method over\nstate-of-the-art approaches, performing well across all classes and\nparticularly excelling in handling unseen classes. Code is available\nat:https://github.com/clearxu/SPT.\n","authors":["Wenhao Xu","Rongtao Xu","Changwei Wang","Shibiao Xu","Li Guo","Man Zhang","Xiaopeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.12754v2.pdf","comment":"AAAI2024 Accepted"},{"id":"http://arxiv.org/abs/2403.02998v2","updated":"2024-06-03T03:04:42Z","published":"2024-03-04T11:23:40Z","title":"Towards Calibrated Deep Clustering Network","summary":" Deep clustering has exhibited remarkable performance; however, the\nover-confidence problem, i.e., the estimated confidence for a sample belonging\nto a particular cluster greatly exceeds its actual prediction accuracy, has\nbeen overlooked in prior research. To tackle this critical issue, we pioneer\nthe development of a calibrated deep clustering framework. Specifically, we\npropose a novel dual-head (calibration head and clustering head) deep\nclustering model that can effectively calibrate the estimated confidence and\nthe actual accuracy. The calibration head adjusts the overconfident predictions\nof the clustering head, generating prediction confidence that match the model\nlearning status. Then, the clustering head dynamically select reliable\nhigh-confidence samples estimated by the calibration head for pseudo-label\nself-training. Additionally, we introduce an effective network initialization\nstrategy that enhances both training speed and network robustness. The\neffectiveness of the proposed calibration approach and initialization strategy\nare both endorsed with solid theoretical guarantees. Extensive experiments\ndemonstrate the proposed calibrated deep clustering model not only surpasses\nstate-of-the-art deep clustering methods by 10 times in terms of expected\ncalibration error but also significantly outperforms them in terms of\nclustering accuracy.\n","authors":["Yuheng Jia","Jianhong Cheng","Hui Liu","Junhui Hou"],"pdf_url":"https://arxiv.org/pdf/2403.02998v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.11473v2","updated":"2024-06-03T03:04:12Z","published":"2024-05-19T07:48:41Z","title":"FIFO-Diffusion: Generating Infinite Videos from Text without Training","summary":" We propose a novel inference technique based on a pretrained diffusion model\nfor text-conditional video generation. Our approach, called FIFO-Diffusion, is\nconceptually capable of generating infinitely long videos without additional\ntraining. This is achieved by iteratively performing diagonal denoising, which\nconcurrently processes a series of consecutive frames with increasing noise\nlevels in a queue; our method dequeues a fully denoised frame at the head while\nenqueuing a new random noise frame at the tail. However, diagonal denoising is\na double-edged sword as the frames near the tail can take advantage of cleaner\nones by forward reference but such a strategy induces the discrepancy between\ntraining and inference. Hence, we introduce latent partitioning to reduce the\ntraining-inference gap and lookahead denoising to leverage the benefit of\nforward referencing. Practically, FIFO-Diffusion consumes a constant amount of\nmemory regardless of the target video length given a baseline model, while\nwell-suited for parallel inference on multiple GPUs. We have demonstrated the\npromising results and effectiveness of the proposed methods on existing\ntext-to-video generation baselines. Generated video samples and source codes\nare available at our project page.\n","authors":["Jihwan Kim","Junoh Kang","Jinyoung Choi","Bohyung Han"],"pdf_url":"https://arxiv.org/pdf/2405.11473v2.pdf","comment":"Project Page: https://jjihwan.github.io/projects/FIFO-Diffusion"},{"id":"http://arxiv.org/abs/2405.21013v2","updated":"2024-06-03T02:43:16Z","published":"2024-05-31T16:55:04Z","title":"StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image\n Perception, Comprehension, and Beyond","summary":" Text-rich images have significant and extensive value, deeply integrated into\nvarious aspects of human life. Notably, both visual cues and linguistic symbols\nin text-rich images play crucial roles in information transmission but are\naccompanied by diverse challenges. Therefore, the efficient and effective\nunderstanding of text-rich images is a crucial litmus test for the capability\nof Vision-Language Models. We have crafted an efficient vision-language model,\nStrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.\nThe significant design of StrucTexTv3 is presented in the following aspects:\nFirstly, we adopt a combination of an effective multi-scale reduced visual\ntransformer and a multi-granularity token sampler (MG-Sampler) as a visual\ntoken generator, successfully solving the challenges of high-resolution input\nand complex representation learning for text-rich images. Secondly, we enhance\nthe perception and comprehension abilities of StrucTexTv3 through instruction\nlearning, seamlessly integrating various text-oriented tasks into a unified\nframework. Thirdly, we have curated a comprehensive collection of high-quality\ntext-rich images, abbreviated as TIM-30M, encompassing diverse scenarios like\nincidental scenes, office documents, web pages, and screenshots, thereby\nimproving the robustness of our model. Our method achieved SOTA results in\ntext-rich image perception tasks, and significantly improved performance in\ncomprehension tasks. Among multimodal models with LLM decoder of approximately\n1.8B parameters, it stands out as a leader, which also makes the deployment of\nedge devices feasible. In summary, the StrucTexTv3 model, featuring efficient\nstructural design, outstanding performance, and broad adaptability, offers\nrobust support for diverse intelligent application tasks involving text-rich\nimages, thus exhibiting immense potential for widespread application.\n","authors":["Pengyuan Lyu","Yulin Li","Hao Zhou","Weihong Ma","Xingyu Wan","Qunyi Xie","Liang Wu","Chengquan Zhang","Kun Yao","Errui Ding","Jingdong Wang"],"pdf_url":"https://arxiv.org/pdf/2405.21013v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.02072v4","updated":"2024-06-03T02:15:03Z","published":"2024-04-02T16:20:02Z","title":"EGTR: Extracting Graph from Transformer for Scene Graph Generation","summary":" Scene Graph Generation (SGG) is a challenging task of detecting objects and\npredicting relationships between objects. After DETR was developed, one-stage\nSGG models based on a one-stage object detector have been actively studied.\nHowever, complex modeling is used to predict the relationship between objects,\nand the inherent relationship between object queries learned in the multi-head\nself-attention of the object detector has been neglected. We propose a\nlightweight one-stage SGG model that extracts the relation graph from the\nvarious relationships learned in the multi-head self-attention layers of the\nDETR decoder. By fully utilizing the self-attention by-products, the relation\ngraph can be extracted effectively with a shallow relation extraction head.\nConsidering the dependency of the relation extraction task on the object\ndetection task, we propose a novel relation smoothing technique that adjusts\nthe relation label adaptively according to the quality of the detected objects.\nBy the relation smoothing, the model is trained according to the continuous\ncurriculum that focuses on object detection task at the beginning of training\nand performs multi-task learning as the object detection performance gradually\nimproves. Furthermore, we propose a connectivity prediction task that predicts\nwhether a relation exists between object pairs as an auxiliary task of the\nrelation extraction. We demonstrate the effectiveness and efficiency of our\nmethod for the Visual Genome and Open Image V6 datasets. Our code is publicly\navailable at https://github.com/naver-ai/egtr.\n","authors":["Jinbae Im","JeongYeon Nam","Nokyung Park","Hyungmin Lee","Seunghyun Park"],"pdf_url":"https://arxiv.org/pdf/2404.02072v4.pdf","comment":"CVPR 2024 (Best paper award candidate)"},{"id":"http://arxiv.org/abs/2402.10717v2","updated":"2024-06-03T02:14:12Z","published":"2024-02-16T14:19:33Z","title":"BioFusionNet: Deep Learning-Based Survival Risk Stratification in ER+\n Breast Cancer Through Multifeature and Multimodal Data Fusion","summary":" Breast cancer is a significant health concern affecting millions of women\nworldwide. Accurate survival risk stratification plays a crucial role in\nguiding personalised treatment decisions and improving patient outcomes. Here\nwe present BioFusionNet, a deep learning framework that fuses image-derived\nfeatures with genetic and clinical data to obtain a holistic profile and\nachieve survival risk stratification of ER+ breast cancer patients. We employ\nmultiple self-supervised feature extractors (DINO and MoCoV3) pretrained on\nhistopathological patches to capture detailed image features. These features\nare then fused by a variational autoencoder and fed to a self-attention network\ngenerating patient-level features. A co-dual-cross-attention mechanism combines\nthe histopathological features with genetic data, enabling the model to capture\nthe interplay between them. Additionally, clinical data is incorporated using a\nfeed-forward network, further enhancing predictive performance and achieving\ncomprehensive multimodal feature integration. Furthermore, we introduce a\nweighted Cox loss function, specifically designed to handle imbalanced survival\ndata, which is a common challenge. Our model achieves a mean concordance index\nof 0.77 and a time-dependent area under the curve of 0.84, outperforming\nstate-of-the-art methods. It predicts risk (high versus low) with prognostic\nsignificance for overall survival in univariate analysis (HR=2.99, 95% CI:\n1.88--4.78, p<0.005), and maintains independent significance in multivariate\nanalysis incorporating standard clinicopathological variables (HR=2.91, 95\\%\nCI: 1.80--4.68, p<0.005).\n","authors":["Raktim Kumar Mondol","Ewan K. A. Millar","Arcot Sowmya","Erik Meijering"],"pdf_url":"https://arxiv.org/pdf/2402.10717v2.pdf","comment":"Keywords: Multimodal Fusion, Breast Cancer, Whole Slide Images, Deep\n Neural Network, Survival Prediction"},{"id":"http://arxiv.org/abs/2401.06127v2","updated":"2024-06-03T02:09:38Z","published":"2024-01-11T18:59:14Z","title":"E$^{2}$GAN: Efficient Training of Efficient GANs for Image-to-Image\n Translation","summary":" One highly promising direction for enabling flexible real-time on-device\nimage editing is utilizing data distillation by leveraging large-scale\ntext-to-image diffusion models to generate paired datasets used for training\ngenerative adversarial networks (GANs). This approach notably alleviates the\nstringent requirements typically imposed by high-end commercial GPUs for\nperforming image editing with diffusion models. However, unlike text-to-image\ndiffusion models, each distilled GAN is specialized for a specific image\nediting task, necessitating costly training efforts to obtain models for\nvarious concepts. In this work, we introduce and address a novel research\ndirection: can the process of distilling GANs from diffusion models be made\nsignificantly more efficient? To achieve this goal, we propose a series of\ninnovative techniques. First, we construct a base GAN model with generalized\nfeatures, adaptable to different concepts through fine-tuning, eliminating the\nneed for training from scratch. Second, we identify crucial layers within the\nbase GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet\neffective rank search process, rather than fine-tuning the entire base model.\nThird, we investigate the minimal amount of data necessary for fine-tuning,\nfurther reducing the overall training time. Extensive experiments show that we\ncan efficiently empower GANs with the ability to perform real-time high-quality\nimage editing on mobile devices with remarkably reduced training and storage\ncosts for each concept.\n","authors":["Yifan Gong","Zheng Zhan","Qing Jin","Yanyu Li","Yerlan Idelbayev","Xian Liu","Andrey Zharkov","Kfir Aberman","Sergey Tulyakov","Yanzhi Wang","Jian Ren"],"pdf_url":"https://arxiv.org/pdf/2401.06127v2.pdf","comment":"ICML 2024. Project Page: https://yifanfanfanfan.github.io/e2gan/"},{"id":"http://arxiv.org/abs/2404.16666v3","updated":"2024-06-03T02:07:14Z","published":"2024-04-25T15:06:58Z","title":"PhyRecon: Physically Plausible Neural Scene Reconstruction","summary":" Neural implicit representations have gained popularity in multi-view 3D\nreconstruction. However, most previous work struggles to yield physically\nplausible results, limiting their utility in domains requiring rigorous\nphysical accuracy, such as embodied AI and robotics. This lack of plausibility\nstems from the absence of physics modeling in existing methods and their\ninability to recover intricate geometrical structures. In this paper, we\nintroduce PhyRecon, the first approach to leverage both differentiable\nrendering and differentiable physics simulation to learn implicit surface\nrepresentations. PhyRecon features a novel differentiable particle-based\nphysical simulator built on neural implicit representations. Central to this\ndesign is an efficient transformation between SDF-based implicit\nrepresentations and explicit surface points via our proposed Surface Points\nMarching Cubes (SP-MC), enabling differentiable learning with both rendering\nand physical losses. Additionally, PhyRecon models both rendering and physical\nuncertainty to identify and compensate for inconsistent and inaccurate\nmonocular geometric priors. This physical uncertainty further facilitates a\nnovel physics-guided pixel sampling to enhance the learning of slender\nstructures. By integrating these techniques, our model supports differentiable\njoint modeling of appearance, geometry, and physics. Extensive experiments\ndemonstrate that PhyRecon significantly outperforms all state-of-the-art\nmethods. Our results also exhibit superior physical stability in physical\nsimulators, with at least a 40% improvement across all datasets, paving the way\nfor future physics-based applications.\n","authors":["Junfeng Ni","Yixin Chen","Bohan Jing","Nan Jiang","Bin Wang","Bo Dai","Puhao Li","Yixin Zhu","Song-Chun Zhu","Siyuan Huang"],"pdf_url":"https://arxiv.org/pdf/2404.16666v3.pdf","comment":"project page: https://phyrecon.github.io/"},{"id":"http://arxiv.org/abs/2403.16286v2","updated":"2024-06-03T01:43:08Z","published":"2024-03-24T20:31:42Z","title":"HemoSet: The First Blood Segmentation Dataset for Automation of\n Hemostasis Management","summary":" Hemorrhaging occurs in surgeries of all types, forcing surgeons to quickly\nadapt to the visual interference that results from blood rapidly filling the\nsurgical field. Introducing automation into the crucial surgical task of\nhemostasis management would offload mental and physical tasks from the surgeon\nand surgical assistants while simultaneously increasing the efficiency and\nsafety of the operation. The first step in automation of hemostasis management\nis detection of blood in the surgical field. To propel the development of blood\ndetection algorithms in surgeries, we present HemoSet, the first blood\nsegmentation dataset based on bleeding during a live animal robotic surgery.\nOur dataset features vessel hemorrhage scenarios where turbulent flow leads to\nabnormal pooling geometries in surgical fields. These pools are formed in\nconditions endemic to surgical procedures -- uneven heterogeneous tissue, under\nglossy lighting conditions and rapid tool movement. We benchmark several\nstate-of-the-art segmentation models and provide insight into the difficulties\nspecific to blood detection. We intend for HemoSet to spur development of\nautonomous blood suction tools by providing a platform for training and\nrefining blood segmentation models, addressing the precision needed for such\nrobotics.\n","authors":["Albert J. Miao","Shan Lin","Jingpei Lu","Florian Richter","Benjamin Ostrander","Emily K. Funk","Ryan K. Orosco","Michael C. Yip"],"pdf_url":"https://arxiv.org/pdf/2403.16286v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.10754v2","updated":"2024-06-03T01:35:11Z","published":"2023-05-18T06:54:56Z","title":"Brain Imaging-to-Graph Generation using Adversarial Hierarchical\n Diffusion Models for MCI Causality Analysis","summary":" Effective connectivity can describe the causal patterns among brain regions.\nThese patterns have the potential to reveal the pathological mechanism and\npromote early diagnosis and effective drug development for cognitive disease.\nHowever, the current methods utilize software toolkits to extract empirical\nfeatures from brain imaging to estimate effective connectivity. These methods\nheavily rely on manual parameter settings and may result in large errors during\neffective connectivity estimation. In this paper, a novel brain\nimaging-to-graph generation (BIGG) framework is proposed to map functional\nmagnetic resonance imaging (fMRI) into effective connectivity for mild\ncognitive impairment (MCI) analysis. To be specific, the proposed BIGG\nframework is based on the diffusion denoising probabilistic models (DDPM),\nwhere each denoising step is modeled as a generative adversarial network (GAN)\nto progressively translate the noise and conditional fMRI to effective\nconnectivity. The hierarchical transformers in the generator are designed to\nestimate the noise at multiple scales. Each scale concentrates on both spatial\nand temporal information between brain regions, enabling good quality in noise\nremoval and better inference of causal relations. Meanwhile, the\ntransformer-based discriminator constrains the generator to further capture\nglobal and local patterns for improving high-quality and diversity generation.\nBy introducing the diffusive factor, the denoising inference with a large\nsampling step size is more efficient and can maintain high-quality results for\neffective connectivity generation. Evaluations of the ADNI dataset demonstrate\nthe feasibility and efficacy of the proposed model. The proposed model not only\nachieves superior prediction performance compared with other competing methods\nbut also predicts MCI-related causal connections that are consistent with\nclinical studies.\n","authors":["Qiankun Zuo","Hao Tian","Chi-Man Pun","Hongfei Wang","Yudong Zhang","Jin Hong"],"pdf_url":"https://arxiv.org/pdf/2305.10754v2.pdf","comment":"10 pages, 12 figures"},{"id":"http://arxiv.org/abs/2402.11058v3","updated":"2024-06-03T01:09:38Z","published":"2024-02-16T20:14:47Z","title":"II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in\n Visual Question Answering","summary":" Visual Question Answering (VQA) often involves diverse reasoning scenarios\nacross Vision and Language (V&L). Most prior VQA studies, however, have merely\nfocused on assessing the model's overall accuracy without evaluating it on\ndifferent reasoning cases. Furthermore, some recent works observe that\nconventional Chain-of-Thought (CoT) prompting fails to generate effective\nreasoning for VQA, especially for complex scenarios requiring multi-hop\nreasoning. In this paper, we propose II-MMR, a novel idea to identify and\nimprove multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA\nquestion with an image and finds a reasoning path to reach its answer using two\nnovel language promptings: (i) answer prediction-guided CoT prompt, or (ii)\nknowledge triplet-guided prompt. II-MMR then analyzes this path to identify\ndifferent reasoning cases in current VQA benchmarks by estimating how many hops\nand what types (i.e., visual or beyond-visual) of reasoning are required to\nanswer the question. On popular benchmarks including GQA and A-OKVQA, II-MMR\nobserves that most of their VQA questions are easy to answer, simply demanding\n\"single-hop\" reasoning, whereas only a few questions require \"multi-hop\"\nreasoning. Moreover, while the recent V&L model struggles with such complex\nmulti-hop reasoning questions even using the traditional CoT method, II-MMR\nshows its effectiveness across all reasoning cases in both zero-shot and\nfine-tuning settings.\n","authors":["Jihyung Kil","Farideh Tavazoee","Dongyeop Kang","Joo-Kyung Kim"],"pdf_url":"https://arxiv.org/pdf/2402.11058v3.pdf","comment":"Accepted to ACL 2024 Findings"},{"id":"http://arxiv.org/abs/2401.05604v2","updated":"2024-06-03T23:49:45Z","published":"2024-01-11T00:30:28Z","title":"REBUS: A Robust Evaluation Benchmark of Understanding Symbols","summary":" We propose a new benchmark evaluating the performance of multimodal large\nlanguage models on rebus puzzles. The dataset covers 333 original examples of\nimage-based wordplay, cluing 13 categories such as movies, composers, major\ncities, and food. To achieve good performance on the benchmark of identifying\nthe clued word or phrase, models must combine image recognition and string\nmanipulation with hypothesis testing, multi-step reasoning, and an\nunderstanding of human cognition, making for a complex, multimodal evaluation\nof capabilities. We find that GPT-4o significantly outperforms all other\nmodels, followed by proprietary models outperforming all other evaluated\nmodels. However, even the best model has a final accuracy of only 42\\%, which\ngoes down to just 7\\% on hard puzzles, highlighting the need for substantial\nimprovements in reasoning. Further, models rarely understand all parts of a\npuzzle, and are almost always incapable of retroactively explaining the correct\nanswer. Our benchmark can therefore be used to identify major shortcomings in\nthe knowledge and reasoning of multimodal large language models.\n","authors":["Andrew Gritsevskiy","Arjun Panickssery","Aaron Kirtland","Derik Kauffman","Hans Gundlach","Irina Gritsevskaya","Joe Cavanagh","Jonathan Chiang","Lydia La Roux","Michelle Hung"],"pdf_url":"https://arxiv.org/pdf/2401.05604v2.pdf","comment":"20 pages, 5 figures. For code, see http://github.com/cvndsh/rebus"},{"id":"http://arxiv.org/abs/2402.01103v3","updated":"2024-06-03T23:30:33Z","published":"2024-02-02T02:40:51Z","title":"Compositional Generative Modeling: A Single Model is Not All You Need","summary":" Large monolithic generative models trained on massive amounts of data have\nbecome an increasingly dominant approach in AI research. In this paper, we\nargue that we should instead construct large generative systems by composing\nsmaller generative models together. We show how such a compositional generative\napproach enables us to learn distributions in a more data-efficient manner,\nenabling generalization to parts of the data distribution unseen at training\ntime. We further show how this enables us to program and construct new\ngenerative models for tasks completely unseen at training. Finally, we show\nthat in many cases, we can discover separate compositional components from\ndata.\n","authors":["Yilun Du","Leslie Kaelbling"],"pdf_url":"https://arxiv.org/pdf/2402.01103v3.pdf","comment":"ICML 2024 (Position Track)"},{"id":"http://arxiv.org/abs/2406.01843v1","updated":"2024-06-03T23:28:57Z","published":"2024-06-03T23:28:57Z","title":"L-MAGIC: Language Model Assisted Generation of Images with Coherence","summary":" In the current era of generative AI breakthroughs, generating panoramic\nscenes from a single input image remains a key challenge. Most existing methods\nuse diffusion-based iterative or simultaneous multi-view inpainting. However,\nthe lack of global scene layout priors leads to subpar outputs with duplicated\nobjects (e.g., multiple beds in a bedroom) or requires time-consuming human\ntext inputs for each view. We propose L-MAGIC, a novel method leveraging large\nlanguage models for guidance while diffusing multiple coherent views of 360\ndegree panoramic scenes. L-MAGIC harnesses pre-trained diffusion and language\nmodels without fine-tuning, ensuring zero-shot performance. The output quality\nis further enhanced by super-resolution and multi-view fusion techniques.\nExtensive experiments demonstrate that the resulting panoramic scenes feature\nbetter scene layouts and perspective view rendering quality compared to related\nworks, with >70% preference in human evaluations. Combined with conditional\ndiffusion models, L-MAGIC can accept various input modalities, including but\nnot limited to text, depth maps, sketches, and colored scripts. Applying depth\nestimation further enables 3D point cloud generation and dynamic scene\nexploration with fluid camera motion. Code is available at\nhttps://github.com/IntelLabs/MMPano. The video presentation is available at\nhttps://youtu.be/XDMNEzH4-Ec?list=PLG9Zyvu7iBa0-a7ccNLO8LjcVRAoMn57s.\n","authors":["Zhipeng Cai","Matthias Mueller","Reiner Birkl","Diana Wofk","Shao-Yen Tseng","JunDa Cheng","Gabriela Ben-Melech Stan","Vasudev Lal","Michael Paulitsch"],"pdf_url":"https://arxiv.org/pdf/2406.01843v1.pdf","comment":"accepted to CVPR 2024"},{"id":"http://arxiv.org/abs/2405.17698v3","updated":"2024-06-03T23:24:39Z","published":"2024-05-27T23:09:37Z","title":"BaboonLand Dataset: Tracking Primates in the Wild and Automating\n Behaviour Recognition from Drone Videos","summary":" Using drones to track multiple individuals simultaneously in their natural\nenvironment is a powerful approach for better understanding group primate\nbehavior. Previous studies have demonstrated that it is possible to automate\nthe classification of primate behavior from video data, but these studies have\nbeen carried out in captivity or from ground-based cameras. To understand group\nbehavior and the self-organization of a collective, the whole troop needs to be\nseen at a scale where behavior can be seen in relation to the natural\nenvironment in which ecological decisions are made. This study presents a novel\ndataset from drone videos for baboon detection, tracking, and behavior\nrecognition. The baboon detection dataset was created by manually annotating\nall baboons in drone videos with bounding boxes. A tiling method was\nsubsequently applied to create a pyramid of images at various scales from the\noriginal 5.3K resolution images, resulting in approximately 30K images used for\nbaboon detection. The tracking dataset is derived from the detection dataset,\nwhere all bounding boxes are assigned the same ID throughout the video. This\nprocess resulted in half an hour of very dense tracking data. The behavior\nrecognition dataset was generated by converting tracks into mini-scenes, a\nvideo subregion centered on each animal; each mini-scene was manually annotated\nwith 12 distinct behavior types, resulting in over 20 hours of data. Benchmark\nresults show mean average precision (mAP) of 92.62\\% for the YOLOv8-X detection\nmodel, multiple object tracking precision (MOTA) of 63.81\\% for the BotSort\ntracking algorithm, and micro top-1 accuracy of 63.97\\% for the X3D behavior\nrecognition model. Using deep learning to classify wildlife behavior from drone\nfootage facilitates non-invasive insight into the collective behavior of an\nentire group.\n","authors":["Isla Duporge","Maksim Kholiavchenko","Roi Harel","Scott Wolf","Dan Rubenstein","Meg Crofoot","Tanya Berger-Wolf","Stephen Lee","Julie Barreau","Jenna Kline","Michelle Ramirez","Charles Stewart"],"pdf_url":"https://arxiv.org/pdf/2405.17698v3.pdf","comment":"Dataset will be published shortly"},{"id":"http://arxiv.org/abs/2406.01837v1","updated":"2024-06-03T23:09:30Z","published":"2024-06-03T23:09:30Z","title":"Boosting Vision-Language Models with Transduction","summary":" Transduction is a powerful paradigm that leverages the structure of unlabeled\ndata to boost predictive accuracy. We present TransCLIP, a novel and\ncomputationally efficient transductive approach designed for Vision-Language\nModels (VLMs). TransCLIP is applicable as a plug-and-play module on top of\npopular inductive zero- and few-shot models, consistently improving their\nperformances. Our new objective function can be viewed as a regularized\nmaximum-likelihood estimation, constrained by a KL divergence penalty that\nintegrates the text-encoder knowledge and guides the transductive learning\nprocess. We further derive an iterative Block Majorize-Minimize (BMM) procedure\nfor optimizing our objective, with guaranteed convergence and decoupled\nsample-assignment updates, yielding computationally efficient transduction for\nlarge-scale datasets. We report comprehensive evaluations, comparisons, and\nablation studies that demonstrate: (i) Transduction can greatly enhance the\ngeneralization capabilities of inductive pretrained zero- and few-shot VLMs;\n(ii) TransCLIP substantially outperforms standard transductive few-shot\nlearning methods relying solely on vision features, notably due to the KL-based\nlanguage constraint.\n","authors":["Maxime Zanella","Benoît Gérin","Ismail Ben Ayed"],"pdf_url":"https://arxiv.org/pdf/2406.01837v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.08400v2","updated":"2024-06-03T23:02:26Z","published":"2024-02-13T11:59:43Z","title":"Adaptive Hierarchical Certification for Segmentation using Randomized\n Smoothing","summary":" Certification for machine learning is proving that no adversarial sample can\nevade a model within a range under certain conditions, a necessity for\nsafety-critical domains. Common certification methods for segmentation use a\nflat set of fine-grained classes, leading to high abstain rates due to model\nuncertainty across many classes. We propose a novel, more practical setting,\nwhich certifies pixels within a multi-level hierarchy, and adaptively relaxes\nthe certification to a coarser level for unstable components classic methods\nwould abstain from, effectively lowering the abstain rate whilst providing more\ncertified semantically meaningful information. We mathematically formulate the\nproblem setup, introduce an adaptive hierarchical certification algorithm and\nprove the correctness of its guarantees. Since certified accuracy does not take\nthe loss of information into account for coarser classes, we introduce the\nCertified Information Gain ($\\mathrm{CIG}$) metric, which is proportional to\nthe class granularity level. Our extensive experiments on the datasets\nCityscapes, PASCAL-Context, ACDC and COCO-Stuff demonstrate that our adaptive\nalgorithm achieves a higher $\\mathrm{CIG}$ and lower abstain rate compared to\nthe current state-of-the-art certification method. Our code can be found here:\nhttps://github.com/AlaaAnani/adaptive-certify.\n","authors":["Alaa Anani","Tobias Lorenz","Bernt Schiele","Mario Fritz"],"pdf_url":"https://arxiv.org/pdf/2402.08400v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14830v2","updated":"2024-06-03T22:59:54Z","published":"2023-12-22T17:06:08Z","title":"Dreaming of Electrical Waves: Generative Modeling of Cardiac Excitation\n Waves using Diffusion Models","summary":" Electrical waves in the heart form rotating spiral or scroll waves during\nlife-threatening arrhythmias such as atrial or ventricular fibrillation. The\nwave dynamics are typically modeled using coupled partial differential\nequations, which describe reaction-diffusion dynamics in excitable media. More\nrecently, data-driven generative modeling has emerged as an alternative to\ngenerate spatio-temporal patterns in physical and biological systems. Here, we\nexplore denoising diffusion probabilistic models for the generative modeling of\nelectrical wave patterns in cardiac tissue. We trained diffusion models with\nsimulated electrical wave patterns to be able to generate such wave patterns in\nunconditional and conditional generation tasks. For instance, we explored the\ndiffusion-based i) parameter-specific generation, ii) evolution and iii)\ninpainting of spiral wave dynamics, including reconstructing three-dimensional\nscroll wave dynamics from superficial two-dimensional measurements. Further, we\ngenerated arbitrarily shaped bi-ventricular geometries and simultaneously\ninitiated scroll wave patterns inside these geometries using diffusion. We\ncharacterized and compared the diffusion-generated solutions to solutions\nobtained with corresponding biophysical models and found that diffusion models\nlearn to replicate spiral and scroll waves dynamics so well that they could be\nused for data-driven modeling of excitation waves in cardiac tissue. For\ninstance, an ensemble of diffusion-generated spiral wave dynamics exhibits\nsimilar self-termination statistics as the corresponding ensemble simulated\nwith a biophysical model. However, we also found that diffusion models {produce\nartifacts if training data is lacking, e.g. during self-termination,} and\n`hallucinate' wave patterns when insufficiently constrained.\n","authors":["Tanish Baranwal","Jan Lebert","Jan Christoph"],"pdf_url":"https://arxiv.org/pdf/2312.14830v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01829v1","updated":"2024-06-03T22:56:40Z","published":"2024-06-03T22:56:40Z","title":"FacAID: A Transformer Model for Neuro-Symbolic Facade Reconstruction","summary":" We introduce a neuro-symbolic transformer-based model that converts flat,\nsegmented facade structures into procedural definitions using a custom-designed\nsplit grammar. To facilitate this, we first develop a semi-complex split\ngrammar tailored for architectural facades and then generate a dataset\ncomprising of facades alongside their corresponding procedural representations.\nThis dataset is used to train our transformer model to convert segmented, flat\nfacades into the procedural language of our grammar. During inference, the\nmodel applies this learned transformation to new facade segmentations,\nproviding a procedural representation that users can adjust to generate varied\nfacade designs. This method not only automates the conversion of static facade\nimages into dynamic, editable procedural formats but also enhances the design\nflexibility, allowing for easy modifications and variations by architects and\ndesigners. Our approach sets a new standard in facade design by combining the\nprecision of procedural generation with the adaptability of neuro-symbolic\nlearning.\n","authors":["Aleksander Płocharski","Jan Swidzinski","Joanna Porter-Sobieraj","Przemyslaw Musialski"],"pdf_url":"https://arxiv.org/pdf/2406.01829v1.pdf","comment":"11 pages, 10 figures, preprint"},{"id":"http://arxiv.org/abs/2405.07842v2","updated":"2024-06-03T22:51:32Z","published":"2024-05-13T15:30:41Z","title":"Ground-based image deconvolution with Swin Transformer UNet","summary":" As ground-based all-sky astronomical surveys will gather millions of images\nin the coming years, a critical requirement emerges for the development of fast\ndeconvolution algorithms capable of efficiently improving the spatial\nresolution of these images. By successfully recovering clean and\nhigh-resolution images from these surveys, the objective is to deepen the\nunderstanding of galaxy formation and evolution through accurate photometric\nmeasurements. We introduce a two-step deconvolution framework using a Swin\nTransformer architecture. Our study reveals that the deep learning-based\nsolution introduces a bias, constraining the scope of scientific analysis. To\naddress this limitation, we propose a novel third step relying on the active\ncoefficients in the sparsity wavelet framework. We conducted a performance\ncomparison between our deep learning-based method and Firedec, a classical\ndeconvolution algorithm, based on an analysis of a subset of the EDisCS cluster\nsamples. We demonstrate the advantage of our method in terms of resolution\nrecovery, generalisation to different noise properties, and computational\nefficiency. The analysis of this cluster sample not only allowed us to assess\nthe efficiency of our method, but it also enabled us to quantify the number of\nclumps within these galaxies in relation to their disc colour. This robust\ntechnique that we propose holds promise for identifying structures in the\ndistant universe through ground-based images.\n","authors":["Utsav Akhaury","Pascale Jablonka","Jean-Luc Starck","Frédéric Courbin"],"pdf_url":"https://arxiv.org/pdf/2405.07842v2.pdf","comment":"11 pages, 14 figures"},{"id":"http://arxiv.org/abs/2311.01623v4","updated":"2024-06-03T22:36:36Z","published":"2023-11-03T16:58:10Z","title":"VQPy: An Object-Oriented Approach to Modern Video Analytics","summary":" Video analytics is widely used in contemporary systems and services. At the\nforefront of video analytics are video queries that users develop to find\nobjects of particular interest. Building upon the insight that video objects\n(e.g., human, animals, cars, etc.), the center of video analytics, are similar\nin spirit to objects modeled by traditional object-oriented languages, we\npropose to develop an object-oriented approach to video analytics. This\napproach, named VQPy, consists of a frontend$\\unicode{x2015}$a Python variant\nwith constructs that make it easy for users to express video objects and their\ninteractions$\\unicode{x2015}$as well as an extensible backend that can\nautomatically construct and optimize pipelines based on video objects. We have\nimplemented and open-sourced VQPy, which has been productized in Cisco as part\nof its DeepVision framework.\n","authors":["Shan Yu","Zhenting Zhu","Yu Chen","Hanchen Xu","Pengzhan Zhao","Yang Wang","Arthi Padmanabhan","Hugo Latapie","Harry Xu"],"pdf_url":"https://arxiv.org/pdf/2311.01623v4.pdf","comment":"MLSys'24"},{"id":"http://arxiv.org/abs/2405.20510v2","updated":"2024-06-03T22:34:58Z","published":"2024-05-30T21:59:29Z","title":"Physically Compatible 3D Object Modeling from a Single Image","summary":" We present a computational framework that transforms single images into 3D\nphysical objects. The visual geometry of a physical object in an image is\ndetermined by three orthogonal attributes: mechanical properties, external\nforces, and rest-shape geometry. Existing single-view 3D reconstruction methods\noften overlook this underlying composition, presuming rigidity or neglecting\nexternal forces. Consequently, the reconstructed objects fail to withstand\nreal-world physical forces, resulting in instability or undesirable deformation\n-- diverging from their intended designs as depicted in the image. Our\noptimization framework addresses this by embedding physical compatibility into\nthe reconstruction process. We explicitly decompose the three physical\nattributes and link them through static equilibrium, which serves as a hard\nconstraint, ensuring that the optimized physical shapes exhibit desired\nphysical behaviors. Evaluations on a dataset collected from Objaverse\ndemonstrate that our framework consistently enhances the physical realism of 3D\nmodels over existing methods. The utility of our framework extends to practical\napplications in dynamic simulations and 3D printing, where adherence to\nphysical compatibility is paramount.\n","authors":["Minghao Guo","Bohan Wang","Pingchuan Ma","Tianyuan Zhang","Crystal Elaine Owens","Chuang Gan","Joshua B. Tenenbaum","Kaiming He","Wojciech Matusik"],"pdf_url":"https://arxiv.org/pdf/2405.20510v2.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2405.12473v2","updated":"2024-06-03T15:05:57Z","published":"2024-05-21T03:25:32Z","title":"Learning Partially Aligned Item Representation for Cross-Domain\n Sequential Recommendation","summary":" Cross-domain sequential recommendation (CDSR) aims to uncover and transfer\nusers' sequential preferences across multiple recommendation domains. While\nsignificant endeavors have been made, they primarily concentrated on developing\nadvanced transfer modules and aligning user representations using\nself-supervised learning techniques. However, the problem of aligning item\nrepresentations has received limited attention, and misaligned item\nrepresentations can potentially lead to sub-optimal sequential modeling and\nuser representation alignment. To this end, we propose a model-agnostic\nframework called \\textbf{C}ross-domain item representation \\textbf{A}lignment\nfor \\textbf{C}ross-\\textbf{D}omain \\textbf{S}equential \\textbf{R}ecommendation\n(\\textbf{CA-CDSR}), which achieves sequence-aware generation and adaptively\npartial alignment for item representations. Specifically, we first develop a\nsequence-aware feature augmentation strategy, which captures both collaborative\nand sequential item correlations, thus facilitating holistic item\nrepresentation generation. Next, we conduct an empirical study to investigate\nthe partial representation alignment problem from a spectrum perspective. It\nmotivates us to devise an adaptive spectrum filter, achieving partial alignment\nadaptively. Furthermore, the aligned item representations can be fed into\ndifferent sequential encoders to obtain user representations. The entire\nframework is optimized in a multi-task learning paradigm with an annealing\nstrategy. Extensive experiments have demonstrated that CA-CDSR can surpass\nstate-of-the-art baselines by a significant margin and can effectively align\nitems in representation spaces to enhance performance.\n","authors":["Mingjia Yin","Hao Wang","Wei Guo","Yong Liu","Zhi Li","Sirui Zhao","Defu Lian","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2405.12473v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17795v2","updated":"2024-06-03T15:02:52Z","published":"2024-05-28T03:45:34Z","title":"Dataset Regeneration for Sequential Recommendation","summary":" The sequential recommender (SR) system is a crucial component of modern\nrecommender systems, as it aims to capture the evolving preferences of users.\nSignificant efforts have been made to enhance the capabilities of SR systems.\nThese methods typically follow the model-centric paradigm, which involves\ndeveloping effective models based on fixed datasets. However, this approach\noften overlooks potential quality issues and flaws inherent in the data. Driven\nby the potential of data-centric AI, we propose a novel data-centric paradigm\nfor developing an ideal training dataset using a model-agnostic dataset\nregeneration framework called DR4SR. This framework enables the regeneration of\na dataset with exceptional cross-architecture generalizability. Additionally,\nwe introduce the DR4SR+ framework, which incorporates a model-aware dataset\npersonalizer to tailor the regenerated dataset specifically for a target model.\nTo demonstrate the effectiveness of the data-centric paradigm, we integrate our\nframework with various model-centric methods and observe significant\nperformance improvements across four widely adopted datasets. Furthermore, we\nconduct in-depth analyses to explore the potential of the data-centric paradigm\nand provide valuable insights. The code can be found at\nhttps://anonymous.4open.science/r/KDD2024-86EA\n","authors":["Mingjia Yin","Hao Wang","Wei Guo","Yong Liu","Suojuan Zhang","Sirui Zhao","Defu Lian","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2405.17795v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.15641v2","updated":"2024-06-03T11:11:13Z","published":"2024-01-28T12:33:14Z","title":"PRE: A Peer Review Based Large Language Model Evaluator","summary":" The impressive performance of large language models (LLMs) has attracted\nconsiderable attention from the academic and industrial communities. Besides\nhow to construct and train LLMs, how to effectively evaluate and compare the\ncapacity of LLMs has also been well recognized as an important yet difficult\nproblem. Existing paradigms rely on either human annotators or model-based\nevaluators to evaluate the performance of LLMs on different tasks. However,\nthese paradigms often suffer from high cost, low generalizability, and\ninherited biases in practice, which make them incapable of supporting the\nsustainable development of LLMs in long term. In order to address these issues,\ninspired by the peer review systems widely used in academic publication\nprocess, we propose a novel framework that can automatically evaluate LLMs\nthrough a peer-review process. Specifically, for the evaluation of a specific\ntask, we first construct a small qualification exam to select \"reviewers\" from\na couple of powerful LLMs. Then, to actually evaluate the \"submissions\" written\nby different candidate LLMs, i.e., the evaluatees, we use the reviewer LLMs to\nrate or compare the submissions. The final ranking of evaluatee LLMs is\ngenerated based on the results provided by all reviewers. We conducted\nextensive experiments on text summarization tasks with eleven LLMs including\nGPT-4. The results demonstrate the existence of biasness when evaluating using\na single LLM. Also, our PRE model outperforms all the baselines, illustrating\nthe effectiveness of the peer review mechanism.\n","authors":["Zhumin Chu","Qingyao Ai","Yiteng Tu","Haitao Li","Yiqun Liu"],"pdf_url":"https://arxiv.org/pdf/2401.15641v2.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2401.04514v2","updated":"2024-06-03T06:50:26Z","published":"2024-01-09T12:12:50Z","title":"Rewriting the Code: A Simple Method for Large Language Model Augmented\n Code Search","summary":" In code search, the Generation-Augmented Retrieval (GAR) framework, which\ngenerates exemplar code snippets to augment queries, has emerged as a promising\nstrategy to address the principal challenge of modality misalignment between\ncode snippets and natural language queries, particularly with the demonstrated\ncode generation capabilities of Large Language Models (LLMs). Nevertheless, our\npreliminary investigations indicate that the improvements conferred by such an\nLLM-augmented framework are somewhat constrained. This limitation could\npotentially be ascribed to the fact that the generated codes, albeit\nfunctionally accurate, frequently display a pronounced stylistic deviation from\nthe ground truth code in the codebase. In this paper, we extend the\nfoundational GAR framework and propose a simple yet effective method that\nadditionally Rewrites the Code (ReCo) within the codebase for style\nnormalization. Experimental results demonstrate that ReCo significantly boosts\nretrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%),\nand fine-tuned dense (up to 23.6%) retrieval settings in diverse search\nscenarios. To further elucidate the advantages of ReCo and stimulate research\nin code style normalization, we introduce Code Style Similarity, the first\nmetric tailored to quantify stylistic similarities in code. Notably, our\nempirical findings reveal the inadequacy of existing metrics in capturing\nstylistic nuances. The source code and data are available at\n\\url{https://github.com/Alex-HaochenLi/ReCo}.\n","authors":["Haochen Li","Xin Zhou","Zhiqi Shen"],"pdf_url":"https://arxiv.org/pdf/2401.04514v2.pdf","comment":"Accepted to ACL 2024"},{"id":"http://arxiv.org/abs/2305.03972v3","updated":"2024-06-03T03:30:35Z","published":"2023-05-06T08:12:11Z","title":"Category-Oriented Representation Learning for Image to Multi-Modal\n Retrieval","summary":" The rise of multi-modal search requests from users has highlighted the\nimportance of multi-modal retrieval (i.e. image-to-text or text-to-image\nretrieval), yet the more complex task of image-to-multi-modal retrieval,\ncrucial for many industry applications, remains under-explored. To address this\ngap and promote further research, we introduce and define the concept of\nImage-to-Multi-Modal Retrieval (IMMR), a process designed to retrieve rich\nmulti-modal (i.e. image and text) documents based on image queries. We focus on\nrepresentation learning for IMMR and analyze three key challenges for it: 1)\nskewed data and noisy label in real-world industrial data, 2) the\ninformation-inequality between image and text modality of documents when\nlearning representations, 3) effective and efficient training in large-scale\nindustrial contexts. To tackle the above challenges, we propose a novel\nframework named organizing categories and learning by classification for\nretrieval (OCLEAR). It consists of three components: 1) a novel\ncategory-oriented data governance scheme coupled with a large-scale\nclassification-based learning paradigm, which handles the skewed and noisy data\nfrom a data perspective. 2) model architecture specially designed for\nmulti-modal learning, where information-inequality between image and text\nmodality of documents is considered for modality fusion. 3) a hybrid parallel\ntraining approach for tackling large-scale training in industrial scenario. The\nproposed framework achieves SOTA performance on public datasets and has been\ndeployed in a real-world industrial e-commence system, leading to significant\nbusiness growth. Code will be made publicly available.\n","authors":["Zida Cheng","Chen Ju","Xu Chen","Zhonghua Zhai","Shuai Xiao","Xiaoyi Zeng","Weilin Huang","Junchi Yan"],"pdf_url":"https://arxiv.org/pdf/2305.03972v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.11300v3","updated":"2024-06-03T19:35:25Z","published":"2023-04-22T03:13:05Z","title":"MAWSEO: Adversarial Wiki Search Poisoning for Illicit Online Promotion","summary":" As a prominent instance of vandalism edits, Wiki search poisoning for illicit\npromotion is a cybercrime in which the adversary aims at editing Wiki articles\nto promote illicit businesses through Wiki search results of relevant queries.\nIn this paper, we report a study that, for the first time, shows that such\nstealthy blackhat SEO on Wiki can be automated. Our technique, called MAWSEO,\nemploys adversarial revisions to achieve real-world cybercriminal objectives,\nincluding rank boosting, vandalism detection evasion, topic relevancy, semantic\nconsistency, user awareness (but not alarming) of promotional content, etc. Our\nevaluation and user study demonstrate that MAWSEO is capable of effectively and\nefficiently generating adversarial vandalism edits, which can bypass\nstate-of-the-art built-in Wiki vandalism detectors, and also get promotional\ncontent through to Wiki users without triggering their alarms. In addition, we\ninvestigated potential defense, including coherence based detection and\nadversarial training of vandalism detection, against our attack in the Wiki\necosystem.\n","authors":["Zilong Lin","Zhengyi Li","Xiaojing Liao","XiaoFeng Wang","Xiaozhong Liu"],"pdf_url":"https://arxiv.org/pdf/2304.11300v3.pdf","comment":"Accepted at the 45th IEEE Symposium on Security and Privacy (IEEE S&P\n 2024)"},{"id":"http://arxiv.org/abs/2406.01702v1","updated":"2024-06-03T18:02:13Z","published":"2024-06-03T18:02:13Z","title":"Session Context Embedding for Intent Understanding in Product Search","summary":" It is often noted that single query-item pair relevance training in search\ndoes not capture the customer intent. User intent can be better deduced from a\nseries of engagements (Clicks, ATCs, Orders) in a given search session. We\npropose a novel method for vectorizing session context for capturing and\nutilizing context in retrieval and rerank. In the runtime, session embedding is\nan alternative to query embedding, saved and updated after each request in the\nsession, it can be used for retrieval and ranking. We outline session\nembedding's solution to session-based intent understanding and its\narchitecture, the background to this line of thought in search and\nrecommendation, detail the methodologies implemented, and finally present the\nresults of an implementation of session embedding for query product type\nclassification. We demonstrate improvements over strategies ignoring session\ncontext in the runtime for user intent understanding.\n","authors":["Navid Mehrdad","Vishal Rathi","Sravanthi Rajanala"],"pdf_url":"https://arxiv.org/pdf/2406.01702v1.pdf","comment":"5 pages, 1 Figure, 5 Tables, SIGIR 2024, LLM for Individuals, Groups,\n and Society"},{"id":"http://arxiv.org/abs/2406.01363v1","updated":"2024-06-03T14:31:47Z","published":"2024-06-03T14:31:47Z","title":"Privacy in LLM-based Recommendation: Recent Advances and Future\n Directions","summary":" Nowadays, large language models (LLMs) have been integrated with conventional\nrecommendation models to improve recommendation performance. However, while\nmost of the existing works have focused on improving the model performance, the\nprivacy issue has only received comparatively less attention. In this paper, we\nreview recent advancements in privacy within LLM-based recommendation,\ncategorizing them into privacy attacks and protection mechanisms. Additionally,\nwe highlight several challenges and propose future directions for the community\nto address these critical problems.\n","authors":["Sichun Luo","Wei Shao","Yuxuan Yao","Jian Xu","Mingyang Liu","Qintong Li","Bowei He","Maolin Wang","Guanzhi Deng","Hanxu Hou","Xinyi Zhang","Linqi Song"],"pdf_url":"https://arxiv.org/pdf/2406.01363v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01285v1","updated":"2024-06-03T12:53:37Z","published":"2024-06-03T12:53:37Z","title":"Large Language Models as Recommender Systems: A Study of Popularity Bias","summary":" The issue of popularity bias -- where popular items are disproportionately\nrecommended, overshadowing less popular but potentially relevant items --\nremains a significant challenge in recommender systems. Recent advancements\nhave seen the integration of general-purpose Large Language Models (LLMs) into\nthe architecture of such systems. This integration raises concerns that it\nmight exacerbate popularity bias, given that the LLM's training data is likely\ndominated by popular items. However, it simultaneously presents a novel\nopportunity to address the bias via prompt tuning. Our study explores this\ndichotomy, examining whether LLMs contribute to or can alleviate popularity\nbias in recommender systems. We introduce a principled way to measure\npopularity bias by discussing existing metrics and proposing a novel metric\nthat fulfills a series of desiderata. Based on our new metric, we compare a\nsimple LLM-based recommender to traditional recommender systems on a movie\nrecommendation task. We find that the LLM recommender exhibits less popularity\nbias, even without any explicit mitigation.\n","authors":["Jan Malte Lichtenberg","Alexander Buchholz","Pola Schwöbel"],"pdf_url":"https://arxiv.org/pdf/2406.01285v1.pdf","comment":"Accepted at Gen-IR@SIGIR24 workshop"},{"id":"http://arxiv.org/abs/2406.01280v1","updated":"2024-06-03T12:48:38Z","published":"2024-06-03T12:48:38Z","title":"Demo: Soccer Information Retrieval via Natural Queries using SoccerRAG","summary":" The rapid evolution of digital sports media necessitates sophisticated\ninformation retrieval systems that can efficiently parse extensive multimodal\ndatasets. This paper demonstrates SoccerRAG, an innovative framework designed\nto harness the power of Retrieval Augmented Generation (RAG) and Large Language\nModels (LLMs) to extract soccer-related information through natural language\nqueries. By leveraging a multimodal dataset, SoccerRAG supports dynamic\nquerying and automatic data validation, enhancing user interaction and\naccessibility to sports archives. We present a novel interactive user interface\n(UI) based on the Chainlit framework which wraps around the core functionality,\nand enable users to interact with the SoccerRAG framework in a chatbot-like\nvisual manner.\n","authors":["Aleksander Theo Strand","Sushant Gautam","Cise Midoglu","Pål Halvorsen"],"pdf_url":"https://arxiv.org/pdf/2406.01280v1.pdf","comment":"accepted to CBMI 2024 as a demonstration;\n https://github.com/simula/soccer-rag"},{"id":"http://arxiv.org/abs/2406.01273v1","updated":"2024-06-03T12:39:04Z","published":"2024-06-03T12:39:04Z","title":"SoccerRAG: Multimodal Soccer Information Retrieval via Natural Queries","summary":" The rapid evolution of digital sports media necessitates sophisticated\ninformation retrieval systems that can efficiently parse extensive multimodal\ndatasets. This paper introduces SoccerRAG, an innovative framework designed to\nharness the power of Retrieval Augmented Generation (RAG) and Large Language\nModels (LLMs) to extract soccer-related information through natural language\nqueries. By leveraging a multimodal dataset, SoccerRAG supports dynamic\nquerying and automatic data validation, enhancing user interaction and\naccessibility to sports archives. Our evaluations indicate that SoccerRAG\neffectively handles complex queries, offering significant improvements over\ntraditional retrieval systems in terms of accuracy and user engagement. The\nresults underscore the potential of using RAG and LLMs in sports analytics,\npaving the way for future advancements in the accessibility and real-time\nprocessing of sports data.\n","authors":["Aleksander Theo Strand","Sushant Gautam","Cise Midoglu","Pål Halvorsen"],"pdf_url":"https://arxiv.org/pdf/2406.01273v1.pdf","comment":"accepted to CBMI 2024 as a regular paper;\n https://github.com/simula/soccer-rag"},{"id":"http://arxiv.org/abs/2406.01233v1","updated":"2024-06-03T11:52:52Z","published":"2024-06-03T11:52:52Z","title":"Multi-word Term Embeddings Improve Lexical Product Retrieval","summary":" Product search is uniquely different from search for documents, Internet\nresources or vacancies, therefore it requires the development of specialized\nsearch systems. The present work describes the H1 embdedding model, designed\nfor an offline term indexing of product descriptions at e-commerce platforms.\nThe model is compared to other state-of-the-art (SoTA) embedding models within\na framework of hybrid product search system that incorporates the advantages of\nlexical methods for product retrieval and semantic embedding-based methods. We\npropose an approach to building semantically rich term vocabularies for search\nindexes. Compared to other production semantic models, H1 paired with the\nproposed approach stands out due to its ability to process multi-word product\nterms as one token. As an example, for search queries \"new balance shoes\",\n\"gloria jeans kids wear\" brand entity will be represented as one token - \"new\nbalance\", \"gloria jeans\". This results in an increased precision of the system\nwithout affecting the recall. The hybrid search system with proposed model\nscores mAP@12 = 56.1% and R@1k = 86.6% on the WANDS public dataset, beating\nother SoTA analogues.\n","authors":["Viktor Shcherbakov","Fedor Krasnov"],"pdf_url":"https://arxiv.org/pdf/2406.01233v1.pdf","comment":"10 pages, 4 figures"},{"id":"http://arxiv.org/abs/2406.01022v1","updated":"2024-06-03T06:08:02Z","published":"2024-06-03T06:08:02Z","title":"Poisoning Attacks and Defenses in Recommender Systems: A Survey","summary":" Modern recommender systems (RS) have profoundly enhanced user experience\nacross digital platforms, yet they face significant threats from poisoning\nattacks. These attacks, aimed at manipulating recommendation outputs for\nunethical gains, exploit vulnerabilities in RS through injecting malicious data\nor intervening model training. This survey presents a unique perspective by\nexamining these threats through the lens of an attacker, offering fresh\ninsights into their mechanics and impacts. Concretely, we detail a systematic\npipeline that encompasses four stages of a poisoning attack: setting attack\ngoals, assessing attacker capabilities, analyzing victim architecture, and\nimplementing poisoning strategies. The pipeline not only aligns with various\nattack tactics but also serves as a comprehensive taxonomy to pinpoint focuses\nof distinct poisoning attacks. Correspondingly, we further classify defensive\nstrategies into two main categories: poisoning data filtering and robust\ntraining from the defender's perspective. Finally, we highlight existing\nlimitations and suggest innovative directions for further exploration in this\nfield.\n","authors":["Zongwei Wang","Junliang Yu","Min Gao","Guanhua Ye","Shazia Sadiq","Hongzhi Yin"],"pdf_url":"https://arxiv.org/pdf/2406.01022v1.pdf","comment":"22 pages, 8 figures"},{"id":"http://arxiv.org/abs/2406.00973v1","updated":"2024-06-03T04:03:24Z","published":"2024-06-03T04:03:24Z","title":"Cold-start Recommendation by Personalized Embedding Region Elicitation","summary":" Rating elicitation is a success element for recommender systems to perform\nwell at cold-starting, in which the systems need to recommend items to a newly\narrived user with no prior knowledge about the user's preference. Existing\nelicitation methods employ a fixed set of items to learn the user's preference\nand then infer the users' preferences on the remaining items. Using a fixed\nseed set can limit the performance of the recommendation system since the seed\nset is unlikely optimal for all new users with potentially diverse preferences.\nThis paper addresses this challenge using a 2-phase, personalized elicitation\nscheme. First, the elicitation scheme asks users to rate a small set of popular\nitems in a ``burn-in'' phase. Second, it sequentially asks the user to rate\nadaptive items to refine the preference and the user's representation.\nThroughout the process, the system represents the user's embedding value not by\na point estimate but by a region estimate. The value of information obtained by\nasking the user's rating on an item is quantified by the distance from the\nregion center embedding space that contains with high confidence the true\nembedding value of the user. Finally, the recommendations are successively\ngenerated by considering the preference region of the user. We show that each\nsubproblem in the elicitation scheme can be efficiently implemented. Further,\nwe empirically demonstrate the effectiveness of the proposed method against\nexisting rating-elicitation methods on several prominent datasets.\n","authors":["Hieu Trung Nguyen","Duy Nguyen","Khoa Doan","Viet Anh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2406.00973v1.pdf","comment":"Accepted at UAI 2024"},{"id":"http://arxiv.org/abs/2406.00944v1","updated":"2024-06-03T02:56:14Z","published":"2024-06-03T02:56:14Z","title":"Unveil the Duality of Retrieval-Augmented Generation: Theoretical\n Analysis and Practical Solution","summary":" Retrieval-augmented generation (RAG) utilizes retrieved texts to enhance\nlarge language models (LLMs). However, studies show that RAG is not\nconsistently effective and can even mislead LLMs due to noisy or incorrect\nretrieved texts. This suggests that RAG possesses a duality including both\nbenefit and detriment. Although many existing methods attempt to address this\nissue, they lack a theoretical explanation for the duality in RAG. The benefit\nand detriment within this duality remain a black box that cannot be quantified\nor compared in an explainable manner. This paper takes the first step in\ntheoretically giving the essential explanation of benefit and detriment in RAG\nby: (1) decoupling and formalizing them from RAG prediction, (2) approximating\nthe gap between their values by representation similarity and (3) establishing\nthe trade-off mechanism between them, to make them explainable, quantifiable,\nand comparable. We demonstrate that the distribution difference between\nretrieved texts and LLMs' knowledge acts as double-edged sword, bringing both\nbenefit and detriment. We also prove that the actual effect of RAG can be\npredicted at token level. Based on our theory, we propose a practical novel\nmethod, X-RAG, which achieves collaborative generation between pure LLM and RAG\nat token level to preserve benefit and avoid detriment. Experiments in\nreal-world tasks based on LLMs including OPT, LLaMA-2, and Mistral show the\neffectiveness of our method and support our theoretical results.\n","authors":["Shicheng Xu","Liang Pang","Huawei Shen","Xueqi Cheng"],"pdf_url":"https://arxiv.org/pdf/2406.00944v1.pdf","comment":"23 pages"},{"id":"http://arxiv.org/abs/2406.00083v1","updated":"2024-06-03T02:25:33Z","published":"2024-06-03T02:25:33Z","title":"BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of\n Large Language Models","summary":" Large Language Models (LLMs) are constrained by outdated information and a\ntendency to generate incorrect data, commonly referred to as \"hallucinations.\"\nRetrieval-Augmented Generation (RAG) addresses these limitations by combining\nthe strengths of retrieval-based methods and generative models. This approach\ninvolves retrieving relevant information from a large, up-to-date dataset and\nusing it to enhance the generation process, leading to more accurate and\ncontextually appropriate responses. Despite its benefits, RAG introduces a new\nattack surface for LLMs, particularly because RAG databases are often sourced\nfrom public data, such as the web. In this paper, we propose \\TrojRAG{} to\nidentify the vulnerabilities and attacks on retrieval parts (RAG database) and\ntheir indirect attacks on generative parts (LLMs). Specifically, we identify\nthat poisoning several customized content passages could achieve a retrieval\nbackdoor, where the retrieval works well for clean queries but always returns\ncustomized poisoned adversarial queries. Triggers and poisoned passages can be\nhighly customized to implement various attacks. For example, a trigger could be\na semantic group like \"The Republican Party, Donald Trump, etc.\" Adversarial\npassages can be tailored to different contents, not only linked to the triggers\nbut also used to indirectly attack generative LLMs without modifying them.\nThese attacks can include denial-of-service attacks on RAG and semantic\nsteering attacks on LLM generations conditioned by the triggers. Our\nexperiments demonstrate that by just poisoning 10 adversarial passages can\ninduce 98.2\\% success rate to retrieve the adversarial passages. Then, these\npassages can increase the reject ratio of RAG-based GPT-4 from 0.01\\% to 74.6\\%\nor increase the rate of negative responses from 0.22\\% to 72\\% for targeted\nqueries.\n","authors":["Jiaqi Xue","Mengxin Zheng","Yebowen Hu","Fei Liu","Xun Chen","Qian Lou"],"pdf_url":"https://arxiv.org/pdf/2406.00083v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2312.00700v2","updated":"2024-06-03T17:57:39Z","published":"2023-12-01T16:33:57Z","title":"GIFT: Generative Interpretable Fine-Tuning","summary":" We present Generative Interpretable Fine-Tuning (GIFT) for\nparameter-efficient fine-tuning of pretrained Transformer backbones, which can\nbe formulated as a simple factorized matrix multiplication in the parameter\nspace or equivalently in the activation space, and thus embraces built-in\ninterpretability. For a pretrained layer with weights $\\omega\\in\n\\mathbb{R}^{d_{out}\\times d_{in}}$, our proposed GIFT learns the fine-tuned\nweights $\\hat{\\omega}$ directly from $\\omega$ as $\\hat{\\omega}=\\omega \\cdot\n(\\mathbb{I}+\\phi_{d_{in}\\times r}\\cdot \\psi_{r\\times d_{in}})$ where\n$\\mathbb{I}$ is an identity matrix. $\\Theta=(\\phi, \\psi)$ are the learnable\nparameters of the two linear layers of GIFT with $r$ being a hyper-parameter.\n$\\Theta$ is shared by all the layers selected for fine-tuning, resulting in\nsignificantly fewer trainable parameters compared to Low-Rank Adaptation\n(LoRA). We perform comprehensive evaluations on natural language tasks\n(commonsense reasoning and sequence classification) and computer vision tasks\n(visual fine-grained classification). We obtain the best accuracy and parameter\nefficiency among baselines both on the Commonsense170k reasoning benchmark\nusing LLaMA-1 (7B) and Llama-2 (7B)/-3 (8B) and on the FGVC and VTAB visual\nrecognition benchmarks using ImageNet-21k pretrained Vision Transformer\n(ViT-B/16). Notably, we obtain 5.9% absolute increase in average accuracy with\n53.8 times reduction of parameters on Commonsense170k using Llama-3 (8B)\ncompared to LoRA. We obtain performance comparable to LoRA on the GLUE\nbenchmark but with significantly fewer parameters using RoBERTa-Base/Large. We\nshow the output of the first linear layer (i.e., $\\omega\\cdot \\phi$) is\nsurprisingly interpretable, which can play the role of a token-clustering head\nas a by-product to localize meaningful objects/parts in images for computer\nvision tasks. Our code is publicly available.\n","authors":["Chinmay Savadikar","Xi Song","Tianfu Wu"],"pdf_url":"https://arxiv.org/pdf/2312.00700v2.pdf","comment":"Project page and code: https://savadikarc.github.io/gift"},{"id":"http://arxiv.org/abs/2401.03955v6","updated":"2024-06-03T17:57:22Z","published":"2024-01-08T15:21:21Z","title":"Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced\n Zero/Few-Shot Forecasting of Multivariate Time Series","summary":" Large pre-trained models excel in zero/few-shot learning for language and\nvision tasks but face challenges in multivariate time series (TS) forecasting\ndue to diverse data characteristics. Consequently, recent research efforts have\nfocused on developing pre-trained TS forecasting models. These models, whether\nbuilt from scratch or adapted from large language models (LLMs), excel in\nzero/few-shot forecasting tasks. However, they are limited by slow performance,\nhigh computational demands, and neglect of cross-channel and exogenous\ncorrelations. To address this, we introduce Tiny Time Mixers (TTM), a compact\nmodel (starting from 1M parameters) with effective transfer learning\ncapabilities, trained exclusively on public TS datasets. TTM, based on the\nlight-weight TSMixer architecture, incorporates innovations like adaptive\npatching, diverse resolution sampling, and resolution prefix tuning to handle\npre-training on varied dataset resolutions with minimal model capacity.\nAdditionally, it employs multi-level modeling to capture channel correlations\nand infuse exogenous signals during fine-tuning. TTM outperforms existing\npopular benchmarks in zero/few-shot forecasting by (4-40\\%), while reducing\ncomputational requirements significantly. Moreover, TTMs are lightweight and\ncan be executed even on CPU-only machines, enhancing usability and fostering\nwider adoption in resource-constrained environments. Model weights for our\ninitial variant (TTM-Q) are available at\nhttps://huggingface.co/ibm-granite/granite-timeseries-ttm-v1. Model weights for\nmore sophisticated variants (TTM-B, TTM-E, and TTM-A) will be shared soon. The\nsource code for TTM can be accessed at\nhttps://github.com/ibm-granite/granite-tsfm/tree/main/tsfm_public/models/tinytimemixer.\n","authors":["Vijay Ekambaram","Arindam Jati","Pankaj Dayama","Sumanta Mukherjee","Nam H. Nguyen","Wesley M. Gifford","Chandra Reddy","Jayant Kalagnanam"],"pdf_url":"https://arxiv.org/pdf/2401.03955v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.10093v2","updated":"2024-06-03T17:51:58Z","published":"2024-02-15T16:46:16Z","title":"MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained\n Representations","summary":" We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning\nboost for pre-trained MIM models. MIM-Refiner is motivated by the insight that\nstrong representations within MIM models generally reside in intermediate\nlayers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are\nconnected to different intermediate layers. In each head, a modified nearest\nneighbor objective constructs semantic clusters that capture semantic\ninformation which improves performance on downstream tasks, including\noff-the-shelf and fine-tuning settings.\n The refinement process is short and simple - yet highly effective. Within a\nfew epochs, we refine the features of MIM models from subpar to\nstate-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with\ndata2vec 2.0 on ImageNet-1K, sets a new state-of-the-art in linear probing\n(84.7%) and low-shot classification among models that are pre-trained on\nImageNet-1K. At ImageNet-1K 1-shot classification, MIM-Refiner advances the\nstate-of-the-art to 64.2%, outperforming larger models that were trained on up\nto 2000 times more data such as DINOv2-g, OpenCLIP-G and MAWS-6.5B.\n","authors":["Benedikt Alkin","Lukas Miklautz","Sepp Hochreiter","Johannes Brandstetter"],"pdf_url":"https://arxiv.org/pdf/2402.10093v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.07193v2","updated":"2024-06-03T17:49:41Z","published":"2024-02-11T13:00:04Z","title":"Loss Symmetry and Noise Equilibrium of Stochastic Gradient Descent","summary":" Symmetries exist abundantly in the loss function of neural networks. We\ncharacterize the learning dynamics of stochastic gradient descent (SGD) when\nexponential symmetries, a broad subclass of continuous symmetries, exist in the\nloss function. We establish that when gradient noises do not balance, SGD has\nthe tendency to move the model parameters toward a point where noises from\ndifferent directions are balanced. Here, a special type of fixed point in the\nconstant directions of the loss function emerges as a candidate for solutions\nfor SGD. As the main theoretical result, we prove that every parameter $\\theta$\nconnects without loss function barrier to a unique noise-balanced fixed point\n$\\theta^*$. The theory implies that the balancing of gradient noise can serve\nas a novel alternative mechanism for relevant phenomena such as progressive\nsharpening and flattening and can be applied to understand common practical\nproblems such as representation normalization, matrix factorization, warmup,\nand formation of latent representations.\n","authors":["Liu Ziyin","Mingze Wang","Hongchao Li","Lei Wu"],"pdf_url":"https://arxiv.org/pdf/2402.07193v2.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2401.12179v2","updated":"2024-06-03T17:37:53Z","published":"2024-01-22T18:10:10Z","title":"DITTO: Diffusion Inference-Time T-Optimization for Music Generation","summary":" We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose\nframe-work for controlling pre-trained text-to-music diffusion models at\ninference-time via optimizing initial noise latents. Our method can be used to\noptimize through any differentiable feature matching loss to achieve a target\n(stylized) output and leverages gradient checkpointing for memory efficiency.\nWe demonstrate a surprisingly wide-range of applications for music generation\nincluding inpainting, outpainting, and looping as well as intensity, melody,\nand musical structure control - all without ever fine-tuning the underlying\nmodel. When we compare our approach against related training, guidance, and\noptimization-based methods, we find DITTO achieves state-of-the-art performance\non nearly all tasks, including outperforming comparable approaches on\ncontrollability, audio quality, and computational efficiency, thus opening the\ndoor for high-quality, flexible, training-free control of diffusion models.\nSound examples can be found at https://DITTO-Music.github.io/web/.\n","authors":["Zachary Novack","Julian McAuley","Taylor Berg-Kirkpatrick","Nicholas J. Bryan"],"pdf_url":"https://arxiv.org/pdf/2401.12179v2.pdf","comment":"Oral at ICML 2024"},{"id":"http://arxiv.org/abs/2401.17505v3","updated":"2024-06-03T17:35:04Z","published":"2024-01-30T23:46:35Z","title":"Arrows of Time for Large Language Models","summary":" We study the probabilistic modeling performed by Autoregressive Large\nLanguage Models (LLMs) through the angle of time directionality, addressing a\nquestion first raised in (Shannon, 1951). For large enough models, we\nempirically find a time asymmetry in their ability to learn natural language: a\ndifference in the average log-perplexity when trying to predict the next token\nversus when trying to predict the previous one. This difference is at the same\ntime subtle and very consistent across various modalities (language, model\nsize, training time, ...). Theoretically, this is surprising: from an\ninformation-theoretic point of view, there should be no such difference. We\nprovide a theoretical framework to explain how such an asymmetry can appear\nfrom sparsity and computational complexity considerations, and outline a number\nof perspectives opened by our results.\n","authors":["Vassilis Papadopoulos","Jérémie Wenger","Clément Hongler"],"pdf_url":"https://arxiv.org/pdf/2401.17505v3.pdf","comment":"Re-arranged and updated figures. Added experiments. 12 figures, 20\n pages"},{"id":"http://arxiv.org/abs/2309.11028v3","updated":"2024-06-03T17:22:24Z","published":"2023-09-20T03:15:11Z","title":"The Topology and Geometry of Neural Representations","summary":" A central question for neuroscience is how to characterize brain\nrepresentations of perceptual and cognitive content. An ideal characterization\nshould distinguish different functional regions with robustness to noise and\nidiosyncrasies of individual brains that do not correspond to computational\ndifferences. Previous studies have characterized brain representations by their\nrepresentational geometry, which is defined by the representational\ndissimilarity matrix (RDM), a summary statistic that abstracts from the roles\nof individual neurons (or responses channels) and characterizes the\ndiscriminability of stimuli. Here we explore a further step of abstraction:\nfrom the geometry to the topology of brain representations. We propose\ntopological representational similarity analysis (tRSA), an extension of\nrepresentational similarity analysis (RSA) that uses a family of\ngeo-topological summary statistics that generalizes the RDM to characterize the\ntopology while de-emphasizing the geometry. We evaluate this new family of\nstatistics in terms of the sensitivity and specificity for model selection\nusing both simulations and fMRI data. In the simulations, the ground truth is a\ndata-generating layer representation in a neural network model and the models\nare the same and other layers in different model instances (trained from\ndifferent random seeds). In fMRI, the ground truth is a visual area and the\nmodels are the same and other areas measured in different subjects. Results\nshow that topology-sensitive characterizations of population codes are robust\nto noise and interindividual variability and maintain excellent sensitivity to\nthe unique representational signatures of different neural network layers and\nbrain regions. These methods enable researchers to calibrate comparisons among\nrepresentations in brains and models to be sensitive to the geometry, the\ntopology, or a combination of both.\n","authors":["Baihan Lin","Nikolaus Kriegeskorte"],"pdf_url":"https://arxiv.org/pdf/2309.11028v3.pdf","comment":"codes: https://github.com/doerlbh/TopologicalRSA"},{"id":"http://arxiv.org/abs/2403.17846v2","updated":"2024-06-03T17:12:25Z","published":"2024-03-26T16:36:43Z","title":"Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot\n Navigation","summary":" Recent open-vocabulary robot mapping methods enrich dense geometric maps with\npre-trained visual-language features. While these maps allow for the prediction\nof point-wise saliency maps when queried for a certain language concept,\nlarge-scale environments and abstract queries beyond the object level still\npose a considerable hurdle, ultimately limiting language-grounded robotic\nnavigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D\nscene graph mapping approach for language-grounded robot navigation. Leveraging\nopen-vocabulary vision foundation models, we first obtain state-of-the-art\nopen-vocabulary segment-level maps in 3D and subsequently construct a 3D scene\ngraph hierarchy consisting of floor, room, and object concepts, each enriched\nwith open-vocabulary features. Our approach is able to represent multi-story\nbuildings and allows robotic traversal of those using a cross-floor Voronoi\ngraph. HOV-SG is evaluated on three distinct datasets and surpasses previous\nbaselines in open-vocabulary semantic accuracy on the object, room, and floor\nlevel while producing a 75% reduction in representation size compared to dense\nopen-vocabulary maps. In order to prove the efficacy and generalization\ncapabilities of HOV-SG, we showcase successful long-horizon\nlanguage-conditioned robot navigation within real-world multi-storage\nenvironments. We provide code and trial video data at http://hovsg.github.io/.\n","authors":["Abdelrhman Werby","Chenguang Huang","Martin Büchner","Abhinav Valada","Wolfram Burgard"],"pdf_url":"https://arxiv.org/pdf/2403.17846v2.pdf","comment":"Code and video are available at http://hovsg.github.io/"},{"id":"http://arxiv.org/abs/2310.17807v3","updated":"2024-06-03T16:59:37Z","published":"2023-10-26T22:58:19Z","title":"Clover: Closed-Loop Verifiable Code Generation","summary":" The use of large language models for code generation is a rapidly growing\ntrend in software development. However, without effective methods for ensuring\nthe correctness of generated code, this trend could lead to any number of\nundesirable outcomes. In this paper, we lay out a vision for addressing this\nchallenge: the Clover paradigm, short for Closed-Loop Verifiable Code\nGeneration, which reduces correctness checking to the more accessible problem\nof consistency checking. At the core of Clover lies a checker that performs\nconsistency checks among code, docstrings, and formal annotations. The checker\nis implemented using a novel integration of formal verification tools and large\nlanguage models. We provide a theoretical analysis to support our thesis that\nClover should be effective at consistency checking. We also empirically\ninvestigate its feasibility on a hand-designed dataset (CloverBench) featuring\nannotated Dafny programs at a textbook level of difficulty. Experimental\nresults show that for this dataset, (i) LLMs are reasonably successful at\nautomatically generating formal specifications; and (ii) our consistency\nchecker achieves a promising acceptance rate (up to 87%) for correct instances\nwhile maintaining zero tolerance for incorrect ones (no false positives).\n","authors":["Chuyue Sun","Ying Sheng","Oded Padon","Clark Barrett"],"pdf_url":"https://arxiv.org/pdf/2310.17807v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.08277v5","updated":"2024-06-03T16:48:59Z","published":"2024-02-13T08:12:48Z","title":"Towards Faithful and Robust LLM Specialists for Evidence-Based\n Question-Answering","summary":" Advances towards more faithful and traceable answers of Large Language Models\n(LLMs) are crucial for various research and practical endeavors. One avenue in\nreaching this goal is basing the answers on reliable sources. However, this\nEvidence-Based QA has proven to work insufficiently with LLMs in terms of\nciting the correct sources (source quality) and truthfully representing the\ninformation within sources (answer attributability). In this work, we\nsystematically investigate how to robustly fine-tune LLMs for better source\nquality and answer attributability. Specifically, we introduce a data\ngeneration pipeline with automated data quality filters, which can synthesize\ndiversified high-quality training and testing data at scale. We further\nintroduce four test sets to benchmark the robustness of fine-tuned specialist\nmodels. Extensive evaluation shows that fine-tuning on synthetic data improves\nperformance on both in- and out-of-distribution. Furthermore, we show that data\nquality, which can be drastically improved by proposed quality filters, matters\nmore than quantity in improving Evidence-Based QA.\n","authors":["Tobias Schimanski","Jingwei Ni","Mathias Kraus","Elliott Ash","Markus Leippold"],"pdf_url":"https://arxiv.org/pdf/2402.08277v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.13784v3","updated":"2024-06-03T16:44:31Z","published":"2024-03-20T17:47:08Z","title":"The Model Openness Framework: Promoting Completeness and Openness for\n Reproducibility, Transparency, and Usability in Artificial Intelligence","summary":" Generative AI (GAI) offers unprecedented opportunities for research and\ninnovation, but its commercialization has raised concerns about transparency,\nreproducibility, and safety. Many open GAI models lack the necessary components\nfor full understanding and reproducibility, and some use restrictive licenses\nwhilst claiming to be ``open-source''. To address these concerns, we propose\nthe Model Openness Framework (MOF), a ranked classification system that rates\nmachine learning models based on their completeness and openness, following\nprinciples of open science, open source, open data, and open access. The MOF\nrequires specific components of the model development lifecycle to be included\nand released under appropriate open licenses. This framework aims to prevent\nmisrepresentation of models claiming to be open, guide researchers and\ndevelopers in providing all model components under permissive licenses, and\nhelp individuals and organizations identify models that can be safely adopted\nwithout restrictions. By promoting transparency and reproducibility, the MOF\ncombats ``openwashing'' practices and establishes completeness and openness as\nprimary criteria alongside the core tenets of responsible AI. Wide adoption of\nthe MOF will foster a more open AI ecosystem, benefiting research, innovation,\nand adoption of state-of-the-art models.\n","authors":["Matt White","Ibrahim Haddad","Cailean Osborne","Xiao-Yang Liu Yanglet","Ahmed Abdelmonsef","Sachin Varghese"],"pdf_url":"https://arxiv.org/pdf/2403.13784v3.pdf","comment":"22 pages"},{"id":"http://arxiv.org/abs/2405.14555v4","updated":"2024-06-03T16:43:16Z","published":"2024-05-23T13:35:34Z","title":"Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating\n Representative and Affinity Bias in Large Language Models","summary":" Research on Large Language Models (LLMs) has often neglected subtle biases\nthat, although less apparent, can significantly influence the models' outputs\ntoward particular social narratives. This study addresses two such biases\nwithin LLMs: representative bias, which denotes a tendency of LLMs to generate\noutputs that mirror the experiences of certain identity groups, and affinity\nbias, reflecting the models' evaluative preferences for specific narratives or\nviewpoints. We introduce two novel metrics to measure these biases: the\nRepresentative Bias Score (RBS) and the Affinity Bias Score (ABS), and present\nthe Creativity-Oriented Generation Suite (CoGS), a collection of open-ended\ntasks such as short story writing and poetry composition, designed with\ncustomized rubrics to detect these subtle biases. Our analysis uncovers marked\nrepresentative biases in prominent LLMs, with a preference for identities\nassociated with being white, straight, and men. Furthermore, our investigation\nof affinity bias reveals distinctive evaluative patterns within each model,\nakin to `bias fingerprints'. This trend is also seen in human evaluators,\nhighlighting a complex interplay between human and machine bias perceptions.\n","authors":["Abhishek Kumar","Sarfaroz Yunusov","Ali Emami"],"pdf_url":"https://arxiv.org/pdf/2405.14555v4.pdf","comment":"9 pages (excluding references), accepted to ACL 2024 Main Conference"},{"id":"http://arxiv.org/abs/2405.16277v3","updated":"2024-06-03T16:42:55Z","published":"2024-05-25T15:28:22Z","title":"Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge","summary":" Large Language Models (LLMs) have demonstrated remarkable success in tasks\nlike the Winograd Schema Challenge (WSC), showcasing advanced textual\ncommon-sense reasoning. However, applying this reasoning to multimodal domains,\nwhere understanding text and images together is essential, remains a\nsubstantial challenge. To address this, we introduce WinoVis, a novel dataset\nspecifically designed to probe text-to-image models on pronoun disambiguation\nwithin multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion\nAttentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel\nevaluation framework that isolates the models' ability in pronoun\ndisambiguation from other visual processing challenges. Evaluation of\nsuccessive model versions reveals that, despite incremental advancements,\nStable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, only marginally\nsurpassing random guessing. Further error analysis identifies important areas\nfor future research aimed at advancing text-to-image models in their ability to\ninterpret and interact with the complex visual world.\n","authors":["Brendan Park","Madeline Janecek","Naser Ezzati-Jivan","Yifeng Li","Ali Emami"],"pdf_url":"https://arxiv.org/pdf/2405.16277v3.pdf","comment":"9 pages (excluding references), accepted to ACL 2024 Main Conference"},{"id":"http://arxiv.org/abs/2405.16282v3","updated":"2024-06-03T16:41:53Z","published":"2024-05-25T15:42:04Z","title":"Confidence Under the Hood: An Investigation into the\n Confidence-Probability Alignment in Large Language Models","summary":" As the use of Large Language Models (LLMs) becomes more widespread,\nunderstanding their self-evaluation of confidence in generated responses\nbecomes increasingly important as it is integral to the reliability of the\noutput of these models. We introduce the concept of Confidence-Probability\nAlignment, that connects an LLM's internal confidence, quantified by token\nprobabilities, to the confidence conveyed in the model's response when\nexplicitly asked about its certainty. Using various datasets and prompting\ntechniques that encourage model introspection, we probe the alignment between\nmodels' internal and expressed confidence. These techniques encompass using\nstructured evaluation scales to rate confidence, including answer options when\nprompting, and eliciting the model's confidence level for outputs it does not\nrecognize as its own. Notably, among the models analyzed, OpenAI's GPT-4 showed\nthe strongest confidence-probability alignment, with an average Spearman's\n$\\hat{\\rho}$ of 0.42, across a wide range of tasks. Our work contributes to the\nongoing efforts to facilitate risk assessment in the application of LLMs and to\nfurther our understanding of model trustworthiness.\n","authors":["Abhishek Kumar","Robert Morabito","Sanzhar Umbet","Jad Kabbara","Ali Emami"],"pdf_url":"https://arxiv.org/pdf/2405.16282v3.pdf","comment":"9 pages (excluding references), accepted to ACL 2024 Main Conference"},{"id":"http://arxiv.org/abs/2402.08845v3","updated":"2024-06-03T16:29:05Z","published":"2024-02-13T23:25:01Z","title":"Feature Attribution with Necessity and Sufficiency via Dual-stage\n Perturbation Test for Causal Explanation","summary":" We investigate the problem of explainability for machine learning models,\nfocusing on Feature Attribution Methods (FAMs) that evaluate feature importance\nthrough perturbation tests. Despite their utility, FAMs struggle to distinguish\nthe contributions of different features, when their prediction changes are\nsimilar after perturbation. To enhance FAMs' discriminative power, we introduce\nFeature Attribution with Necessity and Sufficiency (FANS), which find a\nneighborhood of the input such that perturbing samples within this neighborhood\nhave a high Probability of being Necessity and Sufficiency (PNS) cause for the\nchange in predictions, and use this PNS as the importance of the feature.\nSpecifically, FANS compute this PNS via a heuristic strategy for estimating the\nneighborhood and a perturbation test involving two stages (factual and\ninterventional) for counterfactual reasoning. To generate counterfactual\nsamples, we use a resampling-based approach on the observed samples to\napproximate the required conditional distribution. We demonstrate that FANS\noutperforms existing attribution methods on six benchmarks. Please refer to the\nsource code via \\url{https://github.com/DMIRLAB-Group/FANS}.\n","authors":["Xuexin Chen","Ruichu Cai","Zhengting Huang","Yuxuan Zhu","Julien Horwood","Zhifeng Hao","Zijian Li","Jose Miguel Hernandez-Lobato"],"pdf_url":"https://arxiv.org/pdf/2402.08845v3.pdf","comment":"Accepted in the Proceedings of the 41st International Conference on\n Machine Learning (ICML2024)"},{"id":"http://arxiv.org/abs/2405.06270v3","updated":"2024-06-03T16:23:28Z","published":"2024-05-10T06:52:44Z","title":"XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced\n In-Context Learning in Healthcare","summary":" The integration of Large Language Models (LLMs) into healthcare diagnostics\noffers a promising avenue for clinical decision-making. This study outlines the\ndevelopment of a novel method for zero-shot/few-shot in-context learning (ICL)\nby integrating medical domain knowledge using a multi-layered structured\nprompt. We also explore the efficacy of two communication styles between the\nuser and LLMs: the Numerical Conversational (NC) style, which processes data\nincrementally, and the Natural Language Single-Turn (NL-ST) style, which\nemploys long narrative prompts.\n Our study systematically evaluates the diagnostic accuracy and risk factors,\nincluding gender bias and false negative rates, using a dataset of 920 patient\nrecords in various few-shot scenarios. Results indicate that traditional\nclinical machine learning (ML) models generally outperform LLMs in zero-shot\nand few-shot settings. However, the performance gap narrows significantly when\nemploying few-shot examples alongside effective explainable AI (XAI) methods as\nsources of domain knowledge. Moreover, with sufficient time and an increased\nnumber of examples, the conversational style (NC) nearly matches the\nperformance of ML models. Most notably, LLMs demonstrate comparable or superior\ncost-sensitive accuracy relative to ML models.\n This research confirms that, with appropriate domain knowledge and tailored\ncommunication strategies, LLMs can significantly enhance diagnostic processes.\nThe findings highlight the importance of optimizing the number of training\nexamples and communication styles to improve accuracy and reduce biases in LLM\napplications.\n","authors":["Fatemeh Nazary","Yashar Deldjoo","Tommaso Di Noia","Eugenio di Sciascio"],"pdf_url":"https://arxiv.org/pdf/2405.06270v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.10046v2","updated":"2024-06-03T16:14:51Z","published":"2024-02-15T16:07:56Z","title":"How Flawed Is ECE? An Analysis via Logit Smoothing","summary":" Informally, a model is calibrated if its predictions are correct with a\nprobability that matches the confidence of the prediction. By far the most\ncommon method in the literature for measuring calibration is the expected\ncalibration error (ECE). Recent work, however, has pointed out drawbacks of\nECE, such as the fact that it is discontinuous in the space of predictors. In\nthis work, we ask: how fundamental are these issues, and what are their impacts\non existing results? Towards this end, we completely characterize the\ndiscontinuities of ECE with respect to general probability measures on Polish\nspaces. We then use the nature of these discontinuities to motivate a novel\ncontinuous, easily estimated miscalibration metric, which we term\nLogit-Smoothed ECE (LS-ECE). By comparing the ECE and LS-ECE of pre-trained\nimage classification models, we show in initial experiments that binned ECE\nclosely tracks LS-ECE, indicating that the theoretical pathologies of ECE may\nbe avoidable in practice.\n","authors":["Muthu Chidambaram","Holden Lee","Colin McSwiggen","Semon Rezchikov"],"pdf_url":"https://arxiv.org/pdf/2402.10046v2.pdf","comment":"23 pages, 6 figures"},{"id":"http://arxiv.org/abs/2405.20407v2","updated":"2024-06-03T16:11:03Z","published":"2024-05-30T18:25:19Z","title":"Convolutional L2LFlows: Generating Accurate Showers in Highly Granular\n Calorimeters Using Convolutional Normalizing Flows","summary":" In the quest to build generative surrogate models as computationally\nefficient alternatives to rule-based simulations, the quality of the generated\nsamples remains a crucial frontier. So far, normalizing flows have been among\nthe models with the best fidelity. However, as the latent space in such models\nis required to have the same dimensionality as the data space, scaling up\nnormalizing flows to high dimensional datasets is not straightforward. The\nprior L2LFlows approach successfully used a series of separate normalizing\nflows and sequence of conditioning steps to circumvent this problem. In this\nwork, we extend L2LFlows to simulate showers with a 9-times larger profile in\nthe lateral direction. To achieve this, we introduce convolutional layers and\nU-Net-type connections, move from masked autoregressive flows to coupling\nlayers, and demonstrate the successful modelling of showers in the ILD\nElectromagnetic Calorimeter as well as Dataset 3 from the public CaloChallenge\ndataset.\n","authors":["Thorsten Buss","Frank Gaede","Gregor Kasieczka","Claudius Krause","David Shih"],"pdf_url":"https://arxiv.org/pdf/2405.20407v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17212v4","updated":"2024-06-03T15:57:47Z","published":"2023-05-26T19:14:01Z","title":"Rotational Equilibrium: How Weight Decay Balances Learning Across Neural\n Networks","summary":" This study investigates how weight decay affects the update behavior of\nindividual neurons in deep neural networks through a combination of applied\nanalysis and experimentation. Weight decay can cause the expected magnitude and\nangular updates of a neuron's weight vector to converge to a steady state we\ncall rotational equilibrium. These states can be highly homogeneous,\neffectively balancing the average rotation -- a proxy for the effective\nlearning rate -- across different layers and neurons. Our work analyzes these\ndynamics across optimizers like Adam, Lion, and SGD with momentum, offering a\nnew simple perspective on training that elucidates the efficacy of widely used\nbut poorly understood methods in deep learning. We demonstrate how balanced\nrotation plays a key role in the effectiveness of normalization like Weight\nStandardization, as well as that of AdamW over Adam with L2-regularization.\nFinally, we show that explicitly controlling the rotation provides the benefits\nof weight decay while substantially reducing the need for learning rate warmup.\n","authors":["Atli Kosson","Bettina Messmer","Martin Jaggi"],"pdf_url":"https://arxiv.org/pdf/2305.17212v4.pdf","comment":"Accepted to ICML 2024; Code available at https://github.com/epfml/REQ"},{"id":"http://arxiv.org/abs/2402.14991v3","updated":"2024-06-03T15:42:55Z","published":"2024-02-22T22:03:16Z","title":"Quantum Theory and Application of Contextual Optimal Transport","summary":" Optimal Transport (OT) has fueled machine learning (ML) across many domains.\nWhen paired data measurements $(\\boldsymbol{\\mu}, \\boldsymbol{\\nu})$ are\ncoupled to covariates, a challenging conditional distribution learning setting\narises. Existing approaches for learning a $\\textit{global}$ transport map\nparameterized through a potentially unseen context utilize Neural OT and\nlargely rely on Brenier's theorem. Here, we propose a first-of-its-kind quantum\ncomputing formulation for amortized optimization of contextualized\ntransportation plans. We exploit a direct link between doubly stochastic\nmatrices and unitary operators thus unravelling a natural connection between OT\nand quantum computation. We verify our method (QontOT) on synthetic and real\ndata by predicting variations in cell type distributions conditioned on drug\ndosage. Importantly we conduct a 24-qubit hardware experiment on a task\nchallenging for classical computers and report a performance that cannot be\nmatched with our classical neural OT approach. In sum, this is a first step\ntoward learning to predict contextualized transportation plans through quantum\ncomputing.\n","authors":["Nicola Mariella","Albert Akhriev","Francesco Tacchino","Christa Zoufal","Juan Carlos Gonzalez-Espitia","Benedek Harsanyi","Eugene Koskin","Ivano Tavernelli","Stefan Woerner","Marianna Rapsomaniki","Sergiy Zhuk","Jannis Born"],"pdf_url":"https://arxiv.org/pdf/2402.14991v3.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2312.03654v2","updated":"2024-06-03T15:42:45Z","published":"2023-12-06T18:20:46Z","title":"Efficient Inverse Design Optimization through Multi-fidelity\n Simulations, Machine Learning, and Search Space Reduction Strategies","summary":" This paper introduces a methodology designed to augment the inverse design\noptimization process in scenarios constrained by limited compute, through the\nstrategic synergy of multi-fidelity evaluations, machine learning models, and\noptimization algorithms. The proposed methodology is analyzed on two distinct\nengineering inverse design problems: airfoil inverse design and the scalar\nfield reconstruction problem. It leverages a machine learning model trained\nwith low-fidelity simulation data, in each optimization cycle, thereby\nproficiently predicting a target variable and discerning whether a\nhigh-fidelity simulation is necessitated, which notably conserves computational\nresources. Additionally, the machine learning model is strategically deployed\nprior to optimization to compress the design space boundaries, thereby further\naccelerating convergence toward the optimal solution. The methodology has been\nemployed to enhance two optimization algorithms, namely Differential Evolution\nand Particle Swarm Optimization. Comparative analyses illustrate performance\nimprovements across both algorithms. Notably, this method is adaptable across\nany inverse design application, facilitating a synergy between a representative\nlow-fidelity ML model, and high-fidelity simulation, and can be seamlessly\napplied across any variety of population-based optimization algorithms.}\n","authors":["Luka Grbcic","Juliane Müller","Wibe Albert de Jong"],"pdf_url":"https://arxiv.org/pdf/2312.03654v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.01712v3","updated":"2024-06-03T15:35:12Z","published":"2024-04-02T07:54:18Z","title":"Efficient and Generalizable Certified Unlearning: A Hessian-free\n Recollection Approach","summary":" Machine unlearning strives to uphold the data owners' right to be forgotten\nby enabling models to selectively forget specific data. Recent advances suggest\nprecomputing and storing statistics extracted from second-order information and\nimplementing unlearning through Newton-style updates. However, the theoretical\nanalysis of these works often depends on restrictive assumptions of convexity\nand smoothness, and those mentioned operations on Hessian matrix are extremely\ncostly. As a result, applying these works to high-dimensional models becomes\nchallenging. In this paper, we propose an efficient Hessian-free certified\nunlearning. We propose to maintain a statistical vector for each data, computed\nthrough affine stochastic recursion approximation of the difference between\nretrained and learned models. Our analysis does not involve inverting Hessian\nand thus can be extended to non-convex non-smooth objectives. Under same\nassumptions, we demonstrate advancements of proposed method beyond the\nstate-of-the-art theoretical studies, in terms of generalization, unlearning\nguarantee, deletion capacity, and computation/storage complexity, and we show\nthat the unlearned model of our proposed approach is close to or same as the\nretrained model. Based on the strategy of recollecting statistics for\nforgetting data, we develop an algorithm that achieves near-instantaneous\nunlearning as it only requires a vector addition operation. Experiments\ndemonstrate that the proposed scheme surpasses existing results by orders of\nmagnitude in terms of time/storage costs, while also enhancing accuracy.\n","authors":["Xinbao Qiao","Meng Zhang","Ming Tang","Ermin Wei"],"pdf_url":"https://arxiv.org/pdf/2404.01712v3.pdf","comment":"31 pages, 10 figures"},{"id":"http://arxiv.org/abs/2403.15933v3","updated":"2024-06-03T15:30:52Z","published":"2024-03-23T21:16:56Z","title":"Understanding Domain-Size Generalization in Markov Logic Networks","summary":" We study the generalization behavior of Markov Logic Networks (MLNs) across\nrelational structures of different sizes. Multiple works have noticed that MLNs\nlearned on a given domain generalize poorly across domains of different sizes.\nThis behavior emerges from a lack of internal consistency within an MLN when\nused across different domain sizes. In this paper, we quantify this\ninconsistency and bound it in terms of the variance of the MLN parameters. The\nparameter variance also bounds the KL divergence between an MLN's marginal\ndistributions taken from different domain sizes. We use these bounds to show\nthat maximizing the data log-likelihood while simultaneously minimizing the\nparameter variance corresponds to two natural notions of generalization across\ndomain sizes. Our theoretical results apply to Exponential Random Graphs and\nother Markov network based relational models. Finally, we observe that\nsolutions known to decrease the variance of the MLN parameters, like\nregularization and Domain-Size Aware MLNs, increase the internal consistency of\nthe MLNs. We empirically verify our results on four different datasets, with\ndifferent methods to control parameter variance, showing that controlling\nparameter variance leads to better generalization.\n","authors":["Florian Chen","Felix Weitkämper","Sagar Malhotra"],"pdf_url":"https://arxiv.org/pdf/2403.15933v3.pdf","comment":"To Appear in Proceedings of ECML 2024-Research Track"},{"id":"http://arxiv.org/abs/2402.05724v2","updated":"2024-06-03T15:29:09Z","published":"2024-02-08T14:54:47Z","title":"Model-Based RL for Mean-Field Games is not Statistically Harder than\n Single-Agent RL","summary":" We study the sample complexity of reinforcement learning (RL) in Mean-Field\nGames (MFGs) with model-based function approximation that requires strategic\nexploration to find a Nash Equilibrium policy. We introduce the Partial\nModel-Based Eluder Dimension (P-MBED), a more effective notion to characterize\nthe model class complexity. Notably, P-MBED measures the complexity of the\nsingle-agent model class converted from the given mean-field model class, and\npotentially, can be exponentially lower than the MBED proposed by\n\\citet{huang2023statistical}. We contribute a model elimination algorithm\nfeaturing a novel exploration strategy and establish sample complexity results\npolynomial w.r.t.~P-MBED. Crucially, our results reveal that, under the basic\nrealizability and Lipschitz continuity assumptions, \\emph{learning Nash\nEquilibrium in MFGs is no more statistically challenging than solving a\nlogarithmic number of single-agent RL problems}. We further extend our results\nto Multi-Type MFGs, generalizing from conventional MFGs and involving multiple\ntypes of agents. This extension implies statistical tractability of a broader\nclass of Markov Games through the efficacy of mean-field approximation.\nFinally, inspired by our theoretical algorithm, we present a heuristic approach\nwith improved computational efficiency and empirically demonstrate its\neffectiveness.\n","authors":["Jiawei Huang","Niao He","Andreas Krause"],"pdf_url":"https://arxiv.org/pdf/2402.05724v2.pdf","comment":"ICML 2024; 55 Pages"},{"id":"http://arxiv.org/abs/2311.06103v2","updated":"2024-06-03T15:20:13Z","published":"2023-11-10T15:12:04Z","title":"1-Lipschitz Neural Networks are more expressive with N-Activations","summary":" A crucial property for achieving secure, trustworthy and interpretable deep\nlearning systems is their robustness: small changes to a system's inputs should\nnot result in large changes to its outputs. Mathematically, this means one\nstrives for networks with a small Lipschitz constant. Several recent works have\nfocused on how to construct such Lipschitz networks, typically by imposing\nconstraints on the weight matrices. In this work, we study an orthogonal\naspect, namely the role of the activation function. We show that commonly used\nactivation functions, such as MaxMin, as well as all piece-wise linear ones\nwith two segments unnecessarily restrict the class of representable functions,\neven in the simplest one-dimensional setting. We furthermore introduce the new\nN-activation function that is provably more expressive than currently popular\nactivation functions. We provide code at\nhttps://github.com/berndprach/NActivation.\n","authors":["Bernd Prach","Christoph H. Lampert"],"pdf_url":"https://arxiv.org/pdf/2311.06103v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.20233v2","updated":"2024-06-03T15:16:26Z","published":"2024-03-29T15:22:03Z","title":"Functional Bilevel Optimization for Machine Learning","summary":" In this paper, we introduce a new functional point of view on bilevel\noptimization problems for machine learning, where the inner objective is\nminimized over a function space. These types of problems are most often solved\nby using methods developed in the parametric setting, where the inner objective\nis strongly convex with respect to the parameters of the prediction function.\nThe functional point of view does not rely on this assumption and notably\nallows using over-parameterized neural networks as the inner prediction\nfunction. We propose scalable and efficient algorithms for the functional\nbilevel optimization problem and illustrate the benefits of our approach on\ninstrumental regression and reinforcement learning tasks.\n","authors":["Ieva Petrulionyte","Julien Mairal","Michael Arbel"],"pdf_url":"https://arxiv.org/pdf/2403.20233v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18293v2","updated":"2024-06-03T15:07:01Z","published":"2024-05-28T15:48:27Z","title":"CF-OPT: Counterfactual Explanations for Structured Prediction","summary":" Optimization layers in deep neural networks have enjoyed a growing popularity\nin structured learning, improving the state of the art on a variety of\napplications. Yet, these pipelines lack interpretability since they are made of\ntwo opaque layers: a highly non-linear prediction model, such as a deep neural\nnetwork, and an optimization layer, which is typically a complex black-box\nsolver. Our goal is to improve the transparency of such methods by providing\ncounterfactual explanations. We build upon variational autoencoders a\nprincipled way of obtaining counterfactuals: working in the latent space leads\nto a natural notion of plausibility of explanations. We finally introduce a\nvariant of the classic loss for VAE training that improves their performance in\nour specific structured context. These provide the foundations of CF-OPT, a\nfirst-order optimization algorithm that can find counterfactual explanations\nfor a broad class of structured learning architectures. Our numerical results\nshow that both close and plausible explanations can be obtained for problems\nfrom the recent literature.\n","authors":["Germain Vivier-Ardisson","Alexandre Forel","Axel Parmentier","Thibaut Vidal"],"pdf_url":"https://arxiv.org/pdf/2405.18293v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.15301v2","updated":"2024-06-03T14:56:28Z","published":"2024-03-22T15:51:39Z","title":"Planning with a Learned Policy Basis to Optimally Solve Complex Tasks","summary":" Conventional reinforcement learning (RL) methods can successfully solve a\nwide range of sequential decision problems. However, learning policies that can\ngeneralize predictably across multiple tasks in a setting with non-Markovian\nreward specifications is a challenging problem. We propose to use successor\nfeatures to learn a policy basis so that each (sub)policy in it solves a\nwell-defined subproblem. In a task described by a finite state automaton (FSA)\nthat involves the same set of subproblems, the combination of these\n(sub)policies can then be used to generate an optimal solution without\nadditional learning. In contrast to other methods that combine (sub)policies\nvia planning, our method asymptotically attains global optimality, even in\nstochastic environments.\n","authors":["Guillermo Infante","David Kuric","Anders Jonsson","Vicenç Gómez","Herke van Hoof"],"pdf_url":"https://arxiv.org/pdf/2403.15301v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20448v2","updated":"2024-06-03T14:40:28Z","published":"2024-05-30T19:47:34Z","title":"Knockout: A simple way to handle missing inputs","summary":" Deep learning models can extract predictive and actionable information from\ncomplex inputs. The richer the inputs, the better these models usually perform.\nHowever, models that leverage rich inputs (e.g., multi-modality) can be\ndifficult to deploy widely, because some inputs may be missing at inference.\nCurrent popular solutions to this problem include marginalization, imputation,\nand training multiple models. Marginalization can obtain calibrated predictions\nbut it is computationally costly and therefore only feasible for low\ndimensional inputs. Imputation may result in inaccurate predictions because it\nemploys point estimates for missing variables and does not work well for high\ndimensional inputs (e.g., images). Training multiple models whereby each model\ntakes different subsets of inputs can work well but requires knowing missing\ninput patterns in advance. Furthermore, training and retaining multiple models\ncan be costly. We propose an efficient way to learn both the conditional\ndistribution using full inputs and the marginal distributions. Our method,\nKnockout, randomly replaces input features with appropriate placeholder values\nduring training. We provide a theoretical justification of Knockout and show\nthat it can be viewed as an implicit marginalization strategy. We evaluate\nKnockout in a wide range of simulations and real-world datasets and show that\nit can offer strong empirical performance.\n","authors":["Minh Nguyen","Batuhan K. Karaman","Heejong Kim","Alan Q. Wang","Fengbei Liu","Mert R. Sabuncu"],"pdf_url":"https://arxiv.org/pdf/2405.20448v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.02416v3","updated":"2024-06-03T14:33:45Z","published":"2024-02-04T09:24:51Z","title":"Aligner: Efficient Alignment by Learning to Correct","summary":" With the rapid development of large language models (LLMs) and ever-evolving\npractical requirements, finding an efficient and effective alignment method has\nnever been more critical. However, the tension between the complexity of\ncurrent alignment methods and the need for rapid iteration in deployment\nscenarios necessitates the development of a model-agnostic alignment approach\nthat can operate under these constraints. In this paper, we introduce Aligner,\na novel and simple alignment paradigm that learns the correctional residuals\nbetween preferred and dispreferred answers using a small model. Designed as a\nmodel-agnostic, plug-and-play module, Aligner can be directly applied to\nvarious open-source and API-based models with only one-off training, making it\nsuitable for rapid iteration. Notably, Aligner can be applied to any powerful,\nlarge-scale upstream models. Moreover, it can even iteratively bootstrap the\nupstream models using corrected responses as synthetic human preference data,\nbreaking through the model's performance ceiling. Our experiments demonstrate\nperformance improvements by deploying the same Aligner model across 11\ndifferent LLMs, evaluated on the 3H dimensions (helpfulness, harmlessness, and\nhonesty). Specifically, Aligner-7B has achieved an average improvement of\n68.9\\% in helpfulness and 23.8\\% in harmlessness across the tested LLMs while\nalso effectively reducing hallucination. In the Alpaca-Eval leaderboard,\nstacking Aligner-2B on GPT-4 Turbo improved its LC Win Rate from 55.0\\% to\n58.3\\%, surpassing GPT-4 Omni's 57.5\\% Win Rate (community report).\n","authors":["Jiaming Ji","Boyuan Chen","Hantao Lou","Donghai Hong","Borong Zhang","Xuehai Pan","Juntao Dai","Tianyi Qiu","Yaodong Yang"],"pdf_url":"https://arxiv.org/pdf/2402.02416v3.pdf","comment":"29 pages"},{"id":"http://arxiv.org/abs/2405.06418v2","updated":"2024-06-03T14:27:59Z","published":"2024-05-10T12:03:53Z","title":"PAC-Bayesian Generalization Bounds for Knowledge Graph Representation\n Learning","summary":" While a number of knowledge graph representation learning (KGRL) methods have\nbeen proposed over the past decade, very few theoretical analyses have been\nconducted on them. In this paper, we present the first PAC-Bayesian\ngeneralization bounds for KGRL methods. To analyze a broad class of KGRL\nmodels, we propose a generic framework named ReED (Relation-aware\nEncoder-Decoder), which consists of a relation-aware message passing encoder\nand a triplet classification decoder. Our ReED framework can express at least\n15 different existing KGRL models, including not only graph neural\nnetwork-based models such as R-GCN and CompGCN but also shallow-architecture\nmodels such as RotatE and ANALOGY. Our generalization bounds for the ReED\nframework provide theoretical grounds for the commonly used tricks in KGRL,\ne.g., parameter-sharing and weight normalization schemes, and guide desirable\ndesign choices for practical KGRL methods. We empirically show that the\ncritical factors in our generalization bounds can explain actual generalization\nerrors on three real-world knowledge graphs.\n","authors":["Jaejun Lee","Minsung Hwang","Joyce Jiyoung Whang"],"pdf_url":"https://arxiv.org/pdf/2405.06418v2.pdf","comment":"32 pages, 3 figures, 4 tables, The 41st International Conference on\n Machine Learning (ICML 2024)"},{"id":"http://arxiv.org/abs/2402.07723v2","updated":"2024-06-03T14:20:34Z","published":"2024-02-12T15:35:32Z","title":"Generalization Bounds for Heavy-Tailed SDEs through the Fractional\n Fokker-Planck Equation","summary":" Understanding the generalization properties of heavy-tailed stochastic\noptimization algorithms has attracted increasing attention over the past years.\nWhile illuminating interesting aspects of stochastic optimizers by using\nheavy-tailed stochastic differential equations as proxies, prior works either\nprovided expected generalization bounds, or introduced non-computable\ninformation theoretic terms. Addressing these drawbacks, in this work, we prove\nhigh-probability generalization bounds for heavy-tailed SDEs which do not\ncontain any nontrivial information theoretic terms. To achieve this goal, we\ndevelop new proof techniques based on estimating the entropy flows associated\nwith the so-called fractional Fokker-Planck equation (a partial differential\nequation that governs the evolution of the distribution of the corresponding\nheavy-tailed SDE). In addition to obtaining high-probability bounds, we show\nthat our bounds have a better dependence on the dimension of parameters as\ncompared to prior art. Our results further identify a phase transition\nphenomenon, which suggests that heavy tails can be either beneficial or harmful\ndepending on the problem structure. We support our theory with experiments\nconducted in a variety of settings.\n","authors":["Benjamin Dupuis","Umut Şimşekli"],"pdf_url":"https://arxiv.org/pdf/2402.07723v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.21061v2","updated":"2024-06-03T14:20:27Z","published":"2024-05-31T17:50:27Z","title":"Graph External Attention Enhanced Transformer","summary":" The Transformer architecture has recently gained considerable attention in\nthe field of graph representation learning, as it naturally overcomes several\nlimitations of Graph Neural Networks (GNNs) with customized attention\nmechanisms or positional and structural encodings. Despite making some\nprogress, existing works tend to overlook external information of graphs,\nspecifically the correlation between graphs. Intuitively, graphs with similar\nstructures should have similar representations. Therefore, we propose Graph\nExternal Attention (GEA) -- a novel attention mechanism that leverages multiple\nexternal node/edge key-value units to capture inter-graph correlations\nimplicitly. On this basis, we design an effective architecture called Graph\nExternal Attention Enhanced Transformer (GEAET), which integrates local\nstructure and global interaction information for more comprehensive graph\nrepresentations. Extensive experiments on benchmark datasets demonstrate that\nGEAET achieves state-of-the-art empirical performance. The source code is\navailable for reproducibility at: https://github.com/icm1018/GEAET.\n","authors":["Jianqing Liang","Min Chen","Jiye Liang"],"pdf_url":"https://arxiv.org/pdf/2405.21061v2.pdf","comment":"In Proceedings of ICML 2024"},{"id":"http://arxiv.org/abs/2306.04848v4","updated":"2024-06-03T14:18:29Z","published":"2023-06-08T00:56:33Z","title":"Interpreting and Improving Diffusion Models from an Optimization\n Perspective","summary":" Denoising is intuitively related to projection. Indeed, under the manifold\nhypothesis, adding random noise is approximately equivalent to orthogonal\nperturbation. Hence, learning to denoise is approximately learning to project.\nIn this paper, we use this observation to interpret denoising diffusion models\nas approximate gradient descent applied to the Euclidean distance function. We\nthen provide straight-forward convergence analysis of the DDIM sampler under\nsimple assumptions on the projection error of the denoiser. Finally, we propose\na new gradient-estimation sampler, generalizing DDIM using insights from our\ntheoretical results. In as few as 5-10 function evaluations, our sampler\nachieves state-of-the-art FID scores on pretrained CIFAR-10 and CelebA models\nand can generate high quality samples on latent diffusion models.\n","authors":["Frank Permenter","Chenyang Yuan"],"pdf_url":"https://arxiv.org/pdf/2306.04848v4.pdf","comment":"24 pages, 9 figures, 4 tables. To appear in ICML 2024"},{"id":"http://arxiv.org/abs/2402.08567v2","updated":"2024-06-03T14:15:03Z","published":"2024-02-13T16:06:17Z","title":"Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM\n Agents Exponentially Fast","summary":" A multimodal large language model (MLLM) agent can receive instructions,\ncapture images, retrieve histories from memory, and decide which tools to use.\nNonetheless, red-teaming efforts have revealed that adversarial images/prompts\ncan jailbreak an MLLM and cause unaligned behaviors. In this work, we report an\neven more severe safety issue in multi-agent environments, referred to as\ninfectious jailbreak. It entails the adversary simply jailbreaking a single\nagent, and without any further intervention from the adversary, (almost) all\nagents will become infected exponentially fast and exhibit harmful behaviors.\nTo validate the feasibility of infectious jailbreak, we simulate multi-agent\nenvironments containing up to one million LLaVA-1.5 agents, and employ\nrandomized pair-wise chat as a proof-of-concept instantiation for multi-agent\ninteraction. Our results show that feeding an (infectious) adversarial image\ninto the memory of any randomly chosen agent is sufficient to achieve\ninfectious jailbreak. Finally, we derive a simple principle for determining\nwhether a defense mechanism can provably restrain the spread of infectious\njailbreak, but how to design a practical defense that meets this principle\nremains an open question to investigate. Our project page is available at\nhttps://sail-sg.github.io/Agent-Smith/.\n","authors":["Xiangming Gu","Xiaosen Zheng","Tianyu Pang","Chao Du","Qian Liu","Ye Wang","Jing Jiang","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2402.08567v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2402.16801v2","updated":"2024-06-03T14:12:27Z","published":"2024-02-26T18:19:07Z","title":"Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement\n Learning","summary":" Benchmarks play a crucial role in the development and analysis of\nreinforcement learning (RL) algorithms. We identify that existing benchmarks\nused for research into open-ended learning fall into one of two categories.\nEither they are too slow for meaningful research to be performed without\nenormous computational resources, like Crafter, NetHack and Minecraft, or they\nare not complex enough to pose a significant challenge, like Minigrid and\nProcgen. To remedy this, we first present Craftax-Classic: a ground-up rewrite\nof Crafter in JAX that runs up to 250x faster than the Python-native original.\nA run of PPO using 1 billion environment interactions finishes in under an hour\nusing only a single GPU and averages 90% of the optimal reward. To provide a\nmore compelling challenge we present the main Craftax benchmark, a significant\nextension of the Crafter mechanics with elements inspired from NetHack. Solving\nCraftax requires deep exploration, long term planning and memory, as well as\ncontinual adaptation to novel situations as more of the world is discovered. We\nshow that existing methods including global and episodic exploration, as well\nas unsupervised environment design fail to make material progress on the\nbenchmark. We believe that Craftax can for the first time allow researchers to\nexperiment in a complex, open-ended environment with limited computational\nresources.\n","authors":["Michael Matthews","Michael Beukman","Benjamin Ellis","Mikayel Samvelyan","Matthew Jackson","Samuel Coward","Jakob Foerster"],"pdf_url":"https://arxiv.org/pdf/2402.16801v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.11742v2","updated":"2024-06-03T14:09:10Z","published":"2024-02-18T23:59:54Z","title":"Balanced Data, Imbalanced Spectra: Unveiling Class Disparities with\n Spectral Imbalance","summary":" Classification models are expected to perform equally well for different\nclasses, yet in practice, there are often large gaps in their performance. This\nissue of class bias is widely studied in cases of datasets with sample\nimbalance, but is relatively overlooked in balanced datasets. In this work, we\nintroduce the concept of spectral imbalance in features as a potential source\nfor class disparities and study the connections between spectral imbalance and\nclass bias in both theory and practice. To build the connection between\nspectral imbalance and class gap, we develop a theoretical framework for\nstudying class disparities and derive exact expressions for the per-class error\nin a high-dimensional mixture model setting. We then study this phenomenon in\n11 different state-of-the-art pretrained encoders and show how our proposed\nframework can be used to compare the quality of encoders, as well as evaluate\nand combine data augmentation strategies to mitigate the issue. Our work sheds\nlight on the class-dependent effects of learning, and provides new insights\ninto how state-of-the-art pretrained features may have unknown biases that can\nbe diagnosed through their spectra.\n","authors":["Chiraag Kaushik","Ran Liu","Chi-Heng Lin","Amrit Khera","Matthew Y Jin","Wenrui Ma","Vidya Muthukumar","Eva L Dyer"],"pdf_url":"https://arxiv.org/pdf/2402.11742v2.pdf","comment":"25 pages, 9 figures"},{"id":"http://arxiv.org/abs/2402.02827v2","updated":"2024-06-03T13:51:16Z","published":"2024-02-05T09:24:52Z","title":"PowerGraph: A power grid benchmark dataset for graph neural networks","summary":" Power grids are critical infrastructures of paramount importance to modern\nsociety and, therefore, engineered to operate under diverse conditions and\nfailures. The ongoing energy transition poses new challenges for the\ndecision-makers and system operators. Therefore, we must develop grid analysis\nalgorithms to ensure reliable operations. These key tools include power flow\nanalysis and system security analysis, both needed for effective operational\nand strategic planning. The literature review shows a growing trend of machine\nlearning (ML) models that perform these analyses effectively. In particular,\nGraph Neural Networks (GNNs) stand out in such applications because of the\ngraph-based structure of power grids. However, there is a lack of publicly\navailable graph datasets for training and benchmarking ML models in electrical\npower grid applications. First, we present PowerGraph, which comprises\nGNN-tailored datasets for i) power flows, ii) optimal power flows, and iii)\ncascading failure analyses of power grids. Second, we provide ground-truth\nexplanations for the cascading failure analysis. Finally, we perform a complete\nbenchmarking of GNN methods for node-level and graph-level tasks and\nexplainability. Overall, PowerGraph is a multifaceted GNN dataset for diverse\ntasks that includes power flow and fault scenarios with real-world\nexplanations, providing a valuable resource for developing improved GNN models\nfor node-level, graph-level tasks and explainability methods in power system\nmodeling. The dataset is available at\nhttps://figshare.com/articles/dataset/PowerGraph/22820534 and the code at\nhttps://github.com/PowerGraph-Datasets.\n","authors":["Anna Varbella","Kenza Amara","Blazhe Gjorgiev","Mennatallah El-Assady","Giovanni Sansavini"],"pdf_url":"https://arxiv.org/pdf/2402.02827v2.pdf","comment":"21 pages, 8 figures, conference paper"},{"id":"http://arxiv.org/abs/2403.19289v2","updated":"2024-06-03T13:49:20Z","published":"2024-03-28T10:19:36Z","title":"Uplift Modeling Under Limited Supervision","summary":" Estimating causal effects in e-commerce tends to involve costly treatment\nassignments which can be impractical in large-scale settings. Leveraging\nmachine learning to predict such treatment effects without actual intervention\nis a standard practice to diminish the risk. However, existing methods for\ntreatment effect prediction tend to rely on training sets of substantial size,\nwhich are built from real experiments and are thus inherently risky to create.\nIn this work we propose a graph neural network to diminish the required\ntraining set size, relying on graphs that are common in e-commerce data.\nSpecifically, we view the problem as node regression with a restricted number\nof labeled instances, develop a two-model neural architecture akin to previous\ncausal effect estimators, and test varying message-passing layers for encoding.\nFurthermore, as an extra step, we combine the model with an acquisition\nfunction to guide the creation of the training set in settings with extremely\nlow experimental budget. The framework is flexible since each step can be used\nseparately with other models or treatment policies. The experiments on real\nlarge-scale networks indicate a clear advantage of our methodology over the\nstate of the art, which in many cases performs close to random, underlining the\nneed for models that can generalize with limited supervision to reduce\nexperimental risks.\n","authors":["George Panagopoulos","Daniele Malitesta","Fragkiskos D. Malliaros","Jun Pang"],"pdf_url":"https://arxiv.org/pdf/2403.19289v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.10134v3","updated":"2024-06-03T13:48:59Z","published":"2022-08-22T08:23:53Z","title":"Machine Learning with Confidential Computing: A Systematization of\n Knowledge","summary":" Privacy and security challenges in Machine Learning (ML) have become\nincreasingly severe, along with ML's pervasive development and the recent\ndemonstration of large attack surfaces. As a mature system-oriented approach,\nConfidential Computing has been utilized in both academia and industry to\nmitigate privacy and security issues in various ML scenarios. In this paper,\nthe conjunction between ML and Confidential Computing is investigated. We\nsystematize the prior work on Confidential Computing-assisted ML techniques\nthat provide i) confidentiality guarantees and ii) integrity assurances, and\ndiscuss their advanced features and drawbacks. Key challenges are further\nidentified, and we provide dedicated analyses of the limitations in existing\nTrusted Execution Environment (TEE) systems for ML use cases. Finally,\nprospective works are discussed, including grounded privacy definitions for\nclosed-loop protection, partitioned executions of efficient ML, dedicated\nTEE-assisted designs for ML, TEE-aware ML, and ML full pipeline guarantees. By\nproviding these potential solutions in our systematization of knowledge, we aim\nto build the bridge to help achieve a much stronger TEE-enabled ML for privacy\nguarantees without introducing computation and system costs.\n","authors":["Fan Mo","Zahra Tarkhani","Hamed Haddadi"],"pdf_url":"https://arxiv.org/pdf/2208.10134v3.pdf","comment":"Survey paper, 37 pages, accepted to ACM Computing Surveys"},{"id":"http://arxiv.org/abs/2405.07839v2","updated":"2024-06-03T13:48:52Z","published":"2024-05-13T15:25:03Z","title":"Constrained Exploration via Reflected Replica Exchange Stochastic\n Gradient Langevin Dynamics","summary":" Replica exchange stochastic gradient Langevin dynamics (reSGLD) is an\neffective sampler for non-convex learning in large-scale datasets. However, the\nsimulation may encounter stagnation issues when the high-temperature chain\ndelves too deeply into the distribution tails. To tackle this issue, we propose\nreflected reSGLD (r2SGLD): an algorithm tailored for constrained non-convex\nexploration by utilizing reflection steps within a bounded domain.\nTheoretically, we observe that reducing the diameter of the domain enhances\nmixing rates, exhibiting a $\\textit{quadratic}$ behavior. Empirically, we test\nits performance through extensive experiments, including identifying dynamical\nsystems with physical constraints, simulations of constrained multi-modal\ndistributions, and image classification tasks. The theoretical and empirical\nfindings highlight the crucial role of constrained exploration in improving the\nsimulation efficiency.\n","authors":["Haoyang Zheng","Hengrong Du","Qi Feng","Wei Deng","Guang Lin"],"pdf_url":"https://arxiv.org/pdf/2405.07839v2.pdf","comment":"28 pages, 13 figures"},{"id":"http://arxiv.org/abs/2311.18639v2","updated":"2024-06-03T13:45:44Z","published":"2023-11-30T15:46:22Z","title":"Targeted Reduction of Causal Models","summary":" Why does a phenomenon occur? Addressing this question is central to most\nscientific inquiries and often relies on simulations of scientific models. As\nmodels become more intricate, deciphering the causes behind phenomena in\nhigh-dimensional spaces of interconnected variables becomes increasingly\nchallenging. Causal Representation Learning (CRL) offers a promising avenue to\nuncover interpretable causal patterns within these simulations through an\ninterventional lens. However, developing general CRL frameworks suitable for\npractical applications remains an open challenge. We introduce Targeted Causal\nReduction (TCR), a method for condensing complex intervenable models into a\nconcise set of causal factors that explain a specific target phenomenon. We\npropose an information theoretic objective to learn TCR from interventional\ndata of simulations, establish identifiability for continuous variables under\nshift interventions and present a practical algorithm for learning TCRs. Its\nability to generate interpretable high-level explanations from complex models\nis demonstrated on toy and mechanical systems, illustrating its potential to\nassist scientists in the study of complex phenomena in a broad range of\ndisciplines.\n","authors":["Armin Kekić","Bernhard Schölkopf","Michel Besserve"],"pdf_url":"https://arxiv.org/pdf/2311.18639v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.15479v2","updated":"2024-06-03T13:43:52Z","published":"2023-06-27T13:57:16Z","title":"Predictive Coding beyond Correlations","summary":" Recently, there has been extensive research on the capabilities of\nbiologically plausible algorithms. In this work, we show how one of such\nalgorithms, called predictive coding, is able to perform causal inference\ntasks. First, we show how a simple change in the inference process of\npredictive coding enables to compute interventions without the need to mutilate\nor redefine a causal graph. Then, we explore applications in cases where the\ngraph is unknown, and has to be inferred from observational data. Empirically,\nwe show how such findings can be used to improve the performance of predictive\ncoding in image classification tasks, and conclude that such models are able to\nperform simple end-to-end causal inference tasks.\n","authors":["Tommaso Salvatori","Luca Pinchetti","Amine M'Charrak","Beren Millidge","Thomas Lukasiewicz"],"pdf_url":"https://arxiv.org/pdf/2306.15479v2.pdf","comment":"44 Pages, 24 Figures. Changed title and abstract, following the ICML\n accepted version"},{"id":"http://arxiv.org/abs/2401.08381v2","updated":"2024-06-03T13:40:44Z","published":"2024-01-16T14:11:54Z","title":"Robotic Imitation of Human Actions","summary":" Imitation can allow us to quickly gain an understanding of a new task.\nThrough a demonstration, we can gain direct knowledge about which actions need\nto be performed and which goals they have. In this paper, we introduce a new\napproach to imitation learning that tackles the challenges of a robot imitating\na human, such as the change in perspective and body schema. Our approach can\nuse a single human demonstration to abstract information about the demonstrated\ntask, and use that information to generalise and replicate it. We facilitate\nthis ability by a new integration of two state-of-the-art methods: a diffusion\naction segmentation model to abstract temporal information from the\ndemonstration and an open vocabulary object detector for spatial information.\nFurthermore, we refine the abstracted information and use symbolic reasoning to\ncreate an action plan utilising inverse kinematics, to allow the robot to\nimitate the demonstrated action.\n","authors":["Josua Spisak","Matthias Kerzel","Stefan Wermter"],"pdf_url":"https://arxiv.org/pdf/2401.08381v2.pdf","comment":"Accepted at the ICDL 2024"},{"id":"http://arxiv.org/abs/2401.07039v2","updated":"2024-06-03T13:37:50Z","published":"2024-01-13T10:56:34Z","title":"Quantum Generative Diffusion Model: A Fully Quantum-Mechanical Model for\n Generating Quantum State Ensemble","summary":" Classical diffusion models have shown superior generative results and have\nbeen applied to many problems. Exploring these models in the quantum domain can\nadvance the field of quantum generative learning. In this paper, we introduce\nthe Quantum Generative Diffusion Model (QGDM), a simple and elegant quantum\ncounterpart of classical diffusion models.\n The core idea of QGDM is that any target quantum state can be transformed\ninto a completely mixed state, which has the highest entropy and maximum\nuncertainty about the system, through a non-unitary forward process.\nSubsequently, a trainable backward process can be used to recover the target\nstate from the completely mixed state. The design requirements for QGDM's\nbackward process include ensuring non-unitarity while maintaining a low number\nof parameters. To achieve this, we introduce partial trace operations in the\nbackward process to enforce non-unitary. Additionally, we control the number of\ntrainable parameters by using a parameter-sharing strategy and incorporating\ntemporal information as an input in the backward process. Furthermore, we\nintroduce a resource-efficient version of QGDM, which reduces the number of\nauxiliary qubits while preserving impressive generative capabilities.\n Our proposed models exhibit better convergence performance than Quantum\nGenerative Adversarial Networks (QGANs) because our models optimize a convex\ndistance function using gradient descent. Comparative results with QGANs\ndemonstrate the effectiveness of our models in generating both pure and mixed\nquantum states. Notably, our models achieve 53.03% higher fidelity in\nmixed-state generation tasks compared to QGANs. These results highlight the\npotential of the proposed models to tackle challenging quantum generation\ntasks.\n","authors":["Chuangtao Chen","Qinglin Zhao","MengChu Zhou","Zhimin He","Zhili Sun","Haozhen Situ"],"pdf_url":"https://arxiv.org/pdf/2401.07039v2.pdf","comment":"Comments are welcome"},{"id":"http://arxiv.org/abs/2306.04974v2","updated":"2024-06-03T13:30:28Z","published":"2023-06-08T07:05:36Z","title":"Conservative Prediction via Data-Driven Confidence Minimization","summary":" In safety-critical applications of machine learning, it is often desirable\nfor a model to be conservative, abstaining from making predictions on unknown\ninputs which are not well-represented in the training data. However, detecting\nunknown examples is challenging, as it is impossible to anticipate all\npotential inputs at test time. To address this, prior work (Hendrycks et al.,\n2018) minimizes model confidence on an auxiliary outlier dataset carefully\ncurated to be disjoint from the training distribution. We theoretically analyze\nthe choice of auxiliary dataset for confidence minimization, revealing two\nactionable insights: (1) if the auxiliary set contains unknown examples similar\nto those seen at test time, confidence minimization leads to provable detection\nof unknown test examples, and (2) if the first condition is satisfied, it is\nunnecessary to filter out known examples for out-of-distribution (OOD)\ndetection. Motivated by these guidelines, we propose the Data-Driven Confidence\nMinimization (DCM) framework, which minimizes confidence on an uncertainty\ndataset. We apply DCM to two problem settings in which conservative prediction\nis paramount -- selective classification and OOD detection -- and provide a\nrealistic way to gather uncertainty data for each setting. In our experiments,\nDCM consistently outperforms existing selective classification approaches on 4\ndatasets when tested on unseen distributions and outperforms state-of-the-art\nOOD detection methods on 12 ID-OOD dataset pairs, reducing FPR (at TPR $95\\%$)\nby $6.3\\%$ and $58.1\\%$ on CIFAR-10 and CIFAR-100 compared to Outlier Exposure.\n","authors":["Caroline Choi","Fahim Tajwar","Yoonho Lee","Huaxiu Yao","Ananya Kumar","Chelsea Finn"],"pdf_url":"https://arxiv.org/pdf/2306.04974v2.pdf","comment":"Transactions on Machine Learning Research (TMLR), 2024"},{"id":"http://arxiv.org/abs/2402.04050v2","updated":"2024-06-03T13:22:12Z","published":"2024-02-06T14:53:19Z","title":"Connecting the Dots: Collaborative Fine-tuning for Black-Box\n Vision-Language Models","summary":" With the emergence of pretrained vision-language models (VLMs), considerable\nefforts have been devoted to fine-tuning them for downstream tasks. Despite the\nprogress made in designing efficient fine-tuning methods, such methods require\naccess to the model's parameters, which can be challenging as model owners\noften opt to provide their models as a black box to safeguard model ownership.\nThis paper proposes a \\textbf{C}ollabo\\textbf{ra}tive\n\\textbf{F}ine-\\textbf{T}uning (\\textbf{CraFT}) approach for fine-tuning\nblack-box VLMs to downstream tasks, where one only has access to the input\nprompts and the output predictions of the model. CraFT comprises two modules, a\nprompt generation module for learning text prompts and a prediction refinement\nmodule for enhancing output predictions in residual style. Additionally, we\nintroduce an auxiliary prediction-consistent loss to promote consistent\noptimization across these modules. These modules are optimized by a novel\ncollaborative training algorithm. Extensive experiments on few-shot\nclassification over 15 datasets demonstrate the superiority of CraFT. The\nresults show that CraFT achieves a decent gain of about 12\\% with 16-shot\ndatasets and only 8,000 queries. Moreover, CraFT trains faster and uses only\nabout 1/80 of the memory footprint for deployment, while sacrificing only\n1.62\\% compared to the white-box method. Our code is publicly available at\nhttps://github.com/mrflogs/CraFT .\n","authors":["Zhengbo Wang","Jian Liang","Ran He","Zilei Wang","Tieniu Tan"],"pdf_url":"https://arxiv.org/pdf/2402.04050v2.pdf","comment":"Accepted by ICML 2024"},{"id":"http://arxiv.org/abs/2402.14029v2","updated":"2024-06-03T13:12:18Z","published":"2024-02-20T03:14:45Z","title":"Partial Search in a Frozen Network is Enough to Find a Strong Lottery\n Ticket","summary":" Randomly initialized dense networks contain subnetworks that achieve high\naccuracy without weight learning -- strong lottery tickets (SLTs). Recently,\nGadhikar et al. (2023) demonstrated that SLTs can also be found within a\nrandomly pruned source network, thus reducing the SLT search space. However,\nthis limits the search to SLTs that are even sparser than the source, leading\nto worse accuracy due to unintentionally high sparsity. This paper proposes a\nmethod that reduces the SLT search space by an arbitrary ratio independent of\nthe desired SLT sparsity. A random subset of the initial weights is excluded\nfrom the search space by freezing it -- i.e., by either permanently pruning\nthem or locking them as a fixed part of the SLT. In addition to reducing search\nspace, the proposed random freezing can also provide the benefit of reducing\nthe model size for inference. Furthermore, experimental results show that the\nproposed method finds SLTs with better accuracy-to-model size trade-off than\nthe SLTs obtained from dense or randomly pruned source networks. In particular,\nthe SLTs found in Frozen ResNets on image classification using ImageNet\nsignificantly improve the accuracy-to-search space and accuracy-to-model size\ntrade-offs over SLTs within dense (non-freezing) or sparse (non-locking) random\nnetworks.\n","authors":["Hikari Otsuka","Daiki Chijiwa","Ángel López García-Arias","Yasuyuki Okoshi","Kazushi Kawamura","Thiem Van Chu","Daichi Fujiki","Susumu Takeuchi","Masato Motomura"],"pdf_url":"https://arxiv.org/pdf/2402.14029v2.pdf","comment":"v2: Updates include additional experiments and revisions of some\n experiments"},{"id":"http://arxiv.org/abs/2210.04872v3","updated":"2024-06-03T13:07:57Z","published":"2022-10-10T17:45:37Z","title":"Sequential Neural Score Estimation: Likelihood-Free Inference with\n Conditional Score Based Diffusion Models","summary":" We introduce Sequential Neural Posterior Score Estimation (SNPSE), a\nscore-based method for Bayesian inference in simulator-based models. Our\nmethod, inspired by the remarkable success of score-based methods in generative\nmodelling, leverages conditional score-based diffusion models to generate\nsamples from the posterior distribution of interest. The model is trained using\nan objective function which directly estimates the score of the posterior. We\nembed the model into a sequential training procedure, which guides simulations\nusing the current approximation of the posterior at the observation of\ninterest, thereby reducing the simulation cost. We also introduce several\nalternative sequential approaches, and discuss their relative merits. We then\nvalidate our method, as well as its amortised, non-sequential, variant on\nseveral numerical examples, demonstrating comparable or superior performance to\nexisting state-of-the-art methods such as Sequential Neural Posterior\nEstimation (SNPE).\n","authors":["Louis Sharrock","Jack Simons","Song Liu","Mark Beaumont"],"pdf_url":"https://arxiv.org/pdf/2210.04872v3.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2402.02805v2","updated":"2024-06-03T13:07:06Z","published":"2024-02-05T08:26:33Z","title":"Graph-enhanced Large Language Models in Asynchronous Plan Reasoning","summary":" Planning is a fundamental property of human intelligence. Reasoning about\nasynchronous plans is challenging since it requires sequential and parallel\nplanning to optimize time costs. Can large language models (LLMs) succeed at\nthis task? Here, we present the first large-scale study investigating this\nquestion. We find that a representative set of closed and open-source LLMs,\nincluding GPT-4 and LLaMA-2, behave poorly when not supplied with illustrations\nabout the task-solving process in our benchmark AsyncHow. We propose a novel\ntechnique called Plan Like a Graph (PLaG) that combines graphs with natural\nlanguage prompts and achieves state-of-the-art results. We show that although\nPLaG can boost model performance, LLMs still suffer from drastic degradation\nwhen task complexity increases, highlighting the limits of utilizing LLMs for\nsimulating digital devices. We see our study as an exciting step towards using\nLLMs as efficient autonomous agents. Our code and data are available at\nhttps://github.com/fangru-lin/graph-llm-asynchow-plan.\n","authors":["Fangru Lin","Emanuele La Malfa","Valentin Hofmann","Elle Michelle Yang","Anthony Cohn","Janet B. Pierrehumbert"],"pdf_url":"https://arxiv.org/pdf/2402.02805v2.pdf","comment":"Accepted at ICML-2024"},{"id":"http://arxiv.org/abs/2305.12639v2","updated":"2024-06-03T13:06:52Z","published":"2023-05-22T02:22:14Z","title":"Accelerating Graph Neural Networks via Edge Pruning for Power Allocation\n in Wireless Networks","summary":" Graph Neural Networks (GNNs) have recently emerged as a promising approach to\ntackling power allocation problems in wireless networks. Since unpaired\ntransmitters and receivers are often spatially distant, the distance-based\nthreshold is proposed to reduce the computation time by excluding or including\nthe channel state information in GNNs. In this paper, we are the first to\nintroduce a neighbour-based threshold approach to GNNs to reduce the time\ncomplexity. Furthermore, we conduct a comprehensive analysis of both\ndistance-based and neighbour-based thresholds and provide recommendations for\nselecting the appropriate value in different communication channel scenarios.\nWe design the corresponding neighbour-based Graph Neural Networks (N-GNN) with\nthe aim of allocating transmit powers to maximise the network throughput. Our\nresults show that our proposed N-GNN offer significant advantages in terms of\nreducing time complexity while preserving strong performance and generalisation\ncapacity. Besides, we show that by choosing a suitable threshold, the time\ncomplexity is reduced from O(|V|^2) to O(|V|), where |V| is the total number of\ntransceiver pairs.\n","authors":["Lili Chen","Jingge Zhu","Jamie Evans"],"pdf_url":"https://arxiv.org/pdf/2305.12639v2.pdf","comment":"Published in 2023 IEEE Global Communications Conference Workshops (GC\n Workshops)"},{"id":"http://arxiv.org/abs/2405.11349v2","updated":"2024-06-03T12:55:58Z","published":"2024-05-18T17:38:25Z","title":"Unlock the Power of Algorithm Features: A Generalization Analysis for\n Algorithm Selection","summary":" In the algorithm selection research, the discussion surrounding algorithm\nfeatures has been significantly overshadowed by the emphasis on problem\nfeatures. Although a few empirical studies have yielded evidence regarding the\neffectiveness of algorithm features, the potential benefits of incorporating\nalgorithm features into algorithm selection models and their suitability for\ndifferent scenarios remain unclear. In this paper, we address this gap by\nproposing the first provable guarantee for algorithm selection based on\nalgorithm features, taking a generalization perspective. We analyze the\nbenefits and costs associated with algorithm features and investigate how the\ngeneralization error is affected by different factors. Specifically, we examine\nadaptive and predefined algorithm features under transductive and inductive\nlearning paradigms, respectively, and derive upper bounds for the\ngeneralization error based on their model's Rademacher complexity. Our\ntheoretical findings not only provide tight upper bounds, but also offer\nanalytical insights into the impact of various factors, such as the training\nscale of problem instances and candidate algorithms, model parameters, feature\nvalues, and distributional differences between the training and test data.\nNotably, we demonstrate how models will benefit from algorithm features in\ncomplex scenarios involving many algorithms, and proves the positive\ncorrelation between generalization error bound and $\\chi^2$-divergence of\ndistributions.\n","authors":["Xingyu Wu","Yan Zhong","Jibin Wu","Yuxiao Huang","Sheng-hao Wu","Kay Chen Tan"],"pdf_url":"https://arxiv.org/pdf/2405.11349v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.09445v2","updated":"2024-06-03T12:42:50Z","published":"2024-01-31T16:53:50Z","title":"iMove: Exploring Bio-impedance Sensing for Fitness Activity Recognition","summary":" Automatic and precise fitness activity recognition can be beneficial in\naspects from promoting a healthy lifestyle to personalized preventative\nhealthcare. While IMUs are currently the prominent fitness tracking modality,\nthrough iMove, we show bio-impedence can help improve IMU-based fitness\ntracking through sensor fusion and contrastive learning.To evaluate our\nmethods, we conducted an experiment including six upper body fitness activities\nperformed by ten subjects over five days to collect synchronized data from\nbio-impedance across two wrists and IMU on the left wrist.The contrastive\nlearning framework uses the two modalities to train a better IMU-only\nclassification model, where bio-impedance is only required at the training\nphase, by which the average Macro F1 score with the input of a single IMU was\nimproved by 3.22 \\% reaching 84.71 \\% compared to the 81.49 \\% of the IMU\nbaseline model. We have also shown how bio-impedance can improve human activity\nrecognition (HAR) directly through sensor fusion, reaching an average Macro F1\nscore of 89.57 \\% (two modalities required for both training and inference)\neven if Bio-impedance alone has an average macro F1 score of 75.36 \\%, which is\noutperformed by IMU alone. In addition, similar results were obtained in an\nextended study on lower body fitness activity classification, demonstrating the\ngeneralisability of our approach.Our findings underscore the potential of\nsensor fusion and contrastive learning as valuable tools for advancing fitness\nactivity recognition, with bio-impedance playing a pivotal role in augmenting\nthe capabilities of IMU-based systems.\n","authors":["Mengxi Liu","Vitor Fortes Rey","Yu Zhang","Lala Shakti Swarup Ray","Bo Zhou","Paul Lukowicz"],"pdf_url":"https://arxiv.org/pdf/2402.09445v2.pdf","comment":"Accepted by percom2024"},{"id":"http://arxiv.org/abs/2405.11143v2","updated":"2024-06-03T12:19:18Z","published":"2024-05-20T01:04:40Z","title":"OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework","summary":" As large language models (LLMs) continue to grow by scaling laws,\nreinforcement learning from human feedback (RLHF) has gained significant\nattention due to its outstanding performance. However, unlike pretraining or\nfine-tuning a single model, scaling reinforcement learning from human feedback\n(RLHF) for training large language models poses coordination challenges across\nfour models. We present OpenRLHF, an open-source framework enabling efficient\nRLHF scaling. Unlike existing RLHF frameworks that co-locate four models on the\nsame GPUs, OpenRLHF re-designs scheduling for the models beyond 70B parameters\nusing Ray, vLLM, and DeepSpeed, leveraging improved resource utilization and\ndiverse training approaches. Integrating seamlessly with Hugging Face, OpenRLHF\nprovides an out-of-the-box solution with optimized algorithms and launch\nscripts, which ensures user-friendliness. OpenRLHF implements RLHF, DPO,\nrejection sampling, and other alignment techniques. Empowering state-of-the-art\nLLM development, OpenRLHF's code is available at\nhttps://github.com/OpenLLMAI/OpenRLHF.\n","authors":["Jian Hu","Xibin Wu","Weixun Wang"," Xianyu","Dehao Zhang","Yu Cao"],"pdf_url":"https://arxiv.org/pdf/2405.11143v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.07105v3","updated":"2024-06-03T12:14:34Z","published":"2024-01-13T16:09:49Z","title":"Graph Language Models","summary":" While Language Models (LMs) are the workhorses of NLP, their interplay with\nstructured knowledge graphs (KGs) is still actively researched. Current methods\nfor encoding such graphs typically either (i) linearize them for embedding with\nLMs -- which underutilize structural information, or (ii) use Graph Neural\nNetworks (GNNs) to preserve the graph structure -- but GNNs cannot represent\ntext features as well as pretrained LMs. In our work we introduce a novel LM\ntype, the Graph Language Model (GLM), that integrates the strengths of both\napproaches and mitigates their weaknesses. The GLM parameters are initialized\nfrom a pretrained LM to enhance understanding of individual graph concepts and\ntriplets. Simultaneously, we design the GLM's architecture to incorporate graph\nbiases, thereby promoting effective knowledge distribution within the graph.\nThis enables GLMs to process graphs, texts, and interleaved inputs of both.\nEmpirical evaluations on relation classification tasks show that GLM embeddings\nsurpass both LM- and GNN-based baselines in supervised and zero-shot setting,\ndemonstrating their versatility.\n","authors":["Moritz Plenz","Anette Frank"],"pdf_url":"https://arxiv.org/pdf/2401.07105v3.pdf","comment":"Accepted at ACL 2024. 9 pages, 10 figures, 9 tables"},{"id":"http://arxiv.org/abs/2403.17436v2","updated":"2024-06-03T12:12:25Z","published":"2024-03-26T07:05:06Z","title":"Particle identification with machine learning from incomplete data in\n the ALICE experiment","summary":" The ALICE experiment at the LHC measures properties of the strongly\ninteracting matter formed in ultrarelativistic heavy-ion collisions. Such\nstudies require accurate particle identification (PID). ALICE provides PID\ninformation via several detectors for particles with momentum from about 100\nMeV/c up to 20 GeV/c. Traditionally, particles are selected with rectangular\ncuts. A much better performance can be achieved with machine learning (ML)\nmethods. Our solution uses multiple neural networks (NN) serving as binary\nclassifiers. Moreover, we extended our particle classifier with Feature Set\nEmbedding and attention in order to train on data with incomplete samples. We\nalso present the integration of the ML project with the ALICE analysis\nsoftware, and we discuss domain adaptation, the ML technique needed to transfer\nthe knowledge between simulated and real experimental data.\n","authors":["Maja Karwowska","Łukasz Graczykowski","Kamil Deja","Miłosz Kasak","Małgorzata Janik"],"pdf_url":"https://arxiv.org/pdf/2403.17436v2.pdf","comment":"Proceedings of 3rd Artificial Intelligence for the Electron Ion\n Collider workshop -- AI4EIC2023, 28.11-1.12.2023. Accepted in JINST"},{"id":"http://arxiv.org/abs/2306.12330v2","updated":"2024-06-03T12:09:10Z","published":"2023-06-21T15:17:39Z","title":"ProtoGate: Prototype-based Neural Networks with Global-to-local Feature\n Selection for Tabular Biomedical Data","summary":" Tabular biomedical data poses challenges in machine learning because it is\noften high-dimensional and typically low-sample-size (HDLSS). Previous research\nhas attempted to address these challenges via local feature selection, but\nexisting approaches often fail to achieve optimal performance due to their\nlimitation in identifying globally important features and their susceptibility\nto the co-adaptation problem. In this paper, we propose ProtoGate, a\nprototype-based neural model for feature selection on HDLSS data. ProtoGate\nfirst selects instance-wise features via adaptively balancing global and local\nfeature selection. Furthermore, ProtoGate employs a non-parametric\nprototype-based prediction mechanism to tackle the co-adaptation problem,\nensuring the feature selection results and predictions are consistent with\nunderlying data clusters. We conduct comprehensive experiments to evaluate the\nperformance and interpretability of ProtoGate on synthetic and real-world\ndatasets. The results show that ProtoGate generally outperforms\nstate-of-the-art methods in prediction accuracy by a clear margin while\nproviding high-fidelity feature selection and explainable predictions. Code is\navailable at https://github.com/SilenceX12138/ProtoGate.\n","authors":["Xiangjian Jiang","Andrei Margeloiu","Nikola Simidjievski","Mateja Jamnik"],"pdf_url":"https://arxiv.org/pdf/2306.12330v2.pdf","comment":"Accepted by the Forty-first International Conference on Machine\n Learning (ICML2024)"},{"id":"http://arxiv.org/abs/2203.08717v3","updated":"2024-06-03T12:06:06Z","published":"2022-03-16T16:14:19Z","title":"Weak Augmentation Guided Relational Self-Supervised Learning","summary":" Self-supervised Learning (SSL) including the mainstream contrastive learning\nhas achieved great success in learning visual representations without data\nannotations. However, most methods mainly focus on the instance level\ninformation (\\ie, the different augmented images of the same instance should\nhave the same feature or cluster into the same class), but there is a lack of\nattention on the relationships between different instances. In this paper, we\nintroduce a novel SSL paradigm, which we term as relational self-supervised\nlearning (ReSSL) framework that learns representations by modeling the\nrelationship between different instances. Specifically, our proposed method\nemploys sharpened distribution of pairwise similarities among different\ninstances as \\textit{relation} metric, which is thus utilized to match the\nfeature embeddings of different augmentations. To boost the performance, we\nargue that weak augmentations matter to represent a more reliable relation, and\nleverage momentum strategy for practical efficiency. The designed asymmetric\npredictor head and an InfoNCE warm-up strategy enhance the robustness to\nhyper-parameters and benefit the resulting performance. Experimental results\nshow that our proposed ReSSL substantially outperforms the state-of-the-art\nmethods across different network architectures, including various lightweight\nnetworks (\\eg, EfficientNet and MobileNet).\n","authors":["Mingkai Zheng","Shan You","Fei Wang","Chen Qian","Changshui Zhang","Xiaogang Wang","Chang Xu"],"pdf_url":"https://arxiv.org/pdf/2203.08717v3.pdf","comment":"Extended version of NeurIPS 2021 paper. arXiv admin note: substantial\n text overlap with arXiv:2107.09282"},{"id":"http://arxiv.org/abs/2403.06833v2","updated":"2024-06-03T12:04:50Z","published":"2024-03-11T15:48:56Z","title":"Can LLMs Separate Instructions From Data? And What Do We Even Mean By\n That?","summary":" Instruction-tuned Large Language Models (LLMs) show impressive results in\nnumerous practical applications, but they lack essential safety features that\nare common in other areas of computer science, particularly an explicit\nseparation of instructions and data. This makes them vulnerable to\nmanipulations such as indirect prompt injections and generally unsuitable for\nsafety-critical tasks. Surprisingly, there is currently no established\ndefinition or benchmark to quantify this phenomenon. In this work, we close\nthis gap by introducing a formal measure for instruction-data separation and an\nempirical variant that is calculable from a model's outputs. We also present a\nnew dataset, SEP, that allows estimating the measure for real-world models. Our\nresults on various LLMs show that the problem of instruction-data separation is\nreal: all models fail to achieve high separation, and canonical mitigation\ntechniques, such as prompt engineering and fine-tuning, either fail to\nsubstantially improve separation or reduce model utility. The source code and\nSEP dataset are openly accessible at\nhttps://github.com/egozverev/Shold-It-Be-Executed-Or-Processed.\n","authors":["Egor Zverev","Sahar Abdelnabi","Soroush Tabesh","Mario Fritz","Christoph H. Lampert"],"pdf_url":"https://arxiv.org/pdf/2403.06833v2.pdf","comment":"GitHub:\n https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed. 10 pages main\n text, 30 pages in total"},{"id":"http://arxiv.org/abs/2405.12807v5","updated":"2024-06-03T11:55:11Z","published":"2024-05-21T13:58:17Z","title":"FAdam: Adam is a natural gradient optimizer using diagonal empirical\n Fisher information","summary":" This paper establishes a mathematical foundation for the Adam optimizer,\nelucidating its connection to natural gradient descent through Riemannian and\ninformation geometry. We rigorously analyze the diagonal empirical Fisher\ninformation matrix (FIM) in Adam, clarifying all detailed approximations and\nadvocating for the use of log probability functions as loss, which should be\nbased on discrete distributions, due to the limitations of empirical FIM. Our\nanalysis uncovers flaws in the original Adam algorithm, leading to proposed\ncorrections such as enhanced momentum calculations, adjusted bias corrections,\nadaptive epsilon, and gradient clipping. We refine the weight decay term based\non our theoretical framework. Our modified algorithm, Fisher Adam (FAdam),\ndemonstrates superior performance across diverse domains including LLM, ASR,\nand VQ-VAE, achieving state-of-the-art results in ASR.\n","authors":["Dongseong Hwang"],"pdf_url":"https://arxiv.org/pdf/2405.12807v5.pdf","comment":"21 pages, 4 figures, 6 tables"},{"id":"http://arxiv.org/abs/2402.13891v2","updated":"2024-06-03T11:40:32Z","published":"2024-02-21T16:02:14Z","title":"Overcoming Saturation in Density Ratio Estimation by Iterated\n Regularization","summary":" Estimating the ratio of two probability densities from finitely many samples,\nis a central task in machine learning and statistics. In this work, we show\nthat a large class of kernel methods for density ratio estimation suffers from\nerror saturation, which prevents algorithms from achieving fast error\nconvergence rates on highly regular learning problems. To resolve saturation,\nwe introduce iterated regularization in density ratio estimation to achieve\nfast error rates. Our methods outperform its non-iteratively regularized\nversions on benchmarks for density ratio estimation as well as on large-scale\nevaluations for importance-weighted ensembling of deep unsupervised domain\nadaptation models.\n","authors":["Lukas Gruber","Markus Holzleitner","Johannes Lehner","Sepp Hochreiter","Werner Zellinger"],"pdf_url":"https://arxiv.org/pdf/2402.13891v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.08045v4","updated":"2024-06-03T11:34:05Z","published":"2023-11-14T10:10:31Z","title":"Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM\n Game","summary":" Human preference alignment is essential to improve the interaction quality of\nlarge language models (LLMs). Existing alignment methods depend on manually\nannotated preference data to guide the LLM optimization directions. However,\ncontinuously updating LLMs for alignment raises a distribution gap between\nmodel-generated samples and human-annotated responses, hindering training\neffectiveness. To mitigate this issue, previous methods require additional\npreference annotation on newly generated samples to adapt to the shifted\ndistribution, which consumes a large amount of annotation resources. Targeting\nmore efficient human preference optimization, we propose an Adversarial\nPreference Optimization (APO) framework, in which the LLM and the reward model\nupdate alternatively via a min-max game. Through adversarial training, the\nreward model can adapt to the shifted generation distribution of the LLM\nwithout any additional annotation. With comprehensive experiments, we find the\nproposed adversarial training framework further enhances existing alignment\nbaselines in terms of LLM helpfulness and harmlessness. The code is at\nhttps://github.com/Linear95/APO.\n","authors":["Pengyu Cheng","Yifan Yang","Jian Li","Yong Dai","Tianhao Hu","Peixin Cao","Nan Du","Xiaolong Li"],"pdf_url":"https://arxiv.org/pdf/2311.08045v4.pdf","comment":"Accepted by ACL2024 findings"},{"id":"http://arxiv.org/abs/2403.06807v2","updated":"2024-06-03T11:33:51Z","published":"2024-03-11T15:26:34Z","title":"Multistep Consistency Models","summary":" Diffusion models are relatively easy to train but require many steps to\ngenerate samples. Consistency models are far more difficult to train, but\ngenerate samples in a single step.\n In this paper we propose Multistep Consistency Models: A unification between\nConsistency Models (Song et al., 2023) and TRACT (Berthelot et al., 2023) that\ncan interpolate between a consistency model and a diffusion model: a trade-off\nbetween sampling speed and sampling quality. Specifically, a 1-step consistency\nmodel is a conventional consistency model whereas a $\\infty$-step consistency\nmodel is a diffusion model.\n Multistep Consistency Models work really well in practice. By increasing the\nsample budget from a single step to 2-8 steps, we can train models more easily\nthat generate higher quality samples, while retaining much of the sampling\nspeed benefits. Notable results are 1.4 FID on Imagenet 64 in 8 step and 2.1\nFID on Imagenet128 in 8 steps with consistency distillation, using simple\nlosses without adversarial training. We also show that our method scales to a\ntext-to-image diffusion model, generating samples that are close to the quality\nof the original model.\n","authors":["Jonathan Heek","Emiel Hoogeboom","Tim Salimans"],"pdf_url":"https://arxiv.org/pdf/2403.06807v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17406v2","updated":"2024-06-03T11:32:11Z","published":"2024-05-27T17:55:05Z","title":"Deep Learning Calabi-Yau four folds with hybrid and recurrent neural\n network architectures","summary":" In this work, we report the results of applying deep learning based on hybrid\nconvolutional-recurrent and purely recurrent neural network architectures to\nthe dataset of almost one million complete intersection Calabi-Yau four-folds\n(CICY4) to machine-learn their four Hodge numbers $h^{1,1}, h^{2,1}, h^{3,1},\nh^{2,2}$. In particular, we explored and experimented with twelve different\nneural network models, nine of which are convolutional-recurrent (CNN-RNN)\nhybrids with the RNN unit being either GRU (Gated Recurrent Unit) or Long Short\nTerm Memory (LSTM). The remaining four models are purely recurrent neural\nnetworks based on LSTM. In terms of the $h^{1,1}, h^{2,1}, h^{3,1}, h^{2,2}$\nprediction accuracies, at 72% training ratio, our best performing individual\nmodel is CNN-LSTM-400, a hybrid CNN-LSTM with the LSTM hidden size of 400,\nwhich obtained 99.74%, 98.07%, 95.19%, 81.01%, our second best performing\nindividual model is LSTM-448, an LSTM-based model with the hidden size of 448,\nwhich obtained 99.74%, 97.51%, 94.24%, and 78.63%. These results were improved\nby forming ensembles of the top two, three or even four models. Our best\nensemble, consisting of the top four models, achieved the accuracies of 99.84%,\n98.71%, 96.26%, 85.03%. At 80% training ratio, the top two performing models\nLSTM-448 and LSTM-424 are both LSTM-based with the hidden sizes of 448 and 424.\nCompared with the 72% training ratio, there is a significant improvement of\naccuracies, which reached 99.85%, 98.66%, 96.26%, 84.77% for the best\nindividual model and 99.90%, 99.03%, 97.97%, 87.34% for the best ensemble.\n","authors":["H. L. Dao"],"pdf_url":"https://arxiv.org/pdf/2405.17406v2.pdf","comment":"v2: new (improved) results added, references added, typos corrected"},{"id":"http://arxiv.org/abs/2405.18983v2","updated":"2024-06-03T11:16:55Z","published":"2024-05-29T10:56:13Z","title":"Federated Learning under Partially Class-Disjoint Data via Manifold\n Reshaping","summary":" Statistical heterogeneity severely limits the performance of federated\nlearning (FL), motivating several explorations e.g., FedProx, MOON and FedDyn,\nto alleviate this problem. Despite effectiveness, their considered scenario\ngenerally requires samples from almost all classes during the local training of\neach client, although some covariate shifts may exist among clients. In fact,\nthe natural case of partially class-disjoint data (PCDD), where each client\ncontributes a few classes (instead of all classes) of samples, is practical yet\nunderexplored. Specifically, the unique collapse and invasion characteristics\nof PCDD can induce the biased optimization direction in local training, which\nprevents the efficiency of federated learning. To address this dilemma, we\npropose a manifold reshaping approach called FedMR to calibrate the feature\nspace of local training. Our FedMR adds two interplaying losses to the vanilla\nfederated learning: one is intra-class loss to decorrelate feature dimensions\nfor anti-collapse; and the other one is inter-class loss to guarantee the\nproper margin among categories in the feature expansion. We conduct extensive\nexperiments on a range of datasets to demonstrate that our FedMR achieves much\nhigher accuracy and better communication efficiency. Source code is available\nat: https://github.com/MediaBrain-SJTU/FedMR.git.\n","authors":["Ziqing Fan","Jiangchao Yao","Ruipeng Zhang","Lingjuan Lyu","Ya Zhang","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2405.18983v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16444v2","updated":"2024-06-03T10:57:57Z","published":"2024-05-26T06:00:17Z","title":"CacheBlend: Fast Large Language Model Serving for RAG with Cached\n Knowledge Fusion","summary":" Large language models (LLMs) often incorporate multiple text chunks in their\ninputs to provide the necessary contexts. To speed up the prefill of the long\nLLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache\nwhen the context is reused as the prefix of another LLM input. However, the\nreused text chunks are not always the input prefix, and when they are not,\ntheir precomputed KV caches cannot be directly used since they ignore the\ntext's cross-attention with the preceding text in the LLM input. Thus, the\nbenefits of reusing KV caches remain largely unrealized.\n This paper tackles just one question: when an LLM input contains multiple\ntext chunks, how to quickly combine their precomputed KV caches in order to\nachieve the same generation quality as the expensive full prefill (i.e.,\nwithout reusing KV cache)? We present CacheBlend, a scheme that reuses the\npre-computed KV caches, regardless prefix or not, and selectively recomputes\nthe KV values of a small subset of tokens to partially update each reused KV\ncache. In the meantime,the small extra delay for recomputing some tokens can be\npipelined with the retrieval of KV caches within the same job,allowing\nCacheBlend to store KV caches in slower devices with more storage capacity\nwhile retrieving them without increasing the inference delay. By comparing\nCacheBlend with the state-of-the-art KV cache reusing schemes on three\nopen-source LLMs of various sizes and four popular benchmark datasets of\ndifferent tasks, we show that CacheBlend reduces time-to-first-token (TTFT) by\n2.2-3.3X and increases the inference throughput by 2.8-5X, compared with full\nKV recompute, without compromising generation quality or incurring more storage\ncost.\n","authors":["Jiayi Yao","Hanchen Li","Yuhan Liu","Siddhant Ray","Yihua Cheng","Qizheng Zhang","Kuntai Du","Shan Lu","Junchen Jiang"],"pdf_url":"https://arxiv.org/pdf/2405.16444v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06970v3","updated":"2024-06-03T10:50:08Z","published":"2023-10-10T19:47:58Z","title":"Flood and Echo Net: Algorithmically Aligned GNNs that Generalize","summary":" Most Graph Neural Networks follow the standard message-passing framework\nwhere, in each step, all nodes simultaneously communicate with each other. We\nwant to challenge this paradigm by aligning the computation more closely to the\nexecution of distributed algorithms and propose the Flood and Echo Net. A\nsingle round of a Flood and Echo Net consists of an origin node and a flooding\nphase followed by an echo phase. First, during the flooding, messages are sent\nfrom the origin and propagated outwards throughout the entire graph. Then,\nduring the echo, the message flow reverses and messages are sent back towards\nthe origin. As nodes are only sparsely activated upon receiving a message, this\nleads to a wave-like activation pattern that traverses the graph. Through these\nsparse but parallel activations, the Net becomes more expressive than\ntraditional MPNNs which are limited by the 1-WL test and also is provably more\nefficient in terms of message complexity. Moreover, the mechanism's inherent\nability to generalize across graphs of varying sizes positions it as a\npractical architecture for the task of algorithmic learning. We test the Flood\nand Echo Net on a variety of synthetic tasks and the SALSA-CLRS benchmark and\nfind that the algorithmic alignment of the execution improves generalization to\nlarger graph sizes.\n","authors":["Joël Mathys","Florian Grötschla","Kalyan Varma Nadimpalli","Roger Wattenhofer"],"pdf_url":"https://arxiv.org/pdf/2310.06970v3.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2405.16956v2","updated":"2024-06-03T10:42:50Z","published":"2024-05-27T08:46:57Z","title":"Functional Programming Paradigm of Python for Scientific Computation\n Pipeline Integration","summary":" The advent of modern data processing has led to an increasing tendency\ntowards interdisciplinarity, which frequently involves the importation of\ndifferent technical approaches. Consequently, there is an urgent need for a\nunified data control system to facilitate the integration of varying libraries.\nThis integration is of profound significance in accelerating prototype\nverification, optimising algorithm performance and minimising maintenance\ncosts. This paper presents a novel functional programming (FP) paradigm based\non the Python architecture and associated suites in programming practice,\ndesigned for the integration of pipelines of different data mapping operations.\nIn particular, the solution is intended for the integration of scientific\ncomputation flows, which affords a robust yet flexible solution for the\naforementioned challenges.\n","authors":["Chen Zhang","Lecheng Jia","Wei Zhang","Ning Wen"],"pdf_url":"https://arxiv.org/pdf/2405.16956v2.pdf","comment":"16 pages"},{"id":"http://arxiv.org/abs/2405.16088v2","updated":"2024-06-03T10:26:10Z","published":"2024-05-25T06:39:39Z","title":"Estimating the normal-inverse-Wishart distribution","summary":" The normal-inverse-Wishart (NIW) distribution is commonly used as a prior\ndistribution for the mean and covariance parameters of a multivariate normal\ndistribution. The family of NIW distributions is also a minimal exponential\nfamily. In this short note we describe a convergent procedure for converting\nfrom mean parameters to natural parameters in the NIW family, or --\nequivalently -- for performing maximum likelihood estimation of the natural\nparameters given observed sufficient statistics. This is needed, for example,\nwhen using a NIW base family in expectation propagation.\n","authors":["Jonathan So"],"pdf_url":"https://arxiv.org/pdf/2405.16088v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.10634v2","updated":"2024-06-03T10:26:05Z","published":"2024-02-16T12:33:31Z","title":"Graph-based Forecasting with Missing Data through Spatiotemporal\n Downsampling","summary":" Given a set of synchronous time series, each associated with a sensor-point\nin space and characterized by inter-series relationships, the problem of\nspatiotemporal forecasting consists of predicting future observations for each\npoint. Spatiotemporal graph neural networks achieve striking results by\nrepresenting the relationships across time series as a graph. Nonetheless, most\nexisting methods rely on the often unrealistic assumption that inputs are\nalways available and fail to capture hidden spatiotemporal dynamics when part\nof the data is missing. In this work, we tackle this problem through\nhierarchical spatiotemporal downsampling. The input time series are\nprogressively coarsened over time and space, obtaining a pool of\nrepresentations that capture heterogeneous temporal and spatial dynamics.\nConditioned on observations and missing data patterns, such representations are\ncombined by an interpretable attention mechanism to generate the forecasts. Our\napproach outperforms state-of-the-art methods on synthetic and real-world\nbenchmarks under different missing data distributions, particularly in the\npresence of contiguous blocks of missing values.\n","authors":["Ivan Marisca","Cesare Alippi","Filippo Maria Bianchi"],"pdf_url":"https://arxiv.org/pdf/2402.10634v2.pdf","comment":"Accepted at ICML 2024"},{"id":"http://arxiv.org/abs/2402.09631v3","updated":"2024-06-03T10:24:22Z","published":"2024-02-15T00:20:30Z","title":"Representation Surgery: Theory and Practice of Affine Steering","summary":" Language models often exhibit undesirable behavior, e.g., generating toxic or\ngender-biased text. In the case of neural language models, an encoding of the\nundesirable behavior is often present in the model's representations. Thus, one\nnatural (and common) approach to prevent the model from exhibiting undesirable\nbehavior is to steer the model's representations in a manner that reduces the\nprobability of it generating undesirable text. This paper investigates the\nformal and empirical properties of steering functions, i.e., transformation of\nthe neural language model's representations that alter its behavior. First, we\nderive two optimal, in the least-squares sense, affine steering functions under\ndifferent constraints. Our theory provides justification for existing\napproaches and offers a novel, improved steering approach. Second, we offer a\nseries of experiments that demonstrate the empirical effectiveness of the\nmethods in mitigating bias and reducing toxic generation.\n","authors":["Shashwat Singh","Shauli Ravfogel","Jonathan Herzig","Roee Aharoni","Ryan Cotterell","Ponnurangam Kumaraguru"],"pdf_url":"https://arxiv.org/pdf/2402.09631v3.pdf","comment":"Accepted in ICML 2024"},{"id":"http://arxiv.org/abs/2212.05260v2","updated":"2024-06-03T10:16:12Z","published":"2022-12-10T10:34:35Z","title":"Examining properness in the external validation of survival models with\n squared and logarithmic losses","summary":" Scoring rules promote rational and honest decision-making, which is becoming\nincreasingly important for automated procedures in `auto-ML'. In this paper we\nsurvey common squared and logarithmic scoring rules for survival analysis and\ndetermine which losses are proper and improper. We prove that commonly utilised\nsquared and logarithmic scoring rules that are claimed to be proper are in fact\nimproper, such as the Integrated Survival Brier Score (ISBS). We further prove\nthat under a strict set of assumptions a class of scoring rules is strictly\nproper for, what we term, `approximate' survival losses. Despite the difference\nin properness, experiments in simulated and real-world datasets show there is\nno major difference between improper and proper versions of the widely-used\nISBS, ensuring that we can reasonably trust previous experiments utilizing the\noriginal score for evaluation purposes. We still advocate for the use of proper\nscoring rules, as even minor differences between losses can have important\nimplications in automated processes such as model tuning. We hope our findings\nencourage further research into the properties of survival measures so that\nrobust and honest evaluation of survival models can be achieved.\n","authors":["Raphael Sonabend","John Zobolas","Philipp Kopper","Lukas Burk","Andreas Bender"],"pdf_url":"https://arxiv.org/pdf/2212.05260v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.00463v2","updated":"2024-06-03T10:07:13Z","published":"2023-11-01T11:57:43Z","title":"Robust and Conjugate Gaussian Process Regression","summary":" To enable closed form conditioning, a common assumption in Gaussian process\n(GP) regression is independent and identically distributed Gaussian observation\nnoise. This strong and simplistic assumption is often violated in practice,\nwhich leads to unreliable inferences and uncertainty quantification.\nUnfortunately, existing methods for robustifying GPs break closed-form\nconditioning, which makes them less attractive to practitioners and\nsignificantly more computationally expensive. In this paper, we demonstrate how\nto perform provably robust and conjugate Gaussian process (RCGP) regression at\nvirtually no additional cost using generalised Bayesian inference. RCGP is\nparticularly versatile as it enables exact conjugate closed form updates in all\nsettings where standard GPs admit them. To demonstrate its strong empirical\nperformance, we deploy RCGP for problems ranging from Bayesian optimisation to\nsparse variational Gaussian processes.\n","authors":["Matias Altamirano","François-Xavier Briol","Jeremias Knoblauch"],"pdf_url":"https://arxiv.org/pdf/2311.00463v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.03945v2","updated":"2024-06-03T09:55:44Z","published":"2024-03-06T18:52:39Z","title":"SPEAR:Exact Gradient Inversion of Batches in Federated Learning","summary":" Federated learning is a framework for collaborative machine learning where\nclients only share gradient updates and not their private data with a server.\nHowever, it was recently shown that gradient inversion attacks can reconstruct\nthis data from the shared gradients. In the important honest-but-curious\nsetting, existing attacks enable exact reconstruction only for a batch size of\n$b=1$, with larger batches permitting only approximate reconstruction. In this\nwork, we propose SPEAR, the first algorithm reconstructing whole batches with\n$b >1$ exactly. SPEAR combines insights into the explicit low-rank structure of\ngradients with a sampling-based algorithm. Crucially, we leverage ReLU-induced\ngradient sparsity to precisely filter out large numbers of incorrect samples,\nmaking a final reconstruction step tractable. We provide an efficient GPU\nimplementation for fully connected networks and show that it recovers\nhigh-dimensional ImageNet inputs in batches of up to $b \\lesssim 25$ exactly\nwhile scaling to large networks. Finally, we show theoretically that much\nlarger batches can be reconstructed with high probability given exponential\ntime.\n","authors":["Dimitar I. Dimitrov","Maximilian Baader","Mark Niklas Müller","Martin Vechev"],"pdf_url":"https://arxiv.org/pdf/2403.03945v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.16275v2","updated":"2024-06-03T09:53:22Z","published":"2022-11-29T15:04:09Z","title":"A survey on multi-player bandits","summary":" Due mostly to its application to cognitive radio networks, multiplayer\nbandits gained a lot of interest in the last decade. A considerable progress\nhas been made on its theoretical aspect. However, the current algorithms are\nfar from applicable and many obstacles remain between these theoretical results\nand a possible implementation of multiplayer bandits algorithms in real\ncognitive radio networks. This survey contextualizes and organizes the rich\nmultiplayer bandits literature. In light of the existing works, some clear\ndirections for future research appear. We believe that a further study of these\ndifferent directions might lead to theoretical algorithms adapted to real-world\nsituations.\n","authors":["Etienne Boursier","Vianney Perchet"],"pdf_url":"https://arxiv.org/pdf/2211.16275v2.pdf","comment":"final version, accepted at JMLR"},{"id":"http://arxiv.org/abs/2402.03819v2","updated":"2024-06-03T09:53:06Z","published":"2024-02-06T09:07:41Z","title":"Do we need rebalancing strategies? A theoretical and empirical study\n around SMOTE and its variants","summary":" Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing\nstrategy for handling imbalanced tabular data sets. However, few works analyze\nSMOTE theoretically. In this paper, we prove that SMOTE (with default\nparameter) simply copies the original minority samples asymptotically. We also\nprove that SMOTE exhibits boundary artifacts, thus justifying existing SMOTE\nvariants. Then we introduce two new SMOTE-related strategies, and compare them\nwith state-of-the-art rebalancing procedures. Surprisingly, for most data sets,\nwe observe that applying no rebalancing strategy is competitive in terms of\npredictive performances, with tuned random forests. For highly imbalanced data\nsets, our new method, named Multivariate Gaussian SMOTE, is competitive.\nBesides, our analysis sheds some lights on the behavior of common rebalancing\nstrategies, when used in conjunction with random forests.\n","authors":["Abdoulaye Sakho","Emmanuel Malherbe","Erwan Scornet"],"pdf_url":"https://arxiv.org/pdf/2402.03819v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.09930v3","updated":"2024-06-03T09:46:32Z","published":"2024-03-15T00:09:47Z","title":"Quality-Diversity Actor-Critic: Learning High-Performing and Diverse\n Behaviors via Value and Successor Features Critics","summary":" A key aspect of intelligence is the ability to demonstrate a broad spectrum\nof behaviors for adapting to unexpected situations. Over the past decade,\nadvancements in deep reinforcement learning have led to groundbreaking\nachievements to solve complex continuous control tasks. However, most\napproaches return only one solution specialized for a specific problem. We\nintroduce Quality-Diversity Actor-Critic (QDAC), an off-policy actor-critic\ndeep reinforcement learning algorithm that leverages a value function critic\nand a successor features critic to learn high-performing and diverse behaviors.\nIn this framework, the actor optimizes an objective that seamlessly unifies\nboth critics using constrained optimization to (1) maximize return, while (2)\nexecuting diverse skills. Compared with other Quality-Diversity methods, QDAC\nachieves significantly higher performance and more diverse behaviors on six\nchallenging continuous control locomotion tasks. We also demonstrate that we\ncan harness the learned skills to adapt better than other baselines to five\nperturbed environments. Finally, qualitative analyses showcase a range of\nremarkable behaviors: adaptive-intelligent-robotics.github.io/QDAC.\n","authors":["Luca Grillotti","Maxence Faldor","Borja G. León","Antoine Cully"],"pdf_url":"https://arxiv.org/pdf/2403.09930v3.pdf","comment":"The first two authors contributed equally to this work. Accepted at\n ICML 2024"},{"id":"http://arxiv.org/abs/2405.06582v2","updated":"2024-06-03T09:27:03Z","published":"2024-05-10T16:36:59Z","title":"The Role of Learning Algorithms in Collective Action","summary":" Collective action in machine learning is the study of the control that a\ncoordinated group can have over machine learning algorithms. While previous\nresearch has concentrated on assessing the impact of collectives against\nBayes~(sub)-optimal classifiers, this perspective is limited in that it does\nnot account for the choice of learning algorithm. Classifiers seldom behave\nlike Bayes classifiers and are influenced by the choice of learning algorithms\nalong with their inherent biases. In this work, we initiate the study of how\nthe choice of the learning algorithm plays a role in the success of a\ncollective in practical settings. Specifically, we focus on distributionally\nrobust optimization (DRO), popular for improving a worst group error, and on\nthe ubiquitous stochastic gradient descent (SGD), due to its inductive bias for\n\"simpler\" functions. Our empirical results, supported by a theoretical\nfoundation, show that the effective size and success of the collective are\nhighly dependent on properties of the learning algorithm. This highlights the\nnecessity of taking the learning algorithm into account when studying the\nimpact of collective action in machine learning.\n","authors":["Omri Ben-Dov","Jake Fawkes","Samira Samadi","Amartya Sanyal"],"pdf_url":"https://arxiv.org/pdf/2405.06582v2.pdf","comment":"Accepted at the International Conference in Machine Learning (ICML),\n 2024"},{"id":"http://arxiv.org/abs/2311.18741v2","updated":"2024-06-03T09:15:29Z","published":"2023-11-30T17:38:54Z","title":"VREM-FL: Mobility-Aware Computation-Scheduling Co-Design for Vehicular\n Federated Learning","summary":" Assisted and autonomous driving are rapidly gaining momentum and will soon\nbecome a reality. Artificial intelligence and machine learning are regarded as\nkey enablers thanks to the massive amount of data that smart vehicles will\ncollect from onboard sensors. Federated learning is one of the most promising\ntechniques for training global machine learning models while preserving data\nprivacy of vehicles and optimizing communications resource usage. In this\narticle, we propose vehicular radio environment map federated learning\n(VREM-FL), a computation-scheduling co-design for vehicular federated learning\nthat combines mobility of vehicles with 5G radio environment maps. VREM-FL\njointly optimizes learning performance of the global model and wisely allocates\ncommunication and computation resources. This is achieved by orchestrating\nlocal computations at the vehicles in conjunction with transmission of their\nlocal models in an adaptive and predictive fashion, by exploiting radio channel\nmaps. The proposed algorithm can be tuned to trade training time for radio\nresource usage. Experimental results demonstrate that VREM-FL outperforms\nliterature benchmarks for both a linear regression model (learning time reduced\nby 28%) and a deep neural network for semantic image segmentation (doubling the\nnumber of model updates within the same time window).\n","authors":["Luca Ballotta","Nicolò Dal Fabbro","Giovanni Perin","Luca Schenato","Michele Rossi","Giuseppe Piro"],"pdf_url":"https://arxiv.org/pdf/2311.18741v2.pdf","comment":"This work has been submitted to IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2403.09389v2","updated":"2024-06-03T09:10:27Z","published":"2024-03-14T13:40:26Z","title":"Learning to optimize with convergence guarantees using nonlinear system\n theory","summary":" The increasing reliance on numerical methods for controlling dynamical\nsystems and training machine learning models underscores the need to devise\nalgorithms that dependably and efficiently navigate complex optimization\nlandscapes. Classical gradient descent methods offer strong theoretical\nguarantees for convex problems; however, they demand meticulous hyperparameter\ntuning for non-convex ones. The emerging paradigm of learning to optimize (L2O)\nautomates the discovery of algorithms with optimized performance leveraging\nlearning models and data - yet, it lacks a theoretical framework to analyze\nconvergence of the learned algorithms. In this paper, we fill this gap by\nharnessing nonlinear system theory. Specifically, we propose an unconstrained\nparametrization of all convergent algorithms for smooth non-convex objective\nfunctions. Notably, our framework is directly compatible with automatic\ndifferentiation tools, ensuring convergence by design while learning to\noptimize.\n","authors":["Andrea Martin","Luca Furieri"],"pdf_url":"https://arxiv.org/pdf/2403.09389v2.pdf","comment":"Published in the IEEE Control Systems Letters"},{"id":"http://arxiv.org/abs/2404.00074v2","updated":"2024-06-03T09:03:10Z","published":"2024-03-28T19:57:48Z","title":"A finite operator learning technique for mapping the elastic properties\n of microstructures to their mechanical deformations","summary":" To obtain fast solutions for governing physical equations in solid mechanics,\nwe introduce a method that integrates the core ideas of the finite element\nmethod with physics-informed neural networks and concept of neural operators.\nThis approach generalizes and enhances each method, learning the parametric\nsolution for mechanical problems without relying on data from other resources\n(e.g. other numerical solvers). We propose directly utilizing the available\ndiscretized weak form in finite element packages to construct the loss\nfunctions algebraically, thereby demonstrating the ability to find solutions\neven in the presence of sharp discontinuities. Our focus is on micromechanics\nas an example, where knowledge of deformation and stress fields for a given\nheterogeneous microstructure is crucial for further design applications. The\nprimary parameter under investigation is the Young's modulus distribution\nwithin the heterogeneous solid system. Our investigations reveal that\nphysics-based training yields higher accuracy compared to purely data-driven\napproaches for unseen microstructures. Additionally, we offer two methods to\ndirectly improve the process of obtaining high-resolution solutions, avoiding\nthe need to use basic interpolation techniques. First is based on an\nautoencoder approach to enhance the efficiency for calculation on high\nresolution grid point. Next, Fourier-based parametrization is utilized to\naddress complex 2D and 3D problems in micromechanics. The latter idea aims to\nrepresent complex microstructures efficiently using Fourier coefficients.\nComparisons with other well-known operator learning algorithms, further\nemphasize the advantages of the newly proposed method.\n","authors":["Shahed Rezaei","Reza Najian Asl","Shirko Faroughi","Mahdi Asgharzadeh","Ali Harandi","Rasoul Najafi Koopas","Gottfried Laschet","Stefanie Reese","Markus Apel"],"pdf_url":"https://arxiv.org/pdf/2404.00074v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.06395v3","updated":"2024-06-03T08:54:38Z","published":"2024-04-09T15:36:50Z","title":"MiniCPM: Unveiling the Potential of Small Language Models with Scalable\n Training Strategies","summary":" The burgeoning interest in developing Large Language Models (LLMs) with up to\ntrillion parameters has been met with concerns regarding resource efficiency\nand practical expense, particularly given the immense cost of experimentation.\nThis scenario underscores the importance of exploring the potential of Small\nLanguage Models (SLMs) as a resource-efficient alternative. In this context, we\nintroduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter\nvariants, not only excel in their respective categories but also demonstrate\ncapabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach\nexhibits scalability in both model and data dimensions for future LLM research.\nRegarding model scaling, we employ extensive model wind tunnel experiments for\nstable and optimal scaling. For data scaling, we introduce a\nWarmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to\ncontinuous training and domain adaptation. We present an in-depth analysis of\nthe intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we\nare now able to efficiently study data-model scaling law without extensive\nretraining experiments on both axes of model and data, from which we derive the\nmuch higher compute optimal data-model ratio than Chinchilla Optimal.\nAdditionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE\nand MiniCPM-128K, whose excellent performance further cementing MiniCPM's\nfoundation in diverse SLM applications. MiniCPM models are available publicly\nat https://github.com/OpenBMB/MiniCPM .\n","authors":["Shengding Hu","Yuge Tu","Xu Han","Chaoqun He","Ganqu Cui","Xiang Long","Zhi Zheng","Yewei Fang","Yuxiang Huang","Weilin Zhao","Xinrong Zhang","Zheng Leng Thai","Kaihuo Zhang","Chongyi Wang","Yuan Yao","Chenyang Zhao","Jie Zhou","Jie Cai","Zhongwu Zhai","Ning Ding","Chao Jia","Guoyang Zeng","Dahai Li","Zhiyuan Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2404.06395v3.pdf","comment":"revise according to peer review"},{"id":"http://arxiv.org/abs/2201.05745v4","updated":"2024-06-03T08:51:23Z","published":"2022-01-15T03:13:02Z","title":"Deep Optimal Transport for Domain Adaptation on SPD Manifolds","summary":" The machine learning community has shown increasing interest in addressing\nthe domain adaptation problem on symmetric positive definite (SPD) manifolds.\nThis interest is primarily driven by the complexities of neuroimaging data\ngenerated from brain signals, which often exhibit shifts in data distribution\nacross recording sessions. These neuroimaging data, represented by signal\ncovariance matrices, possess the mathematical properties of symmetry and\npositive definiteness. However, applying conventional domain adaptation methods\nis challenging because these mathematical properties can be disrupted when\noperating on covariance matrices. In this study, we introduce a novel geometric\ndeep learning-based approach utilizing optimal transport on SPD manifolds to\nmanage discrepancies in both marginal and conditional distributions between the\nsource and target domains. We evaluate the effectiveness of this approach in\nthree cross-session brain-computer interface scenarios and provide visualized\nresults for further insights. The GitHub repository of this study can be\naccessed at\nhttps://github.com/GeometricBCI/Deep-Optimal-Transport-for-Domain-Adaptation-on-SPD-Manifolds.\n","authors":["Ce Ju","Cuntai Guan"],"pdf_url":"https://arxiv.org/pdf/2201.05745v4.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2405.04657v2","updated":"2024-06-03T08:50:31Z","published":"2024-05-07T20:30:14Z","title":"ACEGEN: Reinforcement learning of generative chemical agents for drug\n discovery","summary":" In recent years, reinforcement learning (RL) has emerged as a valuable tool\nin drug design, offering the potential to propose and optimize molecules with\ndesired properties. However, striking a balance between capabilities,\nflexibility, reliability, and efficiency remains challenging due to the\ncomplexity of advanced RL algorithms and the significant reliance on\nspecialized code. In this work, we introduce ACEGEN, a comprehensive and\nstreamlined toolkit tailored for generative drug design, built using TorchRL, a\nmodern RL library that offers thoroughly tested reusable components. We\nvalidate ACEGEN by benchmarking against other published generative modeling\nalgorithms and show comparable or improved performance. We also show examples\nof ACEGEN applied in multiple drug discovery case studies. ACEGEN is accessible\nat \\url{https://github.com/acellera/acegen-open} and available for use under\nthe MIT license.\n","authors":["Albert Bou","Morgan Thomas","Sebastian Dittert","Carles Navarro Ramírez","Maciej Majewski","Ye Wang","Shivam Patel","Gary Tresadern","Mazen Ahmad","Vincent Moens","Woody Sherman","Simone Sciabola","Gianni De Fabritiis"],"pdf_url":"https://arxiv.org/pdf/2405.04657v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.05440v3","updated":"2024-06-03T08:49:54Z","published":"2023-11-09T15:24:44Z","title":"A Practical Approach to Novel Class Discovery in Tabular Data","summary":" The problem of Novel Class Discovery (NCD) consists in extracting knowledge\nfrom a labeled set of known classes to accurately partition an unlabeled set of\nnovel classes. While NCD has recently received a lot of attention from the\ncommunity, it is often solved on computer vision problems and under unrealistic\nconditions. In particular, the number of novel classes is usually assumed to be\nknown in advance, and their labels are sometimes used to tune hyperparameters.\nMethods that rely on these assumptions are not applicable in real-world\nscenarios. In this work, we focus on solving NCD in tabular data when no prior\nknowledge of the novel classes is available. To this end, we propose to tune\nthe hyperparameters of NCD methods by adapting the $k$-fold cross-validation\nprocess and hiding some of the known classes in each fold. Since we have found\nthat methods with too many hyperparameters are likely to overfit these hidden\nclasses, we define a simple deep NCD model. This method is composed of only the\nessential elements necessary for the NCD problem and performs impressively well\nunder realistic conditions. Furthermore, we find that the latent space of this\nmethod can be used to reliably estimate the number of novel classes.\nAdditionally, we adapt two unsupervised clustering algorithms ($k$-means and\nSpectral Clustering) to leverage the knowledge of the known classes. Extensive\nexperiments are conducted on 7 tabular datasets and demonstrate the\neffectiveness of the proposed method and hyperparameter tuning process, and\nshow that the NCD problem can be solved without relying on knowledge from the\nnovel classes.\n","authors":["Colin Troisemaine","Alexandre Reiffers-Masson","Stéphane Gosselin","Vincent Lemaire","Sandrine Vaton"],"pdf_url":"https://arxiv.org/pdf/2311.05440v3.pdf","comment":"30 pages, including 7 pages of annexes"},{"id":"http://arxiv.org/abs/2405.21027v2","updated":"2024-06-03T08:43:51Z","published":"2024-05-31T17:16:29Z","title":"Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles","summary":" A popular approach for solving zero-sum games is to maintain populations of\npolicies to approximate the Nash Equilibrium (NE). Previous studies have shown\nthat Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent\nreinforcement learning framework for solving such games. However, repeatedly\ntraining new policies from scratch to approximate Best Response (BR) to\nopponents' mixed policies at each iteration is both inefficient and costly.\nWhile some PSRO variants initialize a new policy by inheriting from past BR\npolicies, this approach limits the exploration of new policies, especially\nagainst challenging opponents. To address this issue, we propose Fusion-PSRO,\nwhich employs policy fusion to initialize policies for better approximation to\nBR. By selecting high-quality base policies from meta-NE, policy fusion fuses\nthe base policies into a new policy through model averaging. This approach\nallows the initialized policies to incorporate multiple expert policies, making\nit easier to handle difficult opponents compared to inheriting from past BR\npolicies or initializing from scratch. Moreover, our method only modifies the\npolicy initialization phase, allowing its application to nearly all PSRO\nvariants without additional training overhead. Our experiments on\nnon-transitive matrix games, Leduc Poker, and the more complex Liars Dice\ndemonstrate that Fusion-PSRO enhances the performance of nearly all PSRO\nvariants, achieving lower exploitability.\n","authors":["Jiesong Lian","Yucong Huang","Mingzhi Wang","Chengdong Ma","Yixue Hao","Ying Wen","Yaodong Yang"],"pdf_url":"https://arxiv.org/pdf/2405.21027v2.pdf","comment":"20 pages, 5 figures"},{"id":"http://arxiv.org/abs/2308.13049v3","updated":"2024-06-03T08:23:19Z","published":"2023-08-24T19:35:58Z","title":"Bayesian Exploration Networks","summary":" Bayesian reinforcement learning (RL) offers a principled and elegant approach\nfor sequential decision making under uncertainty. Most notably, Bayesian agents\ndo not face an exploration/exploitation dilemma, a major pathology of\nfrequentist methods. However theoretical understanding of model-free approaches\nis lacking. In this paper, we introduce a novel Bayesian model-free formulation\nand the first analysis showing that model-free approaches can yield\nBayes-optimal policies. We show all existing model-free approaches make\napproximations that yield policies that can be arbitrarily Bayes-suboptimal. As\na first step towards model-free Bayes optimality, we introduce the Bayesian\nexploration network (BEN) which uses normalising flows to model both the\naleatoric uncertainty (via density estimation) and epistemic uncertainty (via\nvariational inference) in the Bellman operator. In the limit of complete\noptimisation, BEN learns true Bayes-optimal policies, but like in variational\nexpectation-maximisation, partial optimisation renders our approach tractable.\nEmpirical results demonstrate that BEN can learn true Bayes-optimal policies in\ntasks where existing model-free approaches fail.\n","authors":["Mattie Fellows","Brandon Kaplowitz","Christian Schroeder de Witt","Shimon Whiteson"],"pdf_url":"https://arxiv.org/pdf/2308.13049v3.pdf","comment":"ICML 2024 Version Update"},{"id":"http://arxiv.org/abs/2302.09826v3","updated":"2024-06-03T08:20:31Z","published":"2023-02-20T08:19:19Z","title":"On the Expressivity of Persistent Homology in Graph Learning","summary":" Persistent homology, a technique from computational topology, has recently\nshown strong empirical performance in the context of graph classification.\nBeing able to capture long range graph properties via higher-order topological\nfeatures, such as cycles of arbitrary length, in combination with multi-scale\ntopological descriptors, has improved predictive performance for data sets with\nprominent topological structures, such as molecules. At the same time, the\ntheoretical properties of persistent homology have not been formally assessed\nin this context. This paper intends to bridge the gap between computational\ntopology and graph machine learning by providing a brief introduction to\npersistent homology in the context of graphs, as well as a theoretical\ndiscussion and empirical analysis of its expressivity for graph learning tasks.\n","authors":["Rubén Ballester","Bastian Rieck"],"pdf_url":"https://arxiv.org/pdf/2302.09826v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.05890v3","updated":"2024-06-03T08:14:56Z","published":"2024-03-09T12:04:56Z","title":"Towards Efficient Replay in Federated Incremental Learning","summary":" In Federated Learning (FL), the data in each client is typically assumed\nfixed or static. However, data often comes in an incremental manner in\nreal-world applications, where the data domain may increase dynamically. In\nthis work, we study catastrophic forgetting with data heterogeneity in\nFederated Incremental Learning (FIL) scenarios where edge clients may lack\nenough storage space to retain full data. We propose to employ a simple,\ngeneric framework for FIL named Re-Fed, which can coordinate each client to\ncache important samples for replay. More specifically, when a new task arrives,\neach client first caches selected previous samples based on their global and\nlocal importance. Then, the client trains the local model with both the cached\nsamples and the samples from the new task. Theoretically, we analyze the\nability of Re-Fed to discover important samples for replay thus alleviating the\ncatastrophic forgetting problem. Moreover, we empirically show that Re-Fed\nachieves competitive performance compared to state-of-the-art methods.\n","authors":["Yichen Li","Qunwei Li","Haozhao Wang","Ruixuan Li","Wenliang Zhong","Guannan Zhang"],"pdf_url":"https://arxiv.org/pdf/2403.05890v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.05443v3","updated":"2024-06-03T08:12:13Z","published":"2024-02-08T06:45:03Z","title":"Scalable Wasserstein Gradient Flow for Generative Modeling through\n Unbalanced Optimal Transport","summary":" Wasserstein Gradient Flow (WGF) describes the gradient dynamics of\nprobability density within the Wasserstein space. WGF provides a promising\napproach for conducting optimization over the probability distributions.\nNumerically approximating the continuous WGF requires the time discretization\nmethod. The most well-known method for this is the JKO scheme. In this regard,\nprevious WGF models employ the JKO scheme and parametrize transport map for\neach JKO step. However, this approach results in quadratic training complexity\n$O(K^2)$ with the number of JKO step $K$. This severely limits the scalability\nof WGF models. In this paper, we introduce a scalable WGF-based generative\nmodel, called Semi-dual JKO (S-JKO). Our model is based on the semi-dual form\nof the JKO step, derived from the equivalence between the JKO step and the\nUnbalanced Optimal Transport. Our approach reduces the training complexity to\n$O(K)$. We demonstrate that our model significantly outperforms existing\nWGF-based generative models, achieving FID scores of 2.62 on CIFAR-10 and 5.46\non CelebA-HQ-256, which are comparable to state-of-the-art image generative\nmodels.\n","authors":["Jaemoo Choi","Jaewoong Choi","Myungjoo Kang"],"pdf_url":"https://arxiv.org/pdf/2402.05443v3.pdf","comment":"22 pages, 11 figures"},{"id":"http://arxiv.org/abs/2205.09622v5","updated":"2024-06-03T08:02:04Z","published":"2022-05-19T15:37:26Z","title":"What Is Fairness? On the Role of Protected Attributes and Fictitious\n Worlds","summary":" A growing body of literature in fairness-aware machine learning (fairML) aims\nto mitigate machine learning (ML)-related unfairness in automated\ndecision-making (ADM) by defining metrics that measure fairness of an ML model\nand by proposing methods to ensure that trained ML models achieve low scores on\nthese metrics. However, the underlying concept of fairness, i.e., the question\nof what fairness is, is rarely discussed, leaving a significant gap between\ncenturies of philosophical discussion and the recent adoption of the concept in\nthe ML community. In this work, we try to bridge this gap by formalizing a\nconsistent concept of fairness and by translating the philosophical\nconsiderations into a formal framework for the training and evaluation of ML\nmodels in ADM systems. We argue that fairness problems can arise even without\nthe presence of protected attributes (PAs), and point out that fairness and\npredictive performance are not irreconcilable opposites, but that the latter is\nnecessary to achieve the former. Furthermore, we argue why and how causal\nconsiderations are necessary when assessing fairness in the presence of PAs by\nproposing a fictitious, normatively desired (FiND) world in which PAs have no\ncausal effects. In practice, this FiND world must be approximated by a warped\nworld in which the causal effects of the PAs are removed from the real-world\ndata. Finally, we achieve greater linguistic clarity in the discussion of\nfairML. We outline algorithms for practical applications and present\nillustrative experiments on COMPAS data.\n","authors":["Ludwig Bothmann","Kristina Peters","Bernd Bischl"],"pdf_url":"https://arxiv.org/pdf/2205.09622v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.06709v2","updated":"2024-06-03T07:52:21Z","published":"2023-05-11T10:34:27Z","title":"NUBO: A Transparent Python Package for Bayesian Optimization","summary":" NUBO, short for Newcastle University Bayesian Optimization, is a Bayesian\noptimization framework for optimizing expensive-to-evaluate black-box\nfunctions, such as physical experiments and computer simulators. Bayesian\noptimization is a cost-efficient optimization strategy that uses surrogate\nmodeling via Gaussian processes to represent an objective function and\nacquisition functions to guide the selection of candidate points to approximate\nthe global optimum of the objective function. NUBO focuses on transparency and\nuser experience to make Bayesian optimization accessible to researchers from\nall disciplines. Clean and understandable code, precise references, and\nthorough documentation ensure transparency, while a modular and flexible\ndesign, easy-to-write syntax, and careful selection of Bayesian optimization\nalgorithms ensure a good user experience. NUBO allows users to tailor Bayesian\noptimization to their problem by writing a custom optimization loop using the\nprovided building blocks. It supports sequential single-point, parallel\nmulti-point, and asynchronous optimization of bounded, constrained, and mixed\n(discrete and continuous) parameter input spaces. Only algorithms and methods\nextensively tested and validated to perform well are included in NUBO. This\nensures that the package remains compact and does not overwhelm the user with\nan unnecessarily large number of options. The package is written in Python but\ndoes not require expert knowledge of Python to optimize simulators and\nexperiments. NUBO is distributed as open-source software under the BSD 3-Clause\nlicense.\n","authors":["Mike Diessner","Kevin J. Wilson","Richard D. Whalley"],"pdf_url":"https://arxiv.org/pdf/2305.06709v2.pdf","comment":"Accepted for publication by the Journal of Statistical Software"},{"id":"http://arxiv.org/abs/2303.01140v2","updated":"2024-06-03T07:51:54Z","published":"2023-03-02T10:39:13Z","title":"Cardinality Estimation over Knowledge Graphs with Embeddings and Graph\n Neural Networks","summary":" Cardinality Estimation over Knowledge Graphs (KG) is crucial for query\noptimization, yet remains a challenging task due to the semi-structured nature\nand complex correlations of typical Knowledge Graphs. In this work, we propose\nGNCE, a novel approach that leverages knowledge graph embeddings and Graph\nNeural Networks (GNN) to accurately predict the cardinality of conjunctive\nqueries. GNCE first creates semantically meaningful embeddings for all entities\nin the KG, which are then integrated into the given query, which is processed\nby a GNN to estimate the cardinality of the query. We evaluate GNCE on several\nKGs in terms of q-Error and demonstrate that it outperforms state-of-the-art\napproaches based on sampling, summaries, and (machine) learning in terms of\nestimation accuracy while also having lower execution time and less parameters.\nAdditionally, we show that GNCE can inductively generalise to unseen entities,\nmaking it suitable for use in dynamic query processing scenarios. Our proposed\napproach has the potential to significantly improve query optimization and\nrelated applications that rely on accurate cardinality estimates of conjunctive\nqueries.\n","authors":["Tim Schwabe","Maribel Acosta"],"pdf_url":"https://arxiv.org/pdf/2303.01140v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.04854v3","updated":"2024-06-03T07:48:19Z","published":"2024-02-07T13:54:06Z","title":"Hierarchical Tree-structured Knowledge Graph For Academic Insight Survey","summary":" Research surveys have always posed a challenge for beginner researchers who\nlack of research training. These researchers struggle to understand the\ndirections within their research topic, and the discovery of new research\nfindings within a short time. One way to provide intuitive assistance to\nbeginner researchers is by offering relevant knowledge graphs(KG) and\nrecommending related academic papers. However, existing navigation knowledge\ngraphs primarily rely on keywords in the research field and often fail to\npresent the logical hierarchy among multiple related papers clearly. Moreover,\nmost recommendation systems for academic papers simply rely on high text\nsimilarity, which can leave researchers confused as to why a particular article\nis being recommended. They may lack of grasp important information about the\ninsight connection between \"Issue resolved\" and \"Issue finding\" that they hope\nto obtain. To address these issues, this study aims to support research insight\nsurveys for beginner researchers by establishing a hierarchical tree-structured\nknowledge graph that reflects the inheritance insight of research topics and\nthe relevance insight among the academic papers.\n","authors":["Jinghong Li","Huy Phan","Wen Gu","Koichi Ota","Shinobu Hasegawa"],"pdf_url":"https://arxiv.org/pdf/2402.04854v3.pdf","comment":"This paper will be submitted to 'The 18TH International Conference on\n INnovations in Intelligent SysTems and Applications (INISTA 2024)'"},{"id":"http://arxiv.org/abs/2402.02801v2","updated":"2024-06-03T07:35:25Z","published":"2024-02-05T08:19:56Z","title":"KS-Lottery: Finding Certified Lottery Tickets for Multilingual Language\n Models","summary":" The lottery ticket hypothesis posits the existence of ``winning tickets''\nwithin a randomly initialized neural network. Do winning tickets exist for LLMs\nin fine-tuning scenarios? How can we find such winning tickets? In this paper,\nwe propose KS-Lottery, a method to identify a small subset of LLM parameters\nhighly effective in multilingual fine-tuning. Our key idea is to use\nKolmogorov-Smirnov Test to analyze the distribution shift of parameters before\nand after fine-tuning. We further theoretically prove that KS-Lottery can find\nthe certified winning tickets in the embedding layer, fine-tuning on the found\nparameters is guaranteed to perform as well as full fine-tuning. Comparing\nKS-Lottery with other parameter-efficient tuning algorithms on translation\ntasks, the experimental results show that KS-Lottery finds a much smaller set\nof parameters for fine-tuning while achieving the comparable performance as\nfull fine-tuning LLM. Surprisingly, we find that fine-tuning 18 tokens'\nembedding of LLaMA suffices to reach the fine-tuning translation\nperformance~\\footnote{https://github.com/CONE-MT/KS-Lottery.}.\n","authors":["Fei Yuan","Chang Ma","Shuai Yuan","Qiushi Sun","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2402.02801v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.10198v3","updated":"2024-06-03T07:34:37Z","published":"2024-02-15T18:55:05Z","title":"SAMformer: Unlocking the Potential of Transformers in Time Series\n Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention","summary":" Transformer-based architectures achieved breakthrough performance in natural\nlanguage processing and computer vision, yet they remain inferior to simpler\nlinear baselines in multivariate long-term forecasting. To better understand\nthis phenomenon, we start by studying a toy linear forecasting problem for\nwhich we show that transformers are incapable of converging to their true\nsolution despite their high expressive power. We further identify the attention\nof transformers as being responsible for this low generalization capacity.\nBuilding upon this insight, we propose a shallow lightweight transformer model\nthat successfully escapes bad local minima when optimized with sharpness-aware\noptimization. We empirically demonstrate that this result extends to all\ncommonly used real-world multivariate time series datasets. In particular,\nSAMformer surpasses current state-of-the-art methods and is on par with the\nbiggest foundation model MOIRAI while having significantly fewer parameters.\nThe code is available at https://github.com/romilbert/samformer.\n","authors":["Romain Ilbert","Ambroise Odonnat","Vasilii Feofanov","Aladin Virmaux","Giuseppe Paolo","Themis Palpanas","Ievgen Redko"],"pdf_url":"https://arxiv.org/pdf/2402.10198v3.pdf","comment":"Accepted as an Oral at ICML 2024, Vienna. The first two authors\n contributed equally"},{"id":"http://arxiv.org/abs/2405.00946v2","updated":"2024-06-03T07:13:37Z","published":"2024-05-02T02:15:23Z","title":"SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters","summary":" This paper introduces SparseTSF, a novel, extremely lightweight model for\nLong-term Time Series Forecasting (LTSF), designed to address the challenges of\nmodeling complex temporal dependencies over extended horizons with minimal\ncomputational resources. At the heart of SparseTSF lies the Cross-Period Sparse\nForecasting technique, which simplifies the forecasting task by decoupling the\nperiodicity and trend in time series data. This technique involves downsampling\nthe original sequences to focus on cross-period trend prediction, effectively\nextracting periodic features while minimizing the model's complexity and\nparameter count. Based on this technique, the SparseTSF model uses fewer than\n*1k* parameters to achieve competitive or superior performance compared to\nstate-of-the-art models. Furthermore, SparseTSF showcases remarkable\ngeneralization capabilities, making it well-suited for scenarios with limited\ncomputational resources, small samples, or low-quality data. The code is\npublicly available at this repository: https://github.com/lss-1138/SparseTSF.\n","authors":["Shengsheng Lin","Weiwei Lin","Wentai Wu","Haojun Chen","Junjie Yang"],"pdf_url":"https://arxiv.org/pdf/2405.00946v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.00376v2","updated":"2024-06-03T07:09:39Z","published":"2024-03-01T09:01:53Z","title":"Spurious Feature Eraser: Stabilizing Test-Time Adaptation for\n Vision-Language Foundation Model","summary":" Vision-language foundation models have exhibited remarkable success across a\nmultitude of downstream tasks due to their scalability on extensive image-text\npaired data. However, these models also display significant limitations when\napplied to downstream tasks, such as fine-grained image classification, as a\nresult of ``decision shortcuts'' that hinder their generalization capabilities.\nIn this work, we find that the CLIP model possesses a rich set of features,\nencompassing both \\textit{desired invariant causal features} and\n\\textit{undesired decision shortcuts}. Moreover, the underperformance of CLIP\non downstream tasks originates from its inability to effectively utilize\npre-trained features in accordance with specific task requirements. To address\nthis challenge, we propose a simple yet effective method, Spurious Feature\nEraser (SEraser), to alleviate the decision shortcuts by erasing the spurious\nfeatures. Specifically, we introduce a test-time prompt tuning paradigm that\noptimizes a learnable prompt, thereby compelling the model to exploit invariant\nfeatures while disregarding decision shortcuts during the inference phase. The\nproposed method effectively alleviates excessive dependence on potentially\nmisleading spurious information. We conduct comparative analysis of the\nproposed method against various approaches which validates the significant\nsuperiority.\n","authors":["Huan Ma","Yan Zhu","Changqing Zhang","Peilin Zhao","Baoyuan Wu","Long-Kai Huang","Qinghua Hu","Bingzhe Wu"],"pdf_url":"https://arxiv.org/pdf/2403.00376v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.03316v2","updated":"2024-06-03T07:01:54Z","published":"2022-11-07T05:36:30Z","title":"Accented Text-to-Speech Synthesis with a Conditional Variational\n Autoencoder","summary":" Accent plays a significant role in speech communication, influencing one's\ncapability to understand as well as conveying a person's identity. This paper\nintroduces a novel and efficient framework for accented Text-to-Speech (TTS)\nsynthesis based on a Conditional Variational Autoencoder. It has the ability to\nsynthesize a selected speaker's voice, which is converted to any desired target\naccent. Our thorough experiments validate the effectiveness of the proposed\nframework using both objective and subjective evaluations. The results also\nshow remarkable performance in terms of the ability to manipulate accents in\nthe synthesized speech and provide a promising avenue for future accented TTS\nresearch.\n","authors":["Jan Melechovsky","Ambuj Mehrish","Berrak Sisman","Dorien Herremans"],"pdf_url":"https://arxiv.org/pdf/2211.03316v2.pdf","comment":"preprint submitted to a conference, under review"},{"id":"http://arxiv.org/abs/2401.04679v7","updated":"2024-06-03T06:59:31Z","published":"2024-01-09T17:09:01Z","title":"RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation","summary":" We investigate parameter-efficient fine-tuning (PEFT) methods that can\nprovide good accuracy under limited computational and memory budgets in the\ncontext of large language models (LLMs). We present a new PEFT method called\nRobust Adaptation (RoSA) inspired by robust principal component analysis that\njointly trains $\\textit{low-rank}$ and $\\textit{highly-sparse}$ components on\ntop of a set of fixed pretrained weights to efficiently approximate the\nperformance of a full-fine-tuning (FFT) solution. Across a series of\nchallenging generative tasks such as grade-school math and SQL query\ngeneration, which require fine-tuning for good performance, we show that RoSA\noutperforms LoRA, pure sparse fine-tuning, and alternative hybrid methods at\nthe same parameter budget, and can even recover the performance of FFT on some\ntasks. We provide system support for RoSA to complement the training algorithm,\nspecifically in the form of sparse GPU kernels which enable memory- and\ncomputationally-efficient training, and show that it is also compatible with\nlow-precision base weights, resulting in the first joint representation\ncombining quantization, low-rank and sparse approximations. Our code is\navailable at https://github.com/IST-DASLab/RoSA.\n","authors":["Mahdi Nikdan","Soroush Tabesh","Elvir Crnčević","Dan Alistarh"],"pdf_url":"https://arxiv.org/pdf/2401.04679v7.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.18018v4","updated":"2024-06-03T06:52:58Z","published":"2024-01-31T17:28:24Z","title":"On Prompt-Driven Safeguarding for Large Language Models","summary":" Prepending model inputs with safety prompts is a common practice for\nsafeguarding large language models (LLMs) against queries with harmful intents.\nHowever, the underlying working mechanisms of safety prompts have not been\nunraveled yet, restricting the possibility of automatically optimizing them to\nimprove LLM safety. In this work, we investigate how LLMs' behavior (i.e.,\ncomplying with or refusing user queries) is affected by safety prompts from the\nperspective of model representation. We find that in the representation space,\nthe input queries are typically moved by safety prompts in a \"higher-refusal\"\ndirection, in which models become more prone to refusing to provide assistance,\neven when the queries are harmless. On the other hand, LLMs are naturally\ncapable of distinguishing harmful and harmless queries without safety prompts.\nInspired by these findings, we propose a method for safety prompt optimization,\nnamely DRO (Directed Representation Optimization). Treating a safety prompt as\ncontinuous, trainable embeddings, DRO learns to move the queries'\nrepresentations along or opposite the refusal direction, depending on their\nharmfulness. Experiments with eight LLMs on out-of-domain and jailbreak\nbenchmarks demonstrate that DRO remarkably improves the safeguarding\nperformance of human-crafted safety prompts, without compromising the models'\ngeneral performance.\n","authors":["Chujie Zheng","Fan Yin","Hao Zhou","Fandong Meng","Jie Zhou","Kai-Wei Chang","Minlie Huang","Nanyun Peng"],"pdf_url":"https://arxiv.org/pdf/2401.18018v4.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2401.04514v2","updated":"2024-06-03T06:50:26Z","published":"2024-01-09T12:12:50Z","title":"Rewriting the Code: A Simple Method for Large Language Model Augmented\n Code Search","summary":" In code search, the Generation-Augmented Retrieval (GAR) framework, which\ngenerates exemplar code snippets to augment queries, has emerged as a promising\nstrategy to address the principal challenge of modality misalignment between\ncode snippets and natural language queries, particularly with the demonstrated\ncode generation capabilities of Large Language Models (LLMs). Nevertheless, our\npreliminary investigations indicate that the improvements conferred by such an\nLLM-augmented framework are somewhat constrained. This limitation could\npotentially be ascribed to the fact that the generated codes, albeit\nfunctionally accurate, frequently display a pronounced stylistic deviation from\nthe ground truth code in the codebase. In this paper, we extend the\nfoundational GAR framework and propose a simple yet effective method that\nadditionally Rewrites the Code (ReCo) within the codebase for style\nnormalization. Experimental results demonstrate that ReCo significantly boosts\nretrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%),\nand fine-tuned dense (up to 23.6%) retrieval settings in diverse search\nscenarios. To further elucidate the advantages of ReCo and stimulate research\nin code style normalization, we introduce Code Style Similarity, the first\nmetric tailored to quantify stylistic similarities in code. Notably, our\nempirical findings reveal the inadequacy of existing metrics in capturing\nstylistic nuances. The source code and data are available at\n\\url{https://github.com/Alex-HaochenLi/ReCo}.\n","authors":["Haochen Li","Xin Zhou","Zhiqi Shen"],"pdf_url":"https://arxiv.org/pdf/2401.04514v2.pdf","comment":"Accepted to ACL 2024"},{"id":"http://arxiv.org/abs/2312.12558v3","updated":"2024-06-03T06:17:33Z","published":"2023-12-19T19:53:58Z","title":"Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge","summary":" The problem of sample complexity of online reinforcement learning is often\nstudied in the literature without taking into account any partial knowledge\nabout the system dynamics that could potentially accelerate the learning\nprocess. In this paper, we study the sample complexity of online Q-learning\nmethods when some prior knowledge about the dynamics is available or can be\nlearned efficiently. We focus on systems that evolve according to an additive\ndisturbance model of the form $S_{h+1} = f(S_h, A_h) + W_h$, where $f$\nrepresents the underlying system dynamics, and $W_h$ are unknown disturbances\nindependent of states and actions. In the setting of finite episodic Markov\ndecision processes with $S$ states, $A$ actions, and episode length $H$, we\npresent an optimistic Q-learning algorithm that achieves\n$\\tilde{\\mathcal{O}}(\\text{Poly}(H)\\sqrt{T})$ regret under perfect knowledge of\n$f$, where $T$ is the total number of interactions with the system. This is in\ncontrast to the typical $\\tilde{\\mathcal{O}}(\\text{Poly}(H)\\sqrt{SAT})$ regret\nfor existing Q-learning methods. Further, if only a noisy estimate $\\hat{f}$ of\n$f$ is available, our method can learn an approximately optimal policy in a\nnumber of samples that is independent of the cardinalities of state and action\nspaces. The sub-optimality gap depends on the approximation error $\\hat{f}-f$,\nas well as the Lipschitz constant of the corresponding optimal value function.\nOur approach does not require modeling of the transition probabilities and\nenjoys the same memory complexity as model-free methods.\n","authors":["Meshal Alharbi","Mardavij Roozbehani","Munther Dahleh"],"pdf_url":"https://arxiv.org/pdf/2312.12558v3.pdf","comment":"Published in the 38th Annual AAAI Conference on Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2402.00138v2","updated":"2024-06-03T06:05:29Z","published":"2024-01-31T19:32:33Z","title":"Decomposable Submodular Maximization in Federated Setting","summary":" Submodular functions, as well as the sub-class of decomposable submodular\nfunctions, and their optimization appear in a wide range of applications in\nmachine learning, recommendation systems, and welfare maximization. However,\noptimization of decomposable submodular functions with millions of component\nfunctions is computationally prohibitive. Furthermore, the component functions\nmay be private (they might represent user preference function, for example) and\ncannot be widely shared. To address these issues, we propose a {\\em federated\noptimization} setting for decomposable submodular optimization. In this\nsetting, clients have their own preference functions, and a weighted sum of\nthese preferences needs to be maximized. We implement the popular {\\em\ncontinuous greedy} algorithm in this setting where clients take parallel small\nlocal steps towards the local solution and then the local changes are\naggregated at a central server. To address the large number of clients, the\naggregation is performed only on a subsampled set. Further, the aggregation is\nperformed only intermittently between stretches of parallel local steps, which\nreduces communication cost significantly. We show that our federated algorithm\nis guaranteed to provide a good approximate solution, even in the presence of\nabove cost-cutting measures. Finally, we show how the federated setting can be\nincorporated in solving fundamental discrete submodular optimization problems\nsuch as Maximum Coverage and Facility Location.\n","authors":["Akbar Rafiey"],"pdf_url":"https://arxiv.org/pdf/2402.00138v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.15043v2","updated":"2024-06-03T06:02:39Z","published":"2024-02-23T01:30:39Z","title":"KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large\n Language Models","summary":" Automatic evaluation methods for large language models (LLMs) are hindered by\ndata contamination, leading to inflated assessments of their effectiveness.\nExisting strategies, which aim to detect contaminated texts, focus on\nquantifying contamination status instead of accurately gauging model\nperformance. In this paper, we introduce KIEval, a Knowledge-grounded\nInteractive Evaluation framework, which incorporates an LLM-powered\n\"interactor\" role for the first time to accomplish a dynamic\ncontamination-resilient evaluation. Starting with a question in a conventional\nLLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically\ngenerated, multi-round, and knowledge-focused dialogues to determine whether a\nmodel's response is merely a recall of benchmark answers or demonstrates a deep\ncomprehension to apply knowledge in more complex conversations. Extensive\nexperiments on seven leading LLMs across five datasets validate KIEval's\neffectiveness and generalization. We also reveal that data contamination brings\nno contribution or even negative effect to models' real-world applicability and\nunderstanding, and existing contamination detection methods for LLMs can only\nidentify contamination in pre-training but not during supervised fine-tuning.\n","authors":["Zhuohao Yu","Chang Gao","Wenjin Yao","Yidong Wang","Wei Ye","Jindong Wang","Xing Xie","Yue Zhang","Shikun Zhang"],"pdf_url":"https://arxiv.org/pdf/2402.15043v2.pdf","comment":"Accepted to ACL 2024 (main conference); 19 pages, 5 figures, 19\n tables, code is available at: https://github.com/zhuohaoyu/KIEval"},{"id":"http://arxiv.org/abs/2402.11078v3","updated":"2024-06-03T05:39:10Z","published":"2024-02-16T21:10:33Z","title":"Model Editing by Standard Fine-Tuning","summary":" Standard fine-tuning is considered not as effective as specialized methods\nfor model editing due to its comparatively poor performance. However, it is\nsimple, agnostic to the architectural details of the model being edited, and\nable to leverage advances in standard training techniques with no additional\nwork (e.g., black-box PEFT for computational efficiency), making it an\nappealing choice for a model editor. In this work, we show that standard\nfine-tuning alone can yield competitive model editing performance with two\nminor modifications. First, we optimize the conditional likelihood rather than\nthe full likelihood. Second, in addition to the typical practice of training on\nrandomly paraphrased edit prompts to encourage generalization, we also train on\nrandom or similar unedited facts to encourage locality. Our experiments on the\nZsRE and CounterFact datasets demonstrate that these simple modifications allow\nstandard fine-tuning to match or outperform highly specialized editors in terms\nof edit score.\n","authors":["Govind Gangadhar","Karl Stratos"],"pdf_url":"https://arxiv.org/pdf/2402.11078v3.pdf","comment":"Findings of ACL 2024"},{"id":"http://arxiv.org/abs/2405.11930v2","updated":"2024-06-03T05:21:54Z","published":"2024-05-20T10:12:23Z","title":"Data Contamination Calibration for Black-box LLMs","summary":" The rapid advancements of Large Language Models (LLMs) tightly associate with\nthe expansion of the training data size. However, the unchecked\nultra-large-scale training sets introduce a series of potential risks like data\ncontamination, i.e. the benchmark data is used for training. In this work, we\npropose a holistic method named Polarized Augment Calibration (PAC) along with\na new to-be-released dataset to detect the contaminated data and diminish the\ncontamination effect. PAC extends the popular MIA (Membership Inference Attack)\n-- from machine learning community -- by forming a more global target at\ndetecting training data to Clarify invisible training data. As a pioneering\nwork, PAC is very much plug-and-play that can be integrated with most (if not\nall) current white- and black-box LLMs. By extensive experiments, PAC\noutperforms existing methods by at least 4.5%, towards data contamination\ndetection on more 4 dataset formats, with more than 10 base LLMs. Besides, our\napplication in real-world scenarios highlights the prominent presence of\ncontamination and related issues.\n","authors":["Wentao Ye","Jiaqi Hu","Liyao Li","Haobo Wang","Gang Chen","Junbo Zhao"],"pdf_url":"https://arxiv.org/pdf/2405.11930v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.01773v2","updated":"2024-06-03T05:05:24Z","published":"2024-03-04T07:03:10Z","title":"Improving out-of-distribution generalization in graphs via hierarchical\n semantic environments","summary":" Out-of-distribution (OOD) generalization in the graph domain is challenging\ndue to complex distribution shifts and a lack of environmental contexts. Recent\nmethods attempt to enhance graph OOD generalization by generating flat\nenvironments. However, such flat environments come with inherent limitations to\ncapture more complex data distributions. Considering the DrugOOD dataset, which\ncontains diverse training environments (e.g., scaffold, size, etc.), flat\ncontexts cannot sufficiently address its high heterogeneity. Thus, a new\nchallenge is posed to generate more semantically enriched environments to\nenhance graph invariant learning for handling distribution shifts. In this\npaper, we propose a novel approach to generate hierarchical semantic\nenvironments for each graph. Firstly, given an input graph, we explicitly\nextract variant subgraphs from the input graph to generate proxy predictions on\nlocal environments. Then, stochastic attention mechanisms are employed to\nre-extract the subgraphs for regenerating global environments in a hierarchical\nmanner. In addition, we introduce a new learning objective that guides our\nmodel to learn the diversity of environments within the same hierarchy while\nmaintaining consistency across different hierarchies. This approach enables our\nmodel to consider the relationships between environments and facilitates robust\ngraph invariant learning. Extensive experiments on real-world graph data have\ndemonstrated the effectiveness of our framework. Particularly, in the\nchallenging dataset DrugOOD, our method achieves up to 1.29% and 2.83%\nimprovement over the best baselines on IC50 and EC50 prediction tasks,\nrespectively.\n","authors":["Yinhua Piao","Sangseon Lee","Yijingxiu Lu","Sun Kim"],"pdf_url":"https://arxiv.org/pdf/2403.01773v2.pdf","comment":"CVPR 2024"},{"id":"http://arxiv.org/abs/2310.08540v5","updated":"2024-06-03T04:18:11Z","published":"2023-10-12T17:32:09Z","title":"Do pretrained Transformers Learn In-Context by Gradient Descent?","summary":" The emergence of In-Context Learning (ICL) in LLMs remains a remarkable\nphenomenon that is partially understood. To explain ICL, recent studies have\ncreated theoretical connections to Gradient Descent (GD). We ask, do such\nconnections hold up in actual pre-trained language models? We highlight the\nlimiting assumptions in prior works that make their setup considerably\ndifferent from the practical setup in which language models are trained. For\nexample, their experimental verification uses \\emph{ICL objective} (training\nmodels explicitly for ICL), which differs from the emergent ICL in the wild.\nFurthermore, the theoretical hand-constructed weights used in these studies\nhave properties that don't match those of real LLMs. We also look for evidence\nin real models. We observe that ICL and GD have different sensitivity to the\norder in which they observe demonstrations. Finally, we probe and compare the\nICL vs. GD hypothesis in a natural setting. We conduct comprehensive empirical\nanalyses on language models pre-trained on natural data (LLaMa-7B). Our\ncomparisons of three performance metrics highlight the inconsistent behavior of\nICL and GD as a function of various factors such as datasets, models, and the\nnumber of demonstrations. We observe that ICL and GD modify the output\ndistribution of language models differently. These results indicate that\n\\emph{the equivalence between ICL and GD remains an open hypothesis} and calls\nfor further studies.\n","authors":["Lingfeng Shen","Aayush Mishra","Daniel Khashabi"],"pdf_url":"https://arxiv.org/pdf/2310.08540v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07586v5","updated":"2024-06-03T04:17:49Z","published":"2023-12-11T02:40:40Z","title":"Characteristic Guidance: Non-linear Correction for Diffusion Model at\n Large Guidance Scale","summary":" Popular guidance for denoising diffusion probabilistic model (DDPM) linearly\ncombines distinct conditional models together to provide enhanced control over\nsamples. However, this approach overlooks nonlinear effects that become\nsignificant when guidance scale is large. To address this issue, we propose\ncharacteristic guidance, a guidance method that provides first-principle\nnon-linear correction for classifier-free guidance. Such correction forces the\nguided DDPMs to respect the Fokker-Planck (FP) equation of diffusion process,\nin a way that is training-free and compatible with existing sampling methods.\nExperiments show that characteristic guidance enhances semantic characteristics\nof prompts and mitigate irregularities in image generation, proving effective\nin diverse applications ranging from simulating magnet phase transitions to\nlatent space sampling.\n","authors":["Candi Zheng","Yuan Lan"],"pdf_url":"https://arxiv.org/pdf/2312.07586v5.pdf","comment":"8 pages, 7 figures"},{"id":"http://arxiv.org/abs/2405.18395v2","updated":"2024-06-03T03:53:16Z","published":"2024-05-28T17:35:05Z","title":"MC-GTA: Metric-Constrained Model-Based Clustering using Goodness-of-fit\n Tests with Autocorrelations","summary":" A wide range of (multivariate) temporal (1D) and spatial (2D) data analysis\ntasks, such as grouping vehicle sensor trajectories, can be formulated as\nclustering with given metric constraints. Existing metric-constrained\nclustering algorithms overlook the rich correlation between feature similarity\nand metric distance, i.e., metric autocorrelation. The model-based variations\nof these clustering algorithms (e.g. TICC and STICC) achieve SOTA performance,\nyet suffer from computational instability and complexity by using a\nmetric-constrained Expectation-Maximization procedure. In order to address\nthese two problems, we propose a novel clustering algorithm, MC-GTA\n(Model-based Clustering via Goodness-of-fit Tests with Autocorrelations). Its\nobjective is only composed of pairwise weighted sums of feature similarity\nterms (square Wasserstein-2 distance) and metric autocorrelation terms (a novel\nmultivariate generalization of classic semivariogram). We show that MC-GTA is\neffectively minimizing the total hinge loss for intra-cluster observation pairs\nnot passing goodness-of-fit tests, i.e., statistically not originating from the\nsame distribution. Experiments on 1D/2D synthetic and real-world datasets\ndemonstrate that MC-GTA successfully incorporates metric autocorrelation. It\noutperforms strong baselines by large margins (up to 14.3% in ARI and 32.1% in\nNMI) with faster and stabler optimization (>10x speedup).\n","authors":["Zhangyu Wang","Gengchen Mai","Krzysztof Janowicz","Ni Lao"],"pdf_url":"https://arxiv.org/pdf/2405.18395v2.pdf","comment":"ICML-2024 Proceedings"},{"id":"http://arxiv.org/abs/2312.11973v4","updated":"2024-06-03T03:51:38Z","published":"2023-12-19T09:11:49Z","title":"Continual Learning: Forget-free Winning Subnetworks for Video\n Representations","summary":" Inspired by the Lottery Ticket Hypothesis (LTH), which highlights the\nexistence of efficient subnetworks within larger, dense networks, a\nhigh-performing Winning Subnetwork (WSN) in terms of task performance under\nappropriate sparsity conditions is considered for various continual learning\ntasks. It leverages pre-existing weights from dense networks to achieve\nefficient learning in Task Incremental Learning (TIL) and Task-agnostic\nIncremental Learning (TaIL) scenarios. In Few-Shot Class Incremental Learning\n(FSCIL), a variation of WSN referred to as the Soft subnetwork (SoftNet) is\ndesigned to prevent overfitting when the data samples are scarce. Furthermore,\nthe sparse reuse of WSN weights is considered for Video Incremental Learning\n(VIL). The use of Fourier Subneural Operator (FSO) within WSN is considered. It\nenables compact encoding of videos and identifies reusable subnetworks across\nvarying bandwidths. We have integrated FSO into different architectural\nframeworks for continual learning, including VIL, TIL, and FSCIL. Our\ncomprehensive experiments demonstrate FSO's effectiveness, significantly\nimproving task performance at various convolutional representational levels.\nSpecifically, FSO enhances higher-layer performance in TIL and FSCIL and\nlower-layer performance in VIL.\n","authors":["Haeyong Kang","Jaehong Yoon","Sung Ju Hwang","Chang D. Yoo"],"pdf_url":"https://arxiv.org/pdf/2312.11973v4.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2303.14962,\n arXiv:2306.11305"},{"id":"http://arxiv.org/abs/2310.07160v3","updated":"2024-06-03T03:35:01Z","published":"2023-10-11T03:12:47Z","title":"LLark: A Multimodal Instruction-Following Language Model for Music","summary":" Music has a unique and complex structure which is challenging for both expert\nhumans and existing AI systems to understand, and presents unique challenges\nrelative to other forms of audio. We present LLark, an instruction-tuned\nmultimodal model for \\emph{music} understanding. We detail our process for\ndataset creation, which involves augmenting the annotations of diverse\nopen-source music datasets and converting them to a unified instruction-tuning\nformat. We propose a multimodal architecture for LLark, integrating a\npretrained generative model for music with a pretrained language model. In\nevaluations on three types of tasks (music understanding, captioning,\nreasoning), we show that LLark matches or outperforms existing baselines in\nmusic understanding, and that humans show a high degree of agreement with its\nresponses in captioning and reasoning tasks. LLark is trained entirely from\nopen-source music data and models, and we make our training code available\nalong with the release of this paper. Additional results and audio examples are\nat https://bit.ly/llark, and our source code is available at\nhttps://github.com/spotify-research/llark .\n","authors":["Josh Gardner","Simon Durand","Daniel Stoller","Rachel M. Bittner"],"pdf_url":"https://arxiv.org/pdf/2310.07160v3.pdf","comment":"ICML camera-ready version"},{"id":"http://arxiv.org/abs/2405.07374v2","updated":"2024-06-03T03:32:56Z","published":"2024-05-12T20:27:34Z","title":"Conformalized Survival Distributions: A Generic Post-Process to Increase\n Calibration","summary":" Discrimination and calibration represent two important properties of survival\nanalysis, with the former assessing the model's ability to accurately rank\nsubjects and the latter evaluating the alignment of predicted outcomes with\nactual events. With their distinct nature, it is hard for survival models to\nsimultaneously optimize both of them especially as many previous results found\nimproving calibration tends to diminish discrimination performance. This paper\nintroduces a novel approach utilizing conformal regression that can improve a\nmodel's calibration without degrading discrimination. We provide theoretical\nguarantees for the above claim, and rigorously validate the efficiency of our\napproach across 11 real-world datasets, showcasing its practical applicability\nand robustness in diverse scenarios.\n","authors":["Shi-ang Qi","Yakun Yu","Russell Greiner"],"pdf_url":"https://arxiv.org/pdf/2405.07374v2.pdf","comment":"Accepted to ICML 2024; 37 pages, 19 figures"},{"id":"http://arxiv.org/abs/2402.14859v2","updated":"2024-06-03T03:29:07Z","published":"2024-02-20T23:08:21Z","title":"The Wolf Within: Covert Injection of Malice into MLLM Societies via an\n MLLM Operative","summary":" Due to their unprecedented ability to process and respond to various types of\ndata, Multimodal Large Language Models (MLLMs) are constantly defining the new\nboundary of Artificial General Intelligence (AGI). As these advanced generative\nmodels increasingly form collaborative networks for complex tasks, the\nintegrity and security of these systems are crucial. Our paper, ``The Wolf\nWithin'', explores a novel vulnerability in MLLM societies - the indirect\npropagation of malicious content. Unlike direct harmful output generation for\nMLLMs, our research demonstrates how a single MLLM agent can be subtly\ninfluenced to generate prompts that, in turn, induce other MLLM agents in the\nsociety to output malicious content. Our findings reveal that, an MLLM agent,\nwhen manipulated to produce specific prompts or instructions, can effectively\n``infect'' other agents within a society of MLLMs. This infection leads to the\ngeneration and circulation of harmful outputs, such as dangerous instructions\nor misinformation, across the society. We also show the transferability of\nthese indirectly generated prompts, highlighting their possibility in\npropagating malice through inter-agent communication. This research provides a\ncritical insight into a new dimension of threat posed by MLLMs, where a single\nagent can act as a catalyst for widespread malevolent influence. Our work\nunderscores the urgent need for developing robust mechanisms to detect and\nmitigate such covert manipulations within MLLM societies, ensuring their safe\nand ethical utilization in societal applications.\n","authors":["Zhen Tan","Chengshuai Zhao","Raha Moraffah","Yifan Li","Yu Kong","Tianlong Chen","Huan Liu"],"pdf_url":"https://arxiv.org/pdf/2402.14859v2.pdf","comment":"Accepted to workshop on ReGenAI@CVPR 2024"},{"id":"http://arxiv.org/abs/2405.12489v2","updated":"2024-06-03T03:26:59Z","published":"2024-05-21T04:18:57Z","title":"Exploring and Exploiting the Asymmetric Valley of Deep Neural Networks","summary":" Exploring the loss landscape offers insights into the inherent principles of\ndeep neural networks (DNNs). Recent work suggests an additional asymmetry of\nthe valley beyond the flat and sharp ones, yet without thoroughly examining its\ncauses or implications. Our study methodically explores the factors affecting\nthe symmetry of DNN valleys, encompassing (1) the dataset, network\narchitecture, initialization, and hyperparameters that influence the\nconvergence point; and (2) the magnitude and direction of the noise for 1D\nvisualization. Our major observation shows that the {\\it degree of sign\nconsistency} between the noise and the convergence point is a critical\nindicator of valley symmetry. Theoretical insights from the aspects of ReLU\nactivation and softmax function could explain the interesting phenomenon. Our\ndiscovery propels novel understanding and applications in the scenario of Model\nFusion: (1) the efficacy of interpolating separate models significantly\ncorrelates with their sign consistency ratio, and (2) imposing sign alignment\nduring federated learning emerges as an innovative approach for model parameter\nalignment.\n","authors":["Xin-Chun Li","Jin-Lin Tang","Bo Zhang","Lan Li","De-Chuan Zhan"],"pdf_url":"https://arxiv.org/pdf/2405.12489v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.13536v2","updated":"2024-06-03T03:13:35Z","published":"2023-09-24T03:19:40Z","title":"Tackling the Unlimited Staleness in Federated Learning with Intertwined\n Data and Device Heterogeneities","summary":" The efficiency of Federated Learning (FL) is often affected by both data and\ndevice heterogeneities. Data heterogeneity is defined as the heterogeneity of\ndata distributions on different clients. Device heterogeneity is defined as the\nclients' variant latencies in uploading their local model updates due to\nheterogeneous conditions of local hardware resources, and causes the problem of\nstaleness when being addressed by asynchronous FL. Traditional schemes of\ntackling the impact of staleness consider data and device heterogeneities as\ntwo separate and independent aspects in FL, but this assumption is unrealistic\nin many practical FL scenarios where data and device heterogeneities are\nintertwined. In these cases, traditional schemes of weighted aggregation in FL\nhave been proved to be ineffective, and a better approach is to convert a stale\nmodel update into a non-stale one. In this paper, we present a new FL framework\nthat leverages the gradient inversion technique for such conversion, hence\nefficiently tackling unlimited staleness in clients' model updates. Our basic\nidea is to use gradient inversion to get estimations of clients' local training\ndata from their uploaded stale model updates, and use these estimations to\ncompute non-stale client model updates. In this way, we address the problem of\npossible data quality drop when using gradient inversion, while still\npreserving the clients' local data privacy. We compared our approach with the\nexisting FL strategies on mainstream datasets and models, and experiment\nresults demonstrate that when tackling unlimited staleness, our approach can\nsignificantly improve the trained model accuracy by up to 20% and speed up the\nFL training progress by up to 35%.\n","authors":["Haoming Wang","Wei Gao"],"pdf_url":"https://arxiv.org/pdf/2309.13536v2.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2405.20630v2","updated":"2024-06-03T03:11:45Z","published":"2024-05-31T05:42:47Z","title":"Stochastic Optimal Control for Diffusion Bridges in Function Spaces","summary":" Recent advancements in diffusion models and diffusion bridges primarily focus\non finite-dimensional spaces, yet many real-world problems necessitate\noperations in infinite-dimensional function spaces for more natural and\ninterpretable formulations. In this paper, we present a theory of stochastic\noptimal control (SOC) tailored to infinite-dimensional spaces, aiming to extend\ndiffusion-based algorithms to function spaces. Specifically, we demonstrate how\nDoob's $h$-transform, the fundamental tool for constructing diffusion bridges,\ncan be derived from the SOC perspective and expanded to infinite dimensions.\nThis expansion presents a challenge, as infinite-dimensional spaces typically\nlack closed-form densities. Leveraging our theory, we establish that solving\nthe optimal control problem with a specific objective function choice is\nequivalent to learning diffusion-based generative models. We propose two\napplications: (1) learning bridges between two infinite-dimensional\ndistributions and (2) generative models for sampling from an\ninfinite-dimensional distribution. Our approach proves effective for diverse\nproblems involving continuous function space representations, such as\nresolution-free images, time-series data, and probability density functions.\n","authors":["Byoungwoo Park","Jungwon Choi","Sungbin Lim","Juho Lee"],"pdf_url":"https://arxiv.org/pdf/2405.20630v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15524v3","updated":"2024-06-03T03:02:54Z","published":"2023-10-24T05:07:31Z","title":"On the Inherent Privacy Properties of Discrete Denoising Diffusion\n Models","summary":" Privacy concerns have led to a surge in the creation of synthetic datasets,\nwith diffusion models emerging as a promising avenue. Although prior studies\nhave performed empirical evaluations on these models, there has been a gap in\nproviding a mathematical characterization of their privacy-preserving\ncapabilities. To address this, we present the pioneering theoretical\nexploration of the privacy preservation inherent in discrete diffusion models\n(DDMs) for discrete dataset generation. Focusing on per-instance differential\nprivacy (pDP), our framework elucidates the potential privacy leakage for each\ndata point in a given training dataset, offering insights into how the privacy\nloss of each point correlates with the dataset's distribution. Our bounds also\nshow that training with $s$-sized data points leads to a surge in privacy\nleakage from $(\\epsilon, O(\\frac{1}{s^2\\epsilon}))$-pDP to $(\\epsilon,\nO(\\frac{1}{s\\epsilon}))$-pDP of the DDM during the transition from the pure\nnoise to the synthetic clean data phase, and a faster decay in diffusion\ncoefficients amplifies the privacy guarantee. Finally, we empirically verify\nour theoretical findings on both synthetic and real-world datasets.\n","authors":["Rongzhe Wei","Eleonora Kreačić","Haoyu Wang","Haoteng Yin","Eli Chien","Vamsi K. Potluru","Pan Li"],"pdf_url":"https://arxiv.org/pdf/2310.15524v3.pdf","comment":"58 pages"},{"id":"http://arxiv.org/abs/2404.09005v4","updated":"2024-06-03T02:51:46Z","published":"2024-04-13T13:18:40Z","title":"Proof-of-Learning with Incentive Security","summary":" Most concurrent blockchain systems rely heavily on the Proof-of-Work (PoW) or\nProof-of-Stake (PoS) mechanisms for decentralized consensus and security\nassurance. However, the substantial energy expenditure stemming from\ncomputationally intensive yet meaningless tasks has raised considerable\nconcerns surrounding traditional PoW approaches, The PoS mechanism, while free\nof energy consumption, is subject to security and economic issues. Addressing\nthese issues, the paradigm of Proof-of-Useful-Work (PoUW) seeks to employ\nchallenges of practical significance as PoW, thereby imbuing energy consumption\nwith tangible value. While previous efforts in Proof of Learning (PoL) explored\nthe utilization of deep learning model training SGD tasks as PoUW challenges,\nrecent research has revealed its vulnerabilities to adversarial attacks and the\ntheoretical hardness in crafting a byzantine-secure PoL mechanism. In this\npaper, we introduce the concept of incentive-security that incentivizes\nrational provers to behave honestly for their best interest, bypassing the\nexisting hardness to design a PoL mechanism with computational efficiency, a\nprovable incentive-security guarantee and controllable difficulty.\nParticularly, our work is secure against two attacks to the recent work of Jia\net al. [2021], and also improves the computational overhead from $\\Theta(1)$ to\n$O(\\frac{\\log E}{E})$. Furthermore, while most recent research assumes trusted\nproblem providers and verifiers, our design also guarantees frontend\nincentive-security even when problem providers are untrusted, and verifier\nincentive-security that bypasses the Verifier's Dilemma. By incorporating ML\ntraining into blockchain consensus mechanisms with provable guarantees, our\nresearch not only proposes an eco-friendly solution to blockchain systems, but\nalso provides a proposal for a completely decentralized computing power market\nin the new AI age.\n","authors":["Zishuo Zhao","Zhixuan Fang","Xuechao Wang","Xi Chen","Yuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2404.09005v4.pdf","comment":"16 pages"},{"id":"http://arxiv.org/abs/2405.03664v2","updated":"2024-06-03T02:50:35Z","published":"2024-05-06T17:41:13Z","title":"A New Robust Partial $p$-Wasserstein-Based Metric for Comparing\n Distributions","summary":" The $2$-Wasserstein distance is sensitive to minor geometric differences\nbetween distributions, making it a very powerful dissimilarity metric. However,\ndue to this sensitivity, a small outlier mass can also cause a significant\nincrease in the $2$-Wasserstein distance between two similar distributions.\nSimilarly, sampling discrepancy can cause the empirical $2$-Wasserstein\ndistance on $n$ samples in $\\mathbb{R}^2$ to converge to the true distance at a\nrate of $n^{-1/4}$, which is significantly slower than the rate of $n^{-1/2}$\nfor $1$-Wasserstein distance. We introduce a new family of distances\nparameterized by $k \\ge 0$, called $k$-RPW that is based on computing the\npartial $2$-Wasserstein distance. We show that (1) $k$-RPW satisfies the metric\nproperties, (2) $k$-RPW is robust to small outlier mass while retaining the\nsensitivity of $2$-Wasserstein distance to minor geometric differences, and (3)\nwhen $k$ is a constant, $k$-RPW distance between empirical distributions on $n$\nsamples in $\\mathbb{R}^2$ converges to the true distance at a rate of\n$n^{-1/3}$, which is faster than the convergence rate of $n^{-1/4}$ for the\n$2$-Wasserstein distance. Using the partial $p$-Wasserstein distance, we extend\nour distance to any $p \\in [1,\\infty]$. By setting parameters $k$ or $p$\nappropriately, we can reduce our distance to the total variation,\n$p$-Wasserstein, and the L\\'evy-Prokhorov distances. Experiments show that our\ndistance function achieves higher accuracy in comparison to the\n$1$-Wasserstein, $2$-Wasserstein, and TV distances for image retrieval tasks on\nnoisy real-world data sets.\n","authors":["Sharath Raghvendra","Pouyan Shirzadian","Kaiyi Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.03664v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.14285v3","updated":"2024-06-03T02:47:27Z","published":"2024-02-22T04:55:58Z","title":"Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion","summary":" We study the problem of symbolic music generation (e.g., generating piano\nrolls), with a technical focus on non-differentiable rule guidance. Musical\nrules are often expressed in symbolic form on note characteristics, such as\nnote density or chord progression, many of which are non-differentiable which\npose a challenge when using them for guided diffusion. We propose \\oursfull\n(\\ours), a novel guidance method that only requires forward evaluation of rule\nfunctions that can work with pre-trained diffusion models in a plug-and-play\nway, thus achieving training-free guidance for non-differentiable rules for the\nfirst time. Additionally, we introduce a latent diffusion architecture for\nsymbolic music generation with high time resolution, which can be composed with\nSCG in a plug-and-play fashion. Compared to standard strong baselines in\nsymbolic music generation, this framework demonstrates marked advancements in\nmusic quality and rule-based controllability, outperforming current\nstate-of-the-art generators in a variety of settings. For detailed\ndemonstrations, code and model checkpoints, please visit our project website:\nhttps://scg-rule-guided-music.github.io/.\n","authors":["Yujia Huang","Adishree Ghatare","Yuanzhe Liu","Ziniu Hu","Qinsheng Zhang","Chandramouli S Sastry","Siddharth Gururani","Sageev Oore","Yisong Yue"],"pdf_url":"https://arxiv.org/pdf/2402.14285v3.pdf","comment":"ICML 2024 (Oral)"},{"id":"http://arxiv.org/abs/2405.17233v2","updated":"2024-06-03T02:46:53Z","published":"2024-05-27T14:49:39Z","title":"CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs","summary":" Parameter quantization for Large Language Models (LLMs) has attracted\nincreasing attentions recently in reducing memory costs and improving\ncomputational efficiency. Early approaches have been widely adopted. However,\nthe existing methods suffer from poor performance in low-bit (such as 2 to 3\nbits) scenarios. In this paper, we present a novel and effective Column-Level\nAdaptive weight Quantization (CLAQ) framework by introducing three different\ntypes of adaptive strategies for LLM quantization. Firstly, a K-Means\nclustering based algorithm is proposed that allows dynamic generation of\nquantization centroids for each column of a parameter matrix. Secondly, we\ndesign an outlier-guided adaptive precision search strategy which can\ndynamically assign varying bit-widths to different columns. Finally, a dynamic\noutlier reservation scheme is developed to retain some parameters in their\noriginal float point precision, in trade off of boosted model performance.\nExperiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2\nand Yi demonstrate that our methods achieve the state-of-the-art results across\ndifferent bit settings, especially in extremely low-bit scenarios. Code is\navailable at https://github.com/fayuge/CLAQ.\n","authors":["Haoyu Wang","Bei Liu","Hang Shao","Bo Xiao","Ke Zeng","Guanglu Wan","Yanmin Qian"],"pdf_url":"https://arxiv.org/pdf/2405.17233v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.08147v3","updated":"2024-06-03T02:43:24Z","published":"2024-03-13T00:19:06Z","title":"Representing Molecules as Random Walks Over Interpretable Grammars","summary":" Recent research in molecular discovery has primarily been devoted to small,\ndrug-like molecules, leaving many similarly important applications in material\ndesign without adequate technology. These applications often rely on more\ncomplex molecular structures with fewer examples that are carefully designed\nusing known substructures. We propose a data-efficient and interpretable model\nfor representing and reasoning over such molecules in terms of graph grammars\nthat explicitly describe the hierarchical design space featuring motifs to be\nthe design basis. We present a novel representation in the form of random walks\nover the design space, which facilitates both molecule generation and property\nprediction. We demonstrate clear advantages over existing methods in terms of\nperformance, efficiency, and synthesizability of predicted molecules, and we\nprovide detailed insights into the method's chemical interpretability.\n","authors":["Michael Sun","Minghao Guo","Weize Yuan","Veronika Thost","Crystal Elaine Owens","Aristotle Franklin Grosz","Sharvaa Selvan","Katelyn Zhou","Hassan Mohiuddin","Benjamin J Pedretti","Zachary P Smith","Jie Chen","Wojciech Matusik"],"pdf_url":"https://arxiv.org/pdf/2403.08147v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.16324v2","updated":"2024-06-03T02:37:28Z","published":"2024-02-26T06:08:25Z","title":"Achieving $\\tilde{O}(1/ε)$ Sample Complexity for Constrained\n Markov Decision Process","summary":" We consider the reinforcement learning problem for the constrained Markov\ndecision process (CMDP), which plays a central role in satisfying safety or\nresource constraints in sequential learning and decision-making. In this\nproblem, we are given finite resources and a MDP with unknown transition\nprobabilities. At each stage, we take an action, collecting a reward and\nconsuming some resources, all assumed to be unknown and need to be learned over\ntime. In this work, we take the first step towards deriving optimal\nproblem-dependent guarantees for the CMDP problems. We derive a logarithmic\nregret bound, which translates into a\n$O(\\frac{1}{\\Delta\\cdot\\eps}\\cdot\\log^2(1/\\eps))$ sample complexity bound, with\n$\\Delta$ being a problem-dependent parameter, yet independent of $\\eps$. Our\nsample complexity bound improves upon the state-of-art $O(1/\\eps^2)$ sample\ncomplexity for CMDP problems established in the previous literature, in terms\nof the dependency on $\\eps$. To achieve this advance, we develop a new\nframework for analyzing CMDP problems. To be specific, our algorithm operates\nin the primal space and we resolve the primal LP for the CMDP problem at each\nperiod in an online manner, with \\textit{adaptive} remaining resource\ncapacities. The key elements of our algorithm are: i) a characterization of the\ninstance hardness via LP basis, ii) an eliminating procedure that identifies\none optimal basis of the primal LP, and; iii) a resolving procedure that is\nadaptive to the remaining resources and sticks to the characterized optimal\nbasis.\n","authors":["Jiashuo Jiang","Yinyu Ye"],"pdf_url":"https://arxiv.org/pdf/2402.16324v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.04795v2","updated":"2024-06-03T02:36:51Z","published":"2024-05-08T04:01:40Z","title":"Variational Schrödinger Diffusion Models","summary":" Schr\\\"odinger bridge (SB) has emerged as the go-to method for optimizing\ntransportation plans in diffusion models. However, SB requires estimating the\nintractable forward score functions, inevitably resulting in the costly\nimplicit training loss based on simulated trajectories. To improve the\nscalability while preserving efficient transportation plans, we leverage\nvariational inference to linearize the forward score functions (variational\nscores) of SB and restore simulation-free properties in training backward\nscores. We propose the variational Schr\\\"odinger diffusion model (VSDM), where\nthe forward process is a multivariate diffusion and the variational scores are\nadaptively optimized for efficient transport. Theoretically, we use stochastic\napproximation to prove the convergence of the variational scores and show the\nconvergence of the adaptively generated samples based on the optimal\nvariational scores. Empirically, we test the algorithm in simulated examples\nand observe that VSDM is efficient in generations of anisotropic shapes and\nyields straighter sample trajectories compared to the single-variate diffusion.\nWe also verify the scalability of the algorithm in real-world data and achieve\ncompetitive unconditional generation performance in CIFAR10 and conditional\ngeneration in time series modeling. Notably, VSDM no longer depends on warm-up\ninitializations and has become tuning-friendly in training large-scale\nexperiments.\n","authors":["Wei Deng","Weijian Luo","Yixin Tan","Marin Biloš","Yu Chen","Yuriy Nevmyvaka","Ricky T. Q. Chen"],"pdf_url":"https://arxiv.org/pdf/2405.04795v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2402.01306v2","updated":"2024-06-03T02:36:09Z","published":"2024-02-02T10:53:36Z","title":"KTO: Model Alignment as Prospect Theoretic Optimization","summary":" Kahneman & Tversky's $\\textit{prospect theory}$ tells us that humans perceive\nrandom variables in a biased but well-defined manner (1992); for example,\nhumans are famously loss-averse. We show that objectives for aligning LLMs with\nhuman feedback implicitly incorporate many of these biases -- the success of\nthese objectives (e.g., DPO) over cross-entropy minimization can partly be\nascribed to them belonging to a family of loss functions that we call\n$\\textit{human-aware losses}$ (HALOs). However, the utility functions these\nmethods attribute to humans still differ from those in the prospect theory\nliterature. Using a Kahneman-Tversky model of human utility, we propose a HALO\nthat directly maximizes the utility of generations instead of maximizing the\nlog-likelihood of preferences, as current methods do. We call this approach\nKTO, and it matches or exceeds the performance of preference-based methods at\nscales from 1B to 30B, despite only learning from a binary signal of whether an\noutput is desirable. More broadly, our work suggests that there is no one HALO\nthat is universally superior; the best loss depends on the inductive biases\nmost appropriate for a given setting, an oft-overlooked consideration.\n","authors":["Kawin Ethayarajh","Winnie Xu","Niklas Muennighoff","Dan Jurafsky","Douwe Kiela"],"pdf_url":"https://arxiv.org/pdf/2402.01306v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2302.14509v2","updated":"2024-06-03T02:18:44Z","published":"2023-02-28T11:58:39Z","title":"Policy Dispersion in Non-Markovian Environment","summary":" Markov Decision Process (MDP) presents a mathematical framework to formulate\nthe learning processes of agents in reinforcement learning. MDP is limited by\nthe Markovian assumption that a reward only depends on the immediate state and\naction. However, a reward sometimes depends on the history of states and\nactions, which may result in the decision process in a non-Markovian\nenvironment. In such environments, agents receive rewards via\ntemporally-extended behaviors sparsely, and the learned policies may be\nsimilar. This leads the agents acquired with similar policies generally overfit\nto the given task and can not quickly adapt to perturbations of environments.\nTo resolve this problem, this paper tries to learn the diverse policies from\nthe history of state-action pairs under a non-Markovian environment, in which a\npolicy dispersion scheme is designed for seeking diverse policy representation.\nSpecifically, we first adopt a transformer-based method to learn policy\nembeddings. Then, we stack the policy embeddings to construct a dispersion\nmatrix to induce a set of diverse policies. Finally, we prove that if the\ndispersion matrix is positive definite, the dispersed embeddings can\neffectively enlarge the disagreements across policies, yielding a diverse\nexpression for the original policy embedding distribution. Experimental results\nshow that this dispersion scheme can obtain more expressive diverse policies,\nwhich then derive more robust performance than recent learning baselines under\nvarious learning environments.\n","authors":["Bohao Qu","Xiaofeng Cao","Jielong Yang","Hechang Chen","Chang Yi","Ivor W. Tsang","Yew-Soon Ong"],"pdf_url":"https://arxiv.org/pdf/2302.14509v2.pdf","comment":"In further research, we found that the core content of the paper\n requires significant modification and that the entire paper needs to be\n restructured. To enhance the scientific quality and contributions of the\n paper, we have decided to resubmit it after completing the necessary\n revisions and improvements"},{"id":"http://arxiv.org/abs/2404.02072v4","updated":"2024-06-03T02:15:03Z","published":"2024-04-02T16:20:02Z","title":"EGTR: Extracting Graph from Transformer for Scene Graph Generation","summary":" Scene Graph Generation (SGG) is a challenging task of detecting objects and\npredicting relationships between objects. After DETR was developed, one-stage\nSGG models based on a one-stage object detector have been actively studied.\nHowever, complex modeling is used to predict the relationship between objects,\nand the inherent relationship between object queries learned in the multi-head\nself-attention of the object detector has been neglected. We propose a\nlightweight one-stage SGG model that extracts the relation graph from the\nvarious relationships learned in the multi-head self-attention layers of the\nDETR decoder. By fully utilizing the self-attention by-products, the relation\ngraph can be extracted effectively with a shallow relation extraction head.\nConsidering the dependency of the relation extraction task on the object\ndetection task, we propose a novel relation smoothing technique that adjusts\nthe relation label adaptively according to the quality of the detected objects.\nBy the relation smoothing, the model is trained according to the continuous\ncurriculum that focuses on object detection task at the beginning of training\nand performs multi-task learning as the object detection performance gradually\nimproves. Furthermore, we propose a connectivity prediction task that predicts\nwhether a relation exists between object pairs as an auxiliary task of the\nrelation extraction. We demonstrate the effectiveness and efficiency of our\nmethod for the Visual Genome and Open Image V6 datasets. Our code is publicly\navailable at https://github.com/naver-ai/egtr.\n","authors":["Jinbae Im","JeongYeon Nam","Nokyung Park","Hyungmin Lee","Seunghyun Park"],"pdf_url":"https://arxiv.org/pdf/2404.02072v4.pdf","comment":"CVPR 2024 (Best paper award candidate)"},{"id":"http://arxiv.org/abs/2401.06127v2","updated":"2024-06-03T02:09:38Z","published":"2024-01-11T18:59:14Z","title":"E$^{2}$GAN: Efficient Training of Efficient GANs for Image-to-Image\n Translation","summary":" One highly promising direction for enabling flexible real-time on-device\nimage editing is utilizing data distillation by leveraging large-scale\ntext-to-image diffusion models to generate paired datasets used for training\ngenerative adversarial networks (GANs). This approach notably alleviates the\nstringent requirements typically imposed by high-end commercial GPUs for\nperforming image editing with diffusion models. However, unlike text-to-image\ndiffusion models, each distilled GAN is specialized for a specific image\nediting task, necessitating costly training efforts to obtain models for\nvarious concepts. In this work, we introduce and address a novel research\ndirection: can the process of distilling GANs from diffusion models be made\nsignificantly more efficient? To achieve this goal, we propose a series of\ninnovative techniques. First, we construct a base GAN model with generalized\nfeatures, adaptable to different concepts through fine-tuning, eliminating the\nneed for training from scratch. Second, we identify crucial layers within the\nbase GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet\neffective rank search process, rather than fine-tuning the entire base model.\nThird, we investigate the minimal amount of data necessary for fine-tuning,\nfurther reducing the overall training time. Extensive experiments show that we\ncan efficiently empower GANs with the ability to perform real-time high-quality\nimage editing on mobile devices with remarkably reduced training and storage\ncosts for each concept.\n","authors":["Yifan Gong","Zheng Zhan","Qing Jin","Yanyu Li","Yerlan Idelbayev","Xian Liu","Andrey Zharkov","Kfir Aberman","Sergey Tulyakov","Yanzhi Wang","Jian Ren"],"pdf_url":"https://arxiv.org/pdf/2401.06127v2.pdf","comment":"ICML 2024. Project Page: https://yifanfanfanfan.github.io/e2gan/"},{"id":"http://arxiv.org/abs/2405.20445v2","updated":"2024-06-03T02:08:54Z","published":"2024-05-30T19:43:29Z","title":"GraphAny: A Foundation Model for Node Classification on Any Graph","summary":" Foundation models that can perform inference on any new task without\nrequiring specific training have revolutionized machine learning in vision and\nlanguage applications. However, applications involving graph-structured data\nremain a tough nut for foundation models, due to challenges in the unique\nfeature- and label spaces associated with each graph. Traditional graph ML\nmodels such as graph neural networks (GNNs) trained on graphs cannot perform\ninference on a new graph with feature and label spaces different from the\ntraining ones. Furthermore, existing models learn functions specific to the\ntraining graph and cannot generalize to new graphs. In this work, we tackle\nthese two challenges with a new foundational architecture for inductive node\nclassification named GraphAny. GraphAny models inference on a new graph as an\nanalytical solution to a LinearGNN, thereby solving the first challenge. To\nsolve the second challenge, we learn attention scores for each node to fuse the\npredictions of multiple LinearGNNs. Specifically, the attention module is\ncarefully parameterized as a function of the entropy-normalized\ndistance-features between multiple LinearGNNs predictions to ensure\ngeneralization to new graphs. Empirically, GraphAny trained on the Wisconsin\ndataset with only 120 labeled nodes can effectively generalize to 30 new graphs\nwith an average accuracy of 67.26\\% in an inductive manner, surpassing GCN and\nGAT trained in the supervised regime, as well as other inductive baselines.\n","authors":["Jianan Zhao","Hesham Mostafa","Mikhail Galkin","Michael Bronstein","Zhaocheng Zhu","Jian Tang"],"pdf_url":"https://arxiv.org/pdf/2405.20445v2.pdf","comment":"Preprint. Work in progress"},{"id":"http://arxiv.org/abs/2401.07463v2","updated":"2024-06-03T01:55:52Z","published":"2024-01-15T04:20:39Z","title":"Consistency of semi-supervised learning, stochastic tug-of-war games,\n and the p-Laplacian","summary":" In this paper we give a broad overview of the intersection of partial\ndifferential equations (PDEs) and graph-based semi-supervised learning. The\noverview is focused on a large body of recent work on PDE continuum limits of\ngraph-based learning, which have been used to prove well-posedness of\nsemi-supervised learning algorithms in the large data limit. We highlight some\ninteresting research directions revolving around consistency of graph-based\nsemi-supervised learning, and present some new results on the consistency of\n$p$-Laplacian semi-supervised learning using the stochastic tug-of-war game\ninterpretation of the $p$-Laplacian. We also present the results of some\nnumerical experiments that illustrate our results and suggest directions for\nfuture work.\n","authors":["Jeff Calder","Nadejda Drenska"],"pdf_url":"https://arxiv.org/pdf/2401.07463v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.04411v2","updated":"2024-06-03T01:40:46Z","published":"2024-02-06T21:14:45Z","title":"DFA-RAG: Conversational Semantic Router for Large Language Model with\n Definite Finite Automaton","summary":" This paper introduces the retrieval-augmented large language model with\nDefinite Finite Automaton (DFA-RAG), a novel framework designed to enhance the\ncapabilities of conversational agents using large language models (LLMs).\nTraditional LLMs face challenges in generating regulated and compliant\nresponses in special scenarios with predetermined response guidelines, like\nemotional support and customer service. Our framework addresses these\nchallenges by embedding a Definite Finite Automaton (DFA), learned from\ntraining dialogues, within the LLM. This structured approach acts as a semantic\nrouter which enables the LLM to adhere to a deterministic response pathway. The\nrouting is achieved by the retrieval-augmentation generation (RAG) strategy,\nwhich carefully selects dialogue examples aligned with the current\nconversational context. The advantages of DFA-RAG include an interpretable\nstructure through human-readable DFA, context-aware retrieval for responses in\nconversations, and plug-and-play compatibility with existing LLMs. Extensive\nbenchmarks validate DFA-RAG's effectiveness, indicating its potential as a\nvaluable contribution to the conversational agent.\n","authors":["Yiyou Sun","Junjie Hu","Wei Cheng","Haifeng Chen"],"pdf_url":"https://arxiv.org/pdf/2402.04411v2.pdf","comment":"Accepted to ICML 2024"},{"id":"http://arxiv.org/abs/2211.13316v3","updated":"2024-06-03T01:24:38Z","published":"2022-11-23T21:34:35Z","title":"Understanding Sample Generation Strategies for Learning Heuristic\n Functions in Classical Planning","summary":" We study the problem of learning good heuristic functions for classical\nplanning tasks with neural networks based on samples represented by states with\ntheir cost-to-goal estimates. The heuristic function is learned for a state\nspace and goal condition with the number of samples limited to a fraction of\nthe size of the state space, and must generalize well for all states of the\nstate space with the same goal condition. Our main goal is to better understand\nthe influence of sample generation strategies on the performance of a greedy\nbest-first heuristic search (GBFS) guided by a learned heuristic function. In a\nset of controlled experiments, we find that two main factors determine the\nquality of the learned heuristic: the algorithm used to generate the sample set\nand how close the sample estimates to the perfect cost-to-goal are. These two\nfactors are dependent: having perfect cost-to-goal estimates is insufficient if\nthe samples are not well distributed across the state space. We also study\nother effects, such as adding samples with high-value estimates. Based on our\nfindings, we propose practical strategies to improve the quality of learned\nheuristics: three strategies that aim to generate more representative states\nand two strategies that improve the cost-to-goal estimates. Our practical\nstrategies result in a learned heuristic that, when guiding a GBFS algorithm,\nincreases by more than 30% the mean coverage compared to a baseline learned\nheuristic.\n","authors":["R. V. Bettker","P. P. Minini","A. G. Pereira","M. Ritt"],"pdf_url":"https://arxiv.org/pdf/2211.13316v3.pdf","comment":"29 pages"},{"id":"http://arxiv.org/abs/2404.18239v3","updated":"2024-06-03T01:10:53Z","published":"2024-04-28T16:31:32Z","title":"SOUL: Unlocking the Power of Second-Order Optimization for LLM\n Unlearning","summary":" Large Language Models (LLMs) have highlighted the necessity of effective\nunlearning mechanisms to comply with data regulations and ethical AI practices.\nLLM unlearning aims at removing undesired data influences and associated model\ncapabilities without compromising utility out of the scope of unlearning. While\ninterest in studying LLM unlearning is growing,the impact of the optimizer\nchoice for LLM unlearning remains under-explored. In this work, we shed light\non the significance of optimizer selection in LLM unlearning for the first\ntime, establishing a clear connection between {second-order optimization} and\ninfluence unlearning (a classical approach using influence functions to update\nthe model for data influence removal). This insight propels us to develop a\nsecond-order unlearning framework, termed SOUL, built upon the second-order\nclipped stochastic optimization (Sophia)-based LLM training method. SOUL\nextends the static, one-shot model update using influence unlearning to a\ndynamic, iterative unlearning process. Our extensive experiments show that SOUL\nconsistently outperforms conventional first-order methods across various\nunlearning tasks, models, and metrics, suggesting the promise of second-order\noptimization in providing a scalable and easily implementable solution for LLM\nunlearning.\n","authors":["Jinghan Jia","Yihua Zhang","Yimeng Zhang","Jiancheng Liu","Bharat Runwal","James Diffenderfer","Bhavya Kailkhura","Sijia Liu"],"pdf_url":"https://arxiv.org/pdf/2404.18239v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.06470v2","updated":"2024-06-03T01:03:49Z","published":"2022-12-13T10:41:12Z","title":"Position: Considerations for Differentially Private Learning with\n Large-Scale Public Pretraining","summary":" The performance of differentially private machine learning can be boosted\nsignificantly by leveraging the transfer learning capabilities of non-private\nmodels pretrained on large public datasets. We critically review this approach.\n We primarily question whether the use of large Web-scraped datasets should be\nviewed as differential-privacy-preserving. We caution that publicizing these\nmodels pretrained on Web data as \"private\" could lead to harm and erode the\npublic's trust in differential privacy as a meaningful definition of privacy.\n Beyond the privacy considerations of using public data, we further question\nthe utility of this paradigm. We scrutinize whether existing machine learning\nbenchmarks are appropriate for measuring the ability of pretrained models to\ngeneralize to sensitive domains, which may be poorly represented in public Web\ndata. Finally, we notice that pretraining has been especially impactful for the\nlargest available models -- models sufficiently large to prohibit end users\nrunning them on their own devices. Thus, deploying such models today could be a\nnet loss for privacy, as it would require (private) data to be outsourced to a\nmore compute-powerful third party.\n We conclude by discussing potential paths forward for the field of private\nlearning, as public pretraining becomes more popular and powerful.\n","authors":["Florian Tramèr","Gautam Kamath","Nicholas Carlini"],"pdf_url":"https://arxiv.org/pdf/2212.06470v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2401.15800v2","updated":"2024-06-03T00:49:43Z","published":"2024-01-28T23:14:51Z","title":"Provably Stable Feature Rankings with SHAP and LIME","summary":" Feature attributions are ubiquitous tools for understanding the predictions\nof machine learning models. However, the calculation of popular methods for\nscoring input variables such as SHAP and LIME suffers from high instability due\nto random sampling. Leveraging ideas from multiple hypothesis testing, we\ndevise attribution methods that ensure the most important features are ranked\ncorrectly with high probability. Given SHAP estimates from KernelSHAP or\nShapley Sampling, we demonstrate how to retrospectively verify the number of\nstable rankings. Further, we introduce efficient sampling algorithms for SHAP\nand LIME that guarantee the $K$ highest-ranked features have the proper\nordering. Finally, we show how to adapt these local feature attribution methods\nfor the global importance setting.\n","authors":["Jeremy Goldwasser","Giles Hooker"],"pdf_url":"https://arxiv.org/pdf/2401.15800v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.05100v2","updated":"2024-06-03T00:32:53Z","published":"2023-06-08T10:58:46Z","title":"Communication-Efficient Gradient Descent-Accent Methods for Distributed\n Variational Inequalities: Unified Analysis and Local Updates","summary":" Distributed and federated learning algorithms and techniques associated\nprimarily with minimization problems. However, with the increase of minimax\noptimization and variational inequality problems in machine learning, the\nnecessity of designing efficient distributed/federated learning approaches for\nthese problems is becoming more apparent. In this paper, we provide a unified\nconvergence analysis of communication-efficient local training methods for\ndistributed variational inequality problems (VIPs). Our approach is based on a\ngeneral key assumption on the stochastic estimates that allows us to propose\nand analyze several novel local training algorithms under a single framework\nfor solving a class of structured non-monotone VIPs. We present the first local\ngradient descent-accent algorithms with provable improved communication\ncomplexity for solving distributed variational inequalities on heterogeneous\ndata. The general algorithmic framework recovers state-of-the-art algorithms\nand their sharp convergence guarantees when the setting is specialized to\nminimization or minimax optimization problems. Finally, we demonstrate the\nstrong performance of the proposed algorithms compared to state-of-the-art\nmethods when solving federated minimax optimization problems.\n","authors":["Siqi Zhang","Sayantan Choudhury","Sebastian U Stich","Nicolas Loizou"],"pdf_url":"https://arxiv.org/pdf/2306.05100v2.pdf","comment":"ICLR 2024"},{"id":"http://arxiv.org/abs/2405.18749v2","updated":"2024-06-03T00:17:05Z","published":"2024-05-29T04:22:18Z","title":"A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody\n Language Models","summary":" Antibodies are crucial proteins produced by the immune system to eliminate\nharmful foreign substances and have become pivotal therapeutic agents for\ntreating human diseases. To accelerate the discovery of antibody therapeutics,\nthere is growing interest in constructing language models using antibody\nsequences. However, the applicability of pre-trained language models for\nantibody discovery has not been thoroughly evaluated due to the scarcity of\nlabeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2,\na dataset featuring the antigen-variable domain of heavy chain of heavy chain\nantibody (VHH) interactions obtained from two alpacas immunized with severe\nacute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins.\nAVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding\nof diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and\nOmicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset\nfor antibody language models, containing over two million VHH sequences. We\nreport benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT\npre-trained on VHHCorpus-2M and existing general protein and antibody-specific\npre-trained language models. These results confirm that AVIDa-SARS-CoV-2\nprovides valuable benchmarks for evaluating the representation capabilities of\nantibody language models for binding prediction, thereby facilitating the\ndevelopment of AI-driven antibody discovery. The datasets are available at\nhttps://datasets.cognanous.com.\n","authors":["Hirofumi Tsuruta","Hiroyuki Yamazaki","Ryota Maeda","Ryotaro Tamura","Akihiro Imura"],"pdf_url":"https://arxiv.org/pdf/2405.18749v2.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2312.14867v2","updated":"2024-06-03T16:59:20Z","published":"2023-12-22T17:45:19Z","title":"VIEScore: Towards Explainable Metrics for Conditional Image Synthesis\n Evaluation","summary":" In the rapidly advancing field of conditional image generation research,\nchallenges such as limited explainability lie in effectively evaluating the\nperformance and capabilities of various models. This paper introduces VIEScore,\na Visual Instruction-guided Explainable metric for evaluating any conditional\nimage generation tasks. VIEScore leverages general knowledge from Multimodal\nLarge Language Models (MLLMs) as the backbone and does not require training or\nfine-tuning. We evaluate VIEScore on seven prominent tasks in conditional image\ntasks and found: (1) VIEScore (GPT4-o) achieves a high Spearman correlation of\n0.4 with human evaluations, while the human-to-human correlation is 0.45. (2)\nVIEScore (with open-source MLLM) is significantly weaker than GPT-4o and GPT-4v\nin evaluating synthetic images. (3) VIEScore achieves a correlation on par with\nhuman ratings in the generation tasks but struggles in editing tasks. With\nthese results, we believe VIEScore shows its great potential to replace human\njudges in evaluating image synthesis tasks.\n","authors":["Max Ku","Dongfu Jiang","Cong Wei","Xiang Yue","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2312.14867v2.pdf","comment":"Accepted to ACL2024 main"},{"id":"http://arxiv.org/abs/2401.01163v3","updated":"2024-06-03T16:09:55Z","published":"2024-01-02T11:46:42Z","title":"NU-Class Net: A Novel Approach for Video Quality Enhancement","summary":" Video content has experienced a surge in popularity, asserting its dominance\nover internet traffic and Internet of Things (IoT) networks. Video compression\nhas long been regarded as the primary means of efficiently managing the\nsubstantial multimedia traffic generated by video-capturing devices.\nNevertheless, video compression algorithms entail significant computational\ndemands in order to achieve substantial compression ratios. This complexity\npresents a formidable challenge when implementing efficient video coding\nstandards in resource-constrained embedded systems, such as IoT edge node\ncameras. To tackle this challenge, this paper introduces NU-Class Net, an\ninnovative deep-learning model designed to mitigate compression artifacts\nstemming from lossy compression codecs. This enhancement significantly elevates\nthe perceptible quality of low-bit-rate videos. By employing the NU-Class Net,\nthe video encoder within the video-capturing node can reduce output quality,\nthereby generating low-bit-rate videos and effectively curtailing both\ncomputation and bandwidth requirements at the edge. On the decoder side, which\nis typically less encumbered by resource limitations, NU-Class Net is applied\nafter the video decoder to compensate for artifacts and approximate the quality\nof the original video. Experimental results affirm the efficacy of the proposed\nmodel in enhancing the perceptible quality of videos, especially those streamed\nat low bit rates.\n","authors":["Parham Zilouchian Moghaddam","Mehdi Modarressi","Mohammad Amin Sadeghi"],"pdf_url":"https://arxiv.org/pdf/2401.01163v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19802v2","updated":"2024-06-03T12:55:12Z","published":"2024-05-30T08:12:08Z","title":"Exploring the Robustness of Decision-Level Through Adversarial Attacks\n on LLM-Based Embodied Models","summary":" Embodied intelligence empowers agents with a profound sense of perception,\nenabling them to respond in a manner closely aligned with real-world\nsituations. Large Language Models (LLMs) delve into language instructions with\ndepth, serving a crucial role in generating plans for intricate tasks. Thus,\nLLM-based embodied models further enhance the agent's capacity to comprehend\nand process information. However, this amalgamation also ushers in new\nchallenges in the pursuit of heightened intelligence. Specifically, attackers\ncan manipulate LLMs to produce irrelevant or even malicious outputs by altering\ntheir prompts. Confronted with this challenge, we observe a notable absence of\nmulti-modal datasets essential for comprehensively evaluating the robustness of\nLLM-based embodied models. Consequently, we construct the Embodied Intelligent\nRobot Attack Dataset (EIRAD), tailored specifically for robustness evaluation.\nAdditionally, two attack strategies are devised, including untargeted attacks\nand targeted attacks, to effectively simulate a range of diverse attack\nscenarios. At the same time, during the attack process, to more accurately\nascertain whether our method is successful in attacking the LLM-based embodied\nmodel, we devise a new attack success evaluation method utilizing the BLIP2\nmodel. Recognizing the time and cost-intensive nature of the GCG algorithm in\nattacks, we devise a scheme for prompt suffix initialization based on various\ntarget tasks, thus expediting the convergence process. Experimental results\ndemonstrate that our method exhibits a superior attack success rate when\ntargeting LLM-based embodied models, indicating a lower level of decision-level\nrobustness in these models.\n","authors":["Shuyuan Liu","Jiawei Chen","Shouwei Ruan","Hang Su","Zhaoxia Yin"],"pdf_url":"https://arxiv.org/pdf/2405.19802v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.08389v2","updated":"2024-06-03T07:47:36Z","published":"2023-05-15T07:12:19Z","title":"Edit As You Wish: Video Caption Editing with Multi-grained User Control","summary":" Automatically narrating videos in natural language complying with user\nrequests, i.e. Controllable Video Captioning task, can help people manage\nmassive videos with desired intentions. However, existing works suffer from two\nshortcomings: 1) the control signal is single-grained which can not satisfy\ndiverse user intentions; 2) the video description is generated in a single\nround which can not be further edited to meet dynamic needs. In this paper, we\npropose a novel \\textbf{V}ideo \\textbf{C}aption \\textbf{E}diting \\textbf{(VCE)}\ntask to automatically revise an existing video description guided by\nmulti-grained user requests. Inspired by human writing-revision habits, we\ndesign the user command as a pivotal triplet \\{\\textit{operation, position,\nattribute}\\} to cover diverse user needs from coarse-grained to fine-grained.\nTo facilitate the VCE task, we \\textit{automatically} construct an open-domain\nbenchmark dataset named VATEX-EDIT and \\textit{manually} collect an e-commerce\ndataset called EMMAD-EDIT. We further propose a specialized small-scale model\n(i.e., OPA) compared with two generalist Large Multi-modal Models to perform an\nexhaustive analysis of the novel task. For evaluation, we adopt comprehensive\nmetrics considering caption fluency, command-caption consistency, and\nvideo-caption alignment. Experiments reveal the task challenges of fine-grained\nmulti-modal semantics understanding and processing. Our datasets, codes, and\nevaluation tools are ready to be open-sourced.\n","authors":["Linli Yao","Yuanmeng Zhang","Ziheng Wang","Xinglin Hou","Tiezheng Ge","Yuning Jiang","Xu Sun","Qin Jin"],"pdf_url":"https://arxiv.org/pdf/2305.08389v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.18063v2","updated":"2024-06-03T18:22:30Z","published":"2024-03-26T19:29:21Z","title":"Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and\n Time-Series Analysis","summary":" Transformers have revolutionized image modeling tasks with adaptations like\nDeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face\nchallenges with inductive bias and high quadratic complexity, making them less\nefficient for high-resolution images. State space models (SSMs) such as Mamba,\nV-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images\nin computer vision tasks. These SSMs encounter two major issues. First, they\nbecome unstable when scaled to large network sizes. Second, although they\nefficiently capture global information in images, they inherently struggle with\nhandling local information. To address these challenges, we introduce Heracles,\na novel SSM that integrates a local SSM, a global SSM, and an attention-based\ntoken interaction module. Heracles leverages a Hartely kernel-based state space\nmodel for global image information, a localized convolutional network for local\ndetails, and attention mechanisms in deeper layers for token interactions. Our\nextensive experiments demonstrate that Heracles-C-small achieves\nstate-of-the-art performance on the ImageNet dataset with 84.5\\% top-1\naccuracy. Heracles-C-Large and Heracles-C-Huge further improve accuracy to\n85.9\\% and 86.4\\%, respectively. Additionally, Heracles excels in transfer\nlearning tasks on datasets such as CIFAR-10, CIFAR-100, Oxford Flowers, and\nStanford Cars, and in instance segmentation on the MSCOCO dataset. Heracles\nalso proves its versatility by achieving state-of-the-art results on seven\ntime-series datasets, showcasing its ability to generalize across domains with\nspectral data, capturing both local and global information. The project page is\navailable at this link.\\url{https://github.com/badripatro/heracles}\n","authors":["Badri N. Patro","Suhas Ranganath","Vinay P. Namboodiri","Vijay S. Agneeswaran"],"pdf_url":"https://arxiv.org/pdf/2403.18063v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01451v1","updated":"2024-06-03T15:42:30Z","published":"2024-06-03T15:42:30Z","title":"SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised\n Referring Expression Segmentation","summary":" In this paper, we introduce SemiRES, a semi-supervised framework that\neffectively leverages a combination of labeled and unlabeled data to perform\nRES. A significant hurdle in applying semi-supervised techniques to RES is the\nprevalence of noisy pseudo-labels, particularly at the boundaries of objects.\nSemiRES incorporates the Segment Anything Model (SAM), renowned for its precise\nboundary demarcation, to improve the accuracy of these pseudo-labels. Within\nSemiRES, we offer two alternative matching strategies: IoU-based Optimal\nMatching (IOM) and Composite Parts Integration (CPI). These strategies are\ndesigned to extract the most accurate masks from SAM's output, thus guiding the\ntraining of the student model with enhanced precision. In instances where a\nprecise mask cannot be matched from the available candidates, we develop the\nPixel-Wise Adjustment (PWA) strategy, guiding the student model's training\ndirectly by the pseudo-labels. Extensive experiments on three RES\nbenchmarks--RefCOCO, RefCOCO+, and G-Ref reveal its superior performance\ncompared to fully supervised methods. Remarkably, with only 1% labeled data,\nour SemiRES outperforms the supervised baseline by a large margin, e.g. +18.64%\ngains on RefCOCO val set. The project code is available at\n\\url{https://github.com/nini0919/SemiRES}.\n","authors":["Danni Yang","Jiayi Ji","Yiwei Ma","Tianyu Guo","Haowei Wang","Xiaoshuai Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2406.01451v1.pdf","comment":"Accepted by ICML2024"},{"id":"http://arxiv.org/abs/2406.01321v1","updated":"2024-06-03T13:42:10Z","published":"2024-06-03T13:42:10Z","title":"Sequence-to-Sequence Multi-Modal Speech In-Painting","summary":" Speech in-painting is the task of regenerating missing audio contents using\nreliable context information. Despite various recent studies in multi-modal\nperception of audio in-painting, there is still a need for an effective\ninfusion of visual and auditory information in speech in-painting. In this\npaper, we introduce a novel sequence-to-sequence model that leverages the\nvisual information to in-paint audio signals via an encoder-decoder\narchitecture. The encoder plays the role of a lip-reader for facial recordings\nand the decoder takes both encoder outputs as well as the distorted audio\nspectrograms to restore the original speech. Our model outperforms an\naudio-only speech in-painting model and has comparable results with a recent\nmulti-modal speech in-painter in terms of speech quality and intelligibility\nmetrics for distortions of 300 ms to 1500 ms duration, which proves the\neffectiveness of the introduced multi-modality in speech in-painting.\n","authors":["Mahsa Kadkhodaei Elyaderani","Shahram Shirani"],"pdf_url":"https://arxiv.org/pdf/2406.01321v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01280v1","updated":"2024-06-03T12:48:38Z","published":"2024-06-03T12:48:38Z","title":"Demo: Soccer Information Retrieval via Natural Queries using SoccerRAG","summary":" The rapid evolution of digital sports media necessitates sophisticated\ninformation retrieval systems that can efficiently parse extensive multimodal\ndatasets. This paper demonstrates SoccerRAG, an innovative framework designed\nto harness the power of Retrieval Augmented Generation (RAG) and Large Language\nModels (LLMs) to extract soccer-related information through natural language\nqueries. By leveraging a multimodal dataset, SoccerRAG supports dynamic\nquerying and automatic data validation, enhancing user interaction and\naccessibility to sports archives. We present a novel interactive user interface\n(UI) based on the Chainlit framework which wraps around the core functionality,\nand enable users to interact with the SoccerRAG framework in a chatbot-like\nvisual manner.\n","authors":["Aleksander Theo Strand","Sushant Gautam","Cise Midoglu","Pål Halvorsen"],"pdf_url":"https://arxiv.org/pdf/2406.01280v1.pdf","comment":"accepted to CBMI 2024 as a demonstration;\n https://github.com/simula/soccer-rag"},{"id":"http://arxiv.org/abs/2406.01273v1","updated":"2024-06-03T12:39:04Z","published":"2024-06-03T12:39:04Z","title":"SoccerRAG: Multimodal Soccer Information Retrieval via Natural Queries","summary":" The rapid evolution of digital sports media necessitates sophisticated\ninformation retrieval systems that can efficiently parse extensive multimodal\ndatasets. This paper introduces SoccerRAG, an innovative framework designed to\nharness the power of Retrieval Augmented Generation (RAG) and Large Language\nModels (LLMs) to extract soccer-related information through natural language\nqueries. By leveraging a multimodal dataset, SoccerRAG supports dynamic\nquerying and automatic data validation, enhancing user interaction and\naccessibility to sports archives. Our evaluations indicate that SoccerRAG\neffectively handles complex queries, offering significant improvements over\ntraditional retrieval systems in terms of accuracy and user engagement. The\nresults underscore the potential of using RAG and LLMs in sports analytics,\npaving the way for future advancements in the accessibility and real-time\nprocessing of sports data.\n","authors":["Aleksander Theo Strand","Sushant Gautam","Cise Midoglu","Pål Halvorsen"],"pdf_url":"https://arxiv.org/pdf/2406.01273v1.pdf","comment":"accepted to CBMI 2024 as a regular paper;\n https://github.com/simula/soccer-rag"},{"id":"http://arxiv.org/abs/2406.01033v1","updated":"2024-06-03T06:35:11Z","published":"2024-06-03T06:35:11Z","title":"Generalized Jersey Number Recognition Using Multi-task Learning With\n Orientation-guided Weight Refinement","summary":" Jersey number recognition (JNR) has always been an important task in sports\nanalytics. Improving recognition accuracy remains an ongoing challenge because\nimages are subject to blurring, occlusion, deformity, and low resolution.\nRecent research has addressed these problems using number localization and\noptical character recognition. Some approaches apply player identification\nschemes to image sequences, ignoring the impact of human body rotation angles\non jersey digit identification. Accurately predicting the number of jersey\ndigits by using a multi-task scheme to recognize each individual digit enables\nmore robust results. Based on the above considerations, this paper proposes a\nmulti-task learning method called the angle-digit refine scheme (ADRS), which\ncombines human body orientation angles and digit number clues to recognize\nathletic jersey numbers. Based on our experimental results, our approach\nincreases inference information, significantly improving prediction accuracy.\nCompared to state-of-the-art methods, which can only handle a single type of\nsport, the proposed method produces a more diverse and practical JNR\napplication. The incorporation of diverse types of team sports such as soccer,\nfootball, basketball, volleyball, and baseball into our dataset contributes\ngreatly to generalized JNR in sports analytics. Our accuracy achieves 64.07% on\nTop-1 and 89.97% on Top-2, with corresponding F1 scores of 67.46% and 90.64%,\nrespectively.\n","authors":["Yung-Hui Lin","Yu-Wen Chang","Huang-Chia Shih","Takahiro Ogawa"],"pdf_url":"https://arxiv.org/pdf/2406.01033v1.pdf","comment":"10 pages, 6 figures, 5 tables"},{"id":"http://arxiv.org/abs/2406.00919v1","updated":"2024-06-03T01:09:15Z","published":"2024-06-03T01:09:15Z","title":"Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise\n Pseudo Labeling","summary":" The Audio-Visual Video Parsing task aims to identify and temporally localize\nthe events that occur in either or both the audio and visual streams of audible\nvideos. It often performs in a weakly-supervised manner, where only video event\nlabels are provided, \\ie, the modalities and the timestamps of the labels are\nunknown. Due to the lack of densely annotated labels, recent work attempts to\nleverage pseudo labels to enrich the supervision. A commonly used strategy is\nto generate pseudo labels by categorizing the known video event labels for each\nmodality. However, the labels are still confined to the video level, and the\ntemporal boundaries of events remain unlabeled. In this paper, we propose a new\npseudo label generation strategy that can explicitly assign labels to each\nvideo segment by utilizing prior knowledge learned from the open world.\nSpecifically, we exploit the large-scale pretrained models, namely CLIP and\nCLAP, to estimate the events in each video segment and generate segment-level\nvisual and audio pseudo labels, respectively. We then propose a new loss\nfunction to exploit these pseudo labels by taking into account their\ncategory-richness and segment-richness. A label denoising strategy is also\nadopted to further improve the visual pseudo labels by flipping them whenever\nabnormally large forward losses occur. We perform extensive experiments on the\nLLP dataset and demonstrate the effectiveness of each proposed design and we\nachieve state-of-the-art video parsing performance on all types of event\nparsing, \\ie, audio event, visual event, and audio-visual event. We also\nexamine the proposed pseudo label generation strategy on a relevant\nweakly-supervised audio-visual event localization task and the experimental\nresults again verify the benefits and generalization of our method.\n","authors":["Jinxing Zhou","Dan Guo","Yiran Zhong","Meng Wang"],"pdf_url":"https://arxiv.org/pdf/2406.00919v1.pdf","comment":"IJCV 2024 Accepted. arXiv admin note: substantial text overlap with\n arXiv:2303.02344"}]},"2024-06-02T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2405.18672v2","updated":"2024-06-02T23:30:46Z","published":"2024-05-29T00:36:56Z","title":"LLM-based Hierarchical Concept Decomposition for Interpretable\n Fine-Grained Image Classification","summary":" (Renyi Qu's Master's Thesis) Recent advancements in interpretable models for\nvision-language tasks have achieved competitive performance; however, their\ninterpretability often suffers due to the reliance on unstructured text outputs\nfrom large language models (LLMs). This introduces randomness and compromises\nboth transparency and reliability, which are essential for addressing safety\nissues in AI systems. We introduce \\texttt{Hi-CoDe} (Hierarchical Concept\nDecomposition), a novel framework designed to enhance model interpretability\nthrough structured concept analysis. Our approach consists of two main\ncomponents: (1) We use GPT-4 to decompose an input image into a structured\nhierarchy of visual concepts, thereby forming a visual concept tree. (2) We\nthen employ an ensemble of simple linear classifiers that operate on\nconcept-specific features derived from CLIP to perform classification. Our\napproach not only aligns with the performance of state-of-the-art models but\nalso advances transparency by providing clear insights into the decision-making\nprocess and highlighting the importance of various concepts. This allows for a\ndetailed analysis of potential failure modes and improves model compactness,\ntherefore setting a new benchmark in interpretability without compromising the\naccuracy.\n","authors":["Renyi Qu","Mark Yatskar"],"pdf_url":"https://arxiv.org/pdf/2405.18672v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13673v3","updated":"2024-06-02T23:16:50Z","published":"2023-05-23T04:28:16Z","title":"Physics of Language Models: Part 1, Learning Hierarchical Language\n Structures","summary":" Transformer-based language models are effective but complex, and\nunderstanding their inner workings is a significant challenge. Previous\nresearch has primarily explored how these models handle simple tasks like name\ncopying or selection, and we extend this by investigating how these models\ngrasp complex, recursive language structures defined by context-free grammars\n(CFGs). We introduce a family of synthetic CFGs that produce hierarchical\nrules, capable of generating lengthy sentences (e.g., hundreds of tokens) that\nare locally ambiguous and require dynamic programming to parse. Despite this\ncomplexity, we demonstrate that generative models like GPT can accurately learn\nthis CFG language and generate sentences based on it. We explore the model's\ninternals, revealing that its hidden states precisely capture the structure of\nCFGs, and its attention patterns resemble the information passing in a dynamic\nprogramming algorithm.\n This paper also presents several corollaries, including showing why\npositional embedding is inferior to relative attention or rotary embedding;\ndemonstrating that encoder-based models (e.g., BERT, deBERTa) cannot learn very\ndeeply nested CFGs as effectively as generative models (e.g., GPT); and\nhighlighting the necessity of adding structural and syntactic errors to the\npretraining data to make the model more robust to corrupted language prefixes.\n","authors":["Zeyuan Allen-Zhu","Yuanzhi Li"],"pdf_url":"https://arxiv.org/pdf/2305.13673v3.pdf","comment":"V2+V3 polishes writing; V3 includes Figures 6 and 10 for better\n illustrations of our results"},{"id":"http://arxiv.org/abs/2404.14454v2","updated":"2024-06-02T22:42:00Z","published":"2024-04-21T09:20:16Z","title":"Reinforcement of Explainability of ChatGPT Prompts by Embedding Breast\n Cancer Self-Screening Rules into AI Responses","summary":" Addressing the global challenge of breast cancer, this research explores the\nfusion of generative AI, focusing on ChatGPT 3.5 turbo model, and the\nintricacies of breast cancer risk assessment. The research aims to evaluate\nChatGPT's reasoning capabilities, emphasizing its potential to process rules\nand provide explanations for screening recommendations. The study seeks to\nbridge the technology gap between intelligent machines and clinicians by\ndemonstrating ChatGPT's unique proficiency in natural language reasoning. The\nmethodology employs a supervised prompt-engineering approach to enforce\ndetailed explanations for ChatGPT's recommendations. Synthetic use cases,\ngenerated algorithmically, serve as the testing ground for the encoded rules,\nevaluating the model's processing prowess. Findings highlight ChatGPT's\npromising capacity in processing rules comparable to Expert System Shells, with\na focus on natural language reasoning. The research introduces the concept of\nreinforcement explainability, showcasing its potential in elucidating outcomes\nand facilitating user-friendly interfaces for breast cancer risk assessment.\n","authors":["Yousef Khan","Ahmed Abdeen Hamed"],"pdf_url":"https://arxiv.org/pdf/2404.14454v2.pdf","comment":"9 pages, 5 figures, 3 algorithms, 1 table, submitted to the IEEE\n MedAI'24 Conference"},{"id":"http://arxiv.org/abs/2401.11356v3","updated":"2024-06-02T21:05:36Z","published":"2024-01-21T00:58:31Z","title":"ProLex: A Benchmark for Language Proficiency-oriented Lexical\n Substitution","summary":" Lexical Substitution discovers appropriate substitutes for a given target\nword in a context sentence. However, the task fails to consider substitutes\nthat are of equal or higher proficiency than the target, an aspect that could\nbe beneficial for language learners looking to improve their writing. To bridge\nthis gap, we propose a new task, language proficiency-oriented lexical\nsubstitution. We also introduce ProLex, a novel benchmark designed to assess\nsystems' ability to generate not only appropriate substitutes but also\nsubstitutes that demonstrate better language proficiency. Besides the\nbenchmark, we propose models that can automatically perform the new task. We\nshow that our best model, a Llama2-13B model fine-tuned with task-specific\nsynthetic data, outperforms ChatGPT by an average of 3.2% in F-score and\nachieves comparable results with GPT-4 on ProLex.\n","authors":["Xuanming Zhang","Zixun Chen","Zhou Yu"],"pdf_url":"https://arxiv.org/pdf/2401.11356v3.pdf","comment":"In ACL 2024 Findings, 19 pages, 4 figures, 14 tables"},{"id":"http://arxiv.org/abs/2306.12916v3","updated":"2024-06-02T20:38:10Z","published":"2023-06-22T14:31:18Z","title":"Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation","summary":" While summarization has been extensively researched in natural language\nprocessing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a\nlargely unexplored area that has the potential to improve cross-cultural\naccessibility and understanding. This paper comprehensively addresses the CLCTS\ntask, including dataset creation, modeling, and evaluation. We (1) build the\nfirst CLCTS corpus with 328 instances for hDe-En (extended version with 455\ninstances) and 289 for hEn-De (extended version with 501 instances), leveraging\nhistorical fiction texts and Wikipedia summaries in English and German; (2)\nexamine the effectiveness of popular transformer end-to-end models with\ndifferent intermediate finetuning tasks; (3) explore the potential of GPT-3.5\nas a summarizer; (4) report evaluations from humans, GPT-4, and several recent\nautomatic evaluation metrics. Our results indicate that intermediate task\nfinetuned end-to-end models generate bad to moderate quality summaries while\nGPT-3.5, as a zero-shot summarizer, provides moderate to good quality outputs.\nGPT-3.5 also seems very adept at normalizing historical text. To assess data\ncontamination in GPT-3.5, we design an adversarial attack scheme in which we\nfind that GPT-3.5 performs slightly worse for unseen source documents compared\nto seen documents. Moreover, it sometimes hallucinates when the source\nsentences are inverted against its prior knowledge with a summarization\naccuracy of 0.67 for plot omission, 0.71 for entity swap, and 0.53 for plot\nnegation. Overall, our regression results of model performances suggest that\nlonger, older, and more complex source texts (all of which are more\ncharacteristic for historical language variants) are harder to summarize for\nall models, indicating the difficulty of the CLCTS task.\n","authors":["Ran Zhang","Jihed Ouni","Steffen Eger"],"pdf_url":"https://arxiv.org/pdf/2306.12916v3.pdf","comment":"Computational Linguistics. Submitted manuscript.\n https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00519/121095/Cross-lingual-Cross-temporal-Summarization-Dataset"},{"id":"http://arxiv.org/abs/2404.18400v2","updated":"2024-06-02T20:17:59Z","published":"2024-04-29T03:30:06Z","title":"LLM-SR: Scientific Equation Discovery via Programming with Large\n Language Models","summary":" Mathematical equations have been unreasonably effective in describing complex\nnatural phenomena across various scientific disciplines. However, discovering\nsuch insightful equations from data presents significant challenges due to the\nnecessity of navigating extremely high-dimensional combinatorial and nonlinear\nhypothesis spaces. Traditional methods of equation discovery, commonly known as\nsymbolic regression, largely focus on extracting equations from data alone,\noften neglecting the rich domain-specific prior knowledge that scientists\ntypically depend on. To bridge this gap, we introduce LLM-SR, a novel approach\nthat leverages the extensive scientific knowledge and robust code generation\ncapabilities of Large Language Models (LLMs) to discover scientific equations\nfrom data in an efficient manner. Specifically, LLM-SR treats equations as\nprograms with mathematical operators and combines LLMs' scientific priors with\nevolutionary search over equation programs. The LLM iteratively proposes new\nequation skeleton hypotheses, drawing from its physical understanding, which\nare then optimized against data to estimate skeleton parameters. We demonstrate\nLLM-SR's effectiveness across three diverse scientific domains, where it\ndiscovers physically accurate equations that provide significantly better fits\nto in-domain and out-of-domain data compared to the well-established symbolic\nregression baselines. Incorporating scientific prior knowledge also enables\nLLM-SR to search the equation space more efficiently than baselines. Code is\navailable at: https://github.com/deep-symbolic-mathematics/LLM-SR\n","authors":["Parshin Shojaee","Kazem Meidani","Shashank Gupta","Amir Barati Farimani","Chandan K Reddy"],"pdf_url":"https://arxiv.org/pdf/2404.18400v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17394v2","updated":"2024-06-02T19:43:55Z","published":"2024-05-27T17:46:57Z","title":"The Expressive Capacity of State Space Models: A Formal Language\n Perspective","summary":" Recently, recurrent models based on linear state space models (SSMs) have\nshown promising performance in language modeling (LM), competititve with\ntransformers. However, there is little understanding of the in-principle\nabilities of such models, which could provide useful guidance to the search for\nbetter LM architectures. We present a comprehensive theoretical study of the\ncapacity of such SSMs as it compares to that of transformers and traditional\nRNNs. We find that SSMs and transformers have overlapping but distinct\nstrengths. In star-free state tracking, SSMs implement straightforward and\nexact solutions to problems that transformers struggle to represent exactly.\nThey can also model bounded hierarchical structure with optimal memory even\nwithout simulating a stack. On the other hand, we identify a design choice in\ncurrent SSMs that limits their expressive power. We discuss implications for\nSSM and LM research, and verify results empirically on a recent SSM, Mamba.\n","authors":["Yash Sarrof","Yana Veitsman","Michael Hahn"],"pdf_url":"https://arxiv.org/pdf/2405.17394v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15051v2","updated":"2024-06-02T19:17:34Z","published":"2023-05-24T11:41:33Z","title":"A Monte Carlo Language Model Pipeline for Zero-Shot Sociopolitical Event\n Extraction","summary":" Current social science efforts automatically populate event databases of \"who\ndid what to whom?\" tuples, by applying event extraction (EE) to text such as\nnews. The event databases are used to analyze sociopolitical dynamics between\nactor pairs (dyads) in, e.g., international relations. While most EE methods\nheavily rely on rules or supervised learning, \\emph{zero-shot} event extraction\ncould potentially allow researchers to flexibly specify arbitrary event classes\nfor new research questions. Unfortunately, we find that current zero-shot EE\nmethods, as well as a naive zero-shot approach of simple generative language\nmodel (LM) prompting, perform poorly for dyadic event extraction; most suffer\nfrom word sense ambiguity, modality sensitivity, and computational\ninefficiency. We address these challenges with a new fine-grained, multi-stage\ninstruction-following generative LM pipeline, proposing a Monte Carlo approach\nto deal with, and even take advantage of, nondeterminism of generative outputs.\nOur pipeline includes explicit stages of linguistic analysis (synonym\ngeneration, contextual disambiguation, argument realization, event modality),\n\\textit{improving control and interpretability} compared to purely neural\nmethods. This method outperforms other zero-shot EE approaches, and outperforms\nnaive applications of generative LMs by at least 17 F1 percent points. The\npipeline's filtering mechanism greatly improves computational efficiency,\nallowing it to perform as few as 12% of queries that a previous zero-shot\nmethod uses. Finally, we demonstrate our pipeline's application to dyadic\ninternational relations analysis.\n","authors":["Erica Cai","Brendan O'Connor"],"pdf_url":"https://arxiv.org/pdf/2305.15051v2.pdf","comment":"Accepted at NeurIPS 2023 Workshop on Instruction Tuning and\n Instruction Following; oral presentation at New England Natural Language\n Processing, 2023; 17 pages of text including references and appendix"},{"id":"http://arxiv.org/abs/2308.00264v4","updated":"2024-06-02T19:12:57Z","published":"2023-08-01T03:54:27Z","title":"Multimodal Multi-loss Fusion Network for Sentiment Analysis","summary":" This paper investigates the optimal selection and fusion of feature encoders\nacross multiple modalities and combines these in one neural network to improve\nsentiment detection. We compare different fusion methods and examine the impact\nof multi-loss training within the multi-modality fusion network, identifying\nsurprisingly important findings relating to subnet performance. We have also\nfound that integrating context significantly enhances model performance. Our\nbest model achieves state-of-the-art performance for three datasets (CMU-MOSI,\nCMU-MOSEI and CH-SIMS). These results suggest a roadmap toward an optimized\nfeature selection and fusion approach for enhancing sentiment detection in\nneural networks.\n","authors":["Zehui Wu","Ziwei Gong","Jaywon Koo","Julia Hirschberg"],"pdf_url":"https://arxiv.org/pdf/2308.00264v4.pdf","comment":"First two authors contributed equally to the paper"},{"id":"http://arxiv.org/abs/2405.12933v2","updated":"2024-06-02T18:48:56Z","published":"2024-05-21T17:04:44Z","title":"Skin-in-the-Game: Decision Making via Multi-Stakeholder Alignment in\n LLMs","summary":" Large Language Models (LLMs) have shown remarkable capabilities in tasks such\nas summarization, arithmetic reasoning, and question answering. However, they\nencounter significant challenges in the domain of moral reasoning and ethical\ndecision-making, especially in complex scenarios with multiple stakeholders.\nThis paper introduces the Skin-in-the-Game (SKIG) framework, aimed at enhancing\nmoral reasoning in LLMs by exploring decisions' consequences from multiple\nstakeholder perspectives. Central to SKIG's mechanism is simulating\naccountability for actions, which, alongside empathy exercises and risk\nassessment, is pivotal to its effectiveness. We validate SKIG's performance\nacross various moral reasoning benchmarks with proprietary and opensource LLMs,\nand investigate its crucial components through extensive ablation analyses.\n","authors":["Bilgehan Sel","Priya Shanmugasundaram","Mohammad Kachuee","Kun Zhou","Ruoxi Jia","Ming Jin"],"pdf_url":"https://arxiv.org/pdf/2405.12933v2.pdf","comment":"ACL 2024, long paper"},{"id":"http://arxiv.org/abs/2405.05189v2","updated":"2024-06-02T18:47:44Z","published":"2024-05-08T16:25:42Z","title":"MIDGARD: Self-Consistency Using Minimum Description Length for\n Structured Commonsense Reasoning","summary":" We study the task of conducting structured reasoning as generating a\nreasoning graph from natural language input using large language models (LLMs).\nPrevious approaches have explored various prompting schemes, yet they suffer\nfrom error propagation due to the autoregressive nature and single-pass-based\ndecoding, which lack error correction capability. Additionally, relying solely\non a single sample may result in the omission of true nodes and edges. To\ncounter this, we draw inspiration from self-consistency (SC), which involves\nsampling a diverse set of reasoning chains and taking the majority vote as the\nfinal answer. To tackle the substantial challenge of applying SC on generated\ngraphs, we propose MIDGARD (MInimum Description length Guided Aggregation of\nReasoning in Directed acyclic graph) that leverages Minimum Description Length\n(MDL)-based formulation to identify consistent properties among the different\ngraph samples generated by an LLM. This formulation helps reject properties\nthat appear in only a few samples, which are likely to be erroneous, while\nenabling the inclusion of missing elements without compromising precision. Our\nmethod demonstrates superior performance than comparisons across various\nstructured reasoning tasks, including argument structure extraction,\nexplanation graph generation, inferring dependency relations among actions for\neveryday tasks, and semantic graph generation from natural texts.\n","authors":["Inderjeet Nair","Lu Wang"],"pdf_url":"https://arxiv.org/pdf/2405.05189v2.pdf","comment":"Accepted at ACL 2024(main)"},{"id":"http://arxiv.org/abs/2402.11073v3","updated":"2024-06-02T18:35:25Z","published":"2024-02-16T20:59:57Z","title":"AFaCTA: Assisting the Annotation of Factual Claim Detection with\n Reliable LLM Annotators","summary":" With the rise of generative AI, automated fact-checking methods to combat\nmisinformation are becoming more and more important. However, factual claim\ndetection, the first step in a fact-checking pipeline, suffers from two key\nissues that limit its scalability and generalizability: (1) inconsistency in\ndefinitions of the task and what a claim is, and (2) the high cost of manual\nannotation. To address (1), we review the definitions in related work and\npropose a unifying definition of factual claims that focuses on verifiability.\nTo address (2), we introduce AFaCTA (Automatic Factual Claim deTection\nAnnotator), a novel framework that assists in the annotation of factual claims\nwith the help of large language models (LLMs). AFaCTA calibrates its annotation\nconfidence with consistency along three predefined reasoning paths. Extensive\nevaluation and experiments in the domain of political speech reveal that AFaCTA\ncan efficiently assist experts in annotating factual claims and training\nhigh-quality classifiers, and can work with or without expert supervision. Our\nanalyses also result in PoliClaim, a comprehensive claim detection dataset\nspanning diverse political topics.\n","authors":["Jingwei Ni","Minjing Shi","Dominik Stammbach","Mrinmaya Sachan","Elliott Ash","Markus Leippold"],"pdf_url":"https://arxiv.org/pdf/2402.11073v3.pdf","comment":"ACL2024 Main Conference"},{"id":"http://arxiv.org/abs/2205.04355v2","updated":"2024-06-02T16:40:26Z","published":"2022-05-09T14:58:34Z","title":"XSTEM: An exemplar-based stemming algorithm","summary":" Stemming is the process of reducing related words to a standard form by\nremoving affixes from them. Existing algorithms vary with respect to their\ncomplexity, configurability, handling of unknown words, and ability to avoid\nunder- and over-stemming. This paper presents a fast, simple, configurable,\nhigh-precision, high-recall stemming algorithm that combines the simplicity and\nperformance of word-based lookup tables with the strong generalizability of\nrule-based methods to avert problems with out-of-vocabulary words.\n","authors":["Kirk Baker"],"pdf_url":"https://arxiv.org/pdf/2205.04355v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.14259v3","updated":"2024-06-02T16:30:00Z","published":"2024-05-23T07:39:42Z","title":"Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with\n LLMs for Multi-modal Text Recognition","summary":" We introduce \"Generative Fusion Decoding\" (GFD), a novel shallow fusion\nframework, utilized to integrate Large Language Models (LLMs) into multi-modal\ntext recognition systems such as automatic speech recognition (ASR) and optical\ncharacter recognition (OCR). We derive the formulas necessary to enable GFD to\noperate across mismatched token spaces of different models by mapping text\ntoken space to byte token space, enabling seamless fusion during the decoding\nprocess. The framework is plug-and-play, compatible with various\nauto-regressive models, and does not require re-training for feature alignment,\nthus overcoming limitations of previous fusion techniques. We highlight three\nmain advantages of GFD: First, by simplifying the complexity of aligning\ndifferent model sample spaces, GFD allows LLMs to correct errors in tandem with\nthe recognition model, reducing computation latencies. Second, the in-context\nlearning ability of LLMs is fully capitalized by GFD, increasing robustness in\nlong-form speech recognition and instruction aware speech recognition. Third,\nGFD enables fusing recognition models deficient in Chinese text recognition\nwith LLMs extensively trained on Chinese. Our evaluation demonstrates that GFD\nsignificantly improves performance in ASR and OCR tasks, with ASR reaching\nstate-of-the-art in the NTUML2021 benchmark. GFD provides a significant step\nforward in model integration, offering a unified solution that could be widely\napplicable to leveraging existing pre-trained models through step by step\nfusion.\n","authors":["Chan-Jan Hsu","Yi-Chang Chen","Feng-Ting Liao","Pei-Chen Ho","Yu-Hsiang Wang","Po-Chun Hsu","Da-shan Shiu"],"pdf_url":"https://arxiv.org/pdf/2405.14259v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.11999v3","updated":"2024-06-02T16:21:59Z","published":"2024-04-18T08:49:38Z","title":"Token-level Direct Preference Optimization","summary":" Fine-tuning pre-trained Large Language Models (LLMs) is essential to align\nthem with human values and intentions. This process often utilizes methods like\npairwise comparisons and KL divergence against a reference LLM, focusing on the\nevaluation of full answers generated by the models. However, the generation of\nthese responses occurs in a token level, following a sequential,\nauto-regressive fashion. In this paper, we introduce Token-level Direct\nPreference Optimization (TDPO), a novel approach to align LLMs with human\npreferences by optimizing policy at the token level. Unlike previous methods,\nwhich face challenges in divergence efficiency, TDPO incorporates forward KL\ndivergence constraints for each token, improving alignment and diversity.\nUtilizing the Bradley-Terry model for a token-based reward system, TDPO\nenhances the regulation of KL divergence, while preserving simplicity without\nthe need for explicit reward modeling. Experimental results across various text\ntasks demonstrate TDPO's superior performance in balancing alignment with\ngeneration diversity. Notably, fine-tuning with TDPO strikes a better balance\nthan DPO in the controlled sentiment generation and single-turn dialogue\ndatasets, and significantly improves the quality of generated responses\ncompared to both DPO and PPO-based RLHF methods. Our code is open-sourced at\nhttps://github.com/Vance0124/Token-level-Direct-Preference-Optimization.\n","authors":["Yongcheng Zeng","Guoqing Liu","Weiyu Ma","Ning Yang","Haifeng Zhang","Jun Wang"],"pdf_url":"https://arxiv.org/pdf/2404.11999v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.10379v3","updated":"2024-06-02T16:01:35Z","published":"2023-08-20T22:36:23Z","title":"Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language\n Models","summary":" Current literature, aiming to surpass the \"Chain-of-Thought\" approach, often\nresorts to external modi operandi involving halting, modifying, and then\nresuming the generation process to boost Large Language Models' (LLMs)\nreasoning capacities. Due to their myopic perspective, they escalate the number\nof query requests, leading to increased costs, memory, and computational\noverheads. Addressing this, we propose the Algorithm of Thoughts -- a novel\nstrategy that propels LLMs through algorithmic reasoning pathways. By employing\nalgorithmic examples fully in-context, this overarching view of the whole\nprocess exploits the innate recurrence dynamics of LLMs, expanding their idea\nexploration with merely one or a few queries. Our technique outperforms earlier\nsingle-query methods and even more recent multi-query strategies that employ an\nextensive tree search algorithms while using significantly fewer tokens.\nIntriguingly, our results suggest that instructing an LLM using an algorithm\ncan lead to performance surpassing that of the algorithm itself, hinting at\nLLM's inherent ability to weave its intuition into optimized searches. We probe\ninto the underpinnings of our method's efficacy and its nuances in application.\nThe code and related content can be found in:\nhttps://algorithm-of-thoughts.github.io.\n","authors":["Bilgehan Sel","Ahmad Al-Tawaha","Vanshaj Khattar","Ruoxi Jia","Ming Jin"],"pdf_url":"https://arxiv.org/pdf/2308.10379v3.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2402.04601v2","updated":"2024-06-02T15:50:40Z","published":"2024-02-07T05:56:54Z","title":"Alirector: Alignment-Enhanced Chinese Grammatical Error Corrector","summary":" Chinese grammatical error correction (CGEC) faces serious overcorrection\nchallenges when employing autoregressive generative models such as\nsequence-to-sequence (Seq2Seq) models and decoder-only large language models\n(LLMs). While previous methods aim to address overcorrection in Seq2Seq models,\nthey are difficult to adapt to decoder-only LLMs. In this paper, we propose an\nalignment-enhanced corrector for the overcorrection problem that applies to\nboth Seq2Seq models and decoder-only LLMs. Our method first trains a correction\nmodel to generate an initial correction of the source sentence. Then, we\ncombine the source sentence with the initial correction and feed it through an\nalignment model for another round of correction, aiming to enforce the\nalignment model to focus on potential overcorrection. Moreover, to enhance the\nmodel's ability to identify nuances, we further explore the reverse alignment\nof the source sentence and the initial correction. Finally, we transfer the\nalignment knowledge from two alignment models to the correction model,\ninstructing it on how to avoid overcorrection. Experimental results on three\nCGEC datasets demonstrate the effectiveness of our approach in alleviating\novercorrection and improving overall performance. Our code has been made\npublicly available.\n","authors":["Haihui Yang","Xiaojun Quan"],"pdf_url":"https://arxiv.org/pdf/2402.04601v2.pdf","comment":"Accepted to Findings of ACL 2024"},{"id":"http://arxiv.org/abs/2403.00231v3","updated":"2024-06-02T15:47:16Z","published":"2024-03-01T02:21:30Z","title":"Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of\n Large Vision-Language Models","summary":" Large vision-language models (LVLMs) excel across diverse tasks involving\nconcrete images from natural scenes. However, their ability to interpret\nabstract figures, such as geometry shapes and scientific plots, remains limited\ndue to a scarcity of training datasets in scientific domains. To fill this gap,\nwe introduce Multimodal ArXiv, consisting of ArXivCap and ArXivQA, for\nenhancing LVLMs scientific comprehension. ArXivCap is a figure-caption dataset\ncomprising 6.4M images and 3.9M captions, sourced from 572K ArXiv papers\nspanning various scientific domains. Drawing from ArXivCap, we introduce\nArXivQA, a question-answering dataset generated by prompting GPT-4V based on\nscientific figures. ArXivQA greatly enhances open-sourced LVLMs' mathematical\nreasoning capabilities, achieving a 10.4\\% absolute accuracy gain on a\nmultimodal mathematical reasoning benchmark. Furthermore, employing ArXivCap,\nwe devise four vision-to-text tasks for benchmarking LVLMs. Evaluation results\nwith state-of-the-art LVLMs underscore their struggle with the nuanced\nsemantics of academic figures, while domain-specific training yields\nsubstantial performance gains. Our error analysis uncovers misinterpretations\nof visual context, recognition errors, and the production of overly simplified\ncaptions by current LVLMs, shedding light on future improvements.\n","authors":["Lei Li","Yuqi Wang","Runxin Xu","Peiyi Wang","Xiachong Feng","Lingpeng Kong","Qi Liu"],"pdf_url":"https://arxiv.org/pdf/2403.00231v3.pdf","comment":"Project page: https://mm-arxiv.github.io, Camera Ready Version of ACL\n 2024"},{"id":"http://arxiv.org/abs/2405.19076v2","updated":"2024-06-02T15:03:24Z","published":"2024-05-29T13:34:32Z","title":"Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials\n Analysis and Design","summary":" We present Cephalo, a series of multimodal vision large language models\n(V-LLMs) designed for materials science applications, integrating visual and\nlinguistic data for enhanced understanding and interaction within human-AI and\nmulti-agent AI frameworks. A key innovation of Cephalo is its advanced dataset\ngeneration method, which employs a sophisticated algorithm to accurately detect\nand separate images and their corresponding textual descriptions from PDF\ndocuments, such as scientific papers. The method includes a careful refinement\nof image-text pairs through integrated vision and language processing, ensuring\nhigh-quality, contextually relevant, and well reasoned training data. Cephalo\nis trained on integrated image and text data extracted from thousands of\nscientific papers and science-focused Wikipedia pages demonstrates can\ninterpret complex visual scenes, generate precise language descriptions, and\nanswer queries about images effectively. The combination of a vision encoder\nwith an autoregressive transformer supports complex natural language\nunderstanding in an integrated model, which can be coupled with other\ngenerative methods to create an image-to-text-to-image or image-to-text-to-3D\npipeline. To explore the development of larger models from smaller ones, we\nreport both mixture-of-expert methods and model merging. These hybrid\napproaches allow us to leverage the domain-specific expertise and general\nconversational capabilities to harness the strengths of multiple models. We\nexamine the models in diverse use cases that incorporate biological materials,\nfracture and engineering analysis, protein biophysics, and bio-inspired design\nbased on insect behavior. Generative applications include bio-inspired designs,\nincluding pollen-inspired architected materials, as well as the synthesis of\nbio-inspired material microstructures from a photograph of a solar eclipse.\n","authors":["Markus J. Buehler"],"pdf_url":"https://arxiv.org/pdf/2405.19076v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18628v2","updated":"2024-06-02T14:58:48Z","published":"2024-05-28T22:19:30Z","title":"Hardware-Aware Parallel Prompt Decoding for Memory-Efficient\n Acceleration of LLM Inference","summary":" The auto-regressive decoding of Large Language Models (LLMs) results in\nsignificant overheads in their hardware performance. While recent research has\ninvestigated various speculative decoding techniques for multi-token\ngeneration, these efforts have primarily focused on improving processing speed\nsuch as throughput. Crucially, they often neglect other metrics essential for\nreal-life deployments, such as memory consumption and training cost. To\novercome these limitations, we propose a novel parallel prompt decoding that\nrequires only $0.0002$% trainable parameters, enabling efficient training on a\nsingle A100-40GB GPU in just 16 hours. Inspired by the human natural language\ngeneration process, $PPD$ approximates outputs generated at future timesteps in\nparallel by using multiple prompt tokens. This approach partially recovers the\nmissing conditional dependency information necessary for multi-token\ngeneration, resulting in up to a 28% higher acceptance rate for long-range\npredictions. Furthermore, we present a hardware-aware dynamic sparse tree\ntechnique that adaptively optimizes this decoding scheme to fully leverage the\ncomputational capacities on different GPUs. Through extensive experiments\nacross LLMs ranging from MobileLlama to Vicuna-13B on a wide range of\nbenchmarks, our approach demonstrates up to 2.49$\\times$ speedup and maintains\na minimal runtime memory overhead of just $0.0004$%. More importantly, our\nparallel prompt decoding can serve as an orthogonal optimization for\nsynergistic integration with existing speculative decoding, showing up to\n$1.22\\times$ further speed improvement. Our code is available at\nhttps://github.com/hmarkc/parallel-prompt-decoding.\n","authors":["Hao Mark Chen","Wayne Luk","Ka Fai Cedric Yiu","Rui Li","Konstantin Mishchenko","Stylianos I. Venieris","Hongxiang Fan"],"pdf_url":"https://arxiv.org/pdf/2405.18628v2.pdf","comment":"The code for this implementation is available at\n https://github.com/hmarkc/parallel-prompt-decoding"},{"id":"http://arxiv.org/abs/2402.13113v2","updated":"2024-06-02T14:48:13Z","published":"2024-02-20T16:09:49Z","title":"When Only Time Will Tell: Interpreting How Transformers Process Local\n Ambiguities Through the Lens of Restart-Incrementality","summary":" Incremental models that process sentences one token at a time will sometimes\nencounter points where more than one interpretation is possible. Causal models\nare forced to output one interpretation and continue, whereas models that can\nrevise may edit their previous output as the ambiguity is resolved. In this\nwork, we look at how restart-incremental Transformers build and update internal\nstates, in an effort to shed light on what processes cause revisions not viable\nin autoregressive models. We propose an interpretable way to analyse the\nincremental states, showing that their sequential structure encodes information\non the garden path effect and its resolution. Our method brings insights on\nvarious bidirectional encoders for contextualised meaning representation and\ndependency parsing, contributing to show their advantage over causal models\nwhen it comes to revisions.\n","authors":["Brielen Madureira","Patrick Kahardipraja","David Schlangen"],"pdf_url":"https://arxiv.org/pdf/2402.13113v2.pdf","comment":"Accepted to ACL 2024"},{"id":"http://arxiv.org/abs/2402.10659v3","updated":"2024-06-02T13:50:14Z","published":"2024-02-16T13:10:14Z","title":"Network Formation and Dynamics Among Multi-LLMs","summary":" Social networks shape opinions, behaviors, and information dissemination in\nhuman societies. As large language models (LLMs) increasingly integrate into\nsocial and professional environments, understanding their behavior within the\ncontext of social interactions and networks becomes essential. Our study\nanalyzes LLMs' network formation behavior to examine whether the dynamics of\nmultiple LLMs are similar to or different from human social dynamics. We\nobserve that LLMs exhibit key social network principles, including preferential\nattachment, triadic closure, homophily, community structure, and the\nsmall-world phenomenon, when asked about their preferences in network\nformation. We also investigate LLMs' decision-making based on real-world\nnetworks, revealing that triadic closure and homophily have a stronger\ninfluence than preferential attachment and that LLMs perform well in network\nformation predictions. Overall, our study opens up new possibilities for using\nLLMs in network science research and helps develop socially aware LLMs by\nshedding light on their social interaction behaviors and exploring their\nimpacts on social dynamics.\n","authors":["Marios Papachristou","Yuan Yuan"],"pdf_url":"https://arxiv.org/pdf/2402.10659v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.15799v2","updated":"2024-06-02T13:49:32Z","published":"2023-06-27T20:58:41Z","title":"FLuRKA: Fast and accurate unified Low-Rank & Kernel Attention","summary":" Many efficient $\\textit{approximate}$ self-attention techniques have become\nprevalent since the inception of the transformer architecture. Two popular\nclasses of these techniques are low-rank and kernel methods. Each of these\nmethods has its strengths. We observe these strengths synergistically\ncomplement each other and exploit them to fuse low-rank and kernel methods,\nproducing a new class of transformers: FLuRKA ($\\textbf{F}$ast\n$\\textbf{L}$ow-$\\textbf{R}$ank & $\\textbf{K}$ernel$ \\textbf{A}$ttention).\nFLuRKA are highly $\\textit{training-efficient}$ with faster model speeds\n$\\textit{and}$ similar model qualities compared to constituent low-rank and\nkernel methods. We theoretically and empirically evaluate the speed and quality\nof FLuRKA. Our model speed analysis posits a variety of parameter\nconfigurations where FLuRKA exhibit speedups over low-rank and kernel\napproximations and our model quality analysis bounds the error of FLuRKA with\nrespect to full-attention. Empirically, we instantiate three FLuRKA variants\nwhich experience speedups of up to 3.3x and 1.7x over low-rank and kernel\nmethods respectively. This translates to speedups of up to 20x over models with\nflash-attention. Across a diverse set of tasks spanning language modeling,\nlanguage understanding, long sequence modeling, machine translation, and image\nclassification, FLuRKA achieve comparable accuracy with underlying low-rank and\nkernel approximations, occasionally surpassing both.\n","authors":["Ahan Gupta","Hao Guo","Yueming Yuan","Yanqi Zhou","Charith Mendis"],"pdf_url":"https://arxiv.org/pdf/2306.15799v2.pdf","comment":"21 pages, 5 figures"},{"id":"http://arxiv.org/abs/2403.00862v3","updated":"2024-06-02T13:38:01Z","published":"2024-02-29T21:05:14Z","title":"NewsBench: A Systematic Evaluation Framework for Assessing Editorial\n Capabilities of Large Language Models in Chinese Journalism","summary":" We present NewsBench, a novel evaluation framework to systematically assess\nthe capabilities of Large Language Models (LLMs) for editorial capabilities in\nChinese journalism. Our constructed benchmark dataset is focused on four facets\nof writing proficiency and six facets of safety adherence, and it comprises\nmanually and carefully designed 1,267 test samples in the types of multiple\nchoice questions and short answer questions for five editorial tasks in 24 news\ndomains. To measure performances, we propose different GPT-4 based automatic\nevaluation protocols to assess LLM generations for short answer questions in\nterms of writing proficiency and safety adherence, and both are validated by\nthe high correlations with human evaluations. Based on the systematic\nevaluation framework, we conduct a comprehensive analysis of ten popular LLMs\nwhich can handle Chinese. The experimental results highlight GPT-4 and ERNIE\nBot as top performers, yet reveal a relative deficiency in journalistic safety\nadherence in creative writing tasks. Our findings also underscore the need for\nenhanced ethical guidance in machine-generated journalistic content, marking a\nstep forward in aligning LLMs with journalistic standards and safety\nconsiderations.\n","authors":["Miao Li","Ming-Bin Chen","Bo Tang","Shengbin Hou","Pengyu Wang","Haiying Deng","Zhiyu Li","Feiyu Xiong","Keming Mao","Peng Cheng","Yi Luo"],"pdf_url":"https://arxiv.org/pdf/2403.00862v3.pdf","comment":"Long paper, ACL 2024 Main"},{"id":"http://arxiv.org/abs/2402.11941v3","updated":"2024-06-02T13:25:05Z","published":"2024-02-19T08:29:03Z","title":"CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI\n Automation","summary":" Multimodal large language models (MLLMs) have shown remarkable potential as\nhuman-like autonomous language agents to interact with real-world environments,\nespecially for graphical user interface (GUI) automation. However, those GUI\nagents require comprehensive cognition ability including exhaustive perception\nand reliable action response. We propose a Comprehensive Cognitive LLM Agent,\nCoCo-Agent, with two novel approaches, comprehensive environment perception\n(CEP) and conditional action prediction (CAP), to systematically improve the\nGUI automation performance. First, CEP facilitates the GUI perception through\ndifferent aspects and granularity, including screenshots and complementary\ndetailed layouts for the visual channel and historical actions for the textual\nchannel. Second, CAP decomposes the action prediction into sub-problems: action\ntype prediction and action target conditioned on the action type. With our\ntechnical design, our agent achieves new state-of-the-art performance on AITW\nand META-GUI benchmarks, showing promising abilities in realistic scenarios.\nCode is available at https://github.com/xbmxb/CoCo-Agent.\n","authors":["Xinbei Ma","Zhuosheng Zhang","Hai Zhao"],"pdf_url":"https://arxiv.org/pdf/2402.11941v3.pdf","comment":"ACL'2024 Findings"},{"id":"http://arxiv.org/abs/2402.17959v2","updated":"2024-06-02T10:46:13Z","published":"2024-02-28T00:49:06Z","title":"An Iterative Associative Memory Model for Empathetic Response Generation","summary":" Empathetic response generation aims to comprehend the cognitive and emotional\nstates in dialogue utterances and generate proper responses. Psychological\ntheories posit that comprehending emotional and cognitive states necessitates\niteratively capturing and understanding associated words across dialogue\nutterances. However, existing approaches regard dialogue utterances as either a\nlong sequence or independent utterances for comprehension, which are prone to\noverlook the associated words between them. To address this issue, we propose\nan Iterative Associative Memory Model (IAMM) for empathetic response\ngeneration. Specifically, we employ a novel second-order interaction attention\nmechanism to iteratively capture vital associated words between dialogue\nutterances and situations, dialogue history, and a memory module (for storing\nassociated words), thereby accurately and nuancedly comprehending the\nutterances. We conduct experiments on the Empathetic-Dialogue dataset. Both\nautomatic and human evaluations validate the efficacy of the model. Variant\nexperiments on LLMs also demonstrate that attending to associated words\nimproves empathetic comprehension and expression.\n","authors":["Zhou Yang","Zhaochun Ren","Yufeng Wang","Chao Chen","Haizhou Sun","Xiaofei Zhu","Xiangwen Liao"],"pdf_url":"https://arxiv.org/pdf/2402.17959v2.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2402.12026v3","updated":"2024-06-02T10:28:13Z","published":"2024-02-19T10:34:48Z","title":"Acquiring Clean Language Models from Backdoor Poisoned Datasets by\n Downscaling Frequency Space","summary":" Despite the notable success of language models (LMs) in various natural\nlanguage processing (NLP) tasks, the reliability of LMs is susceptible to\nbackdoor attacks. Prior research attempts to mitigate backdoor learning while\ntraining the LMs on the poisoned dataset, yet struggles against complex\nbackdoor attacks in real-world scenarios. In this paper, we investigate the\nlearning mechanisms of backdoor LMs in the frequency space by Fourier analysis.\nOur findings indicate that the backdoor mapping presented on the poisoned\ndatasets exhibits a more discernible inclination towards lower frequency\ncompared to clean mapping, resulting in the faster convergence of backdoor\nmapping. To alleviate this dilemma, we propose Multi-Scale Low-Rank Adaptation\n(MuScleLoRA), which deploys multiple radial scalings in the frequency space\nwith low-rank adaptation to the target model and further aligns the gradients\nwhen updating parameters. Through downscaling in the frequency space,\nMuScleLoRA encourages the model to prioritize the learning of relatively\nhigh-frequency clean mapping, consequently mitigating backdoor learning.\nExperimental results demonstrate that MuScleLoRA outperforms baselines\nsignificantly. Notably, MuScleLoRA reduces the average success rate of diverse\nbackdoor attacks to below 15\\% across multiple datasets and generalizes to\nvarious backbone LMs, including BERT, RoBERTa, GPT2-XL, and Llama2. The codes\nare publicly available at https://github.com/ZrW00/MuScleLoRA.\n","authors":["Zongru Wu","Zhuosheng Zhang","Pengzhou Cheng","Gongshen Liu"],"pdf_url":"https://arxiv.org/pdf/2402.12026v3.pdf","comment":"Accepted at ACL 2024 (Long Paper. Main Conference)"},{"id":"http://arxiv.org/abs/2308.08758v3","updated":"2024-06-02T10:09:01Z","published":"2023-08-17T03:10:17Z","title":"Discrete Prompt Compression with Reinforcement Learning","summary":" Compressed prompts aid instruction-tuned language models (LMs) in overcoming\ncontext window limitations and reducing computational costs. Existing methods,\nwhich primarily based on training embeddings, face various challenges\nassociated with interpretability, the fixed number of embedding tokens,\nreusability across different LMs, and inapplicability when interacting with\nblack-box APIs. This study proposes prompt compression with reinforcement\nlearning (PCRL), which is a discrete prompt compression method that addresses\nthese issues. The proposed PCRL method utilizes a computationally efficient\npolicy network that edits prompts directly. The training approach employed in\nthe proposed PCRLs can be applied flexibly to various types of LMs, including\nboth decoder-only and encoder-decoder architecture and it can be trained\nwithout gradient access to the LMs or labeled data. The proposed PCRL achieves\nan average reduction of 24.6% in terms of the token count across various\ninstruction prompts while maintaining sufficient performance. In addition, we\ndemonstrate that the learned policy can be transferred to larger LMs, and\nthrough a comprehensive analysis, we explore the token importance within the\nprompts. Our code is accessible at\nhttps://github.com/nenomigami/PromptCompressor.\n","authors":["Hoyoun Jung","Kyung-Joong Kim"],"pdf_url":"https://arxiv.org/pdf/2308.08758v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.13963v4","updated":"2024-06-02T10:02:00Z","published":"2024-02-21T17:47:20Z","title":"Towards Building Multilingual Language Model for Medicine","summary":" The development of open-source, multilingual medical language models can\nbenefit a wide, linguistically diverse audience from different regions. To\npromote this domain, we present contributions from the following: First, we\nconstruct a multilingual medical corpus, containing approximately 25.5B tokens\nencompassing 6 main languages, termed as MMedC, enabling auto-regressive domain\nadaptation for general LLMs; Second, to monitor the development of multilingual\nmedical LLMs, we propose a multilingual medical multi-choice question-answering\nbenchmark with rationale, termed as MMedBench; Third, we have assessed a number\nof open-source large language models (LLMs) on our benchmark, along with those\nfurther auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with\nonly 8B parameters, achieves superior performance compared to all other\nopen-source models on both MMedBench and English benchmarks, even rivaling\nGPT-4. In conclusion, in this work, we present a large-scale corpus, a\nbenchmark and a series of models to support the development of multilingual\nmedical LLMs.\n","authors":["Pengcheng Qiu","Chaoyi Wu","Xiaoman Zhang","Weixiong Lin","Haicheng Wang","Ya Zhang","Yanfeng Wang","Weidi Xie"],"pdf_url":"https://arxiv.org/pdf/2402.13963v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.11900v2","updated":"2024-06-02T09:17:37Z","published":"2024-02-19T07:34:10Z","title":"Investigating Multi-Hop Factual Shortcuts in Knowledge Editing of Large\n Language Models","summary":" Recent work has showcased the powerful capability of large language models\n(LLMs) in recalling knowledge and reasoning. However, the reliability of LLMs\nin combining these two capabilities into reasoning through multi-hop facts has\nnot been widely explored. This paper systematically investigates the\npossibilities for LLMs to utilize shortcuts based on direct connections between\nthe initial and terminal entities of multi-hop knowledge. We first explore the\nexistence of factual shortcuts through Knowledge Neurons, revealing that: (i)\nthe strength of factual shortcuts is highly correlated with the frequency of\nco-occurrence of initial and terminal entities in the pre-training corpora;\n(ii) few-shot prompting leverage more shortcuts in answering multi-hop\nquestions compared to chain-of-thought prompting. Then, we analyze the risks\nposed by factual shortcuts from the perspective of multi-hop knowledge editing.\nAnalysis shows that approximately 20% of the failures are attributed to\nshortcuts, and the initial and terminal entities in these failure instances\nusually have higher co-occurrences in the pre-training corpus. Finally, we\npropose erasing shortcut neurons to mitigate the associated risks and find that\nthis approach significantly reduces failures in multiple-hop knowledge editing\ncaused by shortcuts.\n","authors":["Tianjie Ju","Yijin Chen","Xinwei Yuan","Zhuosheng Zhang","Wei Du","Yubin Zheng","Gongshen Liu"],"pdf_url":"https://arxiv.org/pdf/2402.11900v2.pdf","comment":"Accepted at ACL 2024 (Long Paper. Main Conference)"},{"id":"http://arxiv.org/abs/2402.15179v3","updated":"2024-06-02T09:05:31Z","published":"2024-02-23T08:21:02Z","title":"Advancing Parameter Efficiency in Fine-tuning via Representation Editing","summary":" Parameter Efficient Fine-Tuning (PEFT) techniques have drawn significant\nattention due to their ability to yield competitive results while updating only\na small portion of the adjustable parameters. However, existing PEFT methods\npose challenges in hyperparameter selection, such as choosing the rank for LoRA\nor Adapter, or specifying the length of soft prompts. To address these\nchallenges, we propose a novel fine-tuning approach for neural models, named\nRepresentation EDiting (RED), which modifies the representations generated at\nsome layers through the application of scaling and biasing operations. While\nexisting PEFT methods still demonstrate over-parameterization that could\npotentially undermine the generalization ability acquired from pre-training,\nRED can substantially reduce the number of trainable parameters by a factor of\n25, 700 compared to full parameter fine-tuning and by a factor of 32 relative\nto LoRA. Remarkably, RED achieves results comparable or superior to both full\nparameter fine-tuning and other PEFT methods. Extensive experiments across\nvarious model architectures and scales, including RoBERTa, GPT-2, T5, and\nLLaMA-2, have demonstrated the effectiveness and efficiency of RED1, thereby\npositioning it as a promising PEFT strategy for large-scale neural models.\n","authors":["Muling Wu","Wenhao Liu","Xiaohua Wang","Tianlong Li","Changze Lv","Zixuan Ling","Jianhao Zhu","Cenyuan Zhang","Xiaoqing Zheng","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2402.15179v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17653v2","updated":"2024-06-02T08:50:02Z","published":"2024-05-27T20:53:22Z","title":"InversionView: A General-Purpose Method for Reading Information from\n Neural Activations","summary":" The inner workings of neural networks can be better understood if we can\nfully decipher the information encoded in neural activations. In this paper, we\nargue that this information is embodied by the subset of inputs that give rise\nto similar activations. Computing such subsets is nontrivial as the input space\nis exponentially large. We propose InversionView, which allows us to\npractically inspect this subset by sampling from a trained decoder model\nconditioned on activations. This helps uncover the information content of\nactivation vectors, and facilitates understanding of the algorithms implemented\nby transformer models. We present three case studies where we investigate\nmodels ranging from small transformers to GPT-2. In these studies, we\ndemonstrate the characteristics of our method, show the distinctive advantages\nit offers, and provide causally verified circuits.\n","authors":["Xinting Huang","Madhur Panwar","Navin Goyal","Michael Hahn"],"pdf_url":"https://arxiv.org/pdf/2405.17653v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.10288v2","updated":"2024-06-02T08:04:26Z","published":"2024-05-16T17:48:21Z","title":"Timeline-based Sentence Decomposition with In-Context Learning for\n Temporal Fact Extraction","summary":" Facts extraction is pivotal for constructing knowledge graphs. Recently, the\nincreasing demand for temporal facts in downstream tasks has led to the\nemergence of the task of temporal fact extraction. In this paper, we\nspecifically address the extraction of temporal facts from natural language\ntext. Previous studies fail to handle the challenge of establishing\ntime-to-fact correspondences in complex sentences. To overcome this hurdle, we\npropose a timeline-based sentence decomposition strategy using large language\nmodels (LLMs) with in-context learning, ensuring a fine-grained understanding\nof the timeline associated with various facts. In addition, we evaluate the\nperformance of LLMs for direct temporal fact extraction and get unsatisfactory\nresults. To this end, we introduce TSDRE, a method that incorporates the\ndecomposition capabilities of LLMs into the traditional fine-tuning of smaller\npre-trained language models (PLMs). To support the evaluation, we construct\nComplexTRED, a complex temporal fact extraction dataset. Our experiments show\nthat TSDRE achieves state-of-the-art results on both HyperRED-Temporal and\nComplexTRED datasets.\n","authors":["Jianhao Chen","Haoyuan Ouyang","Junyang Ren","Wentao Ding","Wei Hu","Yuzhong Qu"],"pdf_url":"https://arxiv.org/pdf/2405.10288v2.pdf","comment":"Accepted to ACL2024 main conference"},{"id":"http://arxiv.org/abs/2401.17244v2","updated":"2024-06-02T07:50:21Z","published":"2024-01-30T18:37:45Z","title":"LLaMP: Large Language Model Made Powerful for High-fidelity Materials\n Knowledge Retrieval and Distillation","summary":" Reducing hallucination of Large Language Models (LLMs) is imperative for use\nin the sciences, where reliability and reproducibility are crucial. However,\nLLMs inherently lack long-term memory, making it a nontrivial, ad hoc, and\ninevitably biased task to fine-tune them on domain-specific literature and\ndata. Here we introduce LLaMP, a multimodal retrieval-augmented generation\n(RAG) framework of hierarchical reasoning-and-acting (ReAct) agents that can\ndynamically and recursively interact with computational and experimental data\non Materials Project (MP) and run atomistic simulations via high-throughput\nworkflow interface. Without fine-tuning, LLaMP demonstrates strong tool usage\nability to comprehend and integrate various modalities of materials science\nconcepts, fetch relevant data stores on the fly, process higher-order data\n(such as crystal structure and elastic tensor), and streamline complex tasks in\ncomputational materials and chemistry. We propose a simple metric combining\nuncertainty and confidence estimates to evaluate the self-consistency of\nresponses by LLaMP and vanilla LLMs. Our benchmark shows that LLaMP effectively\nmitigates the intrinsic bias in LLMs, counteracting the errors on bulk moduli,\nelectronic bandgaps, and formation energies that seem to derive from mixed data\nsources. We also demonstrate LLaMP's capability to edit crystal structures and\nrun annealing molecular dynamics simulations using pre-trained machine-learning\nforce fields. The framework offers an intuitive and nearly hallucination-free\napproach to exploring and scaling materials informatics, and establishes a\npathway for knowledge distillation and fine-tuning other language models. Code\nand live demo are available at https://github.com/chiang-yuan/llamp\n","authors":["Yuan Chiang","Elvis Hsieh","Chia-Hong Chou","Janosh Riebesell"],"pdf_url":"https://arxiv.org/pdf/2401.17244v2.pdf","comment":"31 pages, 5 figures"},{"id":"http://arxiv.org/abs/2405.19327v3","updated":"2024-06-02T06:09:49Z","published":"2024-05-29T17:57:16Z","title":"MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model\n Series","summary":" Large Language Models (LLMs) have made great strides in recent years to\nachieve unprecedented performance across different tasks. However, due to\ncommercial interest, the most competitive models like GPT, Gemini, and Claude\nhave been gated behind proprietary interfaces without disclosing the training\ndetails. Recently, many institutions have open-sourced several strong LLMs like\nLLaMA-3, comparable to existing closed-source LLMs. However, only the model's\nweights are provided with most details (e.g., intermediate checkpoints,\npre-training corpus, and training code, etc.) being undisclosed. To improve the\ntransparency of LLMs, the research community has formed to open-source truly\nopen LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training\ncorpus and training code) are being provided. These models have greatly\nadvanced the scientific study of these large models including their strengths,\nweaknesses, biases and risks. However, we observe that the existing truly open\nLLMs on reasoning, knowledge, and coding tasks are still inferior to existing\nstate-of-the-art LLMs with similar model sizes. To this end, we open-source\nMAP-Neo, a highly capable and transparent bilingual language model with 7B\nparameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the\nfirst fully open-sourced bilingual LLM with comparable performance compared to\nexisting state-of-the-art LLMs. Moreover, we open-source all details to\nreproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning\npipeline, checkpoints, and well-optimized training/evaluation framework are\nprovided. Finally, we hope our MAP-Neo will enhance and strengthen the open\nresearch community and inspire more innovations and creativities to facilitate\nthe further improvements of LLMs.\n","authors":["Ge Zhang","Scott Qu","Jiaheng Liu","Chenchen Zhang","Chenghua Lin","Chou Leuang Yu","Danny Pan","Esther Cheng","Jie Liu","Qunshu Lin","Raven Yuan","Tuney Zheng","Wei Pang","Xinrun Du","Yiming Liang","Yinghao Ma","Yizhi Li","Ziyang Ma","Bill Lin","Emmanouil Benetos","Huan Yang","Junting Zhou","Kaijing Ma","Minghao Liu","Morry Niu","Noah Wang","Quehry Que","Ruibo Liu","Sine Liu","Shawn Guo","Soren Gao","Wangchunshu Zhou","Xinyue Zhang","Yizhi Zhou","Yubo Wang","Yuelin Bai","Yuhan Zhang","Yuxiang Zhang","Zenith Wang","Zhenzhu Yang","Zijian Zhao","Jiajun Zhang","Wanli Ouyang","Wenhao Huang","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2405.19327v3.pdf","comment":"https://map-neo.github.io/"},{"id":"http://arxiv.org/abs/2402.02619v5","updated":"2024-06-02T05:56:31Z","published":"2024-02-04T21:33:18Z","title":"Increasing Trust in Language Models through the Reuse of Verified\n Circuits","summary":" Language Models (LMs) are increasingly used for a wide range of prediction\ntasks, but their training can often neglect rare edge cases, reducing their\nreliability. Here, we define a stringent standard of trustworthiness whereby\nthe task algorithm and circuit implementation must be verified, accounting for\nedge cases, with no known failure modes. We show that a transformer model can\nbe trained to meet this standard if built using mathematically and logically\nspecified frameworks. In this paper, we fully verify a model for n-digit\ninteger addition. To exhibit the reusability of verified modules, we insert the\ntrained integer addition model into an untrained model and train the combined\nmodel to perform both addition and subtraction. We find extensive reuse of the\naddition circuits for both tasks, easing verification of the more complex\nsubtractor model. We discuss how inserting verified task modules into LMs can\nleverage model reuse to improve verifiability and trustworthiness of language\nmodels built using them. The reuse of verified circuits reduces the effort to\nverify more complex composite models which we believe to be a significant step\ntowards safety of language models.\n","authors":["Philip Quirke","Clement Neo","Fazl Barez"],"pdf_url":"https://arxiv.org/pdf/2402.02619v5.pdf","comment":"8 pages, 6 figures"},{"id":"http://arxiv.org/abs/2404.03635v4","updated":"2024-06-02T04:56:32Z","published":"2024-04-04T17:54:33Z","title":"WorDepth: Variational Language Prior for Monocular Depth Estimation","summary":" Three-dimensional (3D) reconstruction from a single image is an ill-posed\nproblem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text\ndescription(s) is similarly ill-posed, i.e. spatial arrangements of objects\ndescribed. We investigate the question of whether two inherently ambiguous\nmodalities can be used in conjunction to produce metric-scaled reconstructions.\nTo test this, we focus on monocular depth estimation, the problem of predicting\na dense depth map from a single image, but with an additional text caption\ndescribing the scene. To this end, we begin by encoding the text caption as a\nmean and standard deviation; using a variational framework, we learn the\ndistribution of the plausible metric reconstructions of 3D scenes corresponding\nto the text captions as a prior. To \"select\" a specific reconstruction or depth\nmap, we encode the given image through a conditional sampler that samples from\nthe latent space of the variational text encoder, which is then decoded to the\noutput depth map. Our approach is trained alternatingly between the text and\nimage branches: in one optimization step, we predict the mean and standard\ndeviation from the text description and sample from a standard Gaussian, and in\nthe other, we sample using a (image) conditional sampler. Once trained, we\ndirectly predict depth from the encoded text using the conditional sampler. We\ndemonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where\nwe show that language can consistently improve performance in both.\n","authors":["Ziyao Zeng","Daniel Wang","Fengyu Yang","Hyoungseob Park","Yangchao Wu","Stefano Soatto","Byung-Woo Hong","Dong Lao","Alex Wong"],"pdf_url":"https://arxiv.org/pdf/2404.03635v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.01789v2","updated":"2024-06-02T04:48:36Z","published":"2024-02-02T02:43:10Z","title":"The Political Preferences of LLMs","summary":" I report here a comprehensive analysis about the political preferences\nembedded in Large Language Models (LLMs). Namely, I administer 11 political\norientation tests, designed to identify the political preferences of the test\ntaker, to 24 state-of-the-art conversational LLMs, both closed and open source.\nWhen probed with questions/statements with political connotations, most\nconversational LLMs tend to generate responses that are diagnosed by most\npolitical test instruments as manifesting preferences for left-of-center\nviewpoints. This does not appear to be the case for five additional base (i.e.\nfoundation) models upon which LLMs optimized for conversation with humans are\nbuilt. However, the weak performance of the base models at coherently answering\nthe tests' questions makes this subset of results inconclusive. Finally, I\ndemonstrate that LLMs can be steered towards specific locations in the\npolitical spectrum through Supervised Fine-Tuning (SFT) with only modest\namounts of politically aligned data, suggesting SFT's potential to embed\npolitical orientation in LLMs. With LLMs beginning to partially displace\ntraditional information sources like search engines and Wikipedia, the societal\nimplications of political biases embedded in LLMs are substantial.\n","authors":["David Rozado"],"pdf_url":"https://arxiv.org/pdf/2402.01789v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12999v4","updated":"2024-06-02T03:57:06Z","published":"2023-12-20T12:59:31Z","title":"Machine Mindset: An MBTI Exploration of Large Language Models","summary":" We present a novel approach for integrating Myers-Briggs Type Indicator\n(MBTI) personality traits into large language models (LLMs), addressing the\nchallenges of personality consistency in personalized AI. Our method, \"Machine\nMindset,\" involves a two-phase fine-tuning and Direct Preference Optimization\n(DPO) to embed MBTI traits into LLMs. This approach ensures that models\ninternalize these traits, offering a stable and consistent personality profile.\nWe demonstrate the effectiveness of our models across various domains, showing\nalignment between model performance and their respective MBTI traits. The paper\nhighlights significant contributions in the development of personality datasets\nand a new training methodology for personality integration in LLMs, enhancing\nthe potential for personalized AI applications. We also open-sourced our model\nand part of the data at \\url{https://github.com/PKU-YuanGroup/Machine-Mindset}.\n","authors":["Jiaxi Cui","Liuzhenghao Lv","Jing Wen","Rongsheng Wang","Jing Tang","YongHong Tian","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2312.12999v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.17820v4","updated":"2024-06-02T03:48:21Z","published":"2023-06-30T17:38:10Z","title":"Meta-Reasoning: Semantics-Symbol Deconstruction for Large Language\n Models","summary":" Neural-symbolic methods have demonstrated efficiency in enhancing the\nreasoning abilities of large language models (LLMs). However, existing methods\nmainly rely on syntactically mapping natural languages to complete formal\nlanguages like Python and SQL. Those methods require that reasoning tasks be\nconvertible into programs, which cater to the computer execution mindset and\ndeviate from human reasoning habits. To broaden symbolic methods' applicability\nand adaptability in the real world, we propose the Meta-Reasoning from a\nlinguistic perspective. This method empowers LLMs to deconstruct\nreasoning-independent semantic information into generic symbolic\nrepresentations, thereby efficiently capturing more generalized reasoning\nknowledge. We conduct extensive experiments on more than ten datasets\nencompassing conventional reasoning tasks like arithmetic, symbolic, and\nlogical reasoning, and the more complex interactive reasoning tasks like\ntheory-of-mind reasoning. Experimental results demonstrate that Meta-Reasoning\nsignificantly enhances in-context reasoning accuracy, learning efficiency,\nout-of-domain generalization, and output stability compared to the\nChain-of-Thought technique. Code and data are publicly available at\n\\url{https://github.com/Alsace08/Meta-Reasoning}.\n","authors":["Yiming Wang","Zhuosheng Zhang","Pei Zhang","Baosong Yang","Rui Wang"],"pdf_url":"https://arxiv.org/pdf/2306.17820v4.pdf","comment":"Accepted by ACL 2024 Findings"},{"id":"http://arxiv.org/abs/2403.09732v4","updated":"2024-06-02T02:58:53Z","published":"2024-03-13T02:32:41Z","title":"PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with\n Cross-consistency","summary":" Recent advancements in Text-to-SQL (Text2SQL) emphasize stimulating the large\nlanguage models (LLM) on in-context learning, achieving significant results.\nNevertheless, they face challenges when dealing with verbose database\ninformation and complex user intentions. This paper presents a two-stage\nframework to enhance the performance of current LLM-based natural language to\nSQL systems. We first introduce a novel prompt representation, called\nreference-enhanced representation, which includes schema information and\nrandomly sampled cell values from tables to instruct LLMs in generating SQL\nqueries. Then, in the first stage, question-SQL pairs are retrieved as few-shot\ndemonstrations, prompting the LLM to generate a preliminary SQL (PreSQL). After\nthat, the mentioned entities in PreSQL are parsed to conduct schema linking,\nwhich can significantly compact the useful information. In the second stage,\nwith the linked schema, we simplify the prompt's schema information and\ninstruct the LLM to produce the final SQL. Finally, as the post-refinement\nmodule, we propose using cross-consistency across different LLMs rather than\nself-consistency within a particular LLM. Our methods achieve new SOTA results\non the Spider benchmark, with an execution accuracy of 87.6%.\n","authors":["Zhishuai Li","Xiang Wang","Jingjing Zhao","Sun Yang","Guoqing Du","Xiaoru Hu","Bin Zhang","Yuxiao Ye","Ziyue Li","Rui Zhao","Hangyu Mao"],"pdf_url":"https://arxiv.org/pdf/2403.09732v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.14168v2","updated":"2024-06-02T02:44:32Z","published":"2024-03-21T06:43:59Z","title":"M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual\n Academic Lecture Dataset","summary":" Publishing open-source academic video recordings is an emergent and prevalent\napproach to sharing knowledge online. Such videos carry rich multimodal\ninformation including speech, the facial and body movements of the speakers, as\nwell as the texts and pictures in the slides and possibly even the papers.\nAlthough multiple academic video datasets have been constructed and released,\nfew of them support both multimodal content recognition and understanding\ntasks, which is partially due to the lack of high-quality human annotations. In\nthis paper, we propose a novel multimodal, multigenre, and multipurpose\naudio-visual academic lecture dataset (M$^3$AV), which has almost 367 hours of\nvideos from five sources covering computer science, mathematics, and medical\nand biology topics. With high-quality human annotations of the slide text and\nspoken words, in particular high-valued name entities, the dataset can be used\nfor multiple audio-visual recognition and understanding tasks. Evaluations\nperformed on contextual speech recognition, speech synthesis, and slide and\nscript generation tasks demonstrate that the diversity of M$^3$AV makes it a\nchallenging dataset.\n","authors":["Zhe Chen","Heyang Liu","Wenyi Yu","Guangzhi Sun","Hongcheng Liu","Ji Wu","Chao Zhang","Yu Wang","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2403.14168v2.pdf","comment":"ACL 2024 Main Conference. Project website:\n https://jack-zc8.github.io/M3AV-dataset-page"},{"id":"http://arxiv.org/abs/2405.19086v2","updated":"2024-06-02T02:32:31Z","published":"2024-05-29T13:49:44Z","title":"MEMoE: Enhancing Model Editing with Mixture of Experts Adaptors","summary":" Model editing aims to efficiently alter the behavior of Large Language Models\n(LLMs) within a desired scope, while ensuring no adverse impact on other\ninputs. Recent years have witnessed various model editing methods been\nproposed. However, these methods either exhibit poor overall performance or\nstruggle to strike a balance between generalization and locality. We propose\nMEMoE, a model editing adapter utilizing a Mixture of Experts (MoE)\narchitecture with a knowledge anchor routing strategy. MEMoE updates knowledge\nusing a bypass MoE structure, keeping the original parameters unchanged to\npreserve the general ability of LLMs. And, the knowledge anchor routing ensures\nthat inputs requiring similar knowledge are routed to the same expert, thereby\nenhancing the generalization of the updated knowledge. Experimental results\nshow the superiority of our approach over both batch editing and sequential\nbatch editing tasks, exhibiting exceptional overall performance alongside\noutstanding balance between generalization and locality. Our code will be\navailable.\n","authors":["Renzhi Wang","Piji Li"],"pdf_url":"https://arxiv.org/pdf/2405.19086v2.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2405.18672v2","updated":"2024-06-02T23:30:46Z","published":"2024-05-29T00:36:56Z","title":"LLM-based Hierarchical Concept Decomposition for Interpretable\n Fine-Grained Image Classification","summary":" (Renyi Qu's Master's Thesis) Recent advancements in interpretable models for\nvision-language tasks have achieved competitive performance; however, their\ninterpretability often suffers due to the reliance on unstructured text outputs\nfrom large language models (LLMs). This introduces randomness and compromises\nboth transparency and reliability, which are essential for addressing safety\nissues in AI systems. We introduce \\texttt{Hi-CoDe} (Hierarchical Concept\nDecomposition), a novel framework designed to enhance model interpretability\nthrough structured concept analysis. Our approach consists of two main\ncomponents: (1) We use GPT-4 to decompose an input image into a structured\nhierarchy of visual concepts, thereby forming a visual concept tree. (2) We\nthen employ an ensemble of simple linear classifiers that operate on\nconcept-specific features derived from CLIP to perform classification. Our\napproach not only aligns with the performance of state-of-the-art models but\nalso advances transparency by providing clear insights into the decision-making\nprocess and highlighting the importance of various concepts. This allows for a\ndetailed analysis of potential failure modes and improves model compactness,\ntherefore setting a new benchmark in interpretability without compromising the\naccuracy.\n","authors":["Renyi Qu","Mark Yatskar"],"pdf_url":"https://arxiv.org/pdf/2405.18672v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.03458v2","updated":"2024-06-02T23:04:43Z","published":"2024-03-06T04:49:02Z","title":"Slot Abstractors: Toward Scalable Abstract Visual Reasoning","summary":" Abstract visual reasoning is a characteristically human ability, allowing the\nidentification of relational patterns that are abstracted away from object\nfeatures, and the systematic generalization of those patterns to unseen\nproblems. Recent work has demonstrated strong systematic generalization in\nvisual reasoning tasks involving multi-object inputs, through the integration\nof slot-based methods used for extracting object-centric representations\ncoupled with strong inductive biases for relational abstraction. However, this\napproach was limited to problems containing a single rule, and was not scalable\nto visual reasoning problems containing a large number of objects. Other recent\nwork proposed Abstractors, an extension of Transformers that incorporates\nstrong relational inductive biases, thereby inheriting the Transformer's\nscalability and multi-head architecture, but it has yet to be demonstrated how\nthis approach might be applied to multi-object visual inputs. Here we combine\nthe strengths of the above approaches and propose Slot Abstractors, an approach\nto abstract visual reasoning that can be scaled to problems involving a large\nnumber of objects and multiple relations among them. The approach displays\nstate-of-the-art performance across four abstract visual reasoning tasks, as\nwell as an abstract reasoning task involving real-world images.\n","authors":["Shanka Subhra Mondal","Jonathan D. Cohen","Taylor W. Webb"],"pdf_url":"https://arxiv.org/pdf/2403.03458v2.pdf","comment":"18 pages, 9 figures"},{"id":"http://arxiv.org/abs/2405.16517v2","updated":"2024-06-02T22:05:39Z","published":"2024-05-26T11:01:39Z","title":"Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion\n Priors","summary":" We aim to tackle sparse-view reconstruction of a 360 3D scene using priors\nfrom latent diffusion models (LDM). The sparse-view setting is ill-posed and\nunderconstrained, especially for scenes where the camera rotates 360 degrees\naround a point, as no visual information is available beyond some frontal views\nfocused on the central object(s) of interest. In this work, we show that\npretrained 2D diffusion models can strongly improve the reconstruction of a\nscene with low-cost fine-tuning. Specifically, we present SparseSplat360\n(Sp2360), a method that employs a cascade of in-painting and artifact removal\nmodels to fill in missing details and clean novel views. Due to superior\ntraining and rendering speeds, we use an explicit scene representation in the\nform of 3D Gaussians over NeRF-based implicit representations. We propose an\niterative update strategy to fuse generated pseudo novel views with existing 3D\nGaussians fitted to the initial sparse inputs. As a result, we obtain a\nmulti-view consistent scene representation with details coherent with the\nobserved inputs. Our evaluation on the challenging Mip-NeRF360 dataset shows\nthat our proposed 2D to 3D distillation algorithm considerably improves the\nperformance of a regularized version of 3DGS adapted to a sparse-view setting\nand outperforms existing sparse-view reconstruction methods in 360 scene\nreconstruction. Qualitatively, our method generates entire 360 scenes from as\nfew as 9 input views, with a high degree of foreground and background detail.\n","authors":["Soumava Paul","Christopher Wewer","Bernt Schiele","Jan Eric Lenssen"],"pdf_url":"https://arxiv.org/pdf/2405.16517v2.pdf","comment":"18 pages, 11 figures, 4 tables"},{"id":"http://arxiv.org/abs/2307.03887v3","updated":"2024-06-02T21:30:13Z","published":"2023-07-08T03:42:54Z","title":"Improving Prototypical Part Networks with Reward Reweighing,\n Reselection, and Retraining","summary":" In recent years, work has gone into developing deep interpretable methods for\nimage classification that clearly attributes a model's output to specific\nfeatures of the data. One such of these methods is the Prototypical Part\nNetwork (ProtoPNet), which attempts to classify images based on meaningful\nparts of the input. While this architecture is able to produce visually\ninterpretable classifications, it often learns to classify based on parts of\nthe image that are not semantically meaningful. To address this problem, we\npropose the Reward Reweighing, Reselecting, and Retraining (R3) post-processing\nframework, which performs three additional corrective updates to a pretrained\nProtoPNet in an offline and efficient manner. The first two steps involve\nlearning a reward model based on collected human feedback and then aligning the\nprototypes with human preferences. The final step is retraining, which realigns\nthe base features and the classifier layer of the original model with the\nupdated prototypes. We find that our R3 framework consistently improves both\nthe interpretability and the predictive accuracy of ProtoPNet and its variants.\n","authors":["Aaron J. Li","Robin Netzorg","Zhihan Cheng","Zhuoqin Zhang","Bin Yu"],"pdf_url":"https://arxiv.org/pdf/2307.03887v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.18316v3","updated":"2024-06-02T20:57:53Z","published":"2024-04-28T20:57:55Z","title":"Position: Do Not Explain Vision Models Without Context","summary":" Does the stethoscope in the picture make the adjacent person a doctor or a\npatient? This, of course, depends on the contextual relationship of the two\nobjects. If it's obvious, why don't explanation methods for vision models use\ncontextual information? In this paper, we (1) review the most popular methods\nof explaining computer vision models by pointing out that they do not take into\naccount context information, (2) show examples of failures of popular XAI\nmethods, (3) provide examples of real-world use cases where spatial context\nplays a significant role, (4) propose new research directions that may lead to\nbetter use of context information in explaining computer vision models, (5)\nargue that a change in approach to explanations is needed from 'where' to\n'how'.\n","authors":["Paulina Tomaszewska","Przemysław Biecek"],"pdf_url":"https://arxiv.org/pdf/2404.18316v3.pdf","comment":"Accepted at International Conference on Machine Learning (ICML) 2024"},{"id":"http://arxiv.org/abs/2310.06085v3","updated":"2024-06-02T20:12:11Z","published":"2023-08-20T22:27:54Z","title":"Quantile-based Maximum Likelihood Training for Outlier Detection","summary":" Discriminative learning effectively predicts true object class for image\nclassification. However, it often results in false positives for outliers,\nposing critical concerns in applications like autonomous driving and video\nsurveillance systems. Previous attempts to address this challenge involved\ntraining image classifiers through contrastive learning using actual outlier\ndata or synthesizing outliers for self-supervised learning. Furthermore,\nunsupervised generative modeling of inliers in pixel space has shown limited\nsuccess for outlier detection. In this work, we introduce a quantile-based\nmaximum likelihood objective for learning the inlier distribution to improve\nthe outlier separation during inference. Our approach fits a normalizing flow\nto pre-trained discriminative features and detects the outliers according to\nthe evaluated log-likelihood. The experimental evaluation demonstrates the\neffectiveness of our method as it surpasses the performance of the\nstate-of-the-art unsupervised methods for outlier detection. The results are\nalso competitive compared with a recent self-supervised approach for outlier\ndetection. Our work allows to reduce dependency on well-sampled negative\ntraining data, which is especially important for domains like medical\ndiagnostics or remote sensing.\n","authors":["Masoud Taghikhah","Nishant Kumar","Siniša Šegvić","Abouzar Eslami","Stefan Gumhold"],"pdf_url":"https://arxiv.org/pdf/2310.06085v3.pdf","comment":"Camera Ready Version. Accepted at AAAI 2024. Code available at\n https://github.com/taghikhah/QuantOD"},{"id":"http://arxiv.org/abs/2311.17833v2","updated":"2024-06-02T19:18:37Z","published":"2023-11-29T17:35:29Z","title":"DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering\n Classifier Differences, Neuron Visualisations, and Visual Counterfactual\n Explanations","summary":" While deep learning has led to huge progress in complex image classification\ntasks like ImageNet, unexpected failure modes, e.g. via spurious features, call\ninto question how reliably these classifiers work in the wild. Furthermore, for\nsafety-critical tasks the black-box nature of their decisions is problematic,\nand explanations or at least methods which make decisions plausible are needed\nurgently. In this paper, we address these problems by generating images that\noptimize a classifier-derived objective using a framework for guided image\ngeneration. We analyze the decisions of image classifiers by visual\ncounterfactual explanations (VCEs), detection of systematic mistakes by\nanalyzing images where classifiers maximally disagree, and visualization of\nneurons and spurious features. In this way, we validate existing observations,\ne.g. the shape bias of adversarially robust models, as well as novel failure\nmodes, e.g. systematic errors of zero-shot CLIP classifiers. Moreover, our VCEs\noutperform previous work while being more versatile.\n","authors":["Maximilian Augustin","Yannic Neuhaus","Matthias Hein"],"pdf_url":"https://arxiv.org/pdf/2311.17833v2.pdf","comment":"CVPR 2024"},{"id":"http://arxiv.org/abs/2402.12843v3","updated":"2024-06-02T18:08:19Z","published":"2024-02-20T09:13:11Z","title":"Solar Panel Segmentation :Self-Supervised Learning Solutions for\n Imperfect Datasets","summary":" The increasing adoption of solar energy necessitates advanced methodologies\nfor monitoring and maintenance to ensure optimal performance of solar panel\ninstallations. A critical component in this context is the accurate\nsegmentation of solar panels from aerial or satellite imagery, which is\nessential for identifying operational issues and assessing efficiency. This\npaper addresses the significant challenges in panel segmentation, particularly\nthe scarcity of annotated data and the labour-intensive nature of manual\nannotation for supervised learning. We explore and apply Self-Supervised\nLearning (SSL) to solve these challenges. We demonstrate that SSL significantly\nenhances model generalization under various conditions and reduces dependency\non manually annotated data, paving the way for robust and adaptable solar panel\nsegmentation solutions.\n","authors":["Sankarshanaa Sagaram","Krish Didwania","Laven Srivastava","Aditya Kasliwal","Pallavi Kailas","Ujjwal Verma"],"pdf_url":"https://arxiv.org/pdf/2402.12843v3.pdf","comment":"Published at ICLR Tiny Paper 2024"},{"id":"http://arxiv.org/abs/2402.11248v4","updated":"2024-06-02T17:34:18Z","published":"2024-02-17T11:03:02Z","title":"CoLLaVO: Crayon Large Language and Vision mOdel","summary":" The remarkable success of Large Language Models (LLMs) and instruction tuning\ndrives the evolution of Vision Language Models (VLMs) towards a versatile\ngeneral-purpose model. Yet, it remains unexplored whether current VLMs\ngenuinely possess quality object-level image understanding capabilities\ndetermined from 'what objects are in the image?' or 'which object corresponds\nto a specified bounding box?'. Our findings reveal that the image understanding\ncapabilities of current VLMs are strongly correlated with their zero-shot\nperformance on vision language (VL) tasks. This suggests that prioritizing\nbasic image understanding is crucial for VLMs to excel at VL tasks. To enhance\nobject-level image understanding, we propose Crayon Large Language and Vision\nmOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a\nnew visual prompt tuning scheme based on panoptic color maps. Furthermore, we\npresent a learning strategy of Dual QLoRA to preserve object-level image\nunderstanding without forgetting it during visual instruction tuning, thereby\nachieving a significant leap in numerous VL benchmarks in a zero-shot setting.\n","authors":["Byung-Kwan Lee","Beomchan Park","Chae Won Kim","Yong Man Ro"],"pdf_url":"https://arxiv.org/pdf/2402.11248v4.pdf","comment":"ACL 2024 Findings. Code available:\n https://github.com/ByungKwanLee/CoLLaVO"},{"id":"http://arxiv.org/abs/2403.03173v7","updated":"2024-06-02T16:35:23Z","published":"2024-03-05T18:08:29Z","title":"Solving the bongard-logo problem by modeling a probabilistic model","summary":" Abstract reasoning problems pose challenges to the perception and cognition\nabilities of AI algorithms, demanding deeper pattern recognition and inductive\nreasoning beyond mere identification of explicit image features. In this study,\nwe introduce PMoC, a probabilistic model tailored for the Bongard-Logo problem,\nachieving high reasoning accuracy through the construction of an conditional\nprobabilistic model. Additionally, we have designed the Pose-Transformer, an\nenhanced Transformer-Encoder specifically crafted for complex abstract\nreasoning tasks, including Bongard-Logo, RAVEN, I-RAVEN, and PGM. Inspired by\nthe pose matrix in capsule networks, Pose-Transformer strengthens the focus on\npositional relationships between local features when processing image data.\nWhen combined with PMoC, it can further enhance reasoning accuracy. Our\nPose-Transformer effectively addresses reasoning difficulties associated with\nchanges in the position of abstract entities, outperforming previous models on\nRAVEN's OIG, D3$\\times$3 subsets, and the PGM dataset. Finally, considering the\ndeployment difficulties arising from the large number of Pose-Transformer\nparameters, this paper presents a lightweight version, Straw-Pose-Transformer,\nwhich maintains performance while significantly reducing the parameter count.\nThis study contributes to enhancing AI capabilities in abstract reasoning and\ncognitive pattern recognition.\n","authors":["Ruizhuo Song","Beiming Yuan"],"pdf_url":"https://arxiv.org/pdf/2403.03173v7.pdf","comment":"14 pages, 11 figures, 4 tables"},{"id":"http://arxiv.org/abs/2404.13756v2","updated":"2024-06-02T16:29:39Z","published":"2024-04-21T19:42:28Z","title":"BC-MRI-SEG: A Breast Cancer MRI Tumor Segmentation Benchmark","summary":" Binary breast cancer tumor segmentation with Magnetic Resonance Imaging (MRI)\ndata is typically trained and evaluated on private medical data, which makes\ncomparing deep learning approaches difficult. We propose a benchmark\n(BC-MRI-SEG) for binary breast cancer tumor segmentation based on publicly\navailable MRI datasets. The benchmark consists of four datasets in total, where\ntwo datasets are used for supervised training and evaluation, and two are used\nfor zero-shot evaluation. Additionally we compare state-of-the-art (SOTA)\napproaches on our benchmark and provide an exhaustive list of available public\nbreast cancer MRI datasets. The source code has been made available at\nhttps://irulenot.github.io/BC_MRI_SEG_Benchmark.\n","authors":["Anthony Bilic","Chen Chen"],"pdf_url":"https://arxiv.org/pdf/2404.13756v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.13331v2","updated":"2024-06-02T16:08:17Z","published":"2024-05-22T04:20:30Z","title":"Comparative Analysis of Hyperspectral Image Reconstruction Using Deep\n Learning for Agricultural and Biological Applications","summary":" Hyperspectral imaging (HSI) has become a key technology for non-invasive\nquality evaluation in various fields, offering detailed insights through\nspatial and spectral data. Despite its efficacy, the complexity and high cost\nof HSI systems have hindered their widespread adoption. This study addressed\nthese challenges by exploring deep learning-based hyperspectral image\nreconstruction from RGB (Red, Green, Blue) images, particularly for\nagricultural products. Specifically, different hyperspectral reconstruction\nalgorithms, such as Hyperspectral Convolutional Neural Network - Dense\n(HSCNN-D), High-Resolution Network (HRNET), and Multi-Scale Transformer Plus\nPlus (MST++), were compared to assess the dry matter content of sweet potatoes.\nAmong the tested reconstruction methods, HRNET demonstrated superior\nperformance, achieving the lowest mean relative absolute error (MRAE) of 0.07,\nroot mean square error (RMSE) of 0.03, and the highest peak signal-to-noise\nratio (PSNR) of 32.28 decibels (dB). Some key features were selected using the\ngenetic algorithm (GA), and their importance was interpreted using explainable\nartificial intelligence (XAI). Partial least squares regression (PLSR) models\nwere developed using the RGB, reconstructed, and ground truth (GT) data. The\nvisual and spectra quality of these reconstructed methods was compared with GT\ndata, and predicted maps were generated. The results revealed the prospect of\ndeep learning-based hyperspectral image reconstruction as a cost-effective and\nefficient quality assessment tool for agricultural and biological applications.\n","authors":["Md. Toukir Ahmed","Arthur Villordon","Mohammed Kamruzzaman"],"pdf_url":"https://arxiv.org/pdf/2405.13331v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2303.12130v2","updated":"2024-06-02T15:50:37Z","published":"2023-03-21T18:40:59Z","title":"MV-MR: multi-views and multi-representations for self-supervised\n learning and knowledge distillation","summary":" We present a new method of self-supervised learning and knowledge\ndistillation based on the multi-views and multi-representations (MV-MR). The\nMV-MR is based on the maximization of dependence between learnable embeddings\nfrom augmented and non-augmented views, jointly with the maximization of\ndependence between learnable embeddings from augmented view and multiple\nnon-learnable representations from non-augmented view. We show that the\nproposed method can be used for efficient self-supervised classification and\nmodel-agnostic knowledge distillation. Unlike other self-supervised techniques,\nour approach does not use any contrastive learning, clustering, or stop\ngradients. MV-MR is a generic framework allowing the incorporation of\nconstraints on the learnable embeddings via the usage of image\nmulti-representations as regularizers. Along this line, knowledge distillation\nis considered a particular case of such a regularization. MV-MR provides the\nstate-of-the-art performance on the STL10 and ImageNet-1K datasets among\nnon-contrastive and clustering-free methods. We show that a lower complexity\nResNet50 model pretrained using proposed knowledge distillation based on the\nCLIP ViT model achieves state-of-the-art performance on STL10 linear\nevaluation. The code is available at: https://github.com/vkinakh/mv-mr\n","authors":["Vitaliy Kinakh","Mariia Drozdova","Slava Voloshynovskiy"],"pdf_url":"https://arxiv.org/pdf/2303.12130v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10182v3","updated":"2024-06-02T15:47:23Z","published":"2023-07-02T11:09:08Z","title":"Enhancing Super-Resolution Networks through Realistic Thick-Slice CT\n Simulation","summary":" Deep learning-based Generative Models have the potential to convert\nlow-resolution CT images into high-resolution counterparts without long\nacquisition times and increased radiation exposure in thin-slice CT imaging.\nHowever, procuring appropriate training data for these Super-Resolution (SR)\nmodels is challenging. Previous SR research has simulated thick-slice CT images\nfrom thin-slice CT images to create training pairs. However, these methods\neither rely on simplistic interpolation techniques that lack realism or\nsinogram reconstruction, which require the release of raw data and complex\nreconstruction algorithms. Thus, we introduce a simple yet realistic method to\ngenerate thick CT images from thin-slice CT images, facilitating the creation\nof training pairs for SR algorithms. The training pairs produced by our method\nclosely resemble real data distributions (PSNR=49.74 vs. 40.66, p$<$0.05). A\nmultivariate Cox regression analysis involving thick slice CT images with lung\nfibrosis revealed that only the radiomics features extracted using our method\ndemonstrated a significant correlation with mortality (HR=1.19 and HR=1.14,\np$<$0.005). This paper represents the first to identify and address the\nchallenge of generating appropriate paired training data for Deep\nLearning-based CT SR models, which enhances the efficacy and applicability of\nSR models in real-world scenarios.\n","authors":["Zeyu Tang","Xiaodan Xing","Guang Yang"],"pdf_url":"https://arxiv.org/pdf/2307.10182v3.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2403.00231v3","updated":"2024-06-02T15:47:16Z","published":"2024-03-01T02:21:30Z","title":"Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of\n Large Vision-Language Models","summary":" Large vision-language models (LVLMs) excel across diverse tasks involving\nconcrete images from natural scenes. However, their ability to interpret\nabstract figures, such as geometry shapes and scientific plots, remains limited\ndue to a scarcity of training datasets in scientific domains. To fill this gap,\nwe introduce Multimodal ArXiv, consisting of ArXivCap and ArXivQA, for\nenhancing LVLMs scientific comprehension. ArXivCap is a figure-caption dataset\ncomprising 6.4M images and 3.9M captions, sourced from 572K ArXiv papers\nspanning various scientific domains. Drawing from ArXivCap, we introduce\nArXivQA, a question-answering dataset generated by prompting GPT-4V based on\nscientific figures. ArXivQA greatly enhances open-sourced LVLMs' mathematical\nreasoning capabilities, achieving a 10.4\\% absolute accuracy gain on a\nmultimodal mathematical reasoning benchmark. Furthermore, employing ArXivCap,\nwe devise four vision-to-text tasks for benchmarking LVLMs. Evaluation results\nwith state-of-the-art LVLMs underscore their struggle with the nuanced\nsemantics of academic figures, while domain-specific training yields\nsubstantial performance gains. Our error analysis uncovers misinterpretations\nof visual context, recognition errors, and the production of overly simplified\ncaptions by current LVLMs, shedding light on future improvements.\n","authors":["Lei Li","Yuqi Wang","Runxin Xu","Peiyi Wang","Xiachong Feng","Lingpeng Kong","Qi Liu"],"pdf_url":"https://arxiv.org/pdf/2403.00231v3.pdf","comment":"Project page: https://mm-arxiv.github.io, Camera Ready Version of ACL\n 2024"},{"id":"http://arxiv.org/abs/2405.19076v2","updated":"2024-06-02T15:03:24Z","published":"2024-05-29T13:34:32Z","title":"Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials\n Analysis and Design","summary":" We present Cephalo, a series of multimodal vision large language models\n(V-LLMs) designed for materials science applications, integrating visual and\nlinguistic data for enhanced understanding and interaction within human-AI and\nmulti-agent AI frameworks. A key innovation of Cephalo is its advanced dataset\ngeneration method, which employs a sophisticated algorithm to accurately detect\nand separate images and their corresponding textual descriptions from PDF\ndocuments, such as scientific papers. The method includes a careful refinement\nof image-text pairs through integrated vision and language processing, ensuring\nhigh-quality, contextually relevant, and well reasoned training data. Cephalo\nis trained on integrated image and text data extracted from thousands of\nscientific papers and science-focused Wikipedia pages demonstrates can\ninterpret complex visual scenes, generate precise language descriptions, and\nanswer queries about images effectively. The combination of a vision encoder\nwith an autoregressive transformer supports complex natural language\nunderstanding in an integrated model, which can be coupled with other\ngenerative methods to create an image-to-text-to-image or image-to-text-to-3D\npipeline. To explore the development of larger models from smaller ones, we\nreport both mixture-of-expert methods and model merging. These hybrid\napproaches allow us to leverage the domain-specific expertise and general\nconversational capabilities to harness the strengths of multiple models. We\nexamine the models in diverse use cases that incorporate biological materials,\nfracture and engineering analysis, protein biophysics, and bio-inspired design\nbased on insect behavior. Generative applications include bio-inspired designs,\nincluding pollen-inspired architected materials, as well as the synthesis of\nbio-inspired material microstructures from a photograph of a solar eclipse.\n","authors":["Markus J. Buehler"],"pdf_url":"https://arxiv.org/pdf/2405.19076v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.04965v2","updated":"2024-06-02T14:31:09Z","published":"2024-03-08T00:30:25Z","title":"StereoDiffusion: Training-Free Stereo Image Generation Using Latent\n Diffusion Models","summary":" The demand for stereo images increases as manufacturers launch more XR\ndevices. To meet this demand, we introduce StereoDiffusion, a method that,\nunlike traditional inpainting pipelines, is trainning free, remarkably\nstraightforward to use, and it seamlessly integrates into the original Stable\nDiffusion model. Our method modifies the latent variable to provide an\nend-to-end, lightweight capability for fast generation of stereo image pairs,\nwithout the need for fine-tuning model weights or any post-processing of\nimages. Using the original input to generate a left image and estimate a\ndisparity map for it, we generate the latent vector for the right image through\nStereo Pixel Shift operations, complemented by Symmetric Pixel Shift Masking\nDenoise and Self-Attention Layers Modification methods to align the right-side\nimage with the left-side image. Moreover, our proposed method maintains a high\nstandard of image quality throughout the stereo generation process, achieving\nstate-of-the-art scores in various quantitative evaluations.\n","authors":["Lezhong Wang","Jeppe Revall Frisvad","Mark Bo Jensen","Siavash Arjomand Bigdeli"],"pdf_url":"https://arxiv.org/pdf/2403.04965v2.pdf","comment":"Updated to CVPR 2024 GCV accepted version"},{"id":"http://arxiv.org/abs/2405.18715v2","updated":"2024-06-02T14:15:27Z","published":"2024-05-29T02:53:40Z","title":"NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the\n Wild","summary":" Neural Radiance Fields (NeRFs) have shown remarkable success in synthesizing\nphotorealistic views from multi-view images of static scenes, but face\nchallenges in dynamic, real-world environments with distractors like moving\nobjects, shadows, and lighting changes. Existing methods manage controlled\nenvironments and low occlusion ratios but fall short in render quality,\nespecially under high occlusion scenarios. In this paper, we introduce NeRF\nOn-the-go, a simple yet effective approach that enables the robust synthesis of\nnovel views in complex, in-the-wild scenes from only casually captured image\nsequences. Delving into uncertainty, our method not only efficiently eliminates\ndistractors, even when they are predominant in captures, but also achieves a\nnotably faster convergence speed. Through comprehensive experiments on various\nscenes, our method demonstrates a significant improvement over state-of-the-art\ntechniques. This advancement opens new avenues for NeRF in diverse and dynamic\nreal-world applications.\n","authors":["Weining Ren","Zihan Zhu","Boyang Sun","Jiaqi Chen","Marc Pollefeys","Songyou Peng"],"pdf_url":"https://arxiv.org/pdf/2405.18715v2.pdf","comment":"CVPR 2024, first two authors contributed equally. Project Page:\n https://rwn17.github.io/nerf-on-the-go/"},{"id":"http://arxiv.org/abs/2310.17109v2","updated":"2024-06-02T12:38:47Z","published":"2023-10-26T02:37:08Z","title":"LP-OVOD: Open-Vocabulary Object Detection by Linear Probing","summary":" This paper addresses the challenging problem of open-vocabulary object\ndetection (OVOD) where an object detector must identify both seen and unseen\nclasses in test images without labeled examples of the unseen classes in\ntraining. A typical approach for OVOD is to use joint text-image embeddings of\nCLIP to assign box proposals to their closest text label. However, this method\nhas a critical issue: many low-quality boxes, such as over- and\nunder-covered-object boxes, have the same similarity score as high-quality\nboxes since CLIP is not trained on exact object location information. To\naddress this issue, we propose a novel method, LP-OVOD, that discards\nlow-quality boxes by training a sigmoid linear classifier on pseudo labels\nretrieved from the top relevant region proposals to the novel text.\nExperimental results on COCO affirm the superior performance of our approach\nover the state of the art, achieving $\\textbf{40.5}$ in $\\text{AP}_{novel}$\nusing ResNet50 as the backbone and without external datasets or knowing novel\nclasses during training. Our code will be available at\nhttps://github.com/VinAIResearch/LP-OVOD.\n","authors":["Chau Pham","Truong Vu","Khoi Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.17109v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.04314v2","updated":"2024-06-02T11:32:19Z","published":"2023-12-07T14:11:00Z","title":"GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific\n Narratives","summary":" Training Scene Graph Generation (SGG) models with natural language captions\nhas become increasingly popular due to the abundant, cost-effective, and\nopen-world generalization supervision signals that natural language offers.\nHowever, such unstructured caption data and its processing pose significant\nchallenges in learning accurate and comprehensive scene graphs. The challenges\ncan be summarized as three aspects: 1) traditional scene graph parsers based on\nlinguistic representation often fail to extract meaningful relationship\ntriplets from caption data. 2) grounding unlocalized objects of parsed triplets\nwill meet ambiguity issues in visual-language alignment. 3) caption data\ntypically are sparse and exhibit bias to partial observations of image content.\nAiming to address these problems, we propose a divide-and-conquer strategy with\na novel framework named \\textit{GPT4SGG}, to obtain more accurate and\ncomprehensive scene graph signals. This framework decomposes a complex scene\ninto a bunch of simple regions, resulting in a set of region-specific\nnarratives. With these region-specific narratives (partial observations) and a\nholistic narrative (global observation) for an image, a large language model\n(LLM) performs the relationship reasoning to synthesize an accurate and\ncomprehensive scene graph. Experimental results demonstrate \\textit{GPT4SGG}\nsignificantly improves the performance of SGG models trained on image-caption\ndata, in which the ambiguity issue and long-tail bias have been well-handled\nwith more accurate and comprehensive scene graphs.\n","authors":["Zuyao Chen","Jinlin Wu","Zhen Lei","Zhaoxiang Zhang","Changwen Chen"],"pdf_url":"https://arxiv.org/pdf/2312.04314v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07133v2","updated":"2024-06-02T10:50:54Z","published":"2023-12-12T10:07:37Z","title":"LatentMan: Generating Consistent Animated Characters using Image\n Diffusion Models","summary":" We propose a zero-shot approach for generating consistent videos of animated\ncharacters based on Text-to-Image (T2I) diffusion models. Existing\nText-to-Video (T2V) methods are expensive to train and require large-scale\nvideo datasets to produce diverse characters and motions. At the same time,\ntheir zero-shot alternatives fail to produce temporally consistent videos with\ncontinuous motion. We strive to bridge this gap, and we introduce LatentMan,\nwhich leverages existing text-based motion diffusion models to generate diverse\ncontinuous motions to guide the T2I model. To boost the temporal consistency,\nwe introduce the Spatial Latent Alignment module that exploits cross-frame\ndense correspondences that we compute to align the latents of the video frames.\nFurthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a\ndirection that minimizes visual discrepancies between frames. Our proposed\napproach outperforms existing zero-shot T2V approaches in generating videos of\nanimated characters in terms of pixel-wise consistency and user preference.\nProject page https://abdo-eldesokey.github.io/latentman/.\n","authors":["Abdelrahman Eldesokey","Peter Wonka"],"pdf_url":"https://arxiv.org/pdf/2312.07133v2.pdf","comment":"CVPRW 2024. Project page: https://abdo-eldesokey.github.io/latentman/"},{"id":"http://arxiv.org/abs/2405.09552v2","updated":"2024-06-02T10:49:47Z","published":"2024-04-15T11:49:37Z","title":"ODFormer: Semantic Fundus Image Segmentation Using Transformer for Optic\n Nerve Head Detection","summary":" Optic nerve head (ONH) detection has been a crucial area of study in\nophthalmology for years. However, the significant discrepancy between fundus\nimage datasets, each generated using a single type of fundus camera, poses\nchallenges to the generalizability of ONH detection approaches developed based\non semantic segmentation networks. Despite the numerous recent advancements in\ngeneral-purpose semantic segmentation methods using convolutional neural\nnetworks (CNNs) and Transformers, there is currently a lack of benchmarks for\nthese state-of-the-art (SoTA) networks specifically trained for ONH detection.\nTherefore, in this article, we make contributions from three key aspects:\nnetwork design, the publication of a dataset, and the establishment of a\ncomprehensive benchmark. Our newly developed ONH detection network, referred to\nas ODFormer, is based upon the Swin Transformer architecture and incorporates\ntwo novel components: a multi-scale context aggregator and a lightweight\nbidirectional feature recalibrator. Our published large-scale dataset, known as\nTongjiU-DROD, provides multi-resolution fundus images for each participant,\ncaptured using two distinct types of cameras. Our established benchmark\ninvolves three datasets: DRIONS-DB, DRISHTI-GS1, and TongjiU-DROD, created by\nresearchers from different countries and containing fundus images captured from\nparticipants of diverse races and ages. Extensive experimental results\ndemonstrate that our proposed ODFormer outperforms other state-of-the-art\n(SoTA) networks in terms of performance and generalizability. Our dataset and\nsource code are publicly available at mias.group/ODFormer.\n","authors":["Jiayi Wang","Yi-An Mao","Xiaoyu Ma","Sicen Guo","Yuting Shao","Xiao Lv","Wenting Han","Mark Christopher","Linda M. Zangwill","Yanlong Bi","Rui Fan"],"pdf_url":"https://arxiv.org/pdf/2405.09552v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2105.04014v2","updated":"2024-06-02T10:45:26Z","published":"2021-05-09T20:06:25Z","title":"DiagSet: a dataset for prostate cancer histopathological image\n classification","summary":" Cancer diseases constitute one of the most significant societal challenges.\nIn this paper, we introduce a novel histopathological dataset for prostate\ncancer detection. The proposed dataset, consisting of over 2.6 million tissue\npatches extracted from 430 fully annotated scans, 4675 scans with assigned\nbinary diagnoses, and 46 scans with diagnoses independently provided by a group\nof histopathologists can be found at\nhttps://github.com/michalkoziarski/DiagSet. Furthermore, we propose a machine\nlearning framework for detection of cancerous tissue regions and prediction of\nscan-level diagnosis, utilizing thresholding to abstain from the decision in\nuncertain cases. The proposed approach, composed of ensembles of deep neural\nnetworks operating on the histopathological scans at different scales, achieves\n94.6% accuracy in patch-level recognition and is compared in a scan-level\ndiagnosis with 9 human histopathologists showing high statistical agreement.\n","authors":["Michał Koziarski","Bogusław Cyganek","Przemysław Niedziela","Bogusław Olborski","Zbigniew Antosz","Marcin Żydak","Bogdan Kwolek","Paweł Wąsowicz","Andrzej Bukała","Jakub Swadźba","Piotr Sitkowski"],"pdf_url":"https://arxiv.org/pdf/2105.04014v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.02149v2","updated":"2024-06-02T10:39:45Z","published":"2024-02-03T13:35:39Z","title":"Improving Diffusion Models for Inverse Problems Using Optimal Posterior\n Covariance","summary":" Recent diffusion models provide a promising zero-shot solution to noisy\nlinear inverse problems without retraining for specific inverse problems. In\nthis paper, we reveal that recent methods can be uniformly interpreted as\nemploying a Gaussian approximation with hand-crafted isotropic covariance for\nthe intractable denoising posterior to approximate the conditional posterior\nmean. Inspired by this finding, we propose to improve recent methods by using\nmore principled covariance determined by maximum likelihood estimation. To\nachieve posterior covariance optimization without retraining, we provide\ngeneral plug-and-play solutions based on two approaches specifically designed\nfor leveraging pre-trained models with and without reverse covariance. We\nfurther propose a scalable method for learning posterior covariance prediction\nbased on representation with orthonormal basis. Experimental results\ndemonstrate that the proposed methods significantly enhance reconstruction\nperformance without requiring hyperparameter tuning.\n","authors":["Xinyu Peng","Ziyang Zheng","Wenrui Dai","Nuoqian Xiao","Chenglin Li","Junni Zou","Hongkai Xiong"],"pdf_url":"https://arxiv.org/pdf/2402.02149v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19703v2","updated":"2024-06-02T10:24:49Z","published":"2024-05-30T05:27:46Z","title":"Towards a Better Evaluation of Out-of-Domain Generalization","summary":" The objective of Domain Generalization (DG) is to devise algorithms and\nmodels capable of achieving high performance on previously unseen test\ndistributions. In the pursuit of this objective, average measure has been\nemployed as the prevalent measure for evaluating models and comparing\nalgorithms in the existing DG studies. Despite its significance, a\ncomprehensive exploration of the average measure has been lacking and its\nsuitability in approximating the true domain generalization performance has\nbeen questionable. In this study, we carefully investigate the limitations\ninherent in the average measure and propose worst+gap measure as a robust\nalternative. We establish theoretical grounds of the proposed measure by\nderiving two theorems starting from two different assumptions. We conduct\nextensive experimental investigations to compare the proposed worst+gap measure\nwith the conventional average measure. Given the indispensable need to access\nthe true DG performance for studying measures, we modify five existing datasets\nto come up with SR-CMNIST, C-Cats&Dogs, L-CIFAR10, PACS-corrupted, and\nVLCS-corrupted datasets. The experiment results unveil an inferior performance\nof the average measure in approximating the true DG performance and confirm the\nrobustness of the theoretically supported worst+gap measure.\n","authors":["Duhun Hwang","Suhyun Kang","Moonjung Eo","Jimyeong Kim","Wonjong Rhee"],"pdf_url":"https://arxiv.org/pdf/2405.19703v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20222v2","updated":"2024-06-02T10:14:56Z","published":"2024-05-30T16:22:22Z","title":"MOFA-Video: Controllable Image Animation via Generative Motion Field\n Adaptions in Frozen Image-to-Video Diffusion Model","summary":" We present MOFA-Video, an advanced controllable image animation method that\ngenerates video from the given image using various additional controllable\nsignals (such as human landmarks reference, manual trajectories, and another\neven provided video) or their combinations. This is different from previous\nmethods which only can work on a specific motion domain or show weak control\nabilities with diffusion prior. To achieve our goal, we design several\ndomain-aware motion field adapters (\\ie, MOFA-Adapters) to control the\ngenerated motions in the video generation pipeline. For MOFA-Adapters, we\nconsider the temporal motion consistency of the video and generate the dense\nmotion flow from the given sparse control conditions first, and then, the\nmulti-scale features of the given image are wrapped as a guided feature for\nstable video diffusion generation. We naively train two motion adapters for the\nmanual trajectories and the human landmarks individually since they both\ncontain sparse information about the control. After training, the MOFA-Adapters\nin different domains can also work together for more controllable video\ngeneration. Project Page: https://myniuuu.github.io/MOFA_Video/\n","authors":["Muyao Niu","Xiaodong Cun","Xintao Wang","Yong Zhang","Ying Shan","Yinqiang Zheng"],"pdf_url":"https://arxiv.org/pdf/2405.20222v2.pdf","comment":"Project Page: https://myniuuu.github.io/MOFA_Video/ ; Codes:\n https://github.com/MyNiuuu/MOFA-Video"},{"id":"http://arxiv.org/abs/2405.15223v2","updated":"2024-06-02T09:44:20Z","published":"2024-05-24T05:29:12Z","title":"iVideoGPT: Interactive VideoGPTs are Scalable World Models","summary":" World models empower model-based agents to interactively explore, reason, and\nplan within imagined environments for real-world decision-making. However, the\nhigh demand for interactivity poses challenges in harnessing recent\nadvancements in video generative models for developing world models at scale.\nThis work introduces Interactive VideoGPT (iVideoGPT), a scalable\nautoregressive transformer framework that integrates multimodal signals--visual\nobservations, actions, and rewards--into a sequence of tokens, facilitating an\ninteractive experience of agents via next-token prediction. iVideoGPT features\na novel compressive tokenization technique that efficiently discretizes\nhigh-dimensional visual observations. Leveraging its scalable architecture, we\nare able to pre-train iVideoGPT on millions of human and robotic manipulation\ntrajectories, establishing a versatile foundation that is adaptable to serve as\ninteractive world models for a wide range of downstream tasks. These include\naction-conditioned video prediction, visual planning, and model-based\nreinforcement learning, where iVideoGPT achieves competitive performance\ncompared with state-of-the-art methods. Our work advances the development of\ninteractive general world models, bridging the gap between generative video\nmodels and practical model-based reinforcement learning applications.\n","authors":["Jialong Wu","Shaofeng Yin","Ningya Feng","Xu He","Dong Li","Jianye Hao","Mingsheng Long"],"pdf_url":"https://arxiv.org/pdf/2405.15223v2.pdf","comment":"Project website: https://thuml.github.io/iVideoGPT"},{"id":"http://arxiv.org/abs/2312.04557v2","updated":"2024-06-02T09:30:39Z","published":"2023-12-07T18:59:30Z","title":"GenTron: Diffusion Transformers for Image and Video Generation","summary":" In this study, we explore Transformer-based diffusion models for image and\nvideo generation. Despite the dominance of Transformer architectures in various\nfields due to their flexibility and scalability, the visual generative domain\nprimarily utilizes CNN-based U-Net architectures, particularly in\ndiffusion-based models. We introduce GenTron, a family of Generative models\nemploying Transformer-based diffusion, to address this gap. Our initial step\nwas to adapt Diffusion Transformers (DiTs) from class to text conditioning, a\nprocess involving thorough empirical exploration of the conditioning mechanism.\nWe then scale GenTron from approximately 900M to over 3B parameters, observing\nsignificant improvements in visual quality. Furthermore, we extend GenTron to\ntext-to-video generation, incorporating novel motion-free guidance to enhance\nvideo quality. In human evaluations against SDXL, GenTron achieves a 51.1% win\nrate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text\nalignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench,\nunderscoring its strengths in compositional generation. We believe this work\nwill provide meaningful insights and serve as a valuable reference for future\nresearch.\n","authors":["Shoufa Chen","Mengmeng Xu","Jiawei Ren","Yuren Cong","Sen He","Yanping Xie","Animesh Sinha","Ping Luo","Tao Xiang","Juan-Manuel Perez-Rua"],"pdf_url":"https://arxiv.org/pdf/2312.04557v2.pdf","comment":"CVPR2024 Camera Ready. Website:\n https://www.shoufachen.com/gentron_website/"},{"id":"http://arxiv.org/abs/2312.03031v2","updated":"2024-06-02T09:29:00Z","published":"2023-12-05T11:32:31Z","title":"Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?","summary":" End-to-end autonomous driving recently emerged as a promising research\ndirection to target autonomy from a full-stack perspective. Along this line,\nmany of the latest works follow an open-loop evaluation setting on nuScenes to\nstudy the planning behavior. In this paper, we delve deeper into the problem by\nconducting thorough analyses and demystifying more devils in the details. We\ninitially observed that the nuScenes dataset, characterized by relatively\nsimple driving scenarios, leads to an under-utilization of perception\ninformation in end-to-end models incorporating ego status, such as the ego\nvehicle's velocity. These models tend to rely predominantly on the ego\nvehicle's status for future path planning. Beyond the limitations of the\ndataset, we also note that current metrics do not comprehensively assess the\nplanning quality, leading to potentially biased conclusions drawn from existing\nbenchmarks. To address this issue, we introduce a new metric to evaluate\nwhether the predicted trajectories adhere to the road. We further propose a\nsimple baseline able to achieve competitive results without relying on\nperception annotations. Given the current limitations on the benchmark and\nmetrics, we suggest the community reassess relevant prevailing research and be\ncautious whether the continued pursuit of state-of-the-art would yield\nconvincing and universal conclusions. Code and models are available at\n\\url{https://github.com/NVlabs/BEV-Planner}\n","authors":["Zhiqi Li","Zhiding Yu","Shiyi Lan","Jiahan Li","Jan Kautz","Tong Lu","Jose M. Alvarez"],"pdf_url":"https://arxiv.org/pdf/2312.03031v2.pdf","comment":"Accept to cvpr 2024"},{"id":"http://arxiv.org/abs/2307.08924v4","updated":"2024-06-02T08:52:13Z","published":"2023-07-18T01:53:18Z","title":"Towards Task Sampler Learning for Meta-Learning","summary":" Meta-learning aims to learn general knowledge with diverse training tasks\nconducted from limited data, and then transfer it to new tasks. It is commonly\nbelieved that increasing task diversity will enhance the generalization ability\nof meta-learning models. However, this paper challenges this view through\nempirical and theoretical analysis. We obtain three conclusions: (i) there is\nno universal task sampling strategy that can guarantee the optimal performance\nof meta-learning models; (ii) over-constraining task diversity may incur the\nrisk of under-fitting or over-fitting during training; and (iii) the\ngeneralization performance of meta-learning models are affected by task\ndiversity, task entropy, and task difficulty. Based on this insight, we design\na novel task sampler, called Adaptive Sampler (ASr). ASr is a plug-and-play\nmodule that can be integrated into any meta-learning framework. It dynamically\nadjusts task weights according to task diversity, task entropy, and task\ndifficulty, thereby obtaining the optimal probability distribution for\nmeta-training tasks. Finally, we conduct experiments on a series of benchmark\ndatasets across various scenarios, and the results demonstrate that ASr has\nclear advantages.\n","authors":["Jingyao Wang","Wenwen Qiang","Xingzhe Su","Changwen Zheng","Fuchun Sun","Hui Xiong"],"pdf_url":"https://arxiv.org/pdf/2307.08924v4.pdf","comment":"accepted by IJCV"},{"id":"http://arxiv.org/abs/2302.13251v2","updated":"2024-06-02T08:42:11Z","published":"2023-02-26T07:10:09Z","title":"Unsupervised Domain Adaptation for Low-dose CT Reconstruction via\n Bayesian Uncertainty Alignment","summary":" Low-dose computed tomography (LDCT) image reconstruction techniques can\nreduce patient radiation exposure while maintaining acceptable imaging quality.\nDeep learning is widely used in this problem, but the performance of testing\ndata (a.k.a. target domain) is often degraded in clinical scenarios due to the\nvariations that were not encountered in training data (a.k.a. source domain).\nUnsupervised domain adaptation (UDA) of LDCT reconstruction has been proposed\nto solve this problem through distribution alignment. However, existing UDA\nmethods fail to explore the usage of uncertainty quantification, which is\ncrucial for reliable intelligent medical systems in clinical scenarios with\nunexpected variations. Moreover, existing direct alignment for different\npatients would lead to content mismatch issues. To address these issues, we\npropose to leverage a probabilistic reconstruction framework to conduct a joint\ndiscrepancy minimization between source and target domains in both the latent\nand image spaces. In the latent space, we devise a Bayesian uncertainty\nalignment to reduce the epistemic gap between the two domains. This approach\nreduces the uncertainty level of target domain data, making it more likely to\nrender well-reconstructed results on target domains. In the image space, we\npropose a sharpness-aware distribution alignment to achieve a match of\nsecond-order information, which can ensure that the reconstructed images from\nthe target domain have similar sharpness to normal-dose CT images from the\nsource domain. Experimental results on two simulated datasets and one clinical\nlow-dose imaging dataset show that our proposed method outperforms other\nmethods in quantitative and visualized performance.\n","authors":["Kecheng Chen","Jie Liu","Renjie Wan","Victor Ho-Fun Lee","Varut Vardhanabhuti","Hong Yan","Haoliang Li"],"pdf_url":"https://arxiv.org/pdf/2302.13251v2.pdf","comment":"Accepted by IEEE Transactions on Neural Networks and Learning Systems"},{"id":"http://arxiv.org/abs/2305.08117v2","updated":"2024-06-02T08:30:21Z","published":"2023-05-14T10:17:09Z","title":"MBQuant: A Novel Multi-Branch Topology Method for Arbitrary Bit-width\n Network Quantization","summary":" Arbitrary bit-width network quantization has received significant attention\ndue to its high adaptability to various bit-width requirements during runtime.\nHowever, in this paper, we investigate existing methods and observe a\nsignificant accumulation of quantization errors caused by switching weight and\nactivations bit-widths, leading to limited performance. To address this issue,\nwe propose MBQuant, a novel method that utilizes a multi-branch topology for\narbitrary bit-width quantization. MBQuant duplicates the network body into\nmultiple independent branches, where the weights of each branch are quantized\nto a fixed 2-bit and the activations remain in the input bit-width. The\ncomputation of a desired bit-width is completed by selecting an appropriate\nnumber of branches that satisfy the original computational constraint. By\nfixing the weight bit-width, this approach substantially reduces quantization\nerrors caused by switching weight bit-widths. Additionally, we introduce an\namortization branch selection strategy to distribute quantization errors caused\nby switching activation bit-widths among branches to improve performance.\nFinally, we adopt an in-place distillation strategy that facilitates guidance\nbetween branches to further enhance MBQuant's performance. Extensive\nexperiments demonstrate that MBQuant achieves significant performance gains\ncompared to existing arbitrary bit-width quantization methods. Code is at\nhttps://github.com/zysxmu/MultiQuant.\n","authors":["Yunshan Zhong","Yuyao Zhou","Fei Chao","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2305.08117v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.11874v3","updated":"2024-06-02T07:44:51Z","published":"2024-02-19T06:32:23Z","title":"Language-guided Image Reflection Separation","summary":" This paper studies the problem of language-guided reflection separation,\nwhich aims at addressing the ill-posed reflection separation problem by\nintroducing language descriptions to provide layer content. We propose a\nunified framework to solve this problem, which leverages the cross-attention\nmechanism with contrastive learning strategies to construct the correspondence\nbetween language descriptions and image layers. A gated network design and a\nrandomized training strategy are employed to tackle the recognizable layer\nambiguity. The effectiveness of the proposed method is validated by the\nsignificant performance advantage over existing reflection separation methods\non both quantitative and qualitative comparisons.\n","authors":["Haofeng Zhong","Yuchen Hong","Shuchen Weng","Jinxiu Liang","Boxin Shi"],"pdf_url":"https://arxiv.org/pdf/2402.11874v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.03703v2","updated":"2024-06-02T07:34:08Z","published":"2023-12-06T18:59:44Z","title":"Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context\n Learning","summary":" In-context learning provides a new perspective for multi-task modeling for\nvision and NLP. Under this setting, the model can perceive tasks from prompts\nand accomplish them without any extra task-specific head predictions or model\nfine-tuning. However, Skeleton sequence modeling via in-context learning\nremains unexplored. Directly applying existing in-context models from other\nareas onto skeleton sequences fails due to the inter-frame and cross-task pose\nsimilarity that makes it outstandingly hard to perceive the task correctly from\na subtle context. To address this challenge, we propose Skeleton-in-Context\n(SiC), an effective framework for in-context skeleton sequence modeling. Our\nSiC is able to handle multiple skeleton-based tasks simultaneously after a\nsingle training process and accomplish each task from context according to the\ngiven prompt. It can further generalize to new, unseen tasks according to\ncustomized prompts. To facilitate context perception, we additionally propose a\ntask-unified prompt, which adaptively learns tasks of different natures, such\nas partial joint-level generation, sequence-level prediction, or 2D-to-3D\nmotion prediction. We conduct extensive experiments to evaluate the\neffectiveness of our SiC on multiple tasks, including motion prediction, pose\nestimation, joint completion, and future pose estimation. We also evaluate its\ngeneralization capability on unseen tasks such as motion-in-between. These\nexperiments show that our model achieves state-of-the-art multi-task\nperformance and even outperforms single-task methods on certain tasks.\n","authors":["Xinshun Wang","Zhongbin Fang","Xia Li","Xiangtai Li","Mengyuan Liu"],"pdf_url":"https://arxiv.org/pdf/2312.03703v2.pdf","comment":"Project page: https://github.com/fanglaosi/Skeleton-in-Context"},{"id":"http://arxiv.org/abs/2303.09975v5","updated":"2024-06-02T07:32:59Z","published":"2023-03-17T13:48:17Z","title":"MedNeXt: Transformer-driven Scaling of ConvNets for Medical Image\n Segmentation","summary":" There has been exploding interest in embracing Transformer-based\narchitectures for medical image segmentation. However, the lack of large-scale\nannotated medical datasets make achieving performances equivalent to those in\nnatural images challenging. Convolutional networks, in contrast, have higher\ninductive biases and consequently, are easily trainable to high performance.\nRecently, the ConvNeXt architecture attempted to modernize the standard ConvNet\nby mirroring Transformer blocks. In this work, we improve upon this to design a\nmodernized and scalable convolutional architecture customized to challenges of\ndata-scarce medical settings. We introduce MedNeXt, a Transformer-inspired\nlarge kernel segmentation network which introduces - 1) A fully ConvNeXt 3D\nEncoder-Decoder Network for medical image segmentation, 2) Residual ConvNeXt up\nand downsampling blocks to preserve semantic richness across scales, 3) A novel\ntechnique to iteratively increase kernel sizes by upsampling small kernel\nnetworks, to prevent performance saturation on limited medical data, 4)\nCompound scaling at multiple levels (depth, width, kernel size) of MedNeXt.\nThis leads to state-of-the-art performance on 4 tasks on CT and MRI modalities\nand varying dataset sizes, representing a modernized deep architecture for\nmedical image segmentation. Our code is made publicly available at:\nhttps://github.com/MIC-DKFZ/MedNeXt.\n","authors":["Saikat Roy","Gregor Koehler","Constantin Ulrich","Michael Baumgartner","Jens Petersen","Fabian Isensee","Paul F. Jaeger","Klaus Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2303.09975v5.pdf","comment":"Accepted at MICCAI 2023"},{"id":"http://arxiv.org/abs/2402.11435v2","updated":"2024-06-02T05:40:18Z","published":"2024-02-18T03:04:38Z","title":"Momentor: Advancing Video Large Language Model with Fine-Grained\n Temporal Reasoning","summary":" Large Language Models (LLMs) demonstrate remarkable proficiency in\ncomprehending and handling text-based tasks. Many efforts are being made to\ntransfer these attributes to video modality, which are termed Video-LLMs.\nHowever, existing Video-LLMs can only capture the coarse-grained semantics and\nare unable to effectively handle tasks related to comprehension or localization\nof specific video segments. In light of these challenges, we propose Momentor,\na Video-LLM capable of accomplishing fine-grained temporal understanding tasks.\nTo support the training of Momentor, we design an automatic data generation\nengine to construct Moment-10M, a large-scale video instruction dataset with\nsegment-level instruction data. We train Momentor on Moment-10M, enabling it to\nperform segment-level reasoning and localization. Zero-shot evaluations on\nseveral tasks demonstrate that Momentor excels in fine-grained temporally\ngrounded comprehension and localization.\n","authors":["Long Qian","Juncheng Li","Yu Wu","Yaobo Ye","Hao Fei","Tat-Seng Chua","Yueting Zhuang","Siliang Tang"],"pdf_url":"https://arxiv.org/pdf/2402.11435v2.pdf","comment":"Accepted by ICML 2024"},{"id":"http://arxiv.org/abs/2405.17158v2","updated":"2024-06-02T05:27:34Z","published":"2024-05-27T13:31:46Z","title":"PatchScaler: An Efficient Patch-Independent Diffusion Model for\n Super-Resolution","summary":" Diffusion models significantly improve the quality of super-resolved images\nwith their impressive content generation capabilities. However, the huge\ncomputational costs limit the applications of these methods.Recent efforts have\nexplored reasonable inference acceleration to reduce the number of sampling\nsteps, but the computational cost remains high as each step is performed on the\nentire image.This paper introduces PatchScaler, a patch-independent\ndiffusion-based single image super-resolution (SR) method, designed to enhance\nthe efficiency of the inference process.The proposed method is motivated by the\nobservation that not all the image patches within an image need the same\nsampling steps for reconstructing high-resolution images.Based on this\nobservation, we thus develop a Patch-adaptive Group Sampling (PGS) to divide\nfeature patches into different groups according to the patch-level\nreconstruction difficulty and dynamically assign an appropriate sampling\nconfiguration for each group so that the inference speed can be better\naccelerated.In addition, to improve the denoising ability at each step of the\nsampling, we develop a texture prompt to guide the estimations of the diffusion\nmodel by retrieving high-quality texture priors from a patch-independent\nreference texture memory.Experiments show that our PatchScaler achieves\nfavorable performance in both quantitative and qualitative evaluations with\nfast inference speed.Our code and model are available at\n\\url{https://github.com/yongliuy/PatchScaler}.\n","authors":["Yong Liu","Hang Dong","Jinshan Pan","Qingji Dong","Kai Chen","Rongxiang Zhang","Xing Mei","Lean Fu","Fei Wang"],"pdf_url":"https://arxiv.org/pdf/2405.17158v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17261v2","updated":"2024-06-02T05:20:57Z","published":"2023-10-26T09:25:09Z","title":"Attribute Based Interpretable Evaluation Metrics for Generative Models","summary":" When the training dataset comprises a 1:1 proportion of dogs to cats, a\ngenerative model that produces 1:1 dogs and cats better resembles the training\nspecies distribution than another model with 3:1 dogs and cats. Can we capture\nthis phenomenon using existing metrics? Unfortunately, we cannot, because these\nmetrics do not provide any interpretability beyond \"diversity\". In this\ncontext, we propose a new evaluation protocol that measures the divergence of a\nset of generated images from the training set regarding the distribution of\nattribute strengths as follows. Single-attribute Divergence (SaD) measures the\ndivergence regarding PDFs of a single attribute. Paired-attribute Divergence\n(PaD) measures the divergence regarding joint PDFs of a pair of attributes.\nThey provide which attributes the models struggle. For measuring the attribute\nstrengths of an image, we propose Heterogeneous CLIPScore (HCS) which measures\nthe cosine similarity between image and text vectors with heterogeneous initial\npoints. With SaD and PaD, we reveal the following about existing generative\nmodels. ProjectedGAN generates implausible attribute relationships such as a\nbaby with a beard even though it has competitive scores of existing metrics.\nDiffusion models struggle to capture diverse colors in the datasets. The larger\nsampling timesteps of latent diffusion model generate the more minor objects\nincluding earrings and necklaces. Stable Diffusion v1.5 better captures the\nattributes than v2.1. Our metrics lay a foundation for explainable evaluations\nof generative models.\n","authors":["Dongkyun Kim","Mingi Kwon","Youngjung Uh"],"pdf_url":"https://arxiv.org/pdf/2310.17261v2.pdf","comment":"Accepted by ICML2024"},{"id":"http://arxiv.org/abs/2404.03635v4","updated":"2024-06-02T04:56:32Z","published":"2024-04-04T17:54:33Z","title":"WorDepth: Variational Language Prior for Monocular Depth Estimation","summary":" Three-dimensional (3D) reconstruction from a single image is an ill-posed\nproblem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text\ndescription(s) is similarly ill-posed, i.e. spatial arrangements of objects\ndescribed. We investigate the question of whether two inherently ambiguous\nmodalities can be used in conjunction to produce metric-scaled reconstructions.\nTo test this, we focus on monocular depth estimation, the problem of predicting\na dense depth map from a single image, but with an additional text caption\ndescribing the scene. To this end, we begin by encoding the text caption as a\nmean and standard deviation; using a variational framework, we learn the\ndistribution of the plausible metric reconstructions of 3D scenes corresponding\nto the text captions as a prior. To \"select\" a specific reconstruction or depth\nmap, we encode the given image through a conditional sampler that samples from\nthe latent space of the variational text encoder, which is then decoded to the\noutput depth map. Our approach is trained alternatingly between the text and\nimage branches: in one optimization step, we predict the mean and standard\ndeviation from the text description and sample from a standard Gaussian, and in\nthe other, we sample using a (image) conditional sampler. Once trained, we\ndirectly predict depth from the encoded text using the conditional sampler. We\ndemonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where\nwe show that language can consistently improve performance in both.\n","authors":["Ziyao Zeng","Daniel Wang","Fengyu Yang","Hyoungseob Park","Yangchao Wu","Stefano Soatto","Byung-Woo Hong","Dong Lao","Alex Wong"],"pdf_url":"https://arxiv.org/pdf/2404.03635v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.19385v2","updated":"2024-06-02T04:45:00Z","published":"2024-02-29T17:36:39Z","title":"Towards Safe and Reliable Autonomous Driving: Dynamic Occupancy Set\n Prediction","summary":" In the rapidly evolving field of autonomous driving, reliable prediction is\npivotal for vehicular safety. However, trajectory predictions often deviate\nfrom actual paths, particularly in complex and challenging environments,\nleading to significant errors. To address this issue, our study introduces a\nnovel method for Dynamic Occupancy Set (DOS) prediction, it effectively\ncombines advanced trajectory prediction networks with a DOS prediction module,\novercoming the shortcomings of existing models. It provides a comprehensive and\nadaptable framework for predicting the potential occupancy sets of traffic\nparticipants. The innovative contributions of this study include the\ndevelopment of a novel DOS prediction model specifically tailored for\nnavigating complex scenarios, the introduction of precise DOS mathematical\nrepresentations, and the formulation of optimized loss functions that\ncollectively advance the safety and efficiency of autonomous systems. Through\nrigorous validation, our method demonstrates marked improvements over\ntraditional models, establishing a new benchmark for safety and operational\nefficiency in intelligent transportation systems.\n","authors":["Wenbo Shao","Jiahui Xu","Wenhao Yu","Jun Li","Hong Wang"],"pdf_url":"https://arxiv.org/pdf/2402.19385v2.pdf","comment":"Accepted by IEEE IV 2024"},{"id":"http://arxiv.org/abs/2405.19442v2","updated":"2024-06-02T04:16:01Z","published":"2024-05-29T18:40:11Z","title":"Large-scale DSM registration via motion averaging","summary":" Generating wide-area digital surface models (DSMs) requires registering a\nlarge number of individual, and partially overlapped DSMs. This presents a\nchallenging problem for a typical registration algorithm, since when a large\nnumber of observations from these multiple DSMs are considered, it may easily\ncause memory overflow. Sequential registration algorithms, although can\nsignificantly reduce the computation, are especially vulnerable for small\noverlapped pairs, leading to a large error accumulation. In this work, we\npropose a novel solution that builds the DSM registration task as a motion\naveraging problem: pair-wise DSMs are registered to build a scene graph, with\nedges representing relative poses between DSMs. Specifically, based on the grid\nstructure of the large DSM, the pair-wise registration is performed using a\nnovel nearest neighbor search method. We show that the scene graph can be\noptimized via an extremely fast motion average algorithm with O(N) complexity\n(N refers to the number of images). Evaluation of high-resolution\nsatellite-derived DSM demonstrates significant improvement in computation and\naccuracy.\n","authors":["Ningli Xu","Rongjun Qin"],"pdf_url":"https://arxiv.org/pdf/2405.19442v2.pdf","comment":"9 Figures"},{"id":"http://arxiv.org/abs/2311.17091v2","updated":"2024-06-02T03:00:49Z","published":"2023-11-28T05:17:25Z","title":"Beyond Sole Strength: Customized Ensembles for Generalized\n Vision-Language Models","summary":" Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the\nopen-world generalization has gained increasing popularity due to its practical\nvalue. However, performance advancements are limited when relying solely on\nintricate algorithmic designs for a single model, even one exhibiting strong\nperformance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the\ncollaborative potential of leveraging much weaker VLMs to enhance the\ngeneralization of a robust single model. The affirmative findings motivate us\nto address the generalization problem from a novel perspective, i.e., ensemble\nof pre-trained VLMs. We introduce three customized ensemble strategies, each\ntailored to one specific scenario. Firstly, we introduce the zero-shot\nensemble, automatically adjusting the logits of different models based on their\nconfidence when only pre-trained VLMs are available. Furthermore, for scenarios\nwith extra few-shot samples, we propose the training-free and tuning ensemble,\noffering flexibility based on the availability of computing resources. The\nproposed ensemble strategies are evaluated on zero-shot, base-to-new, and\ncross-dataset generalization, achieving new state-of-the-art performance.\nNotably, this work represents an initial stride toward enhancing the\ngeneralization performance of VLMs via ensemble. The code is available at\nhttps://github.com/zhiheLu/Ensemble_VLM.git.\n","authors":["Zhihe Lu","Jiawang Bai","Xin Li","Zeyu Xiao","Xinchao Wang"],"pdf_url":"https://arxiv.org/pdf/2311.17091v2.pdf","comment":"Accepted on ICML 2024"},{"id":"http://arxiv.org/abs/2404.14329v2","updated":"2024-06-02T01:58:41Z","published":"2024-04-22T16:40:11Z","title":"X-Ray: A Sequential 3D Representation For Generation","summary":" We introduce X-Ray, a novel 3D sequential representation inspired by the\npenetrability of x-ray scans. X-Ray transforms a 3D object into a series of\nsurface frames at different layers, making it suitable for generating 3D models\nfrom images. Our method utilizes ray casting from the camera center to capture\ngeometric and textured details, including depth, normal, and color, across all\nintersected surfaces. This process efficiently condenses the whole 3D object\ninto a multi-frame video format, motivating the utilize of a network\narchitecture similar to those in video diffusion models. This design ensures an\nefficient 3D representation by focusing solely on surface information. Also, we\npropose a two-stage pipeline to generate 3D objects from X-Ray Diffusion Model\nand Upsampler. We demonstrate the practicality and adaptability of our X-Ray\nrepresentation by synthesizing the complete visible and hidden surfaces of a 3D\nobject from a single input image. Experimental results reveal the\nstate-of-the-art superiority of our representation in enhancing the accuracy of\n3D generation, paving the way for new 3D representation research and practical\napplications.\n","authors":["Tao Hu","Wenhang Ge","Yuyang Zhao","Gim Hee Lee"],"pdf_url":"https://arxiv.org/pdf/2404.14329v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.05963v3","updated":"2024-06-02T01:35:28Z","published":"2024-03-09T17:05:43Z","title":"Robust Emotion Recognition in Context Debiasing","summary":" Context-aware emotion recognition (CAER) has recently boosted the practical\napplications of affective computing techniques in unconstrained environments.\nMainstream CAER methods invariably extract ensemble representations from\ndiverse contexts and subject-centred characteristics to perceive the target\nperson's emotional state. Despite advancements, the biggest challenge remains\ndue to context bias interference. The harmful bias forces the models to rely on\nspurious correlations between background contexts and emotion labels in\nlikelihood estimation, causing severe performance bottlenecks and confounding\nvaluable context priors. In this paper, we propose a counterfactual emotion\ninference (CLEF) framework to address the above issue. Specifically, we first\nformulate a generalized causal graph to decouple the causal relationships among\nthe variables in CAER. Following the causal graph, CLEF introduces a\nnon-invasive context branch to capture the adverse direct effect caused by the\ncontext bias. During the inference, we eliminate the direct context effect from\nthe total causal effect by comparing factual and counterfactual outcomes,\nresulting in bias mitigation and robust prediction. As a model-agnostic\nframework, CLEF can be readily integrated into existing methods, bringing\nconsistent performance gains.\n","authors":["Dingkang Yang","Kun Yang","Mingcheng Li","Shunli Wang","Shuaibing Wang","Lihua Zhang"],"pdf_url":"https://arxiv.org/pdf/2403.05963v3.pdf","comment":"Accepted by CVPR 2024"},{"id":"http://arxiv.org/abs/2311.03356v3","updated":"2024-06-02T00:33:53Z","published":"2023-11-06T18:59:57Z","title":"GLaMM: Pixel Grounding Large Multimodal Model","summary":" Large Multimodal Models (LMMs) extend Large Language Models to the vision\ndomain. Initial LMMs used holistic images and text prompts to generate\nungrounded textual responses. Recently, region-level LMMs have been used to\ngenerate visually grounded responses. However, they are limited to only\nreferring to a single object category at a time, require users to specify the\nregions, or cannot offer dense pixel-wise object grounding. In this work, we\npresent Grounding LMM (GLaMM), the first model that can generate natural\nlanguage responses seamlessly intertwined with corresponding object\nsegmentation masks. GLaMM not only grounds objects appearing in the\nconversations but is flexible enough to accept both textual and optional visual\nprompts (region of interest) as input. This empowers users to interact with the\nmodel at various levels of granularity, both in textual and visual domains. Due\nto the lack of standard benchmarks for the novel setting of visually Grounded\nConversation Generation (GCG), we introduce a comprehensive evaluation protocol\nwith our curated grounded conversations. Our proposed GCG task requires densely\ngrounded concepts in natural scenes at a large-scale. To this end, we propose a\ndensely annotated Grounding-anything Dataset (GranD) using our proposed\nautomated annotation pipeline that encompasses 7.5M unique concepts grounded in\na total of 810M regions available with segmentation masks. Besides GCG, GLaMM\nalso performs effectively on several downstream tasks, e.g., referring\nexpression segmentation, image and region-level captioning and vision-language\nconversations.\n","authors":["Hanoona Rasheed","Muhammad Maaz","Sahal Shaji Mullappilly","Abdelrahman Shaker","Salman Khan","Hisham Cholakkal","Rao M. Anwer","Erix Xing","Ming-Hsuan Yang","Fahad S. Khan"],"pdf_url":"https://arxiv.org/pdf/2311.03356v3.pdf","comment":"CVPR 2024"},{"id":"http://arxiv.org/abs/2405.05477v2","updated":"2024-06-02T00:00:13Z","published":"2024-05-09T00:30:45Z","title":"DynaSeg: A Deep Dynamic Fusion Method for Unsupervised Image\n Segmentation Incorporating Feature Similarity and Spatial Continuity","summary":" Our work tackles the fundamental challenge of image segmentation in computer\nvision, which is crucial for diverse applications. While supervised methods\ndemonstrate proficiency, their reliance on extensive pixel-level annotations\nlimits scalability. We introduce DynaSeg, an innovative unsupervised image\nsegmentation approach that overcomes the challenge of balancing feature\nsimilarity and spatial continuity without relying on extensive hyperparameter\ntuning. Unlike traditional methods, DynaSeg employs a dynamic weighting scheme\nthat automates parameter tuning, adapts flexibly to image characteristics, and\nfacilitates easy integration with other segmentation networks. By incorporating\na Silhouette Score Phase, DynaSeg prevents undersegmentation failures where the\nnumber of predicted clusters might converge to one. DynaSeg uses CNN-based and\npre-trained ResNet feature extraction, making it computationally efficient and\nmore straightforward than other complex models. Experimental results showcase\nstate-of-the-art performance, achieving a 12.2% and 14.12% mIOU improvement\nover current unsupervised segmentation approaches on COCO-All and COCO-Stuff\ndatasets, respectively. We provide qualitative and quantitative results on five\nbenchmark datasets, demonstrating the efficacy of the proposed approach.\n","authors":["Boujemaa Guermazi","Naimul Khan"],"pdf_url":"https://arxiv.org/pdf/2405.05477v2.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2402.01481v4","updated":"2024-06-02T23:17:38Z","published":"2024-02-02T15:07:09Z","title":"Pre-Training Protein Bi-level Representation Through Span Mask Strategy\n On 3D Protein Chains","summary":" In recent years, there has been a surge in the development of 3D\nstructure-based pre-trained protein models, representing a significant\nadvancement over pre-trained protein language models in various downstream\ntasks. However, most existing structure-based pre-trained models primarily\nfocus on the residue level, i.e., alpha carbon atoms, while ignoring other\natoms like side chain atoms. We argue that modeling proteins at both residue\nand atom levels is important since the side chain atoms can also be crucial for\nnumerous downstream tasks, for example, molecular docking. Nevertheless, we\nfind that naively combining residue and atom information during pre-training\ntypically fails. We identify a key reason is the information leakage caused by\nthe inclusion of atom structure in the input, which renders residue-level\npre-training tasks trivial and results in insufficiently expressive residue\nrepresentations. To address this issue, we introduce a span mask pre-training\nstrategy on 3D protein chains to learn meaningful representations of both\nresidues and atoms. This leads to a simple yet effective approach to learning\nprotein representation suitable for diverse downstream tasks. Extensive\nexperimental results on binding site prediction and function prediction tasks\ndemonstrate our proposed pre-training approach significantly outperforms other\nmethods. Our code will be made public.\n","authors":["Jiale Zhao","Wanru Zhuang","Jia Song","Yaqi Li","Shuqi Lu"],"pdf_url":"https://arxiv.org/pdf/2402.01481v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13673v3","updated":"2024-06-02T23:16:50Z","published":"2023-05-23T04:28:16Z","title":"Physics of Language Models: Part 1, Learning Hierarchical Language\n Structures","summary":" Transformer-based language models are effective but complex, and\nunderstanding their inner workings is a significant challenge. Previous\nresearch has primarily explored how these models handle simple tasks like name\ncopying or selection, and we extend this by investigating how these models\ngrasp complex, recursive language structures defined by context-free grammars\n(CFGs). We introduce a family of synthetic CFGs that produce hierarchical\nrules, capable of generating lengthy sentences (e.g., hundreds of tokens) that\nare locally ambiguous and require dynamic programming to parse. Despite this\ncomplexity, we demonstrate that generative models like GPT can accurately learn\nthis CFG language and generate sentences based on it. We explore the model's\ninternals, revealing that its hidden states precisely capture the structure of\nCFGs, and its attention patterns resemble the information passing in a dynamic\nprogramming algorithm.\n This paper also presents several corollaries, including showing why\npositional embedding is inferior to relative attention or rotary embedding;\ndemonstrating that encoder-based models (e.g., BERT, deBERTa) cannot learn very\ndeeply nested CFGs as effectively as generative models (e.g., GPT); and\nhighlighting the necessity of adding structural and syntactic errors to the\npretraining data to make the model more robust to corrupted language prefixes.\n","authors":["Zeyuan Allen-Zhu","Yuanzhi Li"],"pdf_url":"https://arxiv.org/pdf/2305.13673v3.pdf","comment":"V2+V3 polishes writing; V3 includes Figures 6 and 10 for better\n illustrations of our results"},{"id":"http://arxiv.org/abs/2403.03458v2","updated":"2024-06-02T23:04:43Z","published":"2024-03-06T04:49:02Z","title":"Slot Abstractors: Toward Scalable Abstract Visual Reasoning","summary":" Abstract visual reasoning is a characteristically human ability, allowing the\nidentification of relational patterns that are abstracted away from object\nfeatures, and the systematic generalization of those patterns to unseen\nproblems. Recent work has demonstrated strong systematic generalization in\nvisual reasoning tasks involving multi-object inputs, through the integration\nof slot-based methods used for extracting object-centric representations\ncoupled with strong inductive biases for relational abstraction. However, this\napproach was limited to problems containing a single rule, and was not scalable\nto visual reasoning problems containing a large number of objects. Other recent\nwork proposed Abstractors, an extension of Transformers that incorporates\nstrong relational inductive biases, thereby inheriting the Transformer's\nscalability and multi-head architecture, but it has yet to be demonstrated how\nthis approach might be applied to multi-object visual inputs. Here we combine\nthe strengths of the above approaches and propose Slot Abstractors, an approach\nto abstract visual reasoning that can be scaled to problems involving a large\nnumber of objects and multiple relations among them. The approach displays\nstate-of-the-art performance across four abstract visual reasoning tasks, as\nwell as an abstract reasoning task involving real-world images.\n","authors":["Shanka Subhra Mondal","Jonathan D. Cohen","Taylor W. Webb"],"pdf_url":"https://arxiv.org/pdf/2403.03458v2.pdf","comment":"18 pages, 9 figures"},{"id":"http://arxiv.org/abs/2402.01116v4","updated":"2024-06-02T23:01:08Z","published":"2024-02-02T03:19:54Z","title":"Scalable Multi-modal Model Predictive Control via Duality-based\n Interaction Predictions","summary":" We propose a hierarchical architecture designed for scalable real-time Model\nPredictive Control (MPC) in complex, multi-modal traffic scenarios. This\narchitecture comprises two key components: 1) RAID-Net, a novel attention-based\nRecurrent Neural Network that predicts relevant interactions along the MPC\nprediction horizon between the autonomous vehicle and the surrounding vehicles\nusing Lagrangian duality, and 2) a reduced Stochastic MPC problem that\neliminates irrelevant collision avoidance constraints, enhancing computational\nefficiency. Our approach is demonstrated in a simulated traffic intersection\nwith interactive surrounding vehicles, showcasing a 12x speed-up in solving the\nmotion planning problem. A video demonstrating the proposed architecture in\nmultiple complex traffic scenarios can be found here:\nhttps://youtu.be/-pRiOnPb9_c. GitHub:\nhttps://github.com/MPC-Berkeley/hmpc_raidnet\n","authors":["Hansung Kim","Siddharth H. Nair","Francesco Borrelli"],"pdf_url":"https://arxiv.org/pdf/2402.01116v4.pdf","comment":"Accepted at IEEE Intelligent Vehicles Symposium 2024"},{"id":"http://arxiv.org/abs/2402.06122v3","updated":"2024-06-02T22:41:02Z","published":"2024-02-09T01:11:34Z","title":"Peeking with PEAK: Sequential, Nonparametric Composite Hypothesis Tests\n for Means of Multiple Data Streams","summary":" We propose a novel nonparametric sequential test for composite hypotheses for\nmeans of multiple data streams. Our proposed method, \\emph{peeking with\nexpectation-based averaged capital} (PEAK), builds upon the testing-by-betting\nframework and provides a non-asymptotic $\\alpha$-level test across any stopping\ntime. Our contributions are two-fold: (1) we propose a novel betting scheme and\nprovide theoretical guarantees on type-I error control, power, and asymptotic\ngrowth rate/$e$-power in the setting of a single data stream; (2) we introduce\nPEAK, a generalization of this betting scheme to multiple streams, that (i)\navoids using wasteful union bounds via averaging, (ii) is a test of power one\nunder mild regularity conditions on the sampling scheme of the streams, and\n(iii) reduces computational overhead when applying the testing-as-betting\napproaches for pure-exploration bandit problems. We illustrate the practical\nbenefits of PEAK using both synthetic and real-world HeartSteps datasets. Our\nexperiments show that PEAK provides up to an 85\\% reduction in the number of\nsamples before stopping compared to existing stopping rules for\npure-exploration bandit problems, and matches the performance of\nstate-of-the-art sequential tests while improving upon computational\ncomplexity.\n","authors":["Brian Cho","Kyra Gan","Nathan Kallus"],"pdf_url":"https://arxiv.org/pdf/2402.06122v3.pdf","comment":"To appear at the Forty-first International Conference on Machine\n Learning (ICML 2024)"},{"id":"http://arxiv.org/abs/2305.14689v2","updated":"2024-06-02T22:26:44Z","published":"2023-05-24T03:52:48Z","title":"Least Squares Regression Can Exhibit Under-Parameterized Double Descent","summary":" The relationship between the number of training data points, the number of\nparameters, and the generalization capabilities has been widely studied.\nPrevious work has shown that double descent can occur in the over-parameterized\nregime, and believe that the standard bias-variance trade-off holds in the\nunder-parameterized regime. These works provide multiple reasons for the\nexistence of the peak. We postulate that the location of the peak depends on\nthe technical properties of both the spectrum as well as the eigenvectors of\nthe sample covariance. We present two simple examples that provably exhibit\ndouble descent in the under-parameterized regime and do not seem to occur for\nreasons provided in prior work.\n","authors":["Xinyue Li","Rishi Sonthalia"],"pdf_url":"https://arxiv.org/pdf/2305.14689v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.09778v3","updated":"2024-06-02T22:15:18Z","published":"2023-12-15T13:30:04Z","title":"Hypergraph-MLP: Learning on Hypergraphs without Message Passing","summary":" Hypergraphs are vital in modelling data with higher-order relations\ncontaining more than two entities, gaining prominence in machine learning and\nsignal processing. Many hypergraph neural networks leverage message passing\nover hypergraph structures to enhance node representation learning, yielding\nimpressive performances in tasks like hypergraph node classification. However,\nthese message-passing-based models face several challenges, including\noversmoothing as well as high latency and sensitivity to structural\nperturbations at inference time. To tackle those challenges, we propose an\nalternative approach where we integrate the information about hypergraph\nstructures into training supervision without explicit message passing, thus\nalso removing the reliance on it at inference. Specifically, we introduce\nHypergraph-MLP, a novel learning framework for hypergraph-structured data,\nwhere the learning model is a straightforward multilayer perceptron (MLP)\nsupervised by a loss function based on a notion of signal smoothness on\nhypergraphs. Experiments on hypergraph node classification tasks demonstrate\nthat Hypergraph-MLP achieves competitive performance compared to existing\nbaselines, and is considerably faster and more robust against structural\nperturbations at inference.\n","authors":["Bohan Tang","Siheng Chen","Xiaowen Dong"],"pdf_url":"https://arxiv.org/pdf/2312.09778v3.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2404.14367v3","updated":"2024-06-02T22:00:42Z","published":"2024-04-22T17:20:18Z","title":"Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy\n Data","summary":" Learning from preference labels plays a crucial role in fine-tuning large\nlanguage models. There are several distinct approaches for preference\nfine-tuning, including supervised learning, on-policy reinforcement learning\n(RL), and contrastive learning. Different methods come with different\nimplementation tradeoffs and performance differences, and existing empirical\nfindings present different conclusions, for instance, some results show that\nonline RL is quite important to attain good fine-tuning results, while others\nfind (offline) contrastive or even purely supervised methods sufficient. This\nraises a natural question: what kind of approaches are important for\nfine-tuning with preference data and why? In this paper, we answer this\nquestion by performing a rigorous analysis of a number of fine-tuning\ntechniques on didactic and full-scale LLM problems. Our main finding is that,\nin general, approaches that use on-policy sampling or attempt to push down the\nlikelihood on certain responses (i.e., employ a \"negative gradient\") outperform\noffline and maximum likelihood objectives. We conceptualize our insights and\nunify methods that use on-policy sampling or negative gradient under a notion\nof mode-seeking objectives for categorical distributions. Mode-seeking\nobjectives are able to alter probability mass on specific bins of a categorical\ndistribution at a fast rate compared to maximum likelihood, allowing them to\nrelocate masses across bins more effectively. Our analysis prescribes\nactionable insights for preference fine-tuning of LLMs and informs how data\nshould be collected for maximal improvement.\n","authors":["Fahim Tajwar","Anikait Singh","Archit Sharma","Rafael Rafailov","Jeff Schneider","Tengyang Xie","Stefano Ermon","Chelsea Finn","Aviral Kumar"],"pdf_url":"https://arxiv.org/pdf/2404.14367v3.pdf","comment":"International Conference on Machine Learning (ICML), 2024"},{"id":"http://arxiv.org/abs/2402.04493v2","updated":"2024-06-02T21:38:47Z","published":"2024-02-07T00:33:11Z","title":"A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning\n with Linear MDPs","summary":" We study offline reinforcement learning (RL) with linear MDPs under the\ninfinite-horizon discounted setting which aims to learn a policy that maximizes\nthe expected discounted cumulative reward using a pre-collected dataset.\nExisting algorithms for this setting either require a uniform data coverage\nassumptions or are computationally inefficient for finding an\n$\\epsilon$-optimal policy with $O(\\epsilon^{-2})$ sample complexity. In this\npaper, we propose a primal dual algorithm for offline RL with linear MDPs in\nthe infinite-horizon discounted setting. Our algorithm is the first\ncomputationally efficient algorithm in this setting that achieves sample\ncomplexity of $O(\\epsilon^{-2})$ with partial data coverage assumption. Our\nwork is an improvement upon a recent work that requires $O(\\epsilon^{-4})$\nsamples. Moreover, we extend our algorithm to work in the offline constrained\nRL setting that enforces constraints on additional reward signals.\n","authors":["Kihyuk Hong","Ambuj Tewari"],"pdf_url":"https://arxiv.org/pdf/2402.04493v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.03887v3","updated":"2024-06-02T21:30:13Z","published":"2023-07-08T03:42:54Z","title":"Improving Prototypical Part Networks with Reward Reweighing,\n Reselection, and Retraining","summary":" In recent years, work has gone into developing deep interpretable methods for\nimage classification that clearly attributes a model's output to specific\nfeatures of the data. One such of these methods is the Prototypical Part\nNetwork (ProtoPNet), which attempts to classify images based on meaningful\nparts of the input. While this architecture is able to produce visually\ninterpretable classifications, it often learns to classify based on parts of\nthe image that are not semantically meaningful. To address this problem, we\npropose the Reward Reweighing, Reselecting, and Retraining (R3) post-processing\nframework, which performs three additional corrective updates to a pretrained\nProtoPNet in an offline and efficient manner. The first two steps involve\nlearning a reward model based on collected human feedback and then aligning the\nprototypes with human preferences. The final step is retraining, which realigns\nthe base features and the classifier layer of the original model with the\nupdated prototypes. We find that our R3 framework consistently improves both\nthe interpretability and the predictive accuracy of ProtoPNet and its variants.\n","authors":["Aaron J. Li","Robin Netzorg","Zhihan Cheng","Zhuoqin Zhang","Bin Yu"],"pdf_url":"https://arxiv.org/pdf/2307.03887v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.03507v2","updated":"2024-06-02T21:24:12Z","published":"2024-03-06T07:29:57Z","title":"GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection","summary":" Training Large Language Models (LLMs) presents significant memory challenges,\npredominantly due to the growing size of weights and optimizer states. Common\nmemory-reduction approaches, such as low-rank adaptation (LoRA), add a\ntrainable low-rank matrix to the frozen pre-trained weight in each layer,\nreducing trainable parameters and optimizer states. However, such approaches\ntypically underperform training with full-rank weights in both pre-training and\nfine-tuning stages since they limit the parameter search to a low-rank subspace\nand alter the training dynamics, and further, may require full-rank warm start.\nIn this work, we propose Gradient Low-Rank Projection (GaLore), a training\nstrategy that allows full-parameter learning but is more memory-efficient than\ncommon low-rank adaptation methods such as LoRA. Our approach reduces memory\nusage by up to 65.5% in optimizer states while maintaining both efficiency and\nperformance for pre-training on LLaMA 1B and 7B architectures with C4 dataset\nwith up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit\nGaLore further reduces optimizer memory by up to 82.5% and total training\nmemory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the\nfirst time, the feasibility of pre-training a 7B model on consumer GPUs with\n24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or\noffloading strategies.\n","authors":["Jiawei Zhao","Zhenyu Zhang","Beidi Chen","Zhangyang Wang","Anima Anandkumar","Yuandong Tian"],"pdf_url":"https://arxiv.org/pdf/2403.03507v2.pdf","comment":"ICML 2024 (Oral)"}],"Multimedia":[{"id":"http://arxiv.org/abs/2308.00264v4","updated":"2024-06-02T19:12:57Z","published":"2023-08-01T03:54:27Z","title":"Multimodal Multi-loss Fusion Network for Sentiment Analysis","summary":" This paper investigates the optimal selection and fusion of feature encoders\nacross multiple modalities and combines these in one neural network to improve\nsentiment detection. We compare different fusion methods and examine the impact\nof multi-loss training within the multi-modality fusion network, identifying\nsurprisingly important findings relating to subnet performance. We have also\nfound that integrating context significantly enhances model performance. Our\nbest model achieves state-of-the-art performance for three datasets (CMU-MOSI,\nCMU-MOSEI and CH-SIMS). These results suggest a roadmap toward an optimized\nfeature selection and fusion approach for enhancing sentiment detection in\nneural networks.\n","authors":["Zehui Wu","Ziwei Gong","Jaywon Koo","Julia Hirschberg"],"pdf_url":"https://arxiv.org/pdf/2308.00264v4.pdf","comment":"First two authors contributed equally to the paper"},{"id":"http://arxiv.org/abs/2404.03635v4","updated":"2024-06-02T04:56:32Z","published":"2024-04-04T17:54:33Z","title":"WorDepth: Variational Language Prior for Monocular Depth Estimation","summary":" Three-dimensional (3D) reconstruction from a single image is an ill-posed\nproblem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text\ndescription(s) is similarly ill-posed, i.e. spatial arrangements of objects\ndescribed. We investigate the question of whether two inherently ambiguous\nmodalities can be used in conjunction to produce metric-scaled reconstructions.\nTo test this, we focus on monocular depth estimation, the problem of predicting\na dense depth map from a single image, but with an additional text caption\ndescribing the scene. To this end, we begin by encoding the text caption as a\nmean and standard deviation; using a variational framework, we learn the\ndistribution of the plausible metric reconstructions of 3D scenes corresponding\nto the text captions as a prior. To \"select\" a specific reconstruction or depth\nmap, we encode the given image through a conditional sampler that samples from\nthe latent space of the variational text encoder, which is then decoded to the\noutput depth map. Our approach is trained alternatingly between the text and\nimage branches: in one optimization step, we predict the mean and standard\ndeviation from the text description and sample from a standard Gaussian, and in\nthe other, we sample using a (image) conditional sampler. Once trained, we\ndirectly predict depth from the encoded text using the conditional sampler. We\ndemonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where\nwe show that language can consistently improve performance in both.\n","authors":["Ziyao Zeng","Daniel Wang","Fengyu Yang","Hyoungseob Park","Yangchao Wu","Stefano Soatto","Byung-Woo Hong","Dong Lao","Alex Wong"],"pdf_url":"https://arxiv.org/pdf/2404.03635v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00901v1","updated":"2024-06-02T23:51:43Z","published":"2024-06-02T23:51:43Z","title":"Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach","summary":" The process of reconstructing missing parts of speech audio from context is\ncalled speech in-painting. Human perception of speech is inherently\nmulti-modal, involving both audio and visual (AV) cues. In this paper, we\nintroduce and study a sequence-to-sequence (seq2seq) speech in-painting model\nthat incorporates AV features. Our approach extends AV speech in-painting\ntechniques to scenarios where both audio and visual data may be jointly\ncorrupted. To achieve this, we employ a multi-modal training paradigm that\nboosts the robustness of our model across various conditions involving acoustic\nand visual distortions. This makes our distortion-aware model a plausible\nsolution for real-world challenging environments. We compare our method with\nexisting transformer-based and recurrent neural network-based models, which\nattempt to reconstruct missing speech gaps ranging from a few milliseconds to\nover a second. Our experimental results demonstrate that our novel seq2seq\narchitecture outperforms the state-of-the-art transformer solution by 38.8% in\nterms of enhancing speech quality and 7.14% in terms of improving speech\nintelligibility. We exploit a multi-task learning framework that simultaneously\nperforms lip-reading (transcribing video components to text) while\nreconstructing missing parts of the associated speech.\n","authors":["Mahsa Kadkhodaei Elyaderani","Shahram Shirani"],"pdf_url":"https://arxiv.org/pdf/2406.00901v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00791v1","updated":"2024-06-02T16:13:57Z","published":"2024-06-02T16:13:57Z","title":"Towards Point Cloud Compression for Machine Perception: A Simple and\n Strong Baseline by Learning the Octree Depth Level Predictor","summary":" Point cloud compression has garnered significant interest in computer vision.\nHowever, existing algorithms primarily cater to human vision, while most point\ncloud data is utilized for machine vision tasks. To address this, we propose a\npoint cloud compression framework that simultaneously handles both human and\nmachine vision tasks. Our framework learns a scalable bit-stream, using only\nsubsets for different machine vision tasks to save bit-rate, while employing\nthe entire bit-stream for human vision tasks. Building on mainstream\noctree-based frameworks like VoxelContext-Net, OctAttention, and G-PCC, we\nintroduce a new octree depth-level predictor. This predictor adaptively\ndetermines the optimal depth level for each octree constructed from a point\ncloud, controlling the bit-rate for machine vision tasks. For simpler tasks\n(\\textit{e.g.}, classification) or objects/scenarios, we use fewer depth levels\nwith fewer bits, saving bit-rate. Conversely, for more complex tasks\n(\\textit{e.g}., segmentation) or objects/scenarios, we use deeper depth levels\nwith more bits to enhance performance. Experimental results on various datasets\n(\\textit{e.g}., ModelNet10, ModelNet40, ShapeNet, ScanNet, and KITTI) show that\nour point cloud compression approach improves performance for machine vision\ntasks without compromising human vision quality.\n","authors":["Lei Liu","Zhihao Hu","Zhenghao Chen"],"pdf_url":"https://arxiv.org/pdf/2406.00791v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00758v1","updated":"2024-06-02T14:22:09Z","published":"2024-06-02T14:22:09Z","title":"Once-for-All: Controllable Generative Image Compression with Dynamic\n Granularity Adaption","summary":" Although recent generative image compression methods have demonstrated\nimpressive potential in optimizing the rate-distortion-perception trade-off,\nthey still face the critical challenge of flexible rate adaption to diverse\ncompression necessities and scenarios. To overcome this challenge, this paper\nproposes a Controllable Generative Image Compression framework, Control-GIC,\nthe first capable of fine-grained bitrate adaption across a broad spectrum\nwhile ensuring high-fidelity and generality compression. We base Control-GIC on\na VQGAN framework representing an image as a sequence of variable-length codes\n(i.e. VQ-indices), which can be losslessly compressed and exhibits a direct\npositive correlation with the bitrates. Therefore, drawing inspiration from the\nclassical coding principle, we naturally correlate the information density of\nlocal image patches with their granular representations, to achieve dynamic\nadjustment of the code quantity following different granularity decisions. This\nimplies we can flexibly determine a proper allocation of granularity for the\npatches to acquire desirable compression rates. We further develop a\nprobabilistic conditional decoder that can trace back to historic encoded\nmulti-granularity representations according to transmitted codes, and then\nreconstruct hierarchical granular features in the formalization of conditional\nprobability, enabling more informative aggregation to improve reconstruction\nrealism. Our experiments show that Control-GIC allows highly flexible and\ncontrollable bitrate adaption and even once compression on an entire dataset to\nfulfill constrained bitrate conditions. Experimental results demonstrate its\nsuperior performance over recent state-of-the-art methods.\n","authors":["Anqi Li","Yuxi Liu","Huihui Bai","Feng Li","Runmin Cong","Meng Wang","Yao Zhao"],"pdf_url":"https://arxiv.org/pdf/2406.00758v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00683v1","updated":"2024-06-02T09:36:37Z","published":"2024-06-02T09:36:37Z","title":"Exploiting Frequency Correlation for Hyperspectral Image Reconstruction","summary":" Deep priors have emerged as potent methods in hyperspectral image (HSI)\nreconstruction. While most methods emphasize space-domain learning using image\nspace priors like non-local similarity, frequency-domain learning using image\nfrequency priors remains neglected, limiting the reconstruction capability of\nnetworks. In this paper, we first propose a Hyperspectral Frequency Correlation\n(HFC) prior rooted in in-depth statistical frequency analyses of existent HSI\ndatasets. Leveraging the HFC prior, we subsequently establish the frequency\ndomain learning composed of a Spectral-wise self-Attention of Frequency (SAF)\nand a Spectral-spatial Interaction of Frequency (SIF) targeting low-frequency\nand high-frequency components, respectively. The outputs of SAF and SIF are\nadaptively merged by a learnable gating filter, thus achieving a thorough\nexploitation of image frequency priors. Integrating the frequency domain\nlearning and the existing space domain learning, we finally develop the\nCorrelation-driven Mixing Domains Transformer (CMDT) for HSI reconstruction.\nExtensive experiments highlight that our method surpasses various\nstate-of-the-art (SOTA) methods in reconstruction quality and computational\nefficiency.\n","authors":["Muge Yan","Lizhi Wang","Lin Zhu","Hua Huang"],"pdf_url":"https://arxiv.org/pdf/2406.00683v1.pdf","comment":"14 pages, 11 figures"},{"id":"http://arxiv.org/abs/2406.00626v1","updated":"2024-06-02T06:08:41Z","published":"2024-06-02T06:08:41Z","title":"Intelligent Text-Conditioned Music Generation","summary":" CLIP (Contrastive Language-Image Pre-Training) is a multimodal neural network\ntrained on (text, image) pairs to predict the most relevant text caption given\nan image. It has been used extensively in image generation by connecting its\noutput with a generative model such as VQGAN, with the most notable example\nbeing OpenAI's DALLE-2. In this project, we apply a similar approach to bridge\nthe gap between natural language and music. Our model is split into two steps:\nfirst, we train a CLIP-like model on pairs of text and music over contrastive\nloss to align a piece of music with its most probable text caption. Then, we\ncombine the alignment model with a music decoder to generate music. To the best\nof our knowledge, this is the first attempt at text-conditioned deep music\ngeneration. Our experiments show that it is possible to train the text-music\nalignment model using contrastive loss and train a decoder to generate music\nfrom text prompts.\n","authors":["Zhouyao Xie","Nikhil Yadala","Xinyi Chen","Jing Xi Liu"],"pdf_url":"https://arxiv.org/pdf/2406.00626v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2406.00725v1","updated":"2024-06-02T12:21:10Z","published":"2024-06-02T12:21:10Z","title":"Maximum-Entropy Regularized Decision Transformer with Reward Relabelling\n for Dynamic Recommendation","summary":" Reinforcement learning-based recommender systems have recently gained\npopularity. However, due to the typical limitations of simulation environments\n(e.g., data inefficiency), most of the work cannot be broadly applied in all\ndomains. To counter these challenges, recent advancements have leveraged\noffline reinforcement learning methods, notable for their data-driven approach\nutilizing offline datasets. A prominent example of this is the Decision\nTransformer. Despite its popularity, the Decision Transformer approach has\ninherent drawbacks, particularly evident in recommendation methods based on it.\nThis paper identifies two key shortcomings in existing Decision\nTransformer-based methods: a lack of stitching capability and limited\neffectiveness in online adoption. In response, we introduce a novel methodology\nnamed Max-Entropy enhanced Decision Transformer with Reward Relabeling for\nOffline RLRS (EDT4Rec). Our approach begins with a max entropy perspective,\nleading to the development of a max entropy enhanced exploration strategy. This\nstrategy is designed to facilitate more effective exploration in online\nenvironments. Additionally, to augment the model's capability to stitch\nsub-optimal trajectories, we incorporate a unique reward relabeling technique.\nTo validate the effectiveness and superiority of EDT4Rec, we have conducted\ncomprehensive experiments across six real-world offline datasets and in an\nonline simulator.\n","authors":["Xiaocong Chen","Siyu Wang","Lina Yao"],"pdf_url":"https://arxiv.org/pdf/2406.00725v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00682v1","updated":"2024-06-02T09:36:33Z","published":"2024-06-02T09:36:33Z","title":"A lexicon obtained and validated by a data-driven approach for organic\n residues valorization in emerging and developing countries","summary":" The text mining method presented in this paper was used for annotation of\nterms related to biological transformation and valorization of organic residues\nin agriculture in low and middle-income country. Specialized lexicon was\nobtained through different steps: corpus and extraction of terms, annotation of\nextracted terms, selection of relevant terms.\n","authors":["Christiane Rakotomalala","Jean-Marie Paillat","Frédéric Feder","Angel Avadí","Laurent Thuriès","Marie-Liesse Vermeire","Jean-Michel Médoc","Tom Wassenaar","Caroline Hottelart","Lilou Kieffer","Elisa Ndjie","Mathieu Picart","Jorel Tchamgoue","Alvin Tulle","Laurine Valade","Annie Boyer","Marie-Christine Duchamp","Mathieu Roche"],"pdf_url":"https://arxiv.org/pdf/2406.00682v1.pdf","comment":"5 pages, 2 tables"},{"id":"http://arxiv.org/abs/2406.00638v1","updated":"2024-06-02T06:48:43Z","published":"2024-06-02T06:48:43Z","title":"COS-Mix: Cosine Similarity and Distance Fusion for Improved Information\n Retrieval","summary":" This study proposes a novel hybrid retrieval strategy for Retrieval-Augmented\nGeneration (RAG) that integrates cosine similarity and cosine distance measures\nto improve retrieval performance, particularly for sparse data. The traditional\ncosine similarity measure is widely used to capture the similarity between\nvectors in high-dimensional spaces. However, it has been shown that this\nmeasure can yield arbitrary results in certain scenarios. To address this\nlimitation, we incorporate cosine distance measures to provide a complementary\nperspective by quantifying the dissimilarity between vectors. Our approach is\nexperimented on proprietary data, unlike recent publications that have used\nopen-source datasets. The proposed method demonstrates enhanced retrieval\nperformance and provides a more comprehensive understanding of the semantic\nrelationships between documents or items. This hybrid strategy offers a\npromising solution for efficiently and accurately retrieving relevant\ninformation in knowledge-intensive applications, leveraging techniques such as\nBM25 (sparse) retrieval , vector (Dense) retrieval, and cosine distance based\nretrieval to facilitate efficient information retrieval.\n","authors":["Kush Juvekar","Anupam Purwar"],"pdf_url":"https://arxiv.org/pdf/2406.00638v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00615v1","updated":"2024-06-02T04:33:52Z","published":"2024-06-02T04:33:52Z","title":"Making Recommender Systems More Knowledgeable: A Framework to\n Incorporate Side Information","summary":" Session-based recommender systems typically focus on using only the triplet\n(user_id, timestamp, item_id) to make predictions of users' next actions. In\nthis paper, we aim to utilize side information to help recommender systems\ncatch patterns and signals otherwise undetectable. Specifically, we propose a\ngeneral framework for incorporating item-specific side information into the\nrecommender system to enhance its performance without much modification on the\noriginal model architecture. Experimental results on several models and\ndatasets prove that with side information, our recommender system outperforms\nstate-of-the-art models by a considerable margin and converges much faster.\nAdditionally, we propose a new type of loss to regularize the attention\nmechanism used by recommender systems and evaluate its influence on model\nperformance. Furthermore, through analysis, we put forward a few insights on\npotential further improvements.\n","authors":["Yukun Jiang","Leo Guo","Xinyi Chen","Jing Xi Liu"],"pdf_url":"https://arxiv.org/pdf/2406.00615v1.pdf","comment":"15 pages, 8 figures"},{"id":"http://arxiv.org/abs/2406.02606v1","updated":"2024-06-02T18:26:50Z","published":"2024-06-02T18:26:50Z","title":"Know Your Neighborhood: General and Zero-Shot Capable Binary Function\n Search Powered by Call Graphlets","summary":" Binary code similarity detection is an important problem with applications in\nareas like malware analysis, vulnerability research and plagiarism detection.\nThis paper proposes a novel graph neural network architecture combined with a\nnovel graph data representation called call graphlets. A call graphlet encodes\nthe neighborhood around each function in a binary executable, capturing the\nlocal and global context through a series of statistical features. A\nspecialized graph neural network model is then designed to operate on this\ngraph representation, learning to map it to a feature vector that encodes\nsemantic code similarities using deep metric learning. The proposed approach is\nevaluated across four distinct datasets covering different architectures,\ncompiler toolchains, and optimization levels. Experimental results demonstrate\nthat the combination of call graphlets and the novel graph neural network\narchitecture achieves state-of-the-art performance compared to baseline\ntechniques across cross-architecture, mono-architecture and zero shot tasks. In\naddition, our proposed approach also performs well when evaluated against an\nout-of-domain function inlining task. Overall, the work provides a general and\neffective graph neural network-based solution for conducting binary code\nsimilarity detection.\n","authors":["Joshua Collyer","Tim Watson","Iain Phillips"],"pdf_url":"https://arxiv.org/pdf/2406.02606v1.pdf","comment":null}]},"2024-06-01T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2401.06853v4","updated":"2024-06-01T22:38:32Z","published":"2024-01-12T19:00:26Z","title":"Large Language Models Can Learn Temporal Reasoning","summary":" While large language models (LLMs) have demonstrated remarkable reasoning\ncapabilities, they are not without their flaws and inaccuracies. Recent studies\nhave introduced various methods to mitigate these limitations. Temporal\nreasoning (TR), in particular, presents a significant challenge for LLMs due to\nits reliance on diverse temporal concepts and intricate temporal logic. In this\npaper, we propose TG-LLM, a novel framework towards language-based TR. Instead\nof reasoning over the original context, we adopt a latent representation,\ntemporal graph (TG) that enhances the learning of TR. A synthetic dataset\n(TGQA), which is fully controllable and requires minimal supervision, is\nconstructed for fine-tuning LLMs on this text-to-TG translation task. We\nconfirmed in experiments that the capability of TG translation learned on our\ndataset can be transferred to other TR tasks and benchmarks. On top of that, we\nteach LLM to perform deliberate reasoning over the TGs via Chain-of-Thought\n(CoT) bootstrapping and graph data augmentation. We observed that those\nstrategies, which maintain a balance between usefulness and diversity, bring\nmore reliable CoTs and final results than the vanilla CoT distillation.\n","authors":["Siheng Xiong","Ali Payani","Ramana Kompella","Faramarz Fekri"],"pdf_url":"https://arxiv.org/pdf/2401.06853v4.pdf","comment":"ACL24 (main)"},{"id":"http://arxiv.org/abs/2402.08017v2","updated":"2024-06-01T21:46:50Z","published":"2024-02-12T19:27:26Z","title":"Lumos : Empowering Multimodal LLMs with Scene Text Recognition","summary":" We introduce Lumos, the first end-to-end multimodal question-answering system\nwith text understanding capabilities. At the core of Lumos is a Scene Text\nRecognition (STR) component that extracts text from first person point-of-view\nimages, the output of which is used to augment input to a Multimodal Large\nLanguage Model (MM-LLM). While building Lumos, we encountered numerous\nchallenges related to STR quality, overall latency, and model inference. In\nthis paper, we delve into those challenges, and discuss the system\narchitecture, design choices, and modeling techniques employed to overcome\nthese obstacles. We also provide a comprehensive evaluation for each component,\nshowcasing high quality and efficiency.\n","authors":["Ashish Shenoy","Yichao Lu","Srihari Jayakumar","Debojeet Chatterjee","Mohsen Moslehpour","Pierce Chuang","Abhay Harpale","Vikas Bhardwaj","Di Xu","Shicong Zhao","Longfang Zhao","Ankit Ramchandani","Xin Luna Dong","Anuj Kumar"],"pdf_url":"https://arxiv.org/pdf/2402.08017v2.pdf","comment":"Accepted to KDD 2024 (ADS Track)"},{"id":"http://arxiv.org/abs/2306.10193v2","updated":"2024-06-01T21:40:33Z","published":"2023-06-16T21:55:08Z","title":"Conformal Language Modeling","summary":" We propose a novel approach to conformal prediction for generative language\nmodels (LMs). Standard conformal prediction produces prediction sets -- in\nplace of single predictions -- that have rigorous, statistical performance\nguarantees. LM responses are typically sampled from the model's predicted\ndistribution over the large, combinatorial output space of natural language.\nTranslating this process to conformal prediction, we calibrate a stopping rule\nfor sampling different outputs from the LM that get added to a growing set of\ncandidates until we are confident that the output set is sufficient. Since some\nsamples may be low-quality, we also simultaneously calibrate and apply a\nrejection rule for removing candidates from the output set to reduce noise.\nSimilar to conformal prediction, we prove that the sampled set returned by our\nprocedure contains at least one acceptable answer with high probability, while\nstill being empirically precise (i.e., small) on average. Furthermore, within\nthis set of candidate responses, we show that we can also accurately identify\nsubsets of individual components -- such as phrases or sentences -- that are\neach independently correct (e.g., that are not \"hallucinations\"), again with\nstatistical guarantees. We demonstrate the promise of our approach on multiple\ntasks in open-domain question answering, text summarization, and radiology\nreport generation using different LM variants.\n","authors":["Victor Quach","Adam Fisch","Tal Schuster","Adam Yala","Jae Ho Sohn","Tommi S. Jaakkola","Regina Barzilay"],"pdf_url":"https://arxiv.org/pdf/2306.10193v2.pdf","comment":"ICLR 2024"},{"id":"http://arxiv.org/abs/2310.12815v3","updated":"2024-06-01T21:21:07Z","published":"2023-10-19T15:12:09Z","title":"Formalizing and Benchmarking Prompt Injection Attacks and Defenses","summary":" A prompt injection attack aims to inject malicious instruction/data into the\ninput of an LLM-Integrated Application such that it produces results as an\nattacker desires. Existing works are limited to case studies. As a result, the\nliterature lacks a systematic understanding of prompt injection attacks and\ntheir defenses. We aim to bridge the gap in this work. In particular, we\npropose a framework to formalize prompt injection attacks. Existing attacks are\nspecial cases in our framework. Moreover, based on our framework, we design a\nnew attack by combining existing ones. Using our framework, we conduct a\nsystematic evaluation on 5 prompt injection attacks and 10 defenses with 10\nLLMs and 7 tasks. Our work provides a common benchmark for quantitatively\nevaluating future prompt injection attacks and defenses. To facilitate research\non this topic, we make our platform public at\nhttps://github.com/liu00222/Open-Prompt-Injection.\n","authors":["Yupei Liu","Yuqi Jia","Runpeng Geng","Jinyuan Jia","Neil Zhenqiang Gong"],"pdf_url":"https://arxiv.org/pdf/2310.12815v3.pdf","comment":"To appear in USENIX Security Symposium 2024"},{"id":"http://arxiv.org/abs/2401.04518v2","updated":"2024-06-01T17:52:14Z","published":"2024-01-09T12:20:41Z","title":"The Critique of Critique","summary":" Critique, as a natural language description for assessing the quality of\nmodel-generated content, has played a vital role in the training, evaluation,\nand refinement of LLMs. However, a systematic method to evaluate the quality of\ncritique is lacking. In this paper, we pioneer the critique of critique, termed\nMetaCritique, which builds specific quantification criteria. To achieve a\nreliable evaluation outcome, we propose Atomic Information Units (AIUs), which\ndescribe the critique in a more fine-grained manner. MetaCritique aggregates\neach AIU's judgment for the overall score. Moreover, MetaCritique delivers a\nnatural language rationale for the intricate reasoning within each judgment.\nLastly, we construct a meta-evaluation dataset covering 4 tasks across 16\npublic datasets involving human-written and LLM-generated critiques.\nExperiments demonstrate that MetaCritique can achieve near-human performance.\nOur study can facilitate future research in LLM critiques based on our\nfollowing observations and released resources: (1) superior critiques judged by\nMetaCritique can lead to better refinements, indicating that it can potentially\nenhance the alignment of existing LLMs; (2) the leaderboard of critique models\nreveals that open-source critique models commonly suffer from factuality\nissues; (3) relevant code and data are publicly available at\nhttps://github.com/GAIR-NLP/MetaCritique to support deeper exploration; (4) an\nAPI at PyPI with the usage documentation in Appendix C allows users to assess\nthe critique conveniently.\n","authors":["Shichao Sun","Junlong Li","Weizhe Yuan","Ruifeng Yuan","Wenjie Li","Pengfei Liu"],"pdf_url":"https://arxiv.org/pdf/2401.04518v2.pdf","comment":"Accepted to Findings of ACL 2024"},{"id":"http://arxiv.org/abs/2402.02456v2","updated":"2024-06-01T15:54:54Z","published":"2024-02-04T12:06:13Z","title":"tnGPS: Discovering Unknown Tensor Network Structure Search Algorithms\n via Large Language Models (LLMs)","summary":" Tensor networks are efficient for extremely high-dimensional representation,\nbut their model selection, known as tensor network structure search (TN-SS), is\na challenging problem. Although several works have targeted TN-SS, most\nexisting algorithms are manually crafted heuristics with poor performance,\nsuffering from the curse of dimensionality and local convergence. In this work,\nwe jump out of the box, studying how to harness large language models (LLMs) to\nautomatically discover new TN-SS algorithms, replacing the involvement of human\nexperts. By observing how human experts innovate in research, we model their\ncommon workflow and propose an automatic algorithm discovery framework called\ntnGPS. The proposed framework is an elaborate prompting pipeline that instruct\nLLMs to generate new TN-SS algorithms through iterative refinement and\nenhancement. The experimental results demonstrate that the algorithms\ndiscovered by tnGPS exhibit superior performance in benchmarks compared to the\ncurrent state-of-the-art methods.\n","authors":["Junhua Zeng","Chao Li","Zhun Sun","Qibin Zhao","Guoxu Zhou"],"pdf_url":"https://arxiv.org/pdf/2402.02456v2.pdf","comment":"Accepted by ICML2024, pre-printed version"},{"id":"http://arxiv.org/abs/2405.20314v2","updated":"2024-06-01T15:24:10Z","published":"2024-05-30T17:54:35Z","title":"S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for\n Low-Memory GPUs","summary":" Speculative decoding (SD) has attracted a significant amount of research\nattention due to the substantial speedup it can achieve for LLM inference.\nHowever, despite the high speedups they offer, speculative decoding methods\noften achieve optimal performance on high-end devices or with a substantial GPU\nmemory overhead. Given limited memory and the necessity of quantization, a\nhigh-performing model on a high-end GPU can slow down by up to 7 times. To this\nend, we propose Skippy Simultaneous Speculative Decoding (or S3D), a\ncost-effective self-speculative SD method based on simultaneous multi-token\ndecoding and mid-layer skipping. When compared against recent effective\nopen-source SD systems, our method has achieved one of the top\nperformance-memory ratios while requiring minimal architecture changes and\ntraining data. Leveraging our memory efficiency, we created a smaller yet more\neffective SD model based on Phi-3. It is 1.4 to 2 times faster than the\nquantized EAGLE model and operates in half-precision while using less VRAM.\n","authors":["Wei Zhong","Manasa Bharadwaj"],"pdf_url":"https://arxiv.org/pdf/2405.20314v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.07120v2","updated":"2024-06-01T15:20:25Z","published":"2023-08-14T13:00:53Z","title":"Position: Key Claims in LLM Research Have a Long Tail of Footnotes","summary":" Much of the recent discourse within the ML community has been centered around\nLarge Language Models (LLMs), their functionality and potential -- yet not only\ndo we not have a working definition of LLMs, but much of this discourse relies\non claims and assumptions that are worth re-examining. We contribute a\ndefinition of LLMs, critically examine five common claims regarding their\nproperties (including 'emergent properties'), and conclude with suggestions for\nfuture research directions and their framing.\n","authors":["Anna Rogers","Alexandra Sasha Luccioni"],"pdf_url":"https://arxiv.org/pdf/2308.07120v2.pdf","comment":"ICML 2024 camera-ready (https://openreview.net/forum?id=M2cwkGleRL)"},{"id":"http://arxiv.org/abs/2405.19426v2","updated":"2024-06-01T14:16:42Z","published":"2024-05-29T18:09:35Z","title":"Deep Learning for Assessment of Oral Reading Fluency","summary":" Reading fluency assessment is a critical component of literacy programmes,\nserving to guide and monitor early education interventions. Given the resource\nintensive nature of the exercise when conducted by teachers, the development of\nautomatic tools that can operate on audio recordings of oral reading is\nattractive as an objective and highly scalable solution. Multiple complex\naspects such as accuracy, rate and expressiveness underlie human judgements of\nreading fluency. In this work, we investigate end-to-end modeling on a training\ndataset of children's audio recordings of story texts labeled by human experts.\nThe pre-trained wav2vec2.0 model is adopted due its potential to alleviate the\nchallenges from the limited amount of labeled data. We report the performance\nof a number of system variations on the relevant measures, and also probe the\nlearned embeddings for lexical and acoustic-prosodic features known to be\nimportant to the perception of reading fluency.\n","authors":["Mithilesh Vaidya","Binaya Kumar Sahoo","Preeti Rao"],"pdf_url":"https://arxiv.org/pdf/2405.19426v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.04044v3","updated":"2024-06-01T14:04:21Z","published":"2023-11-07T14:55:52Z","title":"PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language\n Models","summary":" The rapid development of language models (LMs) brings unprecedented\naccessibility and usage for both models and users. On the one hand, powerful\nLMs achieve state-of-the-art performance over numerous downstream NLP tasks. On\nthe other hand, more and more attention is paid to unrestricted model accesses\nthat may bring malicious privacy risks of data leakage. To address these\nissues, many recent works propose privacy-preserving language models (PPLMs)\nwith differential privacy (DP). Unfortunately, different DP implementations\nmake it challenging for a fair comparison among existing PPLMs. In this paper,\nwe present PrivLM-Bench, a multi-perspective privacy evaluation benchmark to\nempirically and intuitively quantify the privacy leakage of LMs. Instead of\nonly reporting DP parameters, PrivLM-Bench sheds light on the neglected\ninference data privacy during actual usage. PrivLM-Bench first clearly defines\nmulti-faceted privacy objectives. Then, PrivLM-Bench constructs a unified\npipeline to perform private fine-tuning. Lastly, PrivLM-Bench performs existing\nprivacy attacks on LMs with pre-defined privacy objectives as the empirical\nevaluation results. The empirical attack results are used to fairly and\nintuitively evaluate the privacy leakage of various PPLMs. We conduct extensive\nexperiments on three datasets of GLUE for mainstream LMs.\n","authors":["Haoran Li","Dadi Guo","Donghao Li","Wei Fan","Qi Hu","Xin Liu","Chunkit Chan","Duanyi Yao","Yuan Yao","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2311.04044v3.pdf","comment":"To appear at ACL 2024"},{"id":"http://arxiv.org/abs/2402.09025v3","updated":"2024-06-01T12:10:48Z","published":"2024-02-14T09:01:13Z","title":"SLEB: Streamlining LLMs through Redundancy Verification and Elimination\n of Transformer Blocks","summary":" Large language models (LLMs) have proven to be highly effective across\nvarious natural language processing tasks. However, their large number of\nparameters poses significant challenges for practical deployment. Pruning, a\ntechnique aimed at reducing the size and complexity of LLMs, offers a potential\nsolution by removing redundant components from the network. Despite the promise\nof pruning, existing methods often struggle to achieve substantial end-to-end\nLLM inference speedup. In this paper, we introduce SLEB, a novel approach\ndesigned to streamline LLMs by eliminating redundant transformer blocks. We\nchoose the transformer block as the fundamental unit for pruning, because LLMs\nexhibit block-level redundancy with high similarity between the outputs of\nneighboring blocks. This choice allows us to effectively enhance the processing\nspeed of LLMs. Our experimental results demonstrate that SLEB outperforms\nprevious LLM pruning methods in accelerating LLM inference while also\nmaintaining superior perplexity and accuracy, making SLEB as a promising\ntechnique for enhancing the efficiency of LLMs. The code is available at:\nhttps://github.com/jiwonsong-dev/SLEB.\n","authors":["Jiwon Song","Kyungseok Oh","Taesu Kim","Hyungjun Kim","Yulhwa Kim","Jae-Joon Kim"],"pdf_url":"https://arxiv.org/pdf/2402.09025v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19778v2","updated":"2024-06-01T11:27:53Z","published":"2024-05-30T07:44:16Z","title":"Enhancing Consistency and Role-Specific Knowledge Capturing by\n Rebuilding Fictional Character's Persona","summary":" With the recent introduction of Assistants API, it is expected that\ndocument-based language models will be actively used in various domains,\nespecially Role-playing. However, a key challenge lies in utilizing\nprotagonist's persona: Assistants API often fails to achieve with its search\nbecause the information extraction part is different each time and it often\nomits important information such as protagonist's backstory or relationships.\nIt is hard to maintain a consistent persona simply by using the persona\ndocument as input to the Assistants API. To address the challenge of achieving\nstable persona consistency, we propose CharacterGPT, a novel persona\nreconstruction framework to alleviate the shortcomings of the Assistants API.\nOur method involves Character Persona Training (CPT), an effective persona\nrebuilding process that updates the character persona by extracting the\ncharacter's traits from given summary of the novel for each character as if the\nstory in a novel progresses. In our experiments, we ask each character to take\nthe Big Five Inventory personality test in various settings and analyze the\nresults. To assess whether it can think outside the box, we let each character\ngenerate short novels. Extensive experiments and human evaluation demonstrate\nthat CharacterGPT presents new possibilities for role-playing agent research.\n","authors":["Jeiyoon Park","Chanjun Park","Heuiseok Lim"],"pdf_url":"https://arxiv.org/pdf/2405.19778v2.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2405.11282v2","updated":"2024-06-01T10:28:41Z","published":"2024-05-18T12:58:02Z","title":"Estimating the Level of Dialectness Predicts Interannotator Agreement in\n Multi-dialect Arabic Datasets","summary":" On annotating multi-dialect Arabic datasets, it is common to randomly assign\nthe samples across a pool of native Arabic speakers. Recent analyses\nrecommended routing dialectal samples to native speakers of their respective\ndialects to build higher-quality datasets. However, automatically identifying\nthe dialect of samples is hard. Moreover, the pool of annotators who are native\nspeakers of specific Arabic dialects might be scarce. Arabic Level of\nDialectness (ALDi) was recently introduced as a quantitative variable that\nmeasures how sentences diverge from Standard Arabic. On randomly assigning\nsamples to annotators, we hypothesize that samples of higher ALDi scores are\nharder to label especially if they are written in dialects that the annotators\ndo not speak. We test this by analyzing the relation between ALDi scores and\nthe annotators' agreement, on 15 public datasets having raw individual sample\nannotations for various sentence-classification tasks. We find strong evidence\nsupporting our hypothesis for 11 of them. Consequently, we recommend\nprioritizing routing samples of high ALDi scores to native speakers of each\nsample's dialect, for which the dialect could be automatically identified at\nhigher accuracies.\n","authors":["Amr Keleg","Walid Magdy","Sharon Goldwater"],"pdf_url":"https://arxiv.org/pdf/2405.11282v2.pdf","comment":"Accepted to ACL 2024 - Main (camera-ready version)"},{"id":"http://arxiv.org/abs/2403.00226v3","updated":"2024-06-01T09:23:22Z","published":"2024-03-01T02:09:25Z","title":"A Semantic Distance Metric Learning approach for Lexical Semantic Change\n Detection","summary":" Detecting temporal semantic changes of words is an important task for various\nNLP applications that must make time-sensitive predictions. Lexical Semantic\nChange Detection (SCD) task involves predicting whether a given target word,\n$w$, changes its meaning between two different text corpora, $C_1$ and $C_2$.\nFor this purpose, we propose a supervised two-staged SCD method that uses\nexisting Word-in-Context (WiC) datasets. In the first stage, for a target word\n$w$, we learn two sense-aware encoders that represent the meaning of $w$ in a\ngiven sentence selected from a corpus. Next, in the second stage, we learn a\nsense-aware distance metric that compares the semantic representations of a\ntarget word across all of its occurrences in $C_1$ and $C_2$. Experimental\nresults on multiple benchmark datasets for SCD show that our proposed method\nachieves strong performance in multiple languages. Additionally, our method\nachieves significant improvements on WiC benchmarks compared to a sense-aware\nencoder with conventional distance functions. Source code is available at\nhttps://github.com/LivNLP/svp-sdml .\n","authors":["Taichi Aida","Danushka Bollegala"],"pdf_url":"https://arxiv.org/pdf/2403.00226v3.pdf","comment":"Findings of ACL2024"},{"id":"http://arxiv.org/abs/2311.07466v3","updated":"2024-06-01T07:57:52Z","published":"2023-11-13T16:53:51Z","title":"On Measuring Faithfulness or Self-consistency of Natural Language\n Explanations","summary":" Large language models (LLMs) can explain their predictions through post-hoc\nor Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably\nsounding explanations that are unfaithful to its underlying reasoning. Recent\nwork has designed tests that aim to judge the faithfulness of post-hoc or CoT\nexplanations. In this work we argue that these faithfulness tests do not\nmeasure faithfulness to the models' inner workings -- but rather their\nself-consistency at output level. Our contributions are three-fold: i) We\nclarify the status of faithfulness tests in view of model explainability,\ncharacterising them as self-consistency tests instead. This assessment we\nunderline by ii) constructing a Comparative Consistency Bank for\nself-consistency tests that for the first time compares existing tests on a\ncommon suite of 11 open LLMs and 5 tasks -- including iii) our new\nself-consistency measure CC-SHAP. CC-SHAP is a fine-grained measure (not a\ntest) of LLM self-consistency. It compares how a model's input contributes to\nthe predicted answer and to generating the explanation. Our fine-grained\nCC-SHAP metric allows us iii) to compare LLM behaviour when making predictions\nand to analyse the effect of other consistency tests at a deeper level, which\ntakes us one step further towards measuring faithfulness by bringing us closer\nto the internals of the model than strictly surface output-oriented tests. Our\ncode is available at \\url{https://github.com/Heidelberg-NLP/CC-SHAP}\n","authors":["Letitia Parcalabescu","Anette Frank"],"pdf_url":"https://arxiv.org/pdf/2311.07466v3.pdf","comment":"Paper accepted for publication at ACL 2024 Main (Bangkok, Thailand);\n 10 main paper pages, 30 appendix pages"},{"id":"http://arxiv.org/abs/2402.14809v4","updated":"2024-06-01T07:46:28Z","published":"2024-02-22T18:59:02Z","title":"CriticBench: Benchmarking LLMs for Critique-Correct Reasoning","summary":" The ability of Large Language Models (LLMs) to critique and refine their\nreasoning is crucial for their application in evaluation, feedback provision,\nand self-improvement. This paper introduces CriticBench, a comprehensive\nbenchmark designed to assess LLMs' abilities to critique and rectify their\nreasoning across a variety of tasks. CriticBench encompasses five reasoning\ndomains: mathematical, commonsense, symbolic, coding, and algorithmic. It\ncompiles 15 datasets and incorporates responses from three LLM families.\nUtilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in\ngeneration, critique, and correction reasoning, i.e., GQC reasoning. Our\nfindings reveal: (1) a linear relationship in GQC capabilities, with\ncritique-focused training markedly enhancing performance; (2) a task-dependent\nvariation in correction effectiveness, with logic-oriented tasks being more\namenable to correction; (3) GQC knowledge inconsistencies that decrease as\nmodel size increases; and (4) an intriguing inter-model critiquing dynamic,\nwhere stronger models are better at critiquing weaker ones, while weaker models\ncan surprisingly surpass stronger ones in their self-critique. We hope these\ninsights into the nuanced critique-correct reasoning of LLMs will foster\nfurther research in LLM critique and self-improvement.\n","authors":["Zicheng Lin","Zhibin Gou","Tian Liang","Ruilin Luo","Haowei Liu","Yujiu Yang"],"pdf_url":"https://arxiv.org/pdf/2402.14809v4.pdf","comment":"ACL 2024 Findings"},{"id":"http://arxiv.org/abs/2405.12059v2","updated":"2024-06-01T07:38:37Z","published":"2024-05-20T14:28:25Z","title":"STYLE: Improving Domain Transferability of Asking Clarification\n Questions in Large Language Model Powered Conversational Agents","summary":" Equipping a conversational search engine with strategies regarding when to\nask clarification questions is becoming increasingly important across various\ndomains. Attributing to the context understanding capability of LLMs and their\naccess to domain-specific sources of knowledge, LLM-based clarification\nstrategies feature rapid transfer to various domains in a post-hoc manner.\nHowever, they still struggle to deliver promising performance on unseen\ndomains, struggling to achieve effective domain transferability. We take the\nfirst step to investigate this issue and existing methods tend to produce\none-size-fits-all strategies across diverse domains, limiting their search\neffectiveness. In response, we introduce a novel method, called Style, to\nachieve effective domain transferability. Our experimental results indicate\nthat Style bears strong domain transferability, resulting in an average search\nperformance improvement of ~10% on four unseen domains.\n","authors":["Yue Chen","Chen Huang","Yang Deng","Wenqiang Lei","Dingnan Jin","Jia Liu","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2405.12059v2.pdf","comment":"Accepted to Findings of ACL 2024. Camera Ready"},{"id":"http://arxiv.org/abs/2405.12063v2","updated":"2024-06-01T07:35:26Z","published":"2024-05-20T14:34:01Z","title":"CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information\n Needs in Large Language Models","summary":" Large language models (LLMs) are increasingly used to meet user information\nneeds, but their effectiveness in dealing with user queries that contain\nvarious types of ambiguity remains unknown, ultimately risking user trust and\nsatisfaction. To this end, we introduce CLAMBER, a benchmark for evaluating\nLLMs using a well-organized taxonomy. Building upon the taxonomy, we construct\n~12K high-quality data to assess the strengths, weaknesses, and potential risks\nof various off-the-shelf LLMs. Our findings indicate the limited practical\nutility of current LLMs in identifying and clarifying ambiguous user queries,\neven enhanced by chain-of-thought (CoT) and few-shot prompting. These\ntechniques may result in overconfidence in LLMs and yield only marginal\nenhancements in identifying ambiguity. Furthermore, current LLMs fall short in\ngenerating high-quality clarifying questions due to a lack of conflict\nresolution and inaccurate utilization of inherent knowledge. In this paper,\nCLAMBER presents a guidance and promotes further research on proactive and\ntrustworthy LLMs. Our dataset is available at\nhttps://github.com/zt991211/CLAMBER\n","authors":["Tong Zhang","Peixin Qin","Yang Deng","Chen Huang","Wenqiang Lei","Junhong Liu","Dingnan Jin","Hongru Liang","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2405.12063v2.pdf","comment":"Accepted to ACL 2024. Camera Ready. Our dataset is available at\n https://github.com/zt991211/CLAMBER"},{"id":"http://arxiv.org/abs/2405.11912v2","updated":"2024-06-01T07:30:51Z","published":"2024-05-20T09:48:15Z","title":"ARAIDA: Analogical Reasoning-Augmented Interactive Data Annotation","summary":" Human annotation is a time-consuming task that requires a significant amount\nof effort. To address this issue, interactive data annotation utilizes an\nannotation model to provide suggestions for humans to approve or correct.\nHowever, annotation models trained with limited labeled data are prone to\ngenerating incorrect suggestions, leading to extra human correction effort. To\ntackle this challenge, we propose Araida, an analogical reasoning-based\napproach that enhances automatic annotation accuracy in the interactive data\nannotation setting and reduces the need for human corrections. Araida involves\nan error-aware integration strategy that dynamically coordinates an annotation\nmodel and a k-nearest neighbors (KNN) model, giving more importance to KNN's\npredictions when predictions from the annotation model are deemed inaccurate.\nEmpirical studies demonstrate that Araida is adaptable to different annotation\ntasks and models. On average, it reduces human correction labor by 11.02%\ncompared to vanilla interactive data annotation methods.\n","authors":["Chen Huang","Yiping Jin","Ilija Ilievski","Wenqiang Lei","Jiancheng Lv"],"pdf_url":"https://arxiv.org/pdf/2405.11912v2.pdf","comment":"Accepted to ACL 2024. Camera Ready"},{"id":"http://arxiv.org/abs/2402.11896v2","updated":"2024-06-01T07:13:15Z","published":"2024-02-19T07:22:29Z","title":"SIBO: A Simple Booster for Parameter-Efficient Fine-Tuning","summary":" Fine-tuning all parameters of large language models (LLMs) necessitates\nsubstantial computational power and extended time. Latest advancements in\nparameter-efficient fine-tuning (PEFT) techniques, such as Adapter tuning and\nLoRA, allow for adjustments to only a minor fraction of the parameters of these\nLLMs. Concurrently, it has been noted that the issue of over-smoothing\ndiminishes the effectiveness of these Transformer-based LLMs, resulting in\nsuboptimal performances in downstream tasks. In this paper, we present SIBO,\nwhich is a SImple BOoster to enhance PEFT, by injecting an initial residual.\nSIBO is straightforward and readily extensible to a range of state-of-the-art\nPEFT techniques to alleviate over-smoothing and enhance performance. Extensive\nexperiments on 22 benchmark datasets demonstrate that SIBO significantly\nenhances the performance of various strong baselines, achieving up to 15.7% and\n23.5% improvement over existing PEFT methods on the arithmetic and commonsense\nreasoning tasks, respectively.\n","authors":["Zhihao Wen","Jie Zhang","Yuan Fang"],"pdf_url":"https://arxiv.org/pdf/2402.11896v2.pdf","comment":"Accepted by ACL 2024, 17 pages"},{"id":"http://arxiv.org/abs/2404.13611v2","updated":"2024-06-01T06:56:16Z","published":"2024-04-21T10:41:04Z","title":"Video sentence grounding with temporally global textual knowledge","summary":" Temporal sentence grounding involves the retrieval of a video moment with a\nnatural language query. Many existing works directly incorporate the given\nvideo and temporally localized query for temporal grounding, overlooking the\ninherent domain gap between different modalities. In this paper, we utilize\npseudo-query features containing extensive temporally global textual knowledge\nsourced from the same video-query pair, to enhance the bridging of domain gaps\nand attain a heightened level of similarity between multi-modal features.\nSpecifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve\nan improved alignment of visual and comprehensive pseudo-query features within\nthe feature space through contrastive learning. Subsequently, we utilize\nlearnable prompts to encapsulate the knowledge of pseudo-queries, propagating\nthem into the textual encoder and multi-modal fusion module, further enhancing\nthe feature alignment between visual and language for better temporal\ngrounding. Extensive experiments conducted on the Charades-STA and\nActivityNet-Captions datasets demonstrate the effectiveness of our method.\n","authors":["Cai Chen","Runzhong Zhang","Jianjun Gao","Kejun Wu","Kim-Hui Yap","Yi Wang"],"pdf_url":"https://arxiv.org/pdf/2404.13611v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.07529v3","updated":"2024-06-01T06:14:37Z","published":"2024-01-15T08:19:22Z","title":"MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of\n Multimodal Large Language Models in Perception","summary":" Recent advancements in Multimodal Large Language Models (MLLMs) have\ndemonstrated exceptional capabilities in visual perception and understanding.\nHowever, these models also suffer from hallucinations, which limit their\nreliability as AI systems. We believe that these hallucinations are partially\ndue to the models' struggle with understanding what they can and cannot\nperceive from images, a capability we refer to as self-awareness in perception.\nDespite its importance, this aspect of MLLMs has been overlooked in prior\nstudies. In this paper, we aim to define and evaluate the self-awareness of\nMLLMs in perception. To do this, we first introduce the knowledge quadrant in\nperception, which helps define what MLLMs know and do not know about images.\nUsing this framework, we propose a novel benchmark, the Self-Awareness in\nPerception for MLLMs (MM-SAP), specifically designed to assess this capability.\nWe apply MM-SAP to a variety of popular MLLMs, offering a comprehensive\nanalysis of their self-awareness and providing detailed insights. The\nexperiment results reveal that current MLLMs possess limited self-awareness\ncapabilities, pointing to a crucial area for future advancement in the\ndevelopment of trustworthy MLLMs. Code and data are available at\nhttps://github.com/YHWmz/MM-SAP.\n","authors":["Yuhao Wang","Yusheng Liao","Heyang Liu","Hongcheng Liu","Yu Wang","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2401.07529v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.07616v3","updated":"2024-06-01T04:52:17Z","published":"2024-02-12T12:48:02Z","title":"Anchor-based Large Language Models","summary":" Large language models (LLMs) predominantly employ decoder-only transformer\narchitectures, necessitating the retention of keys/values information for\nhistorical tokens to provide contextual information and avoid redundant\ncomputation. However, the substantial size and parameter volume of these LLMs\nrequire massive GPU memory. This memory demand increases with the length of the\ninput text, leading to an urgent need for more efficient methods of information\nstorage and processing. This study introduces Anchor-based LLMs (AnLLMs), which\nutilize an innovative anchor-based self-attention network (AnSAN) and also an\nanchor-based inference strategy. This approach enables LLMs to compress\nsequence information into an anchor token, reducing the keys/values cache and\nenhancing inference efficiency. Experiments on question-answering benchmarks\nreveal that AnLLMs maintain similar accuracy levels while achieving up to 99%\nkeys/values cache reduction and up to 3.5 times faster inference. Despite a\nminor compromise in accuracy, the substantial enhancements of AnLLMs employing\nthe AnSAN technique in resource utilization and computational efficiency\nunderscore their potential for practical LLM applications.\n","authors":["Jianhui Pang","Fanghua Ye","Derek Fai Wong","Xin He","Wanshun Chen","Longyue Wang"],"pdf_url":"https://arxiv.org/pdf/2402.07616v3.pdf","comment":"The paper has been accepted by the ACL2024 conference. Work was done\n when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab"},{"id":"http://arxiv.org/abs/2402.11192v2","updated":"2024-06-01T03:36:23Z","published":"2024-02-17T05:05:31Z","title":"I Learn Better If You Speak My Language: Understanding the Superior\n Performance of Fine-Tuning Large Language Models with LLM-Generated Responses","summary":" This paper explores an intriguing observation: fine-tuning a large language\nmodel (LLM) with responses generated by a LLM often yields better results than\nusing responses generated by humans. We conduct an in-depth investigation to\nunderstand why this occurs. Contrary to the common belief that these instances\nis simply due to the more detailed nature of LLM-generated content, our study\nidentifies another contributing factor: an LLM is inherently more \"familiar\"\nwith LLM generated responses. This familiarity is evidenced by lower perplexity\nbefore fine-tuning. We design a series of experiments to understand the impact\nof the \"familiarity\" and our conclusion reveals that this \"familiarity\"\nsignificantly impacts learning performance. Training with LLM-generated\nresponses not only enhances performance but also helps maintain the model's\ncapabilities in other tasks after fine-tuning on a specific task.\n","authors":["Xuan Ren","Biao Wu","Lingqiao Liu"],"pdf_url":"https://arxiv.org/pdf/2402.11192v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19616v2","updated":"2024-06-01T03:00:37Z","published":"2024-05-30T02:09:51Z","title":"Easy Problems That LLMs Get Wrong","summary":" We introduce a comprehensive Linguistic Benchmark designed to evaluate the\nlimitations of Large Language Models (LLMs) in domains such as logical\nreasoning, spatial intelligence, and linguistic understanding, among others.\nThrough a series of straightforward questions, it uncovers the significant\nlimitations of well-regarded models to perform tasks that humans manage with\nease. It also highlights the potential of prompt engineering to mitigate some\nerrors and underscores the necessity for better training methodologies. Our\nfindings stress the importance of grounding LLMs with human reasoning and\ncommon sense, emphasising the need for human-in-the-loop for enterprise\napplications. We hope this work paves the way for future research to enhance\nthe usefulness and reliability of new models.\n","authors":["Sean Williams","James Huckle"],"pdf_url":"https://arxiv.org/pdf/2405.19616v2.pdf","comment":"AutogenAI Ltd. GitHub Repo:\n https://github.com/autogenai/easy-problems-that-llms-get-wrong"},{"id":"http://arxiv.org/abs/2405.18952v2","updated":"2024-06-01T02:18:06Z","published":"2024-05-29T10:08:31Z","title":"Are You Sure? Rank Them Again: Repeated Ranking For Better Preference\n Datasets","summary":" Training Large Language Models (LLMs) with Reinforcement Learning from AI\nFeedback (RLAIF) aligns model outputs more closely with human preferences. This\ninvolves an evaluator model ranking multiple candidate responses to user\nprompts. However, the rankings from popular evaluator models such as GPT-4 can\nbe inconsistent. We propose the Repeat Ranking method - where we evaluate the\nsame responses multiple times and train only on those responses which are\nconsistently ranked. Using 2,714 prompts in 62 languages, we generated\nresponses from 7 top multilingual LLMs and had GPT-4 rank them five times each.\nEvaluating on MT-Bench chat benchmarks in six languages, our method\noutperformed the standard practice of training on all available prompts. Our\nwork highlights the quality versus quantity trade-off in RLAIF dataset\ngeneration and offers a stackable strategy for enhancing dataset and thus model\nquality.\n","authors":["Peter Devine"],"pdf_url":"https://arxiv.org/pdf/2405.18952v2.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2404.13445v2","updated":"2024-06-01T23:28:55Z","published":"2024-04-20T18:52:51Z","title":"DMesh: A Differentiable Mesh Representation","summary":" We present a differentiable representation, DMesh, for general 3D triangular\nmeshes. DMesh considers both the geometry and connectivity information of a\nmesh. In our design, we first get a set of convex tetrahedra that compactly\ntessellates the domain based on Weighted Delaunay Triangulation (WDT), and\nselect triangular faces on the tetrahedra to define the final mesh. We\nformulate probability of faces to exist on the actual surface in a\ndifferentiable manner based on the WDT. This enables DMesh to represent meshes\nof various topology in a differentiable way, and allows us to reconstruct the\nmesh under various observations, such as point cloud and multi-view images\nusing gradient-based optimization. The source code and full paper is available\nat: https://sonsang.github.io/dmesh-project.\n","authors":["Sanghyun Son","Matheus Gadelha","Yang Zhou","Zexiang Xu","Ming C. Lin","Yi Zhou"],"pdf_url":"https://arxiv.org/pdf/2404.13445v2.pdf","comment":"35 pages, 22 figures. Updated with more analysis and experimental\n results"},{"id":"http://arxiv.org/abs/2405.11133v2","updated":"2024-06-01T22:48:27Z","published":"2024-05-18T01:09:02Z","title":"XCAT-3.0: A Comprehensive Library of Personalized Digital Twins Derived\n from CT Scans","summary":" Virtual Imaging Trials (VIT) offer a cost-effective and scalable approach for\nevaluating medical imaging technologies. Computational phantoms, which mimic\nreal patient anatomy and physiology, play a central role in VIT. However, the\ncurrent libraries of computational phantoms face limitations, particularly in\nterms of sample size and diversity. Insufficient representation of the\npopulation hampers accurate assessment of imaging technologies across different\npatient groups. Traditionally, phantoms were created by manual segmentation,\nwhich is a laborious and time-consuming task, impeding the expansion of phantom\nlibraries. This study presents a framework for realistic computational phantom\nmodeling using a suite of four deep learning segmentation models, followed by\nthree forms of automated organ segmentation quality control. Over 2500\ncomputational phantoms with up to 140 structures illustrating a sophisticated\napproach to detailed anatomical modeling are released. Phantoms are available\nin both voxelized and surface mesh formats. The framework is aggregated with an\nin-house CT scanner simulator to produce realistic CT images. The framework can\npotentially advance virtual imaging trials, facilitating comprehensive and\nreliable evaluations of medical imaging technologies. Phantoms may be requested\nat https://cvit.duke.edu/resources/, code, model weights, and sample CT images\nare available at https://xcat-3.github.io.\n","authors":["Lavsen Dahal","Mobina Ghojoghnejad","Dhrubajyoti Ghosh","Yubraj Bhandari","David Kim","Fong Chi Ho","Fakrul Islam Tushar","Sheng Luoa","Kyle J. Lafata","Ehsan Abadi","Ehsan Samei","Joseph Y. Lo","W. Paul Segars"],"pdf_url":"https://arxiv.org/pdf/2405.11133v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.17166v2","updated":"2024-06-01T22:06:36Z","published":"2023-03-30T05:57:59Z","title":"Deep Single Image Camera Calibration by Heatmap Regression to Recover\n Fisheye Images Under Manhattan World Assumption","summary":" A Manhattan world lying along cuboid buildings is useful for camera angle\nestimation. However, accurate and robust angle estimation from fisheye images\nin the Manhattan world has remained an open challenge because general scene\nimages tend to lack constraints such as lines, arcs, and vanishing points. To\nachieve higher accuracy and robustness, we propose a learning-based calibration\nmethod that uses heatmap regression, which is similar to pose estimation using\nkeypoints, to detect the directions of labeled image coordinates.\nSimultaneously, our two estimators recover the rotation and remove fisheye\ndistortion by remapping from a general scene image. Without considering\nvanishing-point constraints, we find that additional points for learning-based\nmethods can be defined. To compensate for the lack of vanishing points in\nimages, we introduce auxiliary diagonal points that have the optimal 3D\narrangement of spatial uniformity. Extensive experiments demonstrated that our\nmethod outperforms conventional methods on large-scale datasets and with\noff-the-shelf cameras.\n","authors":["Nobuhiko Wakai","Satoshi Sato","Yasunori Ishii","Takayoshi Yamashita"],"pdf_url":"https://arxiv.org/pdf/2303.17166v2.pdf","comment":"Accepted by CVPR2024"},{"id":"http://arxiv.org/abs/2402.08017v2","updated":"2024-06-01T21:46:50Z","published":"2024-02-12T19:27:26Z","title":"Lumos : Empowering Multimodal LLMs with Scene Text Recognition","summary":" We introduce Lumos, the first end-to-end multimodal question-answering system\nwith text understanding capabilities. At the core of Lumos is a Scene Text\nRecognition (STR) component that extracts text from first person point-of-view\nimages, the output of which is used to augment input to a Multimodal Large\nLanguage Model (MM-LLM). While building Lumos, we encountered numerous\nchallenges related to STR quality, overall latency, and model inference. In\nthis paper, we delve into those challenges, and discuss the system\narchitecture, design choices, and modeling techniques employed to overcome\nthese obstacles. We also provide a comprehensive evaluation for each component,\nshowcasing high quality and efficiency.\n","authors":["Ashish Shenoy","Yichao Lu","Srihari Jayakumar","Debojeet Chatterjee","Mohsen Moslehpour","Pierce Chuang","Abhay Harpale","Vikas Bhardwaj","Di Xu","Shicong Zhao","Longfang Zhao","Ankit Ramchandani","Xin Luna Dong","Anuj Kumar"],"pdf_url":"https://arxiv.org/pdf/2402.08017v2.pdf","comment":"Accepted to KDD 2024 (ADS Track)"},{"id":"http://arxiv.org/abs/2405.18541v2","updated":"2024-06-01T20:53:23Z","published":"2024-05-28T19:16:59Z","title":"Low-Rank Few-Shot Adaptation of Vision-Language Models","summary":" Recent progress in the few-shot adaptation of Vision-Language Models (VLMs)\nhas further pushed their generalization capabilities, at the expense of just a\nfew labeled samples within the target downstream task. However, this promising,\nalready quite abundant few-shot literature has focused principally on prompt\nlearning and, to a lesser extent, on adapters, overlooking the recent advances\nin Parameter-Efficient Fine-Tuning (PEFT). Furthermore, existing few-shot\nlearning methods for VLMs often rely on heavy training procedures and/or\ncarefully chosen, task-specific hyper-parameters, which might impede their\napplicability. In response, we introduce Low-Rank Adaptation (LoRA) in few-shot\nlearning for VLMs, and show its potential on 11 datasets, in comparison to\ncurrent state-of-the-art prompt- and adapter-based approaches. Surprisingly,\nour simple CLIP-LoRA method exhibits substantial improvements, while reducing\nthe training times and keeping the same hyper-parameters in all the target\ntasks, i.e., across all the datasets and numbers of shots. Certainly, our\nsurprising results do not dismiss the potential of prompt-learning and\nadapter-based research. However, we believe that our strong baseline could be\nused to evaluate progress in these emergent subjects in few-shot VLMs.\n","authors":["Maxime Zanella","Ismail Ben Ayed"],"pdf_url":"https://arxiv.org/pdf/2405.18541v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.11021v2","updated":"2024-06-01T20:07:53Z","published":"2024-05-17T18:00:07Z","title":"Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification\n using Gaussian Splatting and Google Earth Imagery","summary":" 3D urban scene reconstruction and modelling is a crucial research area in\nremote sensing with numerous applications in academia, commerce, industry, and\nadministration. Recent advancements in view synthesis models have facilitated\nphotorealistic 3D reconstruction solely from 2D images. Leveraging Google Earth\nimagery, we construct a 3D Gaussian Splatting model of the Waterloo region\ncentered on the University of Waterloo and are able to achieve view-synthesis\nresults far exceeding previous 3D view-synthesis results based on neural\nradiance fields which we demonstrate in our benchmark. Additionally, we\nretrieved the 3D geometry of the scene using the 3D point cloud extracted from\nthe 3D Gaussian Splatting model which we benchmarked against our Multi-\nView-Stereo dense reconstruction of the scene, thereby reconstructing both the\n3D geometry and photorealistic lighting of the large-scale urban scene through\n3D Gaussian Splatting\n","authors":["Kyle Gao","Dening Lu","Hongjie He","Linlin Xu","Jonathan Li"],"pdf_url":"https://arxiv.org/pdf/2405.11021v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.14874v2","updated":"2024-06-01T20:05:26Z","published":"2024-04-01T14:18:15Z","title":"Investigating Robustness of Open-Vocabulary Foundation Object Detectors\n under Distribution Shifts","summary":" The challenge of Out-Of-Distribution (OOD) robustness remains a critical\nhurdle towards deploying deep vision models. Open-vocabulary object detection\nextends the capabilities of traditional object detection frameworks to\nrecognize and classify objects beyond predefined categories. Investigating OOD\nrobustness in open-vocabulary object detection is essential to increase the\ntrustworthiness of these models. This study presents a comprehensive robustness\nevaluation of zero-shot capabilities of three recent open-vocabulary foundation\nobject detection models, namely OWL-ViT, YOLO World, and Grounding DINO.\nExperiments carried out on the COCO-O and COCO-C benchmarks encompassing\ndistribution shifts highlight the challenges of the models' robustness. Source\ncode shall be made available to the research community on GitHub.\n","authors":["Prakash Chandra Chhipa","Kanjar De","Meenakshi Subhash Chippa","Rajkumar Saini","Marcus Liwicki"],"pdf_url":"https://arxiv.org/pdf/2405.14874v2.pdf","comment":"13 + 3 single column pages"},{"id":"http://arxiv.org/abs/2403.00174v3","updated":"2024-06-01T16:35:45Z","published":"2024-02-29T22:58:13Z","title":"A citizen science toolkit to collect human perceptions of urban\n environments using open street view images","summary":" Street View Imagery (SVI) is a valuable data source for studies (e.g.,\nenvironmental assessments, green space identification or land cover\nclassification). While commercial SVI is available, such providers commonly\nrestrict copying or reuse in ways necessary for research. Open SVI datasets are\nreadily available from less restrictive sources, such as Mapillary, but due to\nthe heterogeneity of the images, these require substantial preprocessing,\nfiltering, and careful quality checks. We present an efficient method for\nautomated downloading, processing, cropping, and filtering open SVI, to be used\nin a survey of human perceptions of the streets portrayed in these images. We\ndemonstrate our open-source reusable SVI preparation and smartphone-friendly\nperception-survey software with Amsterdam (Netherlands) as the case study.\nUsing a citizen science approach, we collected from 331 people 22,637 ratings\nabout their perceptions for various criteria. We have published our software in\na public repository for future re-use and reproducibility.\n","authors":["Matthew Danish","SM Labib","Britta Ricker","Marco Helbich"],"pdf_url":"https://arxiv.org/pdf/2403.00174v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.15264v2","updated":"2024-06-01T16:28:16Z","published":"2023-11-26T10:38:47Z","title":"ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning\n of Heterogeneous Microscopy Images","summary":" Unlike color photography images, which are consistently encoded into RGB\nchannels, biological images encompass various modalities, where the type of\nmicroscopy and the meaning of each channel varies with each experiment.\nImportantly, the number of channels can range from one to a dozen and their\ncorrelation is often comparatively much lower than RGB, as each of them brings\nspecific information content. This aspect is largely overlooked by methods\ndesigned out of the bioimage field, and current solutions mostly focus on\nintra-channel spatial attention, often ignoring the relationship between\nchannels, yet crucial in most biological applications. Importantly, the\nvariable channel type and count prevent the projection of several experiments\nto a unified representation for large scale pre-training. In this study, we\npropose ChAda-ViT, a novel Channel Adaptive Vision Transformer architecture\nemploying an Inter-Channel Attention mechanism on images with an arbitrary\nnumber, order and type of channels. We also introduce IDRCell100k, a bioimage\ndataset with a rich set of 79 experiments covering 7 microscope modalities,\nwith a multitude of channel types, and counts varying from 1 to 10 per\nexperiment. Our architecture, trained in a self-supervised manner, outperforms\nexisting approaches in several biologically relevant downstream tasks.\nAdditionally, it can be used to bridge the gap for the first time between\nassays with different microscopes, channel numbers or types by embedding\nvarious image and experimental modalities into a unified biological image\nrepresentation. The latter should facilitate interdisciplinary studies and pave\nthe way for better adoption of deep learning in biological image-based\nanalyses. Code and Data available at https://github.com/nicoboou/chadavit.\n","authors":["Nicolas Bourriez","Ihab Bendidi","Ethan Cohen","Gabriel Watkinson","Maxime Sanchez","Guillaume Bollot","Auguste Genovesio"],"pdf_url":"https://arxiv.org/pdf/2311.15264v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.00521v4","updated":"2024-06-01T16:22:54Z","published":"2024-03-31T01:41:36Z","title":"CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz\n continuity constrAIned Normalization","summary":" Generative Adversarial Networks (GANs) significantly advanced image\ngeneration but their performance heavily depends on abundant training data. In\nscenarios with limited data, GANs often struggle with discriminator overfitting\nand unstable training. Batch Normalization (BN), despite being known for\nenhancing generalization and training stability, has rarely been used in the\ndiscriminator of Data-Efficient GANs. Our work addresses this gap by\nidentifying a critical flaw in BN: the tendency for gradient explosion during\nthe centering and scaling steps. To tackle this issue, we present CHAIN\n(lipsCHitz continuity constrAIned Normalization), which replaces the\nconventional centering step with zero-mean regularization and integrates a\nLipschitz continuity constraint in the scaling step. CHAIN further enhances GAN\ntraining by adaptively interpolating the normalized and unnormalized features,\neffectively avoiding discriminator overfitting. Our theoretical analyses firmly\nestablishes CHAIN's effectiveness in reducing gradients in latent features and\nweights, improving stability and generalization in GAN training. Empirical\nevidence supports our theory. CHAIN achieves state-of-the-art results in\ndata-limited scenarios on CIFAR-10/100, ImageNet, five low-shot and seven\nhigh-resolution few-shot image datasets. Code:\nhttps://github.com/MaxwellYaoNi/CHAIN\n","authors":["Yao Ni","Piotr Koniusz"],"pdf_url":"https://arxiv.org/pdf/2404.00521v4.pdf","comment":"Accepted by CVPR 2024. 26 pages. Improve Lemma 3.1 - Prop. 3.1 logic\n flow. Code: https://github.com/MaxwellYaoNi/CHAIN"},{"id":"http://arxiv.org/abs/2405.14832v2","updated":"2024-06-01T16:18:53Z","published":"2024-05-23T17:49:37Z","title":"Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion\n Transformer","summary":" Generating high-quality 3D assets from text and images has long been\nchallenging, primarily due to the absence of scalable 3D representations\ncapable of capturing intricate geometry distributions. In this work, we\nintroduce Direct3D, a native 3D generative model scalable to in-the-wild input\nimages, without requiring a multiview diffusion model or SDS optimization. Our\napproach comprises two primary components: a Direct 3D Variational Auto-Encoder\n(D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently\nencodes high-resolution 3D shapes into a compact and continuous latent triplane\nspace. Notably, our method directly supervises the decoded geometry using a\nsemi-continuous surface sampling strategy, diverging from previous methods\nrelying on rendered images as supervision signals. D3D-DiT models the\ndistribution of encoded 3D latents and is specifically designed to fuse\npositional information from the three feature maps of the triplane latent,\nenabling a native 3D generative model scalable to large-scale 3D datasets.\nAdditionally, we introduce an innovative image-to-3D generation pipeline\nincorporating semantic and pixel-level image conditions, allowing the model to\nproduce 3D shapes consistent with the provided conditional image input.\nExtensive experiments demonstrate the superiority of our large-scale\npre-trained Direct3D over previous image-to-3D approaches, achieving\nsignificantly better generation quality and generalization ability, thus\nestablishing a new state-of-the-art for 3D content creation. Project page:\nhttps://nju-3dv.github.io/projects/Direct3D/.\n","authors":["Shuang Wu","Youtian Lin","Feihu Zhang","Yifei Zeng","Jingxi Xu","Philip Torr","Xun Cao","Yao Yao"],"pdf_url":"https://arxiv.org/pdf/2405.14832v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.05474v2","updated":"2024-06-01T16:17:52Z","published":"2023-08-10T10:01:56Z","title":"Spatio-Temporal Encoding of Brain Dynamics with Surface Masked\n Autoencoders","summary":" The development of robust and generalisable models for encoding the\nspatio-temporal dynamics of human brain activity is crucial for advancing\nneuroscientific discoveries. However, significant individual variation in the\norganisation of the human cerebral cortex makes it difficult to identify\npopulation-level trends in these signals. Recently, Surface Vision Transformers\n(SiTs) have emerged as a promising approach for modelling cortical signals, yet\nthey face some limitations in low-data scenarios due to the lack of inductive\nbiases in their architecture. To address these challenges, this paper proposes\nthe surface Masked AutoEncoder (sMAE) and video surface Masked AutoEncoder\n(vsMAE) - for multivariate and spatio-temporal pre-training of cortical signals\nover regular icosahedral grids. These models are trained to reconstruct\ncortical feature maps from masked versions of the input by learning strong\nlatent representations of cortical structure and function. Such representations\ntranslate into better modelling of individual phenotypes and enhanced\nperformance in downstream tasks. The proposed approach was evaluated on\ncortical phenotype regression using data from the young adult Human Connectome\nProject (HCP) and developing HCP (dHCP). Results show that (v)sMAE pre-trained\nmodels improve phenotyping prediction performance on multiple tasks by $\\ge\n26\\%$, and offer faster convergence relative to models trained from scratch.\nFinally, we show that pre-training vision transformers on large datasets, such\nas the UK Biobank (UKB), supports transfer learning to low-data regimes. Our\ncode and pre-trained models are publicly available at\nhttps://github.com/metrics-lab/surface-masked-autoencoders .\n","authors":["Simon Dahan","Logan Z. J. Williams","Yourong Guo","Daniel Rueckert","Emma C. Robinson"],"pdf_url":"https://arxiv.org/pdf/2308.05474v2.pdf","comment":"Accepted for publications for MIDL 2024; 20 figures; 7 figures"},{"id":"http://arxiv.org/abs/2403.18791v2","updated":"2024-06-01T15:25:47Z","published":"2024-03-27T17:35:24Z","title":"Object Pose Estimation via the Aggregation of Diffusion Features","summary":" Estimating the pose of objects from images is a crucial task of 3D scene\nunderstanding, and recent approaches have shown promising results on very large\nbenchmarks. However, these methods experience a significant performance drop\nwhen dealing with unseen objects. We believe that it results from the limited\ngeneralizability of image features. To address this problem, we have an\nin-depth analysis on the features of diffusion models, e.g. Stable Diffusion,\nwhich hold substantial potential for modeling unseen objects. Based on this\nanalysis, we then innovatively introduce these diffusion features for object\npose estimation. To achieve this, we propose three distinct architectures that\ncan effectively capture and aggregate diffusion features of different\ngranularity, greatly improving the generalizability of object pose estimation.\nOur approach outperforms the state-of-the-art methods by a considerable margin\non three popular benchmark datasets, LM, O-LM, and T-LESS. In particular, our\nmethod achieves higher accuracy than the previous best arts on unseen objects:\n98.2% vs. 93.5% on Unseen LM, 85.9% vs. 76.3% on Unseen O-LM, showing the\nstrong generalizability of our method. Our code is released at\nhttps://github.com/Tianfu18/diff-feats-pose.\n","authors":["Tianfu Wang","Guosheng Hu","Hongguang Wang"],"pdf_url":"https://arxiv.org/pdf/2403.18791v2.pdf","comment":"Accepted to CVPR2024"},{"id":"http://arxiv.org/abs/2405.15475v2","updated":"2024-06-01T14:39:30Z","published":"2024-05-24T11:53:27Z","title":"Efficient Degradation-aware Any Image Restoration","summary":" Reconstructing missing details from degraded low-quality inputs poses a\nsignificant challenge. Recent progress in image restoration has demonstrated\nthe efficacy of learning large models capable of addressing various\ndegradations simultaneously. Nonetheless, these approaches introduce\nconsiderable computational overhead and complex learning paradigms, limiting\ntheir practical utility. In response, we propose \\textit{DaAIR}, an efficient\nAll-in-One image restorer employing a Degradation-aware Learner (DaLe) in the\nlow-rank regime to collaboratively mine shared aspects and subtle nuances\nacross diverse degradations, generating a degradation-aware embedding. By\ndynamically allocating model capacity to input degradations, we realize an\nefficient restorer integrating holistic and specific learning within a unified\nmodel. Furthermore, DaAIR introduces a cost-efficient parameter update\nmechanism that enhances degradation awareness while maintaining computational\nefficiency. Extensive comparisons across five image degradations demonstrate\nthat our DaAIR outperforms both state-of-the-art All-in-One models and\ndegradation-specific counterparts, affirming our efficacy and practicality. The\nsource will be publicly made available at https://eduardzamfir.github.io/daair/\n","authors":["Eduard Zamfir","Zongwei Wu","Nancy Mehta","Danda Pani Paudel","Yulun Zhang","Radu Timofte"],"pdf_url":"https://arxiv.org/pdf/2405.15475v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.11909v2","updated":"2024-06-01T14:34:15Z","published":"2023-03-21T15:00:17Z","title":"The Multiscale Surface Vision Transformer","summary":" Surface meshes are a favoured domain for representing structural and\nfunctional information on the human cortex, but their complex topology and\ngeometry pose significant challenges for deep learning analysis. While\nTransformers have excelled as domain-agnostic architectures for\nsequence-to-sequence learning, the quadratic cost of the self-attention\noperation remains an obstacle for many dense prediction tasks. Inspired by some\nof the latest advances in hierarchical modelling with vision transformers, we\nintroduce the Multiscale Surface Vision Transformer (MS-SiT) as a backbone\narchitecture for surface deep learning. The self-attention mechanism is applied\nwithin local-mesh-windows to allow for high-resolution sampling of the\nunderlying data, while a shifted-window strategy improves the sharing of\ninformation between windows. Neighbouring patches are successively merged,\nallowing the MS-SiT to learn hierarchical representations suitable for any\nprediction task. Results demonstrate that the MS-SiT outperforms existing\nsurface deep learning methods for neonatal phenotyping prediction tasks using\nthe Developing Human Connectome Project (dHCP) dataset. Furthermore, building\nthe MS-SiT backbone into a U-shaped architecture for surface segmentation\ndemonstrates competitive results on cortical parcellation using the UK Biobank\n(UKB) and manually-annotated MindBoggle datasets. Code and trained models are\npublicly available at\nhttps://github.com/metrics-lab/surface-vision-transformers.\n","authors":["Simon Dahan","Logan Z. J. Williams","Daniel Rueckert","Emma C. Robinson"],"pdf_url":"https://arxiv.org/pdf/2303.11909v2.pdf","comment":"Accepted for publication at MIDL 2024, 17 pages, 6 figures"},{"id":"http://arxiv.org/abs/2402.08671v3","updated":"2024-06-01T11:34:13Z","published":"2024-02-13T18:53:13Z","title":"Are Semi-Dense Detector-Free Methods Good at Matching Local Features?","summary":" Semi-dense detector-free approaches (SDF), such as LoFTR, are currently among\nthe most popular image matching methods. While SDF methods are trained to\nestablish correspondences between two images, their performances are almost\nexclusively evaluated using relative pose estimation metrics. Thus, the link\nbetween their ability to establish correspondences and the quality of the\nresulting estimated pose has thus far received little attention. This paper is\na first attempt to study this link. We start with proposing a novel structured\nattention-based image matching architecture (SAM). It allows us to show a\ncounter-intuitive result on two datasets (MegaDepth and HPatches): on the one\nhand SAM either outperforms or is on par with SDF methods in terms of\npose/homography estimation metrics, but on the other hand SDF approaches are\nsignificantly better than SAM in terms of matching accuracy. We then propose to\nlimit the computation of the matching accuracy to textured regions, and show\nthat in this case SAM often surpasses SDF methods. Our findings highlight a\nstrong correlation between the ability to establish accurate correspondences in\ntextured regions and the accuracy of the resulting estimated pose/homography.\nOur code will be made available.\n","authors":["Matthieu Vilain","Rémi Giraud","Hugo Germain","Guillaume Bourmaud"],"pdf_url":"https://arxiv.org/pdf/2402.08671v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.12063v2","updated":"2024-06-01T10:54:50Z","published":"2024-02-09T02:23:47Z","title":"Consistency Model is an Effective Posterior Sample Approximation for\n Diffusion Inverse Solvers","summary":" Diffusion Inverse Solvers (DIS) are designed to sample from the conditional\ndistribution $p_{\\theta}(X_0|y)$, with a predefined diffusion model\n$p_{\\theta}(X_0)$, an operator $f(\\cdot)$, and a measurement $y=f(x'_0)$\nderived from an unknown image $x'_0$. Existing DIS estimate the conditional\nscore function by evaluating $f(\\cdot)$ with an approximated posterior sample\ndrawn from $p_{\\theta}(X_0|X_t)$. However, most prior approximations rely on\nthe posterior means, which may not lie in the support of the image\ndistribution, thereby potentially diverge from the appearance of genuine\nimages. Such out-of-support samples may significantly degrade the performance\nof the operator $f(\\cdot)$, particularly when it is a neural network. In this\npaper, we introduces a novel approach for posterior approximation that\nguarantees to generate valid samples within the support of the image\ndistribution, and also enhances the compatibility with neural network-based\noperators $f(\\cdot)$. We first demonstrate that the solution of the Probability\nFlow Ordinary Differential Equation (PF-ODE) with an initial value $x_t$ yields\nan effective posterior sample $p_{\\theta}(X_0|X_t=x_t)$. Based on this\nobservation, we adopt the Consistency Model (CM), which is distilled from\nPF-ODE, for posterior sampling. Furthermore, we design a novel family of DIS\nusing only CM. Through extensive experiments, we show that our proposed method\nfor posterior sample approximation substantially enhance the effectiveness of\nDIS for neural network operators $f(\\cdot)$ (e.g., in semantic segmentation).\nAdditionally, our experiments demonstrate the effectiveness of the new CM-based\ninversion techniques. The source code is provided in the supplementary\nmaterial.\n","authors":["Tongda Xu","Ziran Zhu","Jian Li","Dailan He","Yuanyuan Wang","Ming Sun","Ling Li","Hongwei Qin","Yan Wang","Jingjing Liu","Ya-Qin Zhang"],"pdf_url":"https://arxiv.org/pdf/2403.12063v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.12940v2","updated":"2024-06-01T10:25:54Z","published":"2024-04-19T15:10:54Z","title":"Neural Flow Diffusion Models: Learnable Forward Process for Improved\n Diffusion Modelling","summary":" Conventional diffusion models typically relies on a fixed forward process,\nwhich implicitly defines complex marginal distributions over latent variables.\nThis can often complicate the reverse process' task in learning generative\ntrajectories, and results in costly inference for diffusion models. To address\nthese limitations, we introduce Neural Flow Diffusion Models (NFDM), a novel\nframework that enhances diffusion models by supporting a broader range of\nforward processes beyond the standard Gaussian. We also propose a novel\nparameterization technique for learning the forward process. Our framework\nprovides an end-to-end, simulation-free optimization objective, effectively\nminimizing a variational upper bound on the negative log-likelihood.\nExperimental results demonstrate NFDM's strong performance, evidenced by\nstate-of-the-art likelihood estimation. Furthermore, we investigate NFDM's\ncapacity for learning generative dynamics with specific characteristics, such\nas deterministic straight lines trajectories, and demonstrate how the framework\nmay be adopted for learning bridges between two distributions. The results\nunderscores NFDM's versatility and its potential for a wide range of\napplications.\n","authors":["Grigory Bartosh","Dmitry Vetrov","Christian A. Naesseth"],"pdf_url":"https://arxiv.org/pdf/2404.12940v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.10019v3","updated":"2024-06-01T09:59:01Z","published":"2023-09-18T17:50:56Z","title":"Long-Tail Learning with Foundation Model: Heavy Fine-Tuning Hurts","summary":" The fine-tuning paradigm in addressing long-tail learning tasks has sparked\nsignificant interest since the emergence of foundation models. Nonetheless, how\nfine-tuning impacts performance in long-tail learning was not explicitly\nquantified. In this paper, we disclose that heavy fine-tuning may even lead to\nnon-negligible performance deterioration on tail classes, and lightweight\nfine-tuning is more effective. The reason is attributed to inconsistent class\nconditions caused by heavy fine-tuning. With the observation above, we develop\na low-complexity and accurate long-tail learning algorithms LIFT with the goal\nof facilitating fast prediction and compact models by adaptive lightweight\nfine-tuning. Experiments clearly verify that both the training time and the\nlearned parameters are significantly reduced with more accurate predictive\nperformance compared with state-of-the-art approaches. The implementation code\nis available at https://github.com/shijxcs/LIFT.\n","authors":["Jiang-Xin Shi","Tong Wei","Zhi Zhou","Jie-Jing Shao","Xin-Yan Han","Yu-Feng Li"],"pdf_url":"https://arxiv.org/pdf/2309.10019v3.pdf","comment":"Accepted by ICML 2024"},{"id":"http://arxiv.org/abs/2402.13505v3","updated":"2024-06-01T09:53:00Z","published":"2024-02-21T03:39:04Z","title":"SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed\n Semi-Supervised Learning","summary":" Recent advancements in semi-supervised learning have focused on a more\nrealistic yet challenging task: addressing imbalances in labeled data while the\nclass distribution of unlabeled data remains both unknown and potentially\nmismatched. Current approaches in this sphere often presuppose rigid\nassumptions regarding the class distribution of unlabeled data, thereby\nlimiting the adaptability of models to only certain distribution ranges. In\nthis study, we propose a novel approach, introducing a highly adaptable\nframework, designated as SimPro, which does not rely on any predefined\nassumptions about the distribution of unlabeled data. Our framework, grounded\nin a probabilistic model, innovatively refines the expectation-maximization\n(EM) algorithm by explicitly decoupling the modeling of conditional and\nmarginal class distributions. This separation facilitates a closed-form\nsolution for class distribution estimation during the maximization phase,\nleading to the formulation of a Bayes classifier. The Bayes classifier, in\nturn, enhances the quality of pseudo-labels in the expectation phase.\nRemarkably, the SimPro framework not only comes with theoretical guarantees but\nalso is straightforward to implement. Moreover, we introduce two novel class\ndistributions broadening the scope of the evaluation. Our method showcases\nconsistent state-of-the-art performance across diverse benchmarks and data\ndistribution scenarios. Our code is available at\nhttps://github.com/LeapLabTHU/SimPro.\n","authors":["Chaoqun Du","Yizeng Han","Gao Huang"],"pdf_url":"https://arxiv.org/pdf/2402.13505v3.pdf","comment":"ICML2024 camera-ready version"},{"id":"http://arxiv.org/abs/2404.06207v3","updated":"2024-06-01T09:31:08Z","published":"2024-04-09T10:56:46Z","title":"Leveraging edge detection and neural networks for better UAV\n localization","summary":" We propose a novel method for geolocalizing Unmanned Aerial Vehicles (UAVs)\nin environments lacking Global Navigation Satellite Systems (GNSS). Current\nstate-of-the-art techniques employ an offline-trained encoder to generate a\nvector representation (embedding) of the UAV's current view, which is then\ncompared with pre-computed embeddings of geo-referenced images to determine the\nUAV's position. Here, we demonstrate that the performance of these methods can\nbe significantly enhanced by preprocessing the images to extract their edges,\nwhich exhibit robustness to seasonal and illumination variations. Furthermore,\nwe establish that utilizing edges enhances resilience to orientation and\naltitude inaccuracies. Additionally, we introduce a confidence criterion for\nlocalization. Our findings are substantiated through synthetic experiments.\n","authors":["Theo Di Piazza","Enric Meinhardt-Llopis","Gabriele Facciolo","Benedicte Bascle","Corentin Abgrall","Jean-Clement Devaux"],"pdf_url":"https://arxiv.org/pdf/2404.06207v3.pdf","comment":"Accepted for publication in IGARSS2024. 4 pages, 3 figures, 3 tables"},{"id":"http://arxiv.org/abs/2311.03967v2","updated":"2024-06-01T09:14:56Z","published":"2023-11-07T13:06:50Z","title":"CeCNN: Copula-enhanced convolutional neural networks in joint prediction\n of refraction error and axial length based on ultra-widefield fundus images","summary":" Ultra-widefield (UWF) fundus images are replacing traditional fundus images\nin screening, detection, prediction, and treatment of complications related to\nmyopia because their much broader visual range is advantageous for highly\nmyopic eyes. Spherical equivalent (SE) is extensively used as the main myopia\noutcome measure, and axial length (AL) has drawn increasing interest as an\nimportant ocular component for assessing myopia. Cutting-edge studies show that\nSE and AL are strongly correlated. Using the joint information from SE and AL\nis potentially better than using either separately. In the deep learning\ncommunity, though there is research on multiple-response tasks with a 3D image\nbiomarker, dependence among responses is only sporadically taken into\nconsideration. Inspired by the spirit that information extracted from the data\nby statistical methods can improve the prediction accuracy of deep learning\nmodels, we formulate a class of multivariate response regression models with a\nhigher-order tensor biomarker, for the bivariate tasks of\nregression-classification and regression-regression. Specifically, we propose a\ncopula-enhanced convolutional neural network (CeCNN) framework that\nincorporates the dependence between responses through a Gaussian copula (with\nparameters estimated from a warm-up CNN) and uses the induced copula-likelihood\nloss with the backbone CNNs. We establish the statistical framework and\nalgorithms for the aforementioned two bivariate tasks. We show that the CeCNN\nhas better prediction accuracy after adding the dependency information to the\nbackbone models. The modeling and the proposed CeCNN algorithm are applicable\nbeyond the UWF scenario and can be effective with other backbones beyond ResNet\nand LeNet.\n","authors":["Chong Zhong","Yang Li","Danjuan Yang","Meiyan Li","Xingyao Zhou","Bo Fu","Catherine C. Liu","A. H. Welsh"],"pdf_url":"https://arxiv.org/pdf/2311.03967v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.03601v3","updated":"2024-06-01T08:50:14Z","published":"2023-07-07T13:43:44Z","title":"GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest","summary":" Visual instruction tuning large language model(LLM) on image-text pairs has\nachieved general-purpose vision-language abilities. However, the lack of\nregion-text pairs limits their advancements to fine-grained multimodal\nunderstanding. In this paper, we propose spatial instruction tuning, which\nintroduces the reference to the region-of-interest(RoI) in the instruction.\nBefore sending to LLM, the reference is replaced by RoI features and\ninterleaved with language embeddings as a sequence. Our model GPT4RoI, trained\non 7 region-text pair datasets, brings an unprecedented interactive and\nconversational experience compared to previous image-level models. (1)\nInteraction beyond language: Users can interact with our model by both language\nand drawing bounding boxes to flexibly adjust the referring granularity. (2)\nVersatile multimodal abilities: A variety of attribute information within each\nRoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc.\nFurthermore, it can reason about multiple RoIs based on common sense. On the\nVisual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable\naccuracy of 81.6%, surpassing all existing models by a significant margin (the\nsecond place is 75.6%) and almost reaching human-level performance of 85.0%.\nThe code, dataset, and demo can be found at\nhttps://github.com/jshilong/GPT4RoI.\n","authors":["Shilong Zhang","Peize Sun","Shoufa Chen","Min Xiao","Wenqi Shao","Wenwei Zhang","Yu Liu","Kai Chen","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2307.03601v3.pdf","comment":"Code has been released at https://github.com/jshilong/GPT4RoI"},{"id":"http://arxiv.org/abs/2403.14781v2","updated":"2024-06-01T08:27:23Z","published":"2024-03-21T18:52:58Z","title":"Champ: Controllable and Consistent Human Image Animation with 3D\n Parametric Guidance","summary":" In this study, we introduce a methodology for human image animation by\nleveraging a 3D human parametric model within a latent diffusion framework to\nenhance shape alignment and motion guidance in curernt human generative\ntechniques. The methodology utilizes the SMPL(Skinned Multi-Person Linear)\nmodel as the 3D human parametric model to establish a unified representation of\nbody shape and pose. This facilitates the accurate capture of intricate human\ngeometry and motion characteristics from source videos. Specifically, we\nincorporate rendered depth images, normal maps, and semantic maps obtained from\nSMPL sequences, alongside skeleton-based motion guidance, to enrich the\nconditions to the latent diffusion model with comprehensive 3D shape and\ndetailed pose attributes. A multi-layer motion fusion module, integrating\nself-attention mechanisms, is employed to fuse the shape and motion latent\nrepresentations in the spatial domain. By representing the 3D human parametric\nmodel as the motion guidance, we can perform parametric shape alignment of the\nhuman body between the reference image and the source video motion.\nExperimental evaluations conducted on benchmark datasets demonstrate the\nmethodology's superior ability to generate high-quality human animations that\naccurately capture both pose and shape variations. Furthermore, our approach\nalso exhibits superior generalization capabilities on the proposed in-the-wild\ndataset. Project page: https://fudan-generative-vision.github.io/champ.\n","authors":["Shenhao Zhu","Junming Leo Chen","Zuozhuo Dai","Qingkun Su","Yinghui Xu","Xun Cao","Yao Yao","Hao Zhu","Siyu Zhu"],"pdf_url":"https://arxiv.org/pdf/2403.14781v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2104.00947v3","updated":"2024-06-01T07:55:27Z","published":"2021-04-02T08:55:04Z","title":"A Detector-oblivious Multi-arm Network for Keypoint Matching","summary":" This paper presents a matching network to establish point correspondence\nbetween images. We propose a Multi-Arm Network (MAN) to learn region overlap\nand depth, which can greatly improve the keypoint matching robustness while\nbringing little computational cost during the inference stage. Another design\nthat makes this framework different from many existing learning based pipelines\nthat require re-training when a different keypoint detector is adopted, our\nnetwork can directly work with different keypoint detectors without such a\ntime-consuming re-training process. Comprehensive experiments conducted on\noutdoor and indoor datasets demonstrated that our proposed MAN outperforms\nstate-of-the-art methods.\n","authors":["Xuelun Shen","Qian Hu","Xin Li","Cheng Wang"],"pdf_url":"https://arxiv.org/pdf/2104.00947v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.09486v4","updated":"2024-06-01T07:28:05Z","published":"2024-03-14T15:29:09Z","title":"SpikeReveal: Unlocking Temporal Sequences from Real Blurry Inputs with\n Spike Streams","summary":" Reconstructing a sequence of sharp images from the blurry input is crucial\nfor enhancing our insights into the captured scene and poses a significant\nchallenge due to the limited temporal features embedded in the image. Spike\ncameras, sampling at rates up to 40,000 Hz, have proven effective in capturing\nmotion features and beneficial for solving this ill-posed problem. Nonetheless,\nexisting methods fall into the supervised learning paradigm, which suffers from\nnotable performance degradation when applied to real-world scenarios that\ndiverge from the synthetic training data domain. Moreover, the quality of\nreconstructed images is capped by the generated images based on motion analysis\ninterpolation, which inherently differs from the actual scene, affecting the\ngeneralization ability of these methods in real high-speed scenarios. To\naddress these challenges, we propose the first self-supervised framework for\nthe task of spike-guided motion deblurring. Our approach begins with the\nformulation of a spike-guided deblurring model that explores the theoretical\nrelationships among spike streams, blurry images, and their corresponding sharp\nsequences. We subsequently develop a self-supervised cascaded framework to\nalleviate the issues of spike noise and spatial-resolution mismatching\nencountered in the deblurring model. With knowledge distillation and\nre-blurring loss, we further design a lightweight deblur network to generate\nhigh-quality sequences with brightness and texture consistency with the\noriginal input. Quantitative and qualitative experiments conducted on our\nreal-world and synthetic datasets with spikes validate the superior\ngeneralization of the proposed framework. Our code, data and trained models\nwill be available at \\url{https://github.com/chenkang455/S-SDM}.\n","authors":["Kang Chen","Shiyan Chen","Jiyuan Zhang","Baoyue Zhang","Yajing Zheng","Tiejun Huang","Zhaofei Yu"],"pdf_url":"https://arxiv.org/pdf/2403.09486v4.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2312.17161v2","updated":"2024-06-01T07:03:52Z","published":"2023-12-28T17:50:54Z","title":"Restoration by Generation with Constrained Priors","summary":" The inherent generative power of denoising diffusion models makes them\nwell-suited for image restoration tasks where the objective is to find the\noptimal high-quality image within the generative space that closely resembles\nthe input image. We propose a method to adapt a pretrained diffusion model for\nimage restoration by simply adding noise to the input image to be restored and\nthen denoise. Our method is based on the observation that the space of a\ngenerative model needs to be constrained. We impose this constraint by\nfinetuning the generative model with a set of anchor images that capture the\ncharacteristics of the input image. With the constrained space, we can then\nleverage the sampling strategy used for generation to do image restoration. We\nevaluate against previous methods and show superior performances on multiple\nreal-world restoration datasets in preserving identity and image quality. We\nalso demonstrate an important and practical application on personalized\nrestoration, where we use a personal album as the anchor images to constrain\nthe generative space. This approach allows us to produce results that\naccurately preserve high-frequency details, which previous works are unable to\ndo. Project webpage: https://gen2res.github.io.\n","authors":["Zheng Ding","Xuaner Zhang","Zhuowen Tu","Zhihao Xia"],"pdf_url":"https://arxiv.org/pdf/2312.17161v2.pdf","comment":"CVPR 2024 (Highlight)"},{"id":"http://arxiv.org/abs/2404.13611v2","updated":"2024-06-01T06:56:16Z","published":"2024-04-21T10:41:04Z","title":"Video sentence grounding with temporally global textual knowledge","summary":" Temporal sentence grounding involves the retrieval of a video moment with a\nnatural language query. Many existing works directly incorporate the given\nvideo and temporally localized query for temporal grounding, overlooking the\ninherent domain gap between different modalities. In this paper, we utilize\npseudo-query features containing extensive temporally global textual knowledge\nsourced from the same video-query pair, to enhance the bridging of domain gaps\nand attain a heightened level of similarity between multi-modal features.\nSpecifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve\nan improved alignment of visual and comprehensive pseudo-query features within\nthe feature space through contrastive learning. Subsequently, we utilize\nlearnable prompts to encapsulate the knowledge of pseudo-queries, propagating\nthem into the textual encoder and multi-modal fusion module, further enhancing\nthe feature alignment between visual and language for better temporal\ngrounding. Extensive experiments conducted on the Charades-STA and\nActivityNet-Captions datasets demonstrate the effectiveness of our method.\n","authors":["Cai Chen","Runzhong Zhang","Jianjun Gao","Kejun Wu","Kim-Hui Yap","Yi Wang"],"pdf_url":"https://arxiv.org/pdf/2404.13611v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.16307v2","updated":"2024-06-01T06:20:54Z","published":"2024-04-25T03:22:48Z","title":"Boosting Model Resilience via Implicit Adversarial Data Augmentation","summary":" Data augmentation plays a pivotal role in enhancing and diversifying training\ndata. Nonetheless, consistently improving model performance in varied learning\nscenarios, especially those with inherent data biases, remains challenging. To\naddress this, we propose to augment the deep features of samples by\nincorporating their adversarial and anti-adversarial perturbation\ndistributions, enabling adaptive adjustment in the learning difficulty tailored\nto each sample's specific characteristics. We then theoretically reveal that\nour augmentation process approximates the optimization of a surrogate loss\nfunction as the number of augmented copies increases indefinitely. This insight\nleads us to develop a meta-learning-based framework for optimizing classifiers\nwith this novel loss, introducing the effects of augmentation while bypassing\nthe explicit augmentation process. We conduct extensive experiments across four\ncommon biased learning scenarios: long-tail learning, generalized long-tail\nlearning, noisy label learning, and subpopulation shift learning. The empirical\nresults demonstrate that our method consistently achieves state-of-the-art\nperformance, highlighting its broad adaptability.\n","authors":["Xiaoling Zhou","Wei Ye","Zhemg Lee","Rui Xie","Shikun Zhang"],"pdf_url":"https://arxiv.org/pdf/2404.16307v2.pdf","comment":"9 pages, 6 figures, accepted by IJCAI 2024"},{"id":"http://arxiv.org/abs/2401.07529v3","updated":"2024-06-01T06:14:37Z","published":"2024-01-15T08:19:22Z","title":"MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of\n Multimodal Large Language Models in Perception","summary":" Recent advancements in Multimodal Large Language Models (MLLMs) have\ndemonstrated exceptional capabilities in visual perception and understanding.\nHowever, these models also suffer from hallucinations, which limit their\nreliability as AI systems. We believe that these hallucinations are partially\ndue to the models' struggle with understanding what they can and cannot\nperceive from images, a capability we refer to as self-awareness in perception.\nDespite its importance, this aspect of MLLMs has been overlooked in prior\nstudies. In this paper, we aim to define and evaluate the self-awareness of\nMLLMs in perception. To do this, we first introduce the knowledge quadrant in\nperception, which helps define what MLLMs know and do not know about images.\nUsing this framework, we propose a novel benchmark, the Self-Awareness in\nPerception for MLLMs (MM-SAP), specifically designed to assess this capability.\nWe apply MM-SAP to a variety of popular MLLMs, offering a comprehensive\nanalysis of their self-awareness and providing detailed insights. The\nexperiment results reveal that current MLLMs possess limited self-awareness\ncapabilities, pointing to a crucial area for future advancement in the\ndevelopment of trustworthy MLLMs. Code and data are available at\nhttps://github.com/YHWmz/MM-SAP.\n","authors":["Yuhao Wang","Yusheng Liao","Heyang Liu","Hongcheng Liu","Yu Wang","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2401.07529v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.05634v3","updated":"2024-06-01T05:52:18Z","published":"2023-12-09T18:43:05Z","title":"PGDS: Pose-Guidance Deep Supervision for Mitigating Clothes-Changing in\n Person Re-Identification","summary":" Person Re-Identification (Re-ID) task seeks to enhance the tracking of\nmultiple individuals by surveillance cameras. It supports multimodal tasks,\nincluding text-based person retrieval and human matching. One of the most\nsignificant challenges faced in Re-ID is clothes-changing, where the same\nperson may appear in different outfits. While previous methods have made\nnotable progress in maintaining clothing data consistency and handling clothing\nchange data, they still rely excessively on clothing information, which can\nlimit performance due to the dynamic nature of human appearances. To mitigate\nthis challenge, we propose the Pose-Guidance Deep Supervision (PGDS), an\neffective framework for learning pose guidance within the Re-ID task. It\nconsists of three modules: a human encoder, a pose encoder, and a Pose-to-Human\nProjection module (PHP). Our framework guides the human encoder, i.e., the main\nre-identification model, with pose information from the pose encoder through\nmultiple layers via the knowledge transfer mechanism from the PHP module,\nhelping the human encoder learn body parts information without increasing\ncomputation resources in the inference stage. Through extensive experiments,\nour method surpasses the performance of current state-of-the-art methods,\ndemonstrating its robustness and effectiveness for real-world applications. Our\ncode is available at https://github.com/huyquoctrinh/PGDS.\n","authors":["Quoc-Huy Trinh","Nhat-Tan Bui","Dinh-Hieu Hoang","Phuoc-Thao Vo Thi","Hai-Dang Nguyen","Debesh Jha","Ulas Bagci","Ngan Le","Minh-Triet Tran"],"pdf_url":"https://arxiv.org/pdf/2312.05634v3.pdf","comment":"Accepted at AVSS 2024"},{"id":"http://arxiv.org/abs/2403.06681v3","updated":"2024-06-01T05:19:24Z","published":"2024-03-11T12:56:36Z","title":"Out-of-distribution Partial Label Learning","summary":" Partial Label Learning (PLL) tackles model learning from the data with\ninexact labels under the assumption that training and test objects are in the\nsame distribution, i.e., closed-set scenario. Nevertheless, this assumption\ndoes not hold in real-world open-set scenarios where test data may come from\nOut-Of-Distribution (OOD), resulting in object detection failure and hence\nsignificantly compromising the PLL model's security and trustworthiness. This\nis a previously unexplored problem called Out-Of-Distribution Partial Label\nLearning (OODPLL) that our newly proposed PLOOD framework can effectively\nresolve. During the training phase, our framework leverages self-supervised\nlearning strategy to generate positive and negative samples for each object,\nemulating in and out-of-distributions respectively. Under these distributions,\nPLL methods can learn discriminative features for OOD objects. In the inference\nphase, a novel Partial Energy (PE) scoring technique is proposed which\nleverages the label confidence established during the above training phase to\nmine the actual labels. In this way, the issue of inexact labeling in PLL can\nbe effectively addressed for significantly better performance in OOD object\ndetection. PLOOD is compared with SOTA PLL models and OOD scores on CIFAR-10\nand CIFAR-100 datasets against various OOD datasets. The results demonstrate\nthe effectiveness of our PLOOD framework, significantly outperforming SOTA PLL\nmodels and marking a substantial advancement in addressing PLL problems in\nreal-world OOD scenarios.\n","authors":["Jintao Huang","Yiu-Ming Cheung","Chi-Man Vong"],"pdf_url":"https://arxiv.org/pdf/2403.06681v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13215v2","updated":"2024-06-01T04:49:43Z","published":"2023-10-20T01:44:49Z","title":"Zone Evaluation: Revealing Spatial Bias in Object Detection","summary":" A fundamental limitation of object detectors is that they suffer from\n\"spatial bias\", and in particular perform less satisfactorily when detecting\nobjects near image borders. For a long time, there has been a lack of effective\nways to measure and identify spatial bias, and little is known about where it\ncomes from and what degree it is. To this end, we present a new zone evaluation\nprotocol, extending from the traditional evaluation to a more generalized one,\nwhich measures the detection performance over zones, yielding a series of Zone\nPrecisions (ZPs). For the first time, we provide numerical results, showing\nthat the object detectors perform quite unevenly across the zones.\nSurprisingly, the detector's performance in the 96% border zone of the image\ndoes not reach the AP value (Average Precision, commonly regarded as the\naverage detection performance in the entire image zone). To better understand\nspatial bias, a series of heuristic experiments are conducted. Our\ninvestigation excludes two intuitive conjectures about spatial bias that the\nobject scale and the absolute positions of objects barely influence the spatial\nbias. We find that the key lies in the human-imperceptible divergence in data\npatterns between objects in different zones, thus eventually forming a visible\nperformance gap between the zones. With these findings, we finally discuss a\nfuture direction for object detection, namely, spatial disequilibrium problem,\naiming at pursuing a balanced detection ability over the entire image zone. By\nbroadly evaluating 10 popular object detectors and 5 detection datasets, we\nshed light on the spatial bias of object detectors. We hope this work could\nraise a focus on detection robustness. The source codes, evaluation protocols,\nand tutorials are publicly available at https://github.com/Zzh-tju/ZoneEval.\n","authors":["Zhaohui Zheng","Yuming Chen","Qibin Hou","Xiang Li","Ping Wang","Ming-Ming Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.13215v2.pdf","comment":"Accepted by IEEE TPAMI"},{"id":"http://arxiv.org/abs/2405.14455v2","updated":"2024-06-01T04:44:15Z","published":"2024-05-23T11:37:17Z","title":"TIGER: Text-Instructed 3D Gaussian Retrieval and Coherent Editing","summary":" Editing objects within a scene is a critical functionality required across a\nbroad spectrum of applications in computer vision and graphics. As 3D Gaussian\nSplatting (3DGS) emerges as a frontier in scene representation, the effective\nmodification of 3D Gaussian scenes has become increasingly vital. This process\nentails accurately retrieve the target objects and subsequently performing\nmodifications based on instructions. Though available in pieces, existing\ntechniques mainly embed sparse semantics into Gaussians for retrieval, and rely\non an iterative dataset update paradigm for editing, leading to over-smoothing\nor inconsistency issues. To this end, this paper proposes a systematic\napproach, namely TIGER, for coherent text-instructed 3D Gaussian retrieval and\nediting. In contrast to the top-down language grounding approach for 3D\nGaussians, we adopt a bottom-up language aggregation strategy to generate a\ndenser language embedded 3D Gaussians that supports open-vocabulary retrieval.\nTo overcome the over-smoothing and inconsistency issues in editing, we propose\na Coherent Score Distillation (CSD) that aggregates a 2D image editing\ndiffusion model and a multi-view diffusion model for score distillation,\nproducing multi-view consistent editing with much finer details. In various\nexperiments, we demonstrate that our TIGER is able to accomplish more\nconsistent and realistic edits than prior work.\n","authors":["Teng Xu","Jiamin Chen","Peng Chen","Youjia Zhang","Junqing Yu","Wei Yang"],"pdf_url":"https://arxiv.org/pdf/2405.14455v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.10300v2","updated":"2024-06-01T03:35:22Z","published":"2024-05-16T17:54:15Z","title":"Grounding DINO 1.5: Advance the \"Edge\" of Open-Set Object Detection","summary":" This paper introduces Grounding DINO 1.5, a suite of advanced open-set object\ndetection models developed by IDEA Research, which aims to advance the \"Edge\"\nof open-set object detection. The suite encompasses two models: Grounding DINO\n1.5 Pro, a high-performance model designed for stronger generalization\ncapability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an\nefficient model optimized for faster speed demanded in many applications\nrequiring edge deployment. The Grounding DINO 1.5 Pro model advances its\npredecessor by scaling up the model architecture, integrating an enhanced\nvision backbone, and expanding the training dataset to over 20 million images\nwith grounding annotations, thereby achieving a richer semantic understanding.\nThe Grounding DINO 1.5 Edge model, while designed for efficiency with reduced\nfeature scales, maintains robust detection capabilities by being trained on the\nsame comprehensive dataset. Empirical results demonstrate the effectiveness of\nGrounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP\non the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot\ntransfer benchmark, setting new records for open-set object detection.\nFurthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT,\nachieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP\non the LVIS-minival benchmark, making it more suitable for edge computing\nscenarios. Model examples and demos with API will be released at\nhttps://github.com/IDEA-Research/Grounding-DINO-1.5-API\n","authors":["Tianhe Ren","Qing Jiang","Shilong Liu","Zhaoyang Zeng","Wenlong Liu","Han Gao","Hongjie Huang","Zhengyu Ma","Xiaoke Jiang","Yihao Chen","Yuda Xiong","Hao Zhang","Feng Li","Peijun Tang","Kent Yu","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2405.10300v2.pdf","comment":"homepage: https://deepdataspace.com/home"},{"id":"http://arxiv.org/abs/2307.03212v3","updated":"2024-06-01T03:00:16Z","published":"2023-07-06T16:38:43Z","title":"Attentive Graph Enhanced Region Representation Learning","summary":" Representing urban regions accurately and comprehensively is essential for\nvarious urban planning and analysis tasks. Recently, with the expansion of the\ncity, modeling long-range spatial dependencies with multiple data sources plays\nan important role in urban region representation. In this paper, we propose the\nAttentive Graph Enhanced Region Representation Learning (ATGRL) model, which\naims to capture comprehensive dependencies from multiple graphs and learn rich\nsemantic representations of urban regions. Specifically, we propose a\ngraph-enhanced learning module to construct regional graphs by incorporating\nmobility flow patterns, point of interests (POIs) functions, and check-in\nsemantics with noise filtering. Then, we present a multi-graph aggregation\nmodule to capture both local and global spatial dependencies between regions by\nintegrating information from multiple graphs. In addition, we design a\ndual-stage fusion module to facilitate information sharing between different\nviews and efficiently fuse multi-view representations for urban region\nembedding using an improved linear attention mechanism. Finally, extensive\nexperiments on real-world datasets for three downstream tasks demonstrate the\nsuperior performance of our model compared to state-of-the-art methods.\n","authors":["Weiliang Chen","Qianqian Ren","Jinbao Li"],"pdf_url":"https://arxiv.org/pdf/2307.03212v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.12476v2","updated":"2024-06-01T02:21:22Z","published":"2024-05-21T03:36:13Z","title":"Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection --\n Towards Precise Fish Morphological Assessment in Aquaculture Breeding","summary":" Accurate phenotypic analysis in aquaculture breeding necessitates the\nquantification of subtle morphological phenotypes. Existing datasets suffer\nfrom limitations such as small scale, limited species coverage, and inadequate\nannotation of keypoints for measuring refined and complex morphological\nphenotypes of fish body parts. To address this gap, we introduce FishPhenoKey,\na comprehensive dataset comprising 23,331 high-resolution images spanning six\nfish species. Notably, FishPhenoKey includes 22 phenotype-oriented annotations,\nenabling the capture of intricate morphological phenotypes. Motivated by the\nnuanced evaluation of these subtle morphologies, we also propose a new\nevaluation metric, Percentage of Measured Phenotype (PMP). It is designed to\nassess the accuracy of individual keypoint positions and is highly sensitive to\nthe phenotypes measured using the corresponding keypoints. To enhance keypoint\ndetection accuracy, we further propose a novel loss, Anatomically-Calibrated\nRegularization (ACR), that can be integrated into keypoint detection models,\nleveraging biological insights to refine keypoint localization. Our\ncontributions set a new benchmark in fish phenotype analysis, addressing the\nchallenges of precise morphological quantification and opening new avenues for\nresearch in sustainable aquaculture and genetic studies. Our dataset and code\nare available at https://github.com/WeizhenLiuBioinform/Fish-Phenotype-Detect.\n","authors":["Weizhen Liu","Jiayu Tan","Guangyu Lan","Ao Li","Dongye Li","Le Zhao","Xiaohui Yuan","Nanqing Dong"],"pdf_url":"https://arxiv.org/pdf/2405.12476v2.pdf","comment":"Accepted by IJCAI2024, Code:\n https://github.com/WeizhenLiuBioinform/Fish-Phenotype-Detect"},{"id":"http://arxiv.org/abs/2402.10896v2","updated":"2024-06-01T01:06:16Z","published":"2024-02-16T18:54:47Z","title":"PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong\n Vision-language Adapter","summary":" This paper demonstrates that a progressively aligned language model can\neffectively bridge frozen vision encoders and large language models (LLMs).\nWhile the fundamental architecture and pre-training methods of vision encoders\nand LLMs have been extensively studied, the architecture and training strategy\nof vision-language adapters vary significantly across recent works. Our\nresearch undertakes a thorough exploration of the state-of-the-art perceiver\nresampler architecture and builds a strong baseline. However, we observe that\nthe vision-language alignment with perceiver resampler exhibits slow\nconvergence and limited scalability with a lack of direct supervision. To\naddress this issue, we propose PaLM2-VAdapter, employing a progressively\naligned language model as the vision-language adapter. Compared to the strong\nbaseline with perceiver resampler, our method empirically shows faster\nconvergence, higher performance, and stronger scalability. Extensive\nexperiments across various Visual Question Answering (VQA) and captioning tasks\non both images and videos demonstrate that our model exhibits state-of-the-art\nvisual understanding and multi-modal reasoning capabilities. Notably, our\nmethod achieves these advancements with 30~70% fewer parameters than the\nstate-of-the-art large vision-language models, marking a significant efficiency\nimprovement.\n","authors":["Junfei Xiao","Zheng Xu","Alan Yuille","Shen Yan","Boyu Wang"],"pdf_url":"https://arxiv.org/pdf/2402.10896v2.pdf","comment":"Technical report, 15 pages; v2 fix typos, add additional results in\n appendix"},{"id":"http://arxiv.org/abs/2405.12971v2","updated":"2024-06-01T00:28:58Z","published":"2024-05-21T17:54:06Z","title":"BiomedParse: a biomedical foundation model for image parsing of\n everything everywhere all at once","summary":" Biomedical image analysis is fundamental for biomedical discovery in cell\nbiology, pathology, radiology, and many other biomedical domains. Holistic\nimage analysis comprises interdependent subtasks such as segmentation,\ndetection, and recognition of relevant objects. Here, we propose BiomedParse, a\nbiomedical foundation model for imaging parsing that can jointly conduct\nsegmentation, detection, and recognition for 82 object types across 9 imaging\nmodalities. Through joint learning, we can improve accuracy for individual\ntasks and enable novel applications such as segmenting all relevant objects in\nan image through a text prompt, rather than requiring users to laboriously\nspecify the bounding box for each object. We leveraged readily available\nnatural-language labels or descriptions accompanying those datasets and use\nGPT-4 to harmonize the noisy, unstructured text information with established\nbiomedical object ontologies. We created a large dataset comprising over six\nmillion triples of image, segmentation mask, and textual description. On image\nsegmentation, we showed that BiomedParse is broadly applicable, outperforming\nstate-of-the-art methods on 102,855 test image-mask-label triples across 9\nimaging modalities (everything). On object detection, which aims to locate a\nspecific object of interest, BiomedParse again attained state-of-the-art\nperformance, especially on objects with irregular shapes (everywhere). On\nobject recognition, which aims to identify all objects in a given image along\nwith their semantic types, we showed that BiomedParse can simultaneously\nsegment and label all biomedical objects in an image (all at once). In summary,\nBiomedParse is an all-in-one tool for biomedical image analysis by jointly\nsolving segmentation, detection, and recognition for all major biomedical image\nmodalities, paving the path for efficient and accurate image-based biomedical\ndiscovery.\n","authors":["Theodore Zhao","Yu Gu","Jianwei Yang","Naoto Usuyama","Ho Hin Lee","Tristan Naumann","Jianfeng Gao","Angela Crabtree","Brian Piening","Carlo Bifulco","Mu Wei","Hoifung Poon","Sheng Wang"],"pdf_url":"https://arxiv.org/pdf/2405.12971v2.pdf","comment":"Project page: https://aka.ms/biomedparse-project"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2309.13333v2","updated":"2024-06-01T22:27:42Z","published":"2023-09-23T10:35:01Z","title":"mdendro: An R package for extended agglomerative hierarchical clustering","summary":" \"mdendro\" is an R package that provides a comprehensive collection of linkage\nmethods for agglomerative hierarchical clustering on a matrix of proximity data\n(distances or similarities), returning a multifurcated dendrogram or\nmultidendrogram. Multidendrograms can group more than two clusters at the same\ntime, solving the nonuniqueness problem that arises when there are ties in the\ndata. This problem causes that different binary dendrograms are possible\ndepending both on the order of the input data and on the criterion used to\nbreak ties. Weighted and unweighted versions of the most common linkage methods\nare included in the package, which also implements two parametric linkage\nmethods. In addition, package \"mdendro\" provides five descriptive measures to\nanalyze the resulting dendrograms: cophenetic correlation coefficient, space\ndistortion ratio, agglomeration coefficient, chaining coefficient and tree\nbalance.\n","authors":["Alberto Fernández","Sergio Gómez"],"pdf_url":"https://arxiv.org/pdf/2309.13333v2.pdf","comment":"27 pages, 13 figures. Software available at CRAN\n (https://cran.r-project.org/package=mdendro) and Github\n (https://sergio-gomez.github.io/mdendro/)"},{"id":"http://arxiv.org/abs/2310.13848v2","updated":"2024-06-01T15:02:41Z","published":"2023-10-20T22:47:18Z","title":"FABULA: Intelligence Report Generation Using Retrieval-Augmented\n Narrative Construction","summary":" Narrative construction is the process of representing disparate event\ninformation into a logical plot structure that models an end to end story.\nIntelligence analysis is an example of a domain that can benefit tremendously\nfrom narrative construction techniques, particularly in aiding analysts during\nthe largely manual and costly process of synthesizing event information into\ncomprehensive intelligence reports. Manual intelligence report generation is\noften prone to challenges such as integrating dynamic event information,\nwriting fine-grained queries, and closing information gaps. This motivates the\ndevelopment of a system that retrieves and represents critical aspects of\nevents in a form that aids in automatic generation of intelligence reports.\n We introduce a Retrieval Augmented Generation (RAG) approach to augment\nprompting of an autoregressive decoder by retrieving structured information\nasserted in a knowledge graph to generate targeted information based on a\nnarrative plot model. We apply our approach to the problem of neural\nintelligence report generation and introduce FABULA, framework to augment\nintelligence analysis workflows using RAG. An analyst can use FABULA to query\nan Event Plot Graph (EPG) to retrieve relevant event plot points, which can be\nused to augment prompting of a Large Language Model (LLM) during intelligence\nreport generation. Our evaluation studies show that the plot points included in\nthe generated intelligence reports have high semantic relevance, high\ncoherency, and low data redundancy.\n","authors":["Priyanka Ranade","Anupam Joshi"],"pdf_url":"https://arxiv.org/pdf/2310.13848v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.12774v2","updated":"2024-06-01T09:21:02Z","published":"2024-02-20T07:25:34Z","title":"Interpreting Conversational Dense Retrieval by Rewriting-Enhanced\n Inversion of Session Embedding","summary":" Conversational dense retrieval has shown to be effective in conversational\nsearch. However, a major limitation of conversational dense retrieval is their\nlack of interpretability, hindering intuitive understanding of model behaviors\nfor targeted improvements. This paper presents CONVINV, a simple yet effective\napproach to shed light on interpretable conversational dense retrieval models.\nCONVINV transforms opaque conversational session embeddings into explicitly\ninterpretable text while faithfully maintaining their original retrieval\nperformance as much as possible. Such transformation is achieved by training a\nrecently proposed Vec2Text model based on the ad-hoc query encoder, leveraging\nthe fact that the session and query embeddings share the same space in existing\nconversational dense retrieval. To further enhance interpretability, we propose\nto incorporate external interpretable query rewrites into the transformation\nprocess. Extensive evaluations on three conversational search benchmarks\ndemonstrate that CONVINV can yield more interpretable text and faithfully\npreserve original retrieval performance than baselines. Our work connects\nopaque session embeddings with transparent query rewriting, paving the way\ntoward trustworthy conversational search.\n","authors":["Yiruo Cheng","Kelong Mao","Zhicheng Dou"],"pdf_url":"https://arxiv.org/pdf/2402.12774v2.pdf","comment":"Accepted by ACL 2024. Repo: https://github.com/Ariya12138/ConvInv"},{"id":"http://arxiv.org/abs/2302.02592v3","updated":"2024-06-01T09:10:16Z","published":"2023-02-06T07:00:20Z","title":"RLTP: Reinforcement Learning to Pace for Delayed Impression Modeling in\n Preloaded Ads","summary":" To increase brand awareness, many advertisers conclude contracts with\nadvertising platforms to purchase traffic and then deliver advertisements to\ntarget audiences. In a whole delivery period, advertisers usually desire a\ncertain impression count for the ads, and they also expect that the delivery\nperformance is as good as possible (e.g., obtaining high click-through rate).\nAdvertising platforms employ pacing algorithms to satisfy the demands via\nadjusting the selection probabilities to traffic requests in real-time.\nHowever, the delivery procedure is also affected by the strategies from\npublishers, which cannot be controlled by advertising platforms. Preloading is\na widely used strategy for many types of ads (e.g., video ads) to make sure\nthat the response time for displaying after a traffic request is legitimate,\nwhich results in delayed impression phenomenon. Traditional pacing algorithms\ncannot handle the preloading nature well because they rely on immediate\nfeedback signals, and may fail to guarantee the demands from advertisers.\n In this paper, we focus on a new research problem of impression pacing for\npreloaded ads, and propose a Reinforcement Learning To Pace framework RLTP. It\nlearns a pacing agent that sequentially produces selection probabilities in the\nwhole delivery period. To jointly optimize the two objectives of impression\ncount and delivery performance, RLTP employs tailored reward estimator to\nsatisfy the guaranteed impression count, penalize the over-delivery and\nmaximize the traffic value. Experiments on large-scale industrial datasets\nverify that RLTP outperforms baseline pacing algorithms by a large margin. We\nhave deployed the RLTP framework online to our advertising platform, and\nresults show that it achieves significant uplift to core metrics including\ndelivery completion rate and click-through rate.\n","authors":["Penghui Wei","Yongqiang Chen","Shaoguo Liu","Liang Wang","Bo Zheng"],"pdf_url":"https://arxiv.org/pdf/2302.02592v3.pdf","comment":"KDD 2023 (Applied Data Science Track). The first two authors\n contributed equally"},{"id":"http://arxiv.org/abs/2311.16720v3","updated":"2024-06-01T08:15:58Z","published":"2023-11-28T12:04:19Z","title":"A Two-Stage Adaptation of Large Language Models for Text Ranking","summary":" Text ranking is a critical task in information retrieval. Recent advances in\npre-trained language models (PLMs), especially large language models (LLMs),\npresent new opportunities for applying them to text ranking. While supervised\nfine-tuning (SFT) with ranking data has been widely explored to better align\nPLMs with text ranking goals, previous studies have focused primarily on\nencoder-only and encoder-decoder PLMs. Research on leveraging decoder-only LLMs\nfor text ranking remains scarce. An exception to this is RankLLaMA, which uses\ndirect SFT to explore LLaMA's potential for text ranking. In this work, we\npropose a two-stage progressive paradigm to better adapt LLMs to text ranking.\nFirst, we conduct continual pre-training (CPT) of LLMs on a large\nweakly-supervised corpus. Second, we perform SFT, and propose an improved\noptimization strategy building upon RankLLaMA. Our experimental results on\nmultiple benchmarks show that our approach outperforms previous methods in both\nin-domain and out-domain scenarios.\n","authors":["Longhui Zhang","Yanzhao Zhang","Dingkun Long","Pengjun Xie","Meishan Zhang","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.16720v3.pdf","comment":"Accepted to Findings of ACL 2024. Code and models available at\n https://github.com/Alibaba-NLP/RankingGPT"},{"id":"http://arxiv.org/abs/2404.11343v2","updated":"2024-06-01T07:08:49Z","published":"2024-04-17T13:03:07Z","title":"Large Language Models meet Collaborative Filtering: An Efficient\n All-round LLM-based Recommender System","summary":" Collaborative filtering recommender systems (CF-RecSys) have shown successive\nresults in enhancing the user experience on social media and e-commerce\nplatforms. However, as CF-RecSys struggles under cold scenarios with sparse\nuser-item interactions, recent strategies have focused on leveraging modality\ninformation of user/items (e.g., text or images) based on pre-trained modality\nencoders and Large Language Models (LLMs). Despite their effectiveness under\ncold scenarios, we observe that they underperform simple traditional\ncollaborative filtering models under warm scenarios due to the lack of\ncollaborative knowledge. In this work, we propose an efficient All-round\nLLM-based Recommender system, called A-LLMRec, that excels not only in the cold\nscenario but also in the warm scenario. Our main idea is to enable an LLM to\ndirectly leverage the collaborative knowledge contained in a pre-trained\nstate-of-the-art CF-RecSys so that the emergent ability of the LLM as well as\nthe high-quality user/item embeddings that are already trained by the\nstate-of-the-art CF-RecSys can be jointly exploited. This approach yields two\nadvantages: (1) model-agnostic, allowing for integration with various existing\nCF-RecSys, and (2) efficiency, eliminating the extensive fine-tuning typically\nrequired for LLM-based recommenders. Our extensive experiments on various\nreal-world datasets demonstrate the superiority of A-LLMRec in various\nscenarios, including cold/warm, few-shot, cold user, and cross-domain\nscenarios. Beyond the recommendation task, we also show the potential of\nA-LLMRec in generating natural language outputs based on the understanding of\nthe collaborative knowledge by performing a favorite genre prediction task. Our\ncode is available at https://github.com/ghdtjr/A-LLMRec .\n","authors":["Sein Kim","Hongseok Kang","Seungyoon Choi","Donghyun Kim","Minchul Yang","Chanyoung Park"],"pdf_url":"https://arxiv.org/pdf/2404.11343v2.pdf","comment":"KDD 2024"},{"id":"http://arxiv.org/abs/2403.02630v3","updated":"2024-06-01T03:57:41Z","published":"2024-03-05T03:40:39Z","title":"FedHCDR: Federated Cross-Domain Recommendation with Hypergraph Signal\n Decoupling","summary":" In recent years, Cross-Domain Recommendation (CDR) has drawn significant\nattention, which utilizes user data from multiple domains to enhance the\nrecommendation performance. However, current CDR methods require sharing user\ndata across domains, thereby violating the General Data Protection Regulation\n(GDPR). Consequently, numerous approaches have been proposed for Federated\nCross-Domain Recommendation (FedCDR). Nevertheless, the data heterogeneity\nacross different domains inevitably influences the overall performance of\nfederated learning. In this study, we propose FedHCDR, a novel Federated\nCross-Domain Recommendation framework with Hypergraph signal decoupling.\nSpecifically, to address the data heterogeneity across domains, we introduce an\napproach called hypergraph signal decoupling (HSD) to decouple the user\nfeatures into domain-exclusive and domain-shared features. The approach employs\nhigh-pass and low-pass hypergraph filters to decouple domain-exclusive and\ndomain-shared user representations, which are trained by the local-global\nbi-directional transfer algorithm. In addition, a hypergraph contrastive\nlearning (HCL) module is devised to enhance the learning of domain-shared user\nrelationship information by perturbing the user hypergraph. Extensive\nexperiments conducted on three real-world scenarios demonstrate that FedHCDR\noutperforms existing baselines significantly.\n","authors":["Hongyu Zhang","Dongyi Zheng","Lin Zhong","Xu Yang","Jiyuan Feng","Yunqing Feng","Qing Liao"],"pdf_url":"https://arxiv.org/pdf/2403.02630v3.pdf","comment":"16 pages, 5 figures"},{"id":"http://arxiv.org/abs/2406.01633v1","updated":"2024-06-01T15:54:45Z","published":"2024-06-01T15:54:45Z","title":"On Overcoming Miscalibrated Conversational Priors in LLM-based Chatbots","summary":" We explore the use of Large Language Model (LLM-based) chatbots to power\nrecommender systems. We observe that the chatbots respond poorly when they\nencounter under-specified requests (e.g., they make incorrect assumptions,\nhedge with a long response, or refuse to answer). We conjecture that such\nmiscalibrated response tendencies (i.e., conversational priors) can be\nattributed to LLM fine-tuning using annotators -- single-turn annotations may\nnot capture multi-turn conversation utility, and the annotators' preferences\nmay not even be representative of users interacting with a recommender system.\n We first analyze public LLM chat logs to conclude that query\nunder-specification is common. Next, we study synthetic recommendation problems\nwith configurable latent item utilities and frame them as Partially Observed\nDecision Processes (PODP). We find that pre-trained LLMs can be sub-optimal for\nPODPs and derive better policies that clarify under-specified queries when\nappropriate. Then, we re-calibrate LLMs by prompting them with learned control\nmessages to approximate the improved policy. Finally, we show empirically that\nour lightweight learning approach effectively uses logged conversation data to\nre-calibrate the response strategies of LLM-based chatbots for recommendation\ntasks.\n","authors":["Christine Herlihy","Jennifer Neville","Tobias Schnabel","Adith Swaminathan"],"pdf_url":"https://arxiv.org/pdf/2406.01633v1.pdf","comment":"Preprint of UAI'24 conference publication"},{"id":"http://arxiv.org/abs/2406.01631v1","updated":"2024-06-01T11:56:08Z","published":"2024-06-01T11:56:08Z","title":"An LLM-based Recommender System Environment","summary":" Reinforcement learning (RL) has gained popularity in the realm of recommender\nsystems due to its ability to optimize long-term rewards and guide users in\ndiscovering relevant content. However, the successful implementation of RL in\nrecommender systems is challenging because of several factors, including the\nlimited availability of online data for training on-policy methods. This\nscarcity requires expensive human interaction for online model training.\nFurthermore, the development of effective evaluation frameworks that accurately\nreflect the quality of models remains a fundamental challenge in recommender\nsystems. To address these challenges, we propose a comprehensive framework for\nsynthetic environments that simulate human behavior by harnessing the\ncapabilities of large language models (LLMs). We complement our framework with\nin-depth ablation studies and demonstrate its effectiveness with experiments on\nmovie and book recommendations. By utilizing LLMs as synthetic users, this work\nintroduces a modular and novel framework for training RL-based recommender\nsystems. The software, including the RL environment, is publicly available.\n","authors":["Nathan Corecco","Giorgio Piatti","Luca A. Lanzendörfer","Flint Xiaofeng Fan","Roger Wattenhofer"],"pdf_url":"https://arxiv.org/pdf/2406.01631v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01629v1","updated":"2024-06-01T10:20:52Z","published":"2024-06-01T10:20:52Z","title":"RecDiff: Diffusion Model for Social Recommendation","summary":" Social recommendation has emerged as a powerful approach to enhance\npersonalized recommendations by leveraging the social connections among users,\nsuch as following and friend relations observed in online social platforms. The\nfundamental assumption of social recommendation is that socially-connected\nusers exhibit homophily in their preference patterns. This means that users\nconnected by social ties tend to have similar tastes in user-item activities,\nsuch as rating and purchasing. However, this assumption is not always valid due\nto the presence of irrelevant and false social ties, which can contaminate user\nembeddings and adversely affect recommendation accuracy. To address this\nchallenge, we propose a novel diffusion-based social denoising framework for\nrecommendation (RecDiff). Our approach utilizes a simple yet effective\nhidden-space diffusion paradigm to alleivate the noisy effect in the compressed\nand dense representation space. By performing multi-step noise diffusion and\nremoval, RecDiff possesses a robust ability to identify and eliminate noise\nfrom the encoded user representations, even when the noise levels vary. The\ndiffusion module is optimized in a downstream task-aware manner, thereby\nmaximizing its ability to enhance the recommendation process. We conducted\nextensive experiments to evaluate the efficacy of our framework, and the\nresults demonstrate its superiority in terms of recommendation accuracy,\ntraining efficiency, and denoising effectiveness. The source code for the model\nimplementation is publicly available at: https://github.com/HKUDS/RecDiff.\n","authors":["Zongwei Li","Lianghao Xia","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2406.01629v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00333v1","updated":"2024-06-01T07:18:56Z","published":"2024-06-01T07:18:56Z","title":"A Practice-Friendly Two-Stage LLM-Enhanced Paradigm in Sequential\n Recommendation","summary":" The training paradigm integrating large language models (LLM) is gradually\nreshaping sequential recommender systems (SRS) and has shown promising results.\nHowever, most existing LLM-enhanced methods rely on rich textual information on\nthe item side and instance-level supervised fine-tuning (SFT) to inject\ncollaborative information into LLM, which is inefficient and limited in many\napplications. To alleviate these problems, this paper proposes a novel\npractice-friendly two-stage LLM-enhanced paradigm (TSLRec) for SRS.\nSpecifically, in the information reconstruction stage, we design a new\nuser-level SFT task for collaborative information injection with the assistance\nof a pre-trained SRS model, which is more efficient and compatible with limited\ntext information. We aim to let LLM try to infer the latent category of each\nitem and reconstruct the corresponding user's preference distribution for all\ncategories from the user's interaction sequence. In the information\naugmentation stage, we feed each item into LLM to obtain a set of enhanced\nembeddings that combine collaborative information and LLM inference\ncapabilities. These embeddings can then be used to help train various future\nSRS models. Finally, we verify the effectiveness and efficiency of our TSLRec\non three SRS benchmark datasets.\n","authors":["Dugang Liu","Shenxian Xian","Xiaolin Lin","Xiaolian Zhang","Hong Zhu","Yuan Fang","Zhen Chen","Zhong Ming"],"pdf_url":"https://arxiv.org/pdf/2406.00333v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00323v1","updated":"2024-06-01T06:53:03Z","published":"2024-06-01T06:53:03Z","title":"BeFA: A General Behavior-driven Feature Adapter for Multimedia\n Recommendation","summary":" Multimedia recommender systems focus on utilizing behavioral information and\ncontent information to model user preferences. Typically, it employs\npre-trained feature encoders to extract content features, then fuses them with\nbehavioral features. However, pre-trained feature encoders often extract\nfeatures from the entire content simultaneously, including excessive\npreference-irrelevant details. We speculate that it may result in the extracted\nfeatures not containing sufficient features to accurately reflect user\npreferences. To verify our hypothesis, we introduce an attribution analysis\nmethod for visually and intuitively analyzing the content features. The results\nindicate that certain products' content features exhibit the issues of\ninformation drift}and information omission,reducing the expressive ability of\nfeatures. Building upon this finding, we propose an effective and efficient\ngeneral Behavior-driven Feature Adapter (BeFA) to tackle these issues. This\nadapter reconstructs the content feature with the guidance of behavioral\ninformation, enabling content features accurately reflecting user preferences.\nExtensive experiments demonstrate the effectiveness of the adapter across all\nmultimedia recommendation methods. The code will be publicly available upon the\npaper's acceptance.\n","authors":["Qile Fan","Penghang Yu","Zhiyi Tan","Bing-Kun Bao","Guanming Lu"],"pdf_url":"https://arxiv.org/pdf/2406.00323v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00318v1","updated":"2024-06-01T06:28:41Z","published":"2024-06-01T06:28:41Z","title":"KGLink: A column type annotation method that combines knowledge graph\n and pre-trained language model","summary":" The semantic annotation of tabular data plays a crucial role in various\ndownstream tasks. Previous research has proposed knowledge graph (KG)-based and\ndeep learning-based methods, each with its inherent limitations. KG-based\nmethods encounter difficulties annotating columns when there is no match for\ncolumn cells in the KG. Moreover, KG-based methods can provide multiple\npredictions for one column, making it challenging to determine the semantic\ntype with the most suitable granularity for the dataset. This type granularity\nissue limits their scalability.\n On the other hand, deep learning-based methods face challenges related to the\nvaluable context missing issue. This occurs when the information within the\ntable is insufficient for determining the correct column type.\n This paper presents KGLink, a method that combines WikiData KG information\nwith a pre-trained deep learning language model for table column annotation,\neffectively addressing both type granularity and valuable context missing\nissues. Through comprehensive experiments on widely used tabular datasets\nencompassing numeric and string columns with varying type granularity, we\nshowcase the effectiveness and efficiency of KGLink. By leveraging the\nstrengths of KGLink, we successfully surmount challenges related to type\ngranularity and valuable context issues, establishing it as a robust solution\nfor the semantic annotation of tabular data.\n","authors":["Yubo Wang","Hao Xin","Lei Chen"],"pdf_url":"https://arxiv.org/pdf/2406.00318v1.pdf","comment":"To be published in ICDE 2024"},{"id":"http://arxiv.org/abs/2406.00247v1","updated":"2024-06-01T00:52:41Z","published":"2024-06-01T00:52:41Z","title":"Large Language Models for Relevance Judgment in Product Search","summary":" High relevance of retrieved and re-ranked items to the search query is the\ncornerstone of successful product search, yet measuring relevance of items to\nqueries is one of the most challenging tasks in product information retrieval,\nand quality of product search is highly influenced by the precision and scale\nof available relevance-labelled data. In this paper, we present an array of\ntechniques for leveraging Large Language Models (LLMs) for automating the\nrelevance judgment of query-item pairs (QIPs) at scale. Using a unique dataset\nof multi-million QIPs, annotated by human evaluators, we test and optimize\nhyper parameters for finetuning billion-parameter LLMs with and without Low\nRank Adaption (LoRA), as well as various modes of item attribute concatenation\nand prompting in LLM finetuning, and consider trade offs in item attribute\ninclusion for quality of relevance predictions. We demonstrate considerable\nimprovement over baselines of prior generations of LLMs, as well as\noff-the-shelf models, towards relevance annotations on par with the human\nrelevance evaluators. Our findings have immediate implications for the growing\nfield of relevance judgment automation in product search.\n","authors":["Navid Mehrdad","Hrushikesh Mohapatra","Mossaab Bagdouri","Prijith Chandran","Alessandro Magnani","Xunfan Cai","Ajit Puthenputhussery","Sachin Yadav","Tony Lee","ChengXiang Zhai","Ciya Liao"],"pdf_url":"https://arxiv.org/pdf/2406.00247v1.pdf","comment":"10 pages, 1 figure, 11 tables - SIGIR 2024, LLM4Eval"}],"Multimedia":[{"id":"http://arxiv.org/abs/2312.15583v3","updated":"2024-06-01T09:31:15Z","published":"2023-12-25T01:57:22Z","title":"ITEACH-Net: Inverted Teacher-studEnt seArCH Network for Emotion\n Recognition in Conversation","summary":" There remain two critical challenges that hinder the development of ERC.\nFirstly, there is a lack of exploration into mining deeper insights from the\ndata itself for conversational emotion tasks. Secondly, the systems exhibit\nvulnerability to random modality feature missing, which is a common occurrence\nin realistic settings. Focusing on these two key challenges, we propose a novel\nframework for incomplete multimodal learning in ERC, called \"Inverted\nTeacher-studEnt seArCH Network (ITEACH-Net).\" ITEACH-Net comprises two novel\ncomponents: the Emotion Context Changing Encoder (ECCE) and the Inverted\nTeacher-Student (ITS) framework. Specifically, leveraging the tendency for\nemotional states to exhibit local stability within conversational contexts,\nECCE captures these patterns and further perceives their evolution over time.\nRecognizing the varying challenges of handling incomplete versus complete data,\nITS employs a teacher-student framework to decouple the respective\ncomputations. Subsequently, through Neural Architecture Search, the student\nmodel develops enhanced computational capabilities for handling incomplete data\ncompared to the teacher model. During testing, we design a novel evaluation\nmethod, testing the model's performance under different missing rate conditions\nwithout altering the model weights. We conduct experiments on three benchmark\nERC datasets, and the results demonstrate that our ITEACH-Net outperforms\nexisting methods in incomplete multimodal ERC. We believe ITEACH-Net can\ninspire relevant research on the intrinsic nature of emotions within\nconversation scenarios and pave a more robust route for incomplete learning\ntechniques. Codes will be made available.\n","authors":["Haiyang Sun","Zheng Lian","Chenglong Wang","Kang Chen","Licai Sun","Bin Liu","Jianhua Tao"],"pdf_url":"https://arxiv.org/pdf/2312.15583v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00409v1","updated":"2024-06-01T11:43:00Z","published":"2024-06-01T11:43:00Z","title":"Arabic Handwritten Text for Person Biometric Identification: A Deep\n Learning Approach","summary":" This study thoroughly investigates how well deep learning models can\nrecognize Arabic handwritten text for person biometric identification. It\ncompares three advanced architectures -- ResNet50, MobileNetV2, and\nEfficientNetB7 -- using three widely recognized datasets: AHAWP, Khatt, and\nLAMIS-MSHD. Results show that EfficientNetB7 outperforms the others, achieving\ntest accuracies of 98.57\\%, 99.15\\%, and 99.79\\% on AHAWP, Khatt, and\nLAMIS-MSHD datasets, respectively. EfficientNetB7's exceptional performance is\ncredited to its innovative techniques, including compound scaling, depth-wise\nseparable convolutions, and squeeze-and-excitation blocks. These features allow\nthe model to extract more abstract and distinctive features from handwritten\ntext images. The study's findings hold significant implications for enhancing\nidentity verification and authentication systems, highlighting the potential of\ndeep learning in Arabic handwritten text recognition for person biometric\nidentification.\n","authors":["Mazen Balat","Youssef Mohamed","Ahmed Heakl","Ahmed Zaky"],"pdf_url":"https://arxiv.org/pdf/2406.00409v1.pdf","comment":"6 pages, 11 figures, 4 tables, International IEEE Conference on the\n Intelligent Methods, Systems, and Applications (IMSA)"},{"id":"http://arxiv.org/abs/2406.00323v1","updated":"2024-06-01T06:53:03Z","published":"2024-06-01T06:53:03Z","title":"BeFA: A General Behavior-driven Feature Adapter for Multimedia\n Recommendation","summary":" Multimedia recommender systems focus on utilizing behavioral information and\ncontent information to model user preferences. Typically, it employs\npre-trained feature encoders to extract content features, then fuses them with\nbehavioral features. However, pre-trained feature encoders often extract\nfeatures from the entire content simultaneously, including excessive\npreference-irrelevant details. We speculate that it may result in the extracted\nfeatures not containing sufficient features to accurately reflect user\npreferences. To verify our hypothesis, we introduce an attribution analysis\nmethod for visually and intuitively analyzing the content features. The results\nindicate that certain products' content features exhibit the issues of\ninformation drift}and information omission,reducing the expressive ability of\nfeatures. Building upon this finding, we propose an effective and efficient\ngeneral Behavior-driven Feature Adapter (BeFA) to tackle these issues. This\nadapter reconstructs the content feature with the guidance of behavioral\ninformation, enabling content features accurately reflecting user preferences.\nExtensive experiments demonstrate the effectiveness of the adapter across all\nmultimedia recommendation methods. The code will be publicly available upon the\npaper's acceptance.\n","authors":["Qile Fan","Penghang Yu","Zhiyi Tan","Bing-Kun Bao","Guanming Lu"],"pdf_url":"https://arxiv.org/pdf/2406.00323v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00320v1","updated":"2024-06-01T06:40:22Z","published":"2024-06-01T06:40:22Z","title":"Frieren: Efficient Video-to-Audio Generation with Rectified Flow\n Matching","summary":" Video-to-audio (V2A) generation aims to synthesize content-matching audio\nfrom silent video, and it remains challenging to build V2A models with high\ngeneration quality, efficiency, and visual-audio temporal synchrony. We propose\nFrieren, a V2A model based on rectified flow matching. Frieren regresses the\nconditional transport vector field from noise to spectrogram latent with\nstraight paths and conducts sampling by solving ODE, outperforming\nautoregressive and score-based models in terms of audio quality. By employing a\nnon-autoregressive vector field estimator based on a feed-forward transformer\nand channel-level cross-modal feature fusion with strong temporal alignment,\nour model generates audio that is highly synchronized with the input video.\nFurthermore, through reflow and one-step distillation with guided vector field,\nour model can generate decent audio in a few, or even only one sampling step.\nExperiments indicate that Frieren achieves state-of-the-art performance in both\ngeneration quality and temporal alignment on VGGSound, with alignment accuracy\nreaching 97.22%, and 6.2% improvement in inception score over the strong\ndiffusion-based baseline. Audio samples are available at\nhttp://frieren-v2a.github.io .\n","authors":["Yongqi Wang","Wenxiang Guo","Rongjie Huang","Jiawei Huang","Zehan Wang","Fuming You","Ruiqi Li","Zhou Zhao"],"pdf_url":"https://arxiv.org/pdf/2406.00320v1.pdf","comment":null}]},"2024-06-04T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2406.02543v1","updated":"2024-06-04T17:58:18Z","published":"2024-06-04T17:58:18Z","title":"To Believe or Not to Believe Your LLM","summary":" We explore uncertainty quantification in large language models (LLMs), with\nthe goal to identify when uncertainty in responses given a query is large. We\nsimultaneously consider both epistemic and aleatoric uncertainties, where the\nformer comes from the lack of knowledge about the ground truth (such as about\nfacts or the language), and the latter comes from irreducible randomness (such\nas multiple possible answers). In particular, we derive an\ninformation-theoretic metric that allows to reliably detect when only epistemic\nuncertainty is large, in which case the output of the model is unreliable. This\ncondition can be computed based solely on the output of the model obtained\nsimply by some special iterative prompting based on the previous responses.\nSuch quantification, for instance, allows to detect hallucinations (cases when\nepistemic uncertainty is high) in both single- and multi-answer responses. This\nis in contrast to many standard uncertainty quantification strategies (such as\nthresholding the log-likelihood of a response) where hallucinations in the\nmulti-answer case cannot be detected. We conduct a series of experiments which\ndemonstrate the advantage of our formulation. Further, our investigations shed\nsome light on how the probabilities assigned to a given output by an LLM can be\namplified by iterative prompting, which might be of independent interest.\n","authors":["Yasin Abbasi Yadkori","Ilja Kuzborskij","András György","Csaba Szepesvári"],"pdf_url":"https://arxiv.org/pdf/2406.02543v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02539v1","updated":"2024-06-04T17:56:28Z","published":"2024-06-04T17:56:28Z","title":"Parrot: Multilingual Visual Instruction Tuning","summary":" The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V\nhas marked a significant step towards artificial general intelligence. Existing\nmethods mainly focus on aligning vision encoders with LLMs through supervised\nfine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs'\ninherent ability to react to multiple languages progressively deteriorate as\nthe training process evolves. We empirically find that the imbalanced SFT\ndatasets, primarily composed of English-centric image-text pairs, lead to\nsignificantly reduced performance in non-English languages. This is due to the\nfailure of aligning the vision encoder and LLM with multilingual tokens during\nthe SFT process. In this paper, we introduce Parrot, a novel method that\nutilizes textual guidance to drive visual token alignment at the language\nlevel. Parrot makes the visual tokens condition on diverse language inputs and\nuses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens.\nSpecifically, to enhance non-English visual tokens alignment, we compute the\ncross-attention using the initial visual features and textual embeddings, the\nresult of which is then fed into the MoE router to select the most relevant\nexperts. The selected experts subsequently convert the initial visual tokens\ninto language-specific visual tokens. Moreover, considering the current lack of\nbenchmarks for evaluating multilingual capabilities within the field, we\ncollect and make available a Massive Multilingual Multimodal Benchmark which\nincludes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our\nmethod not only demonstrates state-of-the-art performance on multilingual\nMMBench and MMMB, but also excels across a broad range of multimodal tasks.\nBoth the source code and the training dataset of Parrot will be made publicly\navailable.\n","authors":["Hai-Long Sun","Da-Wei Zhou","Yang Li","Shiyin Lu","Chao Yi","Qing-Guo Chen","Zhao Xu","Weihua Luo","Kaifu Zhang","De-Chuan Zhan","Han-Jia Ye"],"pdf_url":"https://arxiv.org/pdf/2406.02539v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02537v1","updated":"2024-06-04T17:55:43Z","published":"2024-06-04T17:55:43Z","title":"TopViewRS: Vision-Language Models as Top-View Spatial Reasoners","summary":" Top-view perspective denotes a typical way in which humans read and reason\nover different types of maps, and it is vital for localization and navigation\nof humans as well as of `non-human' agents, such as the ones backed by large\nVision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of\nmodern VLMs remain unattested and underexplored. In this work, we thus study\ntheir capability to understand and reason over spatial relations from the top\nview. The focus on top view also enables controlled evaluations at different\ngranularity of spatial reasoning; we clearly disentangle different abilities\n(e.g., recognizing particular objects versus understanding their relative\npositions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset,\nconsisting of 11,384 multiple-choice questions with either realistic or\nsemantic top-view map as visual input. We then use it to study and evaluate\nVLMs across 4 perception and reasoning tasks with different levels of\ncomplexity. Evaluation of 10 representative open- and closed-source VLMs\nreveals the gap of more than 50% compared to average human performance, and it\nis even lower than the random baseline in some cases. Although additional\nexperiments show that Chain-of-Thought reasoning can boost model capabilities\nby 5.82% on average, the overall performance of VLMs remains limited. Our\nfindings underscore the critical need for enhanced model capability in top-view\nspatial reasoning and set a foundation for further research towards human-level\nproficiency of VLMs in real-world multimodal tasks.\n","authors":["Chengzu Li","Caiqi Zhang","Han Zhou","Nigel Collier","Anna Korhonen","Ivan Vulić"],"pdf_url":"https://arxiv.org/pdf/2406.02537v1.pdf","comment":"9 pages, 3 figures, 3 tables (21 pages, 4 figures, 15 tables\n including references and appendices)"},{"id":"http://arxiv.org/abs/2406.02536v1","updated":"2024-06-04T17:55:38Z","published":"2024-06-04T17:55:38Z","title":"Mitigate Position Bias in Large Language Models via Scaling a Single\n Dimension","summary":" Large Language Models (LLMs) are increasingly applied in various real-world\nscenarios due to their excellent generalization capabilities and robust\ngenerative abilities. However, they exhibit position bias, also known as \"lost\nin the middle\", a phenomenon that is especially pronounced in long-context\nscenarios, which indicates the placement of the key information in different\npositions of a prompt can significantly affect accuracy. This paper first\nexplores the micro-level manifestations of position bias, concluding that\nattention weights are a micro-level expression of position bias. It further\nidentifies that, in addition to position embeddings, causal attention mask also\ncontributes to position bias by creating position-specific hidden states. Based\non these insights, we propose a method to mitigate position bias by scaling\nthis positional hidden states. Experiments on the NaturalQuestions\nMulti-document QA, KV retrieval, LongBench and timeline reorder tasks, using\nvarious models including RoPE models, context windowextended models, and Alibi\nmodels, demonstrate the effectiveness and generalizability of our approach. Our\nmethod can improve performance by up to 15.2% by modifying just one dimension\nof hidden states. Our code is available at https://aka.ms/PositionalHidden.\n","authors":["Yijiong Yu","Huiqiang Jiang","Xufang Luo","Qianhui Wu","Chin-Yew Lin","Dongsheng Li","Yuqing Yang","Yongfeng Huang","Lili Qiu"],"pdf_url":"https://arxiv.org/pdf/2406.02536v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02532v1","updated":"2024-06-04T17:53:36Z","published":"2024-06-04T17:53:36Z","title":"SpecExec: Massively Parallel Speculative Decoding for Interactive LLM\n Inference on Consumer Devices","summary":" As large language models gain widespread adoption, running them efficiently\nbecomes crucial. Recent works on LLM inference use speculative decoding to\nachieve extreme speedups. However, most of these works implicitly design their\nalgorithms for high-end datacenter hardware. In this work, we ask the opposite\nquestion: how fast can we run LLMs on consumer machines? Consumer GPUs can no\nlonger fit the largest available models (50B+ parameters) and must offload them\nto RAM or SSD. When running with offloaded parameters, the inference engine can\nprocess batches of hundreds or thousands of tokens at the same time as just one\ntoken, making it a natural fit for speculative decoding. We propose SpecExec\n(Speculative Execution), a simple parallel decoding method that can generate up\nto 20 tokens per target model iteration for popular LLM families. It utilizes\nthe high spikiness of the token probabilities distribution in modern LLMs and a\nhigh degree of alignment between model output probabilities. SpecExec takes the\nmost probable tokens continuation from the draft model to build a \"cache\" tree\nfor the target model, which then gets validated in a single pass. Using\nSpecExec, we demonstrate inference of 50B+ parameter LLMs on consumer GPUs with\nRAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens\nper second with 16-bit weights.\n","authors":["Ruslan Svirschevski","Avner May","Zhuoming Chen","Beidi Chen","Zhihao Jia","Max Ryabinin"],"pdf_url":"https://arxiv.org/pdf/2406.02532v1.pdf","comment":"preprint. arXiv admin note: text overlap with arXiv:2312.17238 by\n other authors"},{"id":"http://arxiv.org/abs/2406.02528v1","updated":"2024-06-04T17:50:34Z","published":"2024-06-04T17:50:34Z","title":"Scalable MatMul-free Language Modeling","summary":" Matrix multiplication (MatMul) typically dominates the overall computational\ncost of large language models (LLMs). This cost only grows as LLMs scale to\nlarger embedding dimensions and context lengths. In this work, we show that\nMatMul operations can be completely eliminated from LLMs while maintaining\nstrong performance at billion-parameter scales. Our experiments show that our\nproposed MatMul-free models achieve performance on-par with state-of-the-art\nTransformers that require far more memory during inference at a scale up to at\nleast 2.7B parameters. We investigate the scaling laws and find that the\nperformance gap between our MatMul-free models and full precision Transformers\nnarrows as the model size increases. We also provide a GPU-efficient\nimplementation of this model which reduces memory usage by up to 61% over an\nunoptimized baseline during training. By utilizing an optimized kernel during\ninference, our model's memory consumption can be reduced by more than 10x\ncompared to unoptimized models. To properly quantify the efficiency of our\narchitecture, we build a custom hardware solution on an FPGA which exploits\nlightweight operations beyond what GPUs are capable of. We processed\nbillion-parameter scale models at 13W beyond human readable throughput, moving\nLLMs closer to brain-like efficiency. This work not only shows how far LLMs can\nbe stripped back while still performing effectively, but also points at the\ntypes of operations future accelerators should be optimized for in processing\nthe next generation of lightweight LLMs. Our code implementation is available\nat \\url{https://github.com/ridgerchu/matmulfreellm}.\n","authors":["Rui-Jie Zhu","Yu Zhang","Ethan Sifferman","Tyler Sheaves","Yiqiao Wang","Dustin Richmond","Peng Zhou","Jason K. Eshraghian"],"pdf_url":"https://arxiv.org/pdf/2406.02528v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02524v1","updated":"2024-06-04T17:42:21Z","published":"2024-06-04T17:42:21Z","title":"CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks","summary":" Large Language Models (LLMs) are revolutionizing various domains, yet\nverifying their answers remains a significant challenge, especially for\nintricate open-ended tasks such as consolidation, summarization, and extraction\nof knowledge. In this work, we propose CheckEmbed: an accurate, scalable, and\nsimple LLM verification approach. CheckEmbed is driven by a straightforward yet\npowerful idea: in order to compare LLM solutions to one another or to the\nground-truth, compare their corresponding answer-level embeddings obtained with\na model such as GPT Text Embedding Large. This reduces a complex textual answer\nto a single embedding, facilitating straightforward, fast, and meaningful\nverification. We develop a comprehensive verification pipeline implementing the\nCheckEmbed methodology. The CheckEmbed pipeline also comes with metrics for\nassessing the truthfulness of the LLM answers, such as embedding heatmaps and\ntheir summaries. We show how to use these metrics for deploying practical\nengines that decide whether an LLM answer is satisfactory or not. We apply the\npipeline to real-world document analysis tasks, including term extraction and\ndocument summarization, showcasing significant improvements in accuracy,\ncost-effectiveness, and runtime performance compared to existing token-,\nsentence-, and fact-level schemes such as BERTScore or SelfCheckGPT.\n","authors":["Maciej Besta","Lorenzo Paleari","Ales Kubicek","Piotr Nyczyk","Robert Gerstenberger","Patrick Iff","Tomasz Lehmann","Hubert Niewiadomski","Torsten Hoefler"],"pdf_url":"https://arxiv.org/pdf/2406.02524v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02517v1","updated":"2024-06-04T17:39:23Z","published":"2024-06-04T17:39:23Z","title":"Deterministic Reversible Data Augmentation for Neural Machine\n Translation","summary":" Data augmentation is an effective way to diversify corpora in machine\ntranslation, but previous methods may introduce semantic inconsistency between\noriginal and augmented data because of irreversible operations and random\nsubword sampling procedures. To generate both symbolically diverse and\nsemantically consistent augmentation data, we propose Deterministic Reversible\nData Augmentation (DRDA), a simple but effective data augmentation method for\nneural machine translation. DRDA adopts deterministic segmentations and\nreversible operations to generate multi-granularity subword representations and\npulls them closer together with multi-view techniques. With no extra corpora or\nmodel changes required, DRDA outperforms strong baselines on several\ntranslation tasks with a clear margin (up to 4.3 BLEU gain over Transformer)\nand exhibits good robustness in noisy, low-resource, and cross-domain datasets.\n","authors":["Jiashu Yao","Heyan Huang","Zeming Liu","Yuhang Guo"],"pdf_url":"https://arxiv.org/pdf/2406.02517v1.pdf","comment":"Findings of ACL 2024"},{"id":"http://arxiv.org/abs/2404.15485v2","updated":"2024-06-04T17:37:08Z","published":"2024-04-23T19:55:18Z","title":"Large Language Models Spot Phishing Emails with Surprising Accuracy: A\n Comparative Analysis of Performance","summary":" Phishing, a prevalent cybercrime tactic for decades, remains a significant\nthreat in today's digital world. By leveraging clever social engineering\nelements and modern technology, cybercrime targets many individuals,\nbusinesses, and organizations to exploit trust and security. These\ncyber-attackers are often disguised in many trustworthy forms to appear as\nlegitimate sources. By cleverly using psychological elements like urgency,\nfear, social proof, and other manipulative strategies, phishers can lure\nindividuals into revealing sensitive and personalized information. Building on\nthis pervasive issue within modern technology, this paper aims to analyze the\neffectiveness of 15 Large Language Models (LLMs) in detecting phishing\nattempts, specifically focusing on a randomized set of \"419 Scam\" emails. The\nobjective is to determine which LLMs can accurately detect phishing emails by\nanalyzing a text file containing email metadata based on predefined criteria.\nThe experiment concluded that the following models, ChatGPT 3.5,\nGPT-3.5-Turbo-Instruct, and ChatGPT, were the most effective in detecting\nphishing emails.\n","authors":["Het Patel","Umair Rehman","Farkhund Iqbal"],"pdf_url":"https://arxiv.org/pdf/2404.15485v2.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2402.04833v2","updated":"2024-06-04T17:20:01Z","published":"2024-02-07T13:32:11Z","title":"Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for\n Instruction Fine-Tuning","summary":" There is a consensus that instruction fine-tuning of LLMs requires\nhigh-quality data, but what are they? LIMA (NeurIPS 2023) and AlpaGasus (ICLR\n2024) are state-of-the-art methods for selecting such high-quality examples,\neither via manual curation or using GPT-3.5-Turbo as a quality scorer. We show\nthat the extremely simple baseline of selecting the 1,000 instructions with\nlongest responses -- that intuitively contain more learnable information and\nare harder to overfit -- from standard datasets can consistently outperform\nthese sophisticated methods according to GPT-4 and PaLM-2 as judges, while\nremaining competitive on the Open LLM benchmarks that test factual knowledge.\nWe demonstrate this for several LLMs (Llama-2-7B, Llama-2-13B, Mistral-7B-v0.1)\nand datasets (Alpaca-52k, Evol-Instruct-70k). In addition, a lightweight\nrefinement of such long instructions can further improve the abilities of the\nfine-tuned LLMs, and allows us to obtain competitive results on MT-Bench and\nthe 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0, while training\non only 1,000 examples and no extra preference data. We also conduct a thorough\nanalysis of our models to ensure that their enhanced performance is not simply\ndue to GPT-4's preference for longer responses. Overall, our findings suggest\nthat fine-tuning on the longest responses should be the default baseline for\nany work on instruction fine-tuning. We provide our code at\nhttps://github.com/tml-epfl/long-is-more-for-alignment.\n","authors":["Hao Zhao","Maksym Andriushchenko","Francesco Croce","Nicolas Flammarion"],"pdf_url":"https://arxiv.org/pdf/2402.04833v2.pdf","comment":"Accepted at ICML 2024. This camera-ready version adds MT-Bench\n evaluations, a human study, more thorough analysis of length bias. Code at\n https://github.com/tml-epfl/long-is-more-for-alignment"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2406.02552v1","updated":"2024-06-04T17:59:57Z","published":"2024-06-04T17:59:57Z","title":"VHS: High-Resolution Iterative Stereo Matching with Visual Hull Priors","summary":" We present a stereo-matching method for depth estimation from high-resolution\nimages using visual hulls as priors, and a memory-efficient technique for the\ncorrelation computation. Our method uses object masks extracted from\nsupplementary views of the scene to guide the disparity estimation, effectively\nreducing the search space for matches. This approach is specifically tailored\nto stereo rigs in volumetric capture systems, where an accurate depth plays a\nkey role in the downstream reconstruction task. To enable training and\nregression at high resolutions targeted by recent systems, our approach extends\na sparse correlation computation into a hybrid sparse-dense scheme suitable for\napplication in leading recurrent network architectures. We evaluate the\nperformance-efficiency trade-off of our method compared to state-of-the-art\nmethods, and demonstrate the efficacy of the visual hull guidance. In addition,\nwe propose a training scheme for a further reduction of memory requirements\nduring optimization, facilitating training on high-resolution data.\n","authors":["Markus Plack","Hannah Dröge","Leif Van Holland","Matthias B. Hullin"],"pdf_url":"https://arxiv.org/pdf/2406.02552v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02549v1","updated":"2024-06-04T17:59:32Z","published":"2024-06-04T17:59:32Z","title":"Dreamguider: Improved Training free Diffusion-based Conditional\n Generation","summary":" Diffusion models have emerged as a formidable tool for training-free\nconditional generation.However, a key hurdle in inference-time guidance\ntechniques is the need for compute-heavy backpropagation through the diffusion\nnetwork for estimating the guidance direction. Moreover, these techniques often\nrequire handcrafted parameter tuning on a case-by-case basis. Although some\nrecent works have introduced minimal compute methods for linear inverse\nproblems, a generic lightweight guidance solution to both linear and non-linear\nguidance problems is still missing. To this end, we propose Dreamguider, a\nmethod that enables inference-time guidance without compute-heavy\nbackpropagation through the diffusion network. The key idea is to regulate the\ngradient flow through a time-varying factor. Moreover, we propose an empirical\nguidance scale that works for a wide variety of tasks, hence removing the need\nfor handcrafted parameter tuning. We further introduce an effective lightweight\naugmentation strategy that significantly boosts the performance during\ninference-time guidance. We present experiments using Dreamguider on multiple\ntasks across multiple datasets and models to show the effectiveness of the\nproposed modules. To facilitate further research, we will make the code public\nafter the review process.\n","authors":["Nithin Gopalakrishnan Nair","Vishal M Patel"],"pdf_url":"https://arxiv.org/pdf/2406.02549v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02548v1","updated":"2024-06-04T17:59:31Z","published":"2024-06-04T17:59:31Z","title":"Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance\n Segmentation","summary":" Recent works on open-vocabulary 3D instance segmentation show strong promise,\nbut at the cost of slow inference speed and high computation requirements. This\nhigh computation cost is typically due to their heavy reliance on 3D clip\nfeatures, which require computationally expensive 2D foundation models like\nSegment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a\nconsequence, this hampers their applicability in many real-world applications\nthat require both fast and accurate predictions. To this end, we propose a fast\nyet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO\n3D, that effectively leverages only 2D object detection from multi-view RGB\nimages for open-vocabulary 3D instance segmentation. We address this task by\ngenerating class-agnostic 3D masks for objects in the scene and associating\nthem with text prompts. We observe that the projection of class-agnostic 3D\npoint cloud instances already holds instance information; thus, using SAM might\nonly result in redundancy that unnecessarily increases the inference time. We\nempirically find that a better performance of matching text prompts to 3D masks\ncan be achieved in a faster fashion with a 2D object detector. We validate our\nOpen-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios:\n(i) with ground truth masks, where labels are required for given object\nproposals, and (ii) with class-agnostic 3D proposals generated from a 3D\nproposal network. Our Open-YOLO 3D achieves state-of-the-art performance on\nboth datasets while obtaining up to $\\sim$16$\\times$ speedup compared to the\nbest existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D\nachieves mean average precision (mAP) of 24.7\\% while operating at 22 seconds\nper scene. Code and model are available at github.com/aminebdj/OpenYOLO3D.\n","authors":["Mohamed El Amine Boudjoghra","Angela Dai","Jean Lahoud","Hisham Cholakkal","Rao Muhammad Anwer","Salman Khan","Fahad Shahbaz Khan"],"pdf_url":"https://arxiv.org/pdf/2406.02548v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02547v1","updated":"2024-06-04T17:59:25Z","published":"2024-06-04T17:59:25Z","title":"Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal\n Learning","summary":" Training models with longer in-context lengths is a significant challenge for\nmultimodal model due to substantial GPU memory and computational costs. This\nexploratory study does not present state-of-the-art models; rather, it\nintroduces an innovative method designed to increase in-context text length in\nmulti-modality large language models (MLLMs) efficiently. We present Visualized\nIn-Context Text Processing (VisInContext), which processes long in-context text\nusing visual tokens. This technique significantly reduces GPU memory usage and\nfloating point operations (FLOPs) for both training and inferenceing stage. For\ninstance, our method expands the pre-training in-context text length from 256\nto 2048 tokens with nearly same FLOPs for a 56 billion parameter MOE model.\nExperimental results demonstrate that model trained with VisInContext delivers\nsuperior performance on common downstream benchmarks for in-context few-shot\nevaluation. Additionally, VisInContext is complementary to existing methods for\nincreasing in-context text length and enhances document understanding\ncapabilities, showing great potential in document QA tasks and sequential\ndocument retrieval.\n","authors":["Alex Jinpeng Wang","Linjie Li","Yiqi Lin","Min Li","Lijuan Wang","Mike Zheng Shou"],"pdf_url":"https://arxiv.org/pdf/2406.02547v1.pdf","comment":"12 pages. The website is\n \\url{https://fingerrec.github.io/visincontext}"},{"id":"http://arxiv.org/abs/2406.02541v1","updated":"2024-06-04T17:57:37Z","published":"2024-06-04T17:57:37Z","title":"Enhancing Temporal Consistency in Video Editing by Reconstructing Videos\n with 3D Gaussian Splatting","summary":" Recent advancements in zero-shot video diffusion models have shown promise\nfor text-driven video editing, but challenges remain in achieving high temporal\nconsistency. To address this, we introduce Video-3DGS, a 3D Gaussian Splatting\n(3DGS)-based video refiner designed to enhance temporal consistency in\nzero-shot video editors. Our approach utilizes a two-stage 3D Gaussian\noptimizing process tailored for editing dynamic monocular videos. In the first\nstage, Video-3DGS employs an improved version of COLMAP, referred to as\nMC-COLMAP, which processes original videos using a Masked and Clipped approach.\nFor each video clip, MC-COLMAP generates the point clouds for dynamic\nforeground objects and complex backgrounds. These point clouds are utilized to\ninitialize two sets of 3D Gaussians (Frg-3DGS and Bkg-3DGS) aiming to represent\nforeground and background views. Both foreground and background views are then\nmerged with a 2D learnable parameter map to reconstruct full views. In the\nsecond stage, we leverage the reconstruction ability developed in the first\nstage to impose the temporal constraints on the video diffusion model. To\ndemonstrate the efficacy of Video-3DGS on both stages, we conduct extensive\nexperiments across two related tasks: Video Reconstruction and Video Editing.\nVideo-3DGS trained with 3k iterations significantly improves video\nreconstruction quality (+3 PSNR, +7 PSNR increase) and training efficiency\n(x1.9, x4.5 times faster) over NeRF-based and 3DGS-based state-of-art methods\non DAVIS dataset, respectively. Moreover, it enhances video editing by ensuring\ntemporal consistency across 58 dynamic monocular videos.\n","authors":["Inkyu Shin","Qihang Yu","Xiaohui Shen","In So Kweon","Kuk-Jin Yoon","Liang-Chieh Chen"],"pdf_url":"https://arxiv.org/pdf/2406.02541v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02540v1","updated":"2024-06-04T17:57:10Z","published":"2024-06-04T17:57:10Z","title":"ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers\n for Image and Video Generation","summary":" Diffusion transformers (DiTs) have exhibited remarkable performance in visual\ngeneration tasks, such as generating realistic images or videos based on\ntextual instructions. However, larger model sizes and multi-frame processing\nfor video generation lead to increased computational and memory costs, posing\nchallenges for practical deployment on edge devices. Post-Training Quantization\n(PTQ) is an effective method for reducing memory costs and computational\ncomplexity. When quantizing diffusion transformers, we find that applying\nexisting diffusion quantization methods designed for U-Net faces challenges in\npreserving quality. After analyzing the major challenges for quantizing\ndiffusion transformers, we design an improved quantization scheme: \"ViDiT-Q\":\nVideo and Image Diffusion Transformer Quantization) to address these issues.\nFurthermore, we identify highly sensitive layers and timesteps hinder\nquantization for lower bit-widths. To tackle this, we improve ViDiT-Q with a\nnovel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP). We\nvalidate the effectiveness of ViDiT-Q across a variety of text-to-image and\nvideo models. While baseline quantization methods fail at W8A8 and produce\nunreadable content at W4A8, ViDiT-Q achieves lossless W8A8 quantization.\nViDiTQ-MP achieves W4A8 with negligible visual quality degradation, resulting\nin a 2.5x memory optimization and a 1.5x latency speedup.\n","authors":["Tianchen Zhao","Tongcheng Fang","Enshu Liu","Wan Rui","Widyadewi Soedarmadji","Shiyao Li","Zinan Lin","Guohao Dai","Shengen Yan","Huazhong Yang","Xuefei Ning","Yu Wang"],"pdf_url":"https://arxiv.org/pdf/2406.02540v1.pdf","comment":"Project Page: https://a-suozhang.xyz/viditq.github.io/"},{"id":"http://arxiv.org/abs/2406.02539v1","updated":"2024-06-04T17:56:28Z","published":"2024-06-04T17:56:28Z","title":"Parrot: Multilingual Visual Instruction Tuning","summary":" The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V\nhas marked a significant step towards artificial general intelligence. Existing\nmethods mainly focus on aligning vision encoders with LLMs through supervised\nfine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs'\ninherent ability to react to multiple languages progressively deteriorate as\nthe training process evolves. We empirically find that the imbalanced SFT\ndatasets, primarily composed of English-centric image-text pairs, lead to\nsignificantly reduced performance in non-English languages. This is due to the\nfailure of aligning the vision encoder and LLM with multilingual tokens during\nthe SFT process. In this paper, we introduce Parrot, a novel method that\nutilizes textual guidance to drive visual token alignment at the language\nlevel. Parrot makes the visual tokens condition on diverse language inputs and\nuses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens.\nSpecifically, to enhance non-English visual tokens alignment, we compute the\ncross-attention using the initial visual features and textual embeddings, the\nresult of which is then fed into the MoE router to select the most relevant\nexperts. The selected experts subsequently convert the initial visual tokens\ninto language-specific visual tokens. Moreover, considering the current lack of\nbenchmarks for evaluating multilingual capabilities within the field, we\ncollect and make available a Massive Multilingual Multimodal Benchmark which\nincludes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our\nmethod not only demonstrates state-of-the-art performance on multilingual\nMMBench and MMMB, but also excels across a broad range of multimodal tasks.\nBoth the source code and the training dataset of Parrot will be made publicly\navailable.\n","authors":["Hai-Long Sun","Da-Wei Zhou","Yang Li","Shiyin Lu","Chao Yi","Qing-Guo Chen","Zhao Xu","Weihua Luo","Kaifu Zhang","De-Chuan Zhan","Han-Jia Ye"],"pdf_url":"https://arxiv.org/pdf/2406.02539v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02537v1","updated":"2024-06-04T17:55:43Z","published":"2024-06-04T17:55:43Z","title":"TopViewRS: Vision-Language Models as Top-View Spatial Reasoners","summary":" Top-view perspective denotes a typical way in which humans read and reason\nover different types of maps, and it is vital for localization and navigation\nof humans as well as of `non-human' agents, such as the ones backed by large\nVision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of\nmodern VLMs remain unattested and underexplored. In this work, we thus study\ntheir capability to understand and reason over spatial relations from the top\nview. The focus on top view also enables controlled evaluations at different\ngranularity of spatial reasoning; we clearly disentangle different abilities\n(e.g., recognizing particular objects versus understanding their relative\npositions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset,\nconsisting of 11,384 multiple-choice questions with either realistic or\nsemantic top-view map as visual input. We then use it to study and evaluate\nVLMs across 4 perception and reasoning tasks with different levels of\ncomplexity. Evaluation of 10 representative open- and closed-source VLMs\nreveals the gap of more than 50% compared to average human performance, and it\nis even lower than the random baseline in some cases. Although additional\nexperiments show that Chain-of-Thought reasoning can boost model capabilities\nby 5.82% on average, the overall performance of VLMs remains limited. Our\nfindings underscore the critical need for enhanced model capability in top-view\nspatial reasoning and set a foundation for further research towards human-level\nproficiency of VLMs in real-world multimodal tasks.\n","authors":["Chengzu Li","Caiqi Zhang","Han Zhou","Nigel Collier","Anna Korhonen","Ivan Vulić"],"pdf_url":"https://arxiv.org/pdf/2406.02537v1.pdf","comment":"9 pages, 3 figures, 3 tables (21 pages, 4 figures, 15 tables\n including references and appendices)"},{"id":"http://arxiv.org/abs/2406.02535v1","updated":"2024-06-04T17:55:22Z","published":"2024-06-04T17:55:22Z","title":"Enhancing 2D Representation Learning with a 3D Prior","summary":" Learning robust and effective representations of visual data is a fundamental\ntask in computer vision. Traditionally, this is achieved by training models\nwith labeled data which can be expensive to obtain. Self-supervised learning\nattempts to circumvent the requirement for labeled data by learning\nrepresentations from raw unlabeled visual data alone. However, unlike humans\nwho obtain rich 3D information from their binocular vision and through motion,\nthe majority of current self-supervised methods are tasked with learning from\nmonocular 2D image collections. This is noteworthy as it has been demonstrated\nthat shape-centric visual processing is more robust compared to texture-biased\nautomated methods. Inspired by this, we propose a new approach for\nstrengthening existing self-supervised methods by explicitly enforcing a strong\n3D structural prior directly into the model during training. Through\nexperiments, across a range of datasets, we demonstrate that our 3D aware\nrepresentations are more robust compared to conventional self-supervised\nbaselines.\n","authors":["Mehmet Aygün","Prithviraj Dhar","Zhicheng Yan","Oisin Mac Aodha","Rakesh Ranjan"],"pdf_url":"https://arxiv.org/pdf/2406.02535v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.01101v2","updated":"2024-06-04T17:55:02Z","published":"2023-04-03T16:01:03Z","title":"Dsfer-Net: A Deep Supervision and Feature Retrieval Network for\n Bitemporal Change Detection Using Modern Hopfield Networks","summary":" Change detection, an essential application for high-resolution remote sensing\nimages, aims to monitor and analyze changes in the land surface over time. Due\nto the rapid increase in the quantity of high-resolution remote sensing data\nand the complexity of texture features, several quantitative deep\nlearning-based methods have been proposed. These methods outperform traditional\nchange detection methods by extracting deep features and combining\nspatial-temporal information. However, reasonable explanations for how deep\nfeatures improve detection performance are still lacking. In our\ninvestigations, we found that modern Hopfield network layers significantly\nenhance semantic understanding. In this paper, we propose a Deep Supervision\nand FEature Retrieval network (Dsfer-Net) for bitemporal change detection.\nSpecifically, the highly representative deep features of bitemporal images are\njointly extracted through a fully convolutional Siamese network. Based on the\nsequential geographical information of the bitemporal images, we designed a\nfeature retrieval module to extract difference features and leverage\ndiscriminative information in a deeply supervised manner. Additionally, we\nobserved that the deeply supervised feature retrieval module provides\nexplainable evidence of the semantic understanding of the proposed network in\nits deep layers. Finally, our end-to-end network establishes a novel framework\nby aggregating retrieved features and feature pairs from different layers.\nExperiments conducted on three public datasets (LEVIR-CD, WHU-CD, and CDD)\nconfirm the superiority of the proposed Dsfer-Net over other state-of-the-art\nmethods.\n","authors":["Shizhen Chang","Michael Kopp","Pedram Ghamisi","Bo Du"],"pdf_url":"https://arxiv.org/pdf/2304.01101v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02534v1","updated":"2024-06-04T17:54:44Z","published":"2024-06-04T17:54:44Z","title":"Enhancing predictive imaging biomarker discovery through treatment\n effect analysis","summary":" Identifying predictive biomarkers, which forecast individual treatment\neffectiveness, is crucial for personalized medicine and informs decision-making\nacross diverse disciplines. These biomarkers are extracted from pre-treatment\ndata, often within randomized controlled trials, and have to be distinguished\nfrom prognostic biomarkers, which are independent of treatment assignment. Our\nstudy focuses on the discovery of predictive imaging biomarkers, aiming to\nleverage pre-treatment images to unveil new causal relationships. Previous\napproaches relied on labor-intensive handcrafted or manually derived features,\nwhich may introduce biases. In response, we present a new task of discovering\npredictive imaging biomarkers directly from the pre-treatment images to learn\nrelevant image features. We propose an evaluation protocol for this task to\nassess a model's ability to identify predictive imaging biomarkers and\ndifferentiate them from prognostic ones. It employs statistical testing and a\ncomprehensive analysis of image feature attribution. We explore the suitability\nof deep learning models originally designed for estimating the conditional\naverage treatment effect (CATE) for this task, which previously have been\nprimarily assessed for the precision of CATE estimation, overlooking the\nevaluation of imaging biomarker discovery. Our proof-of-concept analysis\ndemonstrates promising results in discovering and validating predictive imaging\nbiomarkers from synthetic outcomes and real-world image datasets.\n","authors":["Shuhan Xiao","Lukas Klein","Jens Petersen","Philipp Vollmuth","Paul F. Jaeger","Klaus H. Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2406.02534v1.pdf","comment":"19 pages, 12 figures"},{"id":"http://arxiv.org/abs/2406.02533v1","updated":"2024-06-04T17:54:20Z","published":"2024-06-04T17:54:20Z","title":"SatSplatYOLO: 3D Gaussian Splatting-based Virtual Object Detection\n Ensembles for Satellite Feature Recognition","summary":" On-orbit servicing (OOS), inspection of spacecraft, and active debris removal\n(ADR). Such missions require precise rendezvous and proximity operations in the\nvicinity of non-cooperative, possibly unknown, resident space objects. Safety\nconcerns with manned missions and lag times with ground-based control\nnecessitate complete autonomy. In this article, we present an approach for\nmapping geometries and high-confidence detection of components of unknown,\nnon-cooperative satellites on orbit. We implement accelerated 3D Gaussian\nsplatting to learn a 3D representation of the satellite, render virtual views\nof the target, and ensemble the YOLOv5 object detector over the virtual views,\nresulting in reliable, accurate, and precise satellite component detections.\nThe full pipeline capable of running on-board and stand to enable downstream\nmachine intelligence tasks necessary for autonomous guidance, navigation, and\ncontrol tasks.\n","authors":["Van Minh Nguyen","Emma Sandidge","Trupti Mahendrakar","Ryan T. White"],"pdf_url":"https://arxiv.org/pdf/2406.02533v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02529v1","updated":"2024-06-04T17:51:08Z","published":"2024-06-04T17:51:08Z","title":"ReLUs Are Sufficient for Learning Implicit Neural Representations","summary":" Motivated by the growing theoretical understanding of neural networks that\nemploy the Rectified Linear Unit (ReLU) as their activation function, we\nrevisit the use of ReLU activation functions for learning implicit neural\nrepresentations (INRs). Inspired by second order B-spline wavelets, we\nincorporate a set of simple constraints to the ReLU neurons in each layer of a\ndeep neural network (DNN) to remedy the spectral bias. This in turn enables its\nuse for various INR tasks. Empirically, we demonstrate that, contrary to\npopular belief, one can learn state-of-the-art INRs based on a DNN composed of\nonly ReLU neurons. Next, by leveraging recent theoretical works which\ncharacterize the kinds of functions ReLU neural networks learn, we provide a\nway to quantify the regularity of the learned function. This offers a\nprincipled approach to selecting the hyperparameters in INR architectures. We\nsubstantiate our claims through experiments in signal representation, super\nresolution, and computed tomography, demonstrating the versatility and\neffectiveness of our method. The code for all experiments can be found at\nhttps://github.com/joeshenouda/relu-inrs.\n","authors":["Joseph Shenouda","Yamin Zhou","Robert D. Nowak"],"pdf_url":"https://arxiv.org/pdf/2406.02529v1.pdf","comment":"Accepted to ICML 2024"},{"id":"http://arxiv.org/abs/2406.02518v1","updated":"2024-06-04T17:39:31Z","published":"2024-06-04T17:39:31Z","title":"DDGS-CT: Direction-Disentangled Gaussian Splatting for Realistic Volume\n Rendering","summary":" Digitally reconstructed radiographs (DRRs) are simulated 2D X-ray images\ngenerated from 3D CT volumes, widely used in preoperative settings but limited\nin intraoperative applications due to computational bottlenecks, especially for\naccurate but heavy physics-based Monte Carlo methods. While analytical DRR\nrenderers offer greater efficiency, they overlook anisotropic X-ray image\nformation phenomena, such as Compton scattering. We present a novel approach\nthat marries realistic physics-inspired X-ray simulation with efficient,\ndifferentiable DRR generation using 3D Gaussian splatting (3DGS). Our\ndirection-disentangled 3DGS (DDGS) method separates the radiosity contribution\ninto isotropic and direction-dependent components, approximating complex\nanisotropic interactions without intricate runtime simulations. Additionally,\nwe adapt the 3DGS initialization to account for tomography data properties,\nenhancing accuracy and efficiency. Our method outperforms state-of-the-art\ntechniques in image accuracy. Furthermore, our DDGS shows promise for\nintraoperative applications and inverse problems such as pose registration,\ndelivering superior registration accuracy and runtime performance compared to\nanalytical DRR methods.\n","authors":["Zhongpai Gao","Benjamin Planche","Meng Zheng","Xiao Chen","Terrence Chen","Ziyan Wu"],"pdf_url":"https://arxiv.org/pdf/2406.02518v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02511v1","updated":"2024-06-04T17:32:52Z","published":"2024-06-04T17:32:52Z","title":"V-Express: Conditional Dropout for Progressive Training of Portrait\n Video Generation","summary":" In the field of portrait video generation, the use of single images to\ngenerate portrait videos has become increasingly prevalent. A common approach\ninvolves leveraging generative models to enhance adapters for controlled\ngeneration. However, control signals (e.g., text, audio, reference image, pose,\ndepth map, etc.) can vary in strength. Among these, weaker conditions often\nstruggle to be effective due to interference from stronger conditions, posing a\nchallenge in balancing these conditions. In our work on portrait video\ngeneration, we identified audio signals as particularly weak, often\novershadowed by stronger signals such as facial pose and reference image.\nHowever, direct training with weak signals often leads to difficulties in\nconvergence. To address this, we propose V-Express, a simple method that\nbalances different control signals through the progressive training and the\nconditional dropout operation. Our method gradually enables effective control\nby weak conditions, thereby achieving generation capabilities that\nsimultaneously take into account the facial pose, reference image, and audio.\nThe experimental results demonstrate that our method can effectively generate\nportrait videos controlled by audio. Furthermore, a potential solution is\nprovided for the simultaneous and effective use of conditions of varying\nstrengths.\n","authors":["Cong Wang","Kuan Tian","Jun Zhang","Yonghang Guan","Feng Luo","Fei Shen","Zhiwei Jiang","Qing Gu","Xiao Han","Wei Yang"],"pdf_url":"https://arxiv.org/pdf/2406.02511v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02509v1","updated":"2024-06-04T17:27:19Z","published":"2024-06-04T17:27:19Z","title":"CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation","summary":" Recently video diffusion models have emerged as expressive generative tools\nfor high-quality video content creation readily available to general users.\nHowever, these models often do not offer precise control over camera poses for\nvideo generation, limiting the expression of cinematic language and user\ncontrol. To address this issue, we introduce CamCo, which allows fine-grained\nCamera pose Control for image-to-video generation. We equip a pre-trained\nimage-to-video generator with accurately parameterized camera pose input using\nPl\\\"ucker coordinates. To enhance 3D consistency in the videos produced, we\nintegrate an epipolar attention module in each attention block that enforces\nepipolar constraints to the feature maps. Additionally, we fine-tune CamCo on\nreal-world videos with camera poses estimated through structure-from-motion\nalgorithms to better synthesize object motion. Our experiments show that CamCo\nsignificantly improves 3D consistency and camera control capabilities compared\nto previous models while effectively generating plausible object motion.\nProject page: https://ir1d.github.io/CamCo/\n","authors":["Dejia Xu","Weili Nie","Chao Liu","Sifei Liu","Jan Kautz","Zhangyang Wang","Arash Vahdat"],"pdf_url":"https://arxiv.org/pdf/2406.02509v1.pdf","comment":"Project page: https://ir1d.github.io/CamCo/"},{"id":"http://arxiv.org/abs/2406.02507v1","updated":"2024-06-04T17:25:59Z","published":"2024-06-04T17:25:59Z","title":"Guiding a Diffusion Model with a Bad Version of Itself","summary":" The primary axes of interest in image-generating diffusion models are image\nquality, the amount of variation in the results, and how well the results align\nwith a given condition, e.g., a class label or a text prompt. The popular\nclassifier-free guidance approach uses an unconditional model to guide a\nconditional model, leading to simultaneously better prompt alignment and\nhigher-quality images at the cost of reduced variation. These effects seem\ninherently entangled, and thus hard to control. We make the surprising\nobservation that it is possible to obtain disentangled control over image\nquality without compromising the amount of variation by guiding generation\nusing a smaller, less-trained version of the model itself rather than an\nunconditional model. This leads to significant improvements in ImageNet\ngeneration, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using\npublicly available networks. Furthermore, the method is also applicable to\nunconditional diffusion models, drastically improving their quality.\n","authors":["Tero Karras","Miika Aittala","Tuomas Kynkäänniemi","Jaakko Lehtinen","Timo Aila","Samuli Laine"],"pdf_url":"https://arxiv.org/pdf/2406.02507v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14125v4","updated":"2024-06-04T17:25:20Z","published":"2023-12-21T18:46:41Z","title":"VideoPoet: A Large Language Model for Zero-Shot Video Generation","summary":" We present VideoPoet, a language model capable of synthesizing high-quality\nvideo, with matching audio, from a large variety of conditioning signals.\nVideoPoet employs a decoder-only transformer architecture that processes\nmultimodal inputs -- including images, videos, text, and audio. The training\nprotocol follows that of Large Language Models (LLMs), consisting of two\nstages: pretraining and task-specific adaptation. During pretraining, VideoPoet\nincorporates a mixture of multimodal generative objectives within an\nautoregressive Transformer framework. The pretrained LLM serves as a foundation\nthat can be adapted for a range of video generation tasks. We present empirical\nresults demonstrating the model's state-of-the-art capabilities in zero-shot\nvideo generation, specifically highlighting VideoPoet's ability to generate\nhigh-fidelity motions. Project page: http://sites.research.google/videopoet/\n","authors":["Dan Kondratyuk","Lijun Yu","Xiuye Gu","José Lezama","Jonathan Huang","Grant Schindler","Rachel Hornung","Vighnesh Birodkar","Jimmy Yan","Ming-Chang Chiu","Krishna Somandepalli","Hassan Akbari","Yair Alon","Yong Cheng","Josh Dillon","Agrim Gupta","Meera Hahn","Anja Hauth","David Hendon","Alonso Martinez","David Minnen","Mikhail Sirotenko","Kihyuk Sohn","Xuan Yang","Hartwig Adam","Ming-Hsuan Yang","Irfan Essa","Huisheng Wang","David A. Ross","Bryan Seybold","Lu Jiang"],"pdf_url":"https://arxiv.org/pdf/2312.14125v4.pdf","comment":"To appear at ICML 2024; Project page:\n http://sites.research.google/videopoet/"},{"id":"http://arxiv.org/abs/2406.02506v1","updated":"2024-06-04T17:24:19Z","published":"2024-06-04T17:24:19Z","title":"An Open-Source Tool for Mapping War Destruction at Scale in Ukraine\n using Sentinel-1 Time Series","summary":" Access to detailed war impact assessments is crucial for humanitarian\norganizations to effectively assist populations most affected by armed\nconflicts. However, maintaining a comprehensive understanding of the situation\non the ground is challenging, especially in conflicts that cover vast\nterritories and extend over long periods. This study presents a scalable and\ntransferable method for estimating war-induced damage to buildings. We first\ntrain a machine learning model to output pixel-wise probability of destruction\nfrom Synthetic Aperture Radar (SAR) satellite image time series, leveraging\nexisting, manual damage assessments as ground truth and cloud-based geospatial\nanalysis tools for large-scale inference. We further post-process these\nassessments using open building footprints to obtain a final damage estimate\nper building. We introduce an accessible, open-source tool that allows users to\nadjust the confidence interval based on their specific requirements and use\ncases. Our approach enables humanitarian organizations and other actors to\nrapidly screen large geographic regions for war impacts. We provide two\npublicly accessible dashboards: a Ukraine Damage Explorer to dynamically view\nour pre-computed estimates, and a Rapid Damage Mapping Tool to easily run our\nmethod and produce custom maps.\n","authors":["Olivier Dietrich","Torben Peters","Vivien Sainte Fare Garnot","Valerie Sticher","Thao Ton-That Whelan","Konrad Schindler","Jan Dirk Wegner"],"pdf_url":"https://arxiv.org/pdf/2406.02506v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02495v1","updated":"2024-06-04T17:13:10Z","published":"2024-06-04T17:13:10Z","title":"GenS: Generalizable Neural Surface Reconstruction from Multi-View Images","summary":" Combining the signed distance function (SDF) and differentiable volume\nrendering has emerged as a powerful paradigm for surface reconstruction from\nmulti-view images without 3D supervision. However, current methods are impeded\nby requiring long-time per-scene optimizations and cannot generalize to new\nscenes. In this paper, we present GenS, an end-to-end generalizable neural\nsurface reconstruction model. Unlike coordinate-based methods that train a\nseparate network for each scene, we construct a generalized multi-scale volume\nto directly encode all scenes. Compared with existing solutions, our\nrepresentation is more powerful, which can recover high-frequency details while\nmaintaining global smoothness. Meanwhile, we introduce a multi-scale\nfeature-metric consistency to impose the multi-view consistency in a more\ndiscriminative multi-scale feature space, which is robust to the failures of\nthe photometric consistency. And the learnable feature can be self-enhanced to\ncontinuously improve the matching accuracy and mitigate aggregation ambiguity.\nFurthermore, we design a view contrast loss to force the model to be robust to\nthose regions covered by few viewpoints through distilling the geometric prior\nfrom dense input to sparse input. Extensive experiments on popular benchmarks\nshow that our model can generalize well to new scenes and outperform existing\nstate-of-the-art methods even those employing ground-truth depth supervision.\nCode is available at https://github.com/prstrive/GenS.\n","authors":["Rui Peng","Xiaodong Gu","Luyang Tang","Shihe Shen","Fanqi Yu","Ronggang Wang"],"pdf_url":"https://arxiv.org/pdf/2406.02495v1.pdf","comment":"NeurIPS 2023 Accepted"},{"id":"http://arxiv.org/abs/2403.07134v2","updated":"2024-06-04T16:57:16Z","published":"2024-03-11T20:04:03Z","title":"COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization","summary":" Post-training quantization (PTQ) has emerged as a practical approach to\ncompress large neural networks, making them highly efficient for deployment.\nHowever, effectively reducing these models to their low-bit counterparts\nwithout compromising the original accuracy remains a key challenge. In this\npaper, we propose an innovative PTQ algorithm termed COMQ, which sequentially\nconducts coordinate-wise minimization of the layer-wise reconstruction errors.\nWe consider the widely used integer quantization, where every quantized weight\ncan be decomposed into a shared floating-point scalar and an integer bit-code.\nWithin a fixed layer, COMQ treats all the scaling factor(s) and bit-codes as\nthe variables of the reconstruction error. Every iteration improves this error\nalong a single coordinate while keeping all other variables constant. COMQ is\neasy to use and requires no hyper-parameter tuning. It instead involves only\ndot products and rounding operations. We update these variables in a carefully\ndesigned greedy order, significantly enhancing the accuracy. COMQ achieves\nremarkable results in quantizing 4-bit Vision Transformers, with a negligible\nloss of less than 1% in Top-1 accuracy. In 4-bit INT quantization of\nconvolutional neural networks, COMQ maintains near-lossless accuracy with a\nminimal drop of merely 0.3% in Top-1 accuracy.\n","authors":["Aozhong Zhang","Zi Yang","Naigang Wang","Yingyong Qin","Jack Xin","Xin Li","Penghang Yin"],"pdf_url":"https://arxiv.org/pdf/2403.07134v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02485v1","updated":"2024-06-04T16:54:28Z","published":"2024-06-04T16:54:28Z","title":"Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image\n Generation","summary":" Controllable text-to-image (T2I) diffusion models have shown impressive\nperformance in generating high-quality visual content through the incorporation\nof various conditions. Current methods, however, exhibit limited performance\nwhen guided by skeleton human poses, especially in complex pose conditions such\nas side or rear perspectives of human figures. To address this issue, we\npresent Stable-Pose, a novel adapter model that introduces a coarse-to-fine\nattention masking strategy into a vision Transformer (ViT) to gain accurate\npose guidance for T2I models. Stable-Pose is designed to adeptly handle pose\nconditions within pre-trained Stable Diffusion, providing a refined and\nefficient way of aligning pose representation during image synthesis. We\nleverage the query-key self-attention mechanism of ViTs to explore the\ninterconnections among different anatomical parts in human pose skeletons.\nMasked pose images are used to smoothly refine the attention maps based on\ntarget pose-related features in a hierarchical manner, transitioning from\ncoarse to fine levels. Additionally, our loss function is formulated to\nallocate increased emphasis to the pose region, thereby augmenting the model's\nprecision in capturing intricate pose details. We assessed the performance of\nStable-Pose across five public datasets under a wide range of indoor and\noutdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the\nLAION-Human dataset, marking around 13% improvement over the established\ntechnique ControlNet. The project link and code is available at\nhttps://github.com/ai-med/StablePose.\n","authors":["Jiajun Wang","Morteza Ghahremani","Yitong Li","Björn Ommer","Christian Wachinger"],"pdf_url":"https://arxiv.org/pdf/2406.02485v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02477v1","updated":"2024-06-04T16:47:47Z","published":"2024-06-04T16:47:47Z","title":"Inpainting Pathology in Lumbar Spine MRI with Latent Diffusion","summary":" Data driven models for automated diagnosis in radiology suffer from\ninsufficient and imbalanced datasets due to low representation of pathology in\na population and the cost of expert annotations. Datasets can be bolstered\nthrough data augmentation. However, even when utilizing a full suite of\ntransformations during model training, typical data augmentations do not\naddress variations in human anatomy. An alternative direction is to synthesize\ndata using generative models, which can potentially craft datasets with\nspecific attributes. While this holds promise, commonly used generative models\nsuch as Generative Adversarial Networks may inadvertently produce anatomically\ninaccurate features. On the other hand, diffusion models, which offer greater\nstability, tend to memorize training data, raising concerns about privacy and\ngenerative diversity. Alternatively, inpainting has the potential to augment\ndata through directly inserting pathology in medical images. However, this\napproach introduces a new challenge: accurately merging the generated\npathological features with the surrounding anatomical context. While inpainting\nis a well established method for addressing simple lesions, its application to\npathologies that involve complex structural changes remains relatively\nunexplored. We propose an efficient method for inpainting pathological features\nonto healthy anatomy in MRI through voxelwise noise scheduling in a latent\ndiffusion model. We evaluate the method's ability to insert disc herniation and\ncentral canal stenosis in lumbar spine sagittal T2 MRI, and it achieves\nsuperior Frechet Inception Distance compared to state-of-the-art methods.\n","authors":["Colin Hansen","Simas Glinskis","Ashwin Raju","Micha Kornreich","JinHyeong Park","Jayashri Pawar","Richard Herzog","Li Zhang","Benjamin Odry"],"pdf_url":"https://arxiv.org/pdf/2406.02477v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02468v1","updated":"2024-06-04T16:38:06Z","published":"2024-06-04T16:38:06Z","title":"DL-KDD: Dual-Light Knowledge Distillation for Action Recognition in the\n Dark","summary":" Human action recognition in dark videos is a challenging task for computer\nvision. Recent research focuses on applying dark enhancement methods to improve\nthe visibility of the video. However, such video processing results in the loss\nof critical information in the original (un-enhanced) video. Conversely,\ntraditional two-stream methods are capable of learning information from both\noriginal and processed videos, but it can lead to a significant increase in the\ncomputational cost during the inference phase in the task of video\nclassification. To address these challenges, we propose a novel teacher-student\nvideo classification framework, named Dual-Light KnowleDge Distillation for\nAction Recognition in the Dark (DL-KDD). This framework enables the model to\nlearn from both original and enhanced video without introducing additional\ncomputational cost during inference. Specifically, DL-KDD utilizes the strategy\nof knowledge distillation during training. The teacher model is trained with\nenhanced video, and the student model is trained with both the original video\nand the soft target generated by the teacher model. This teacher-student\nframework allows the student model to predict action using only the original\ninput video during inference. In our experiments, the proposed DL-KDD framework\noutperforms state-of-the-art methods on the ARID, ARID V1.5, and Dark-48\ndatasets. We achieve the best performance on each dataset and up to a 4.18%\nimprovement on Dark-48, using only original video inputs, thus avoiding the use\nof two-stream framework or enhancement modules for inference. We further\nvalidate the effectiveness of the distillation strategy in ablative\nexperiments. The results highlight the advantages of our knowledge distillation\nframework in dark human action recognition.\n","authors":["Chi-Jui Chang","Oscar Tai-Yuan Chen","Vincent S. Tseng"],"pdf_url":"https://arxiv.org/pdf/2406.02468v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02465v1","updated":"2024-06-04T16:34:17Z","published":"2024-06-04T16:34:17Z","title":"An Empirical Study into Clustering of Unseen Datasets with\n Self-Supervised Encoders","summary":" Can pretrained models generalize to new datasets without any retraining? We\ndeploy pretrained image models on datasets they were not trained for, and\ninvestigate whether their embeddings form meaningful clusters. Our suite of\nbenchmarking experiments use encoders pretrained solely on ImageNet-1k with\neither supervised or self-supervised training techniques, deployed on image\ndatasets that were not seen during training, and clustered with conventional\nclustering algorithms. This evaluation provides new insights into the\nembeddings of self-supervised models, which prioritize different features to\nsupervised models. Supervised encoders typically offer more utility than SSL\nencoders within the training domain, and vice-versa far outside of it, however,\nfine-tuned encoders demonstrate the opposite trend. Clustering provides a way\nto evaluate the utility of self-supervised learned representations orthogonal\nto existing methods such as kNN. Additionally, we find the silhouette score\nwhen measured in a UMAP-reduced space is highly correlated with clustering\nperformance, and can therefore be used as a proxy for clustering performance on\ndata with no ground truth labels. Our code implementation is available at\n\\url{https://github.com/scottclowe/zs-ssl-clustering/}.\n","authors":["Scott C. Lowe","Joakim Bruslund Haurum","Sageev Oore","Thomas B. Moeslund","Graham W. Taylor"],"pdf_url":"https://arxiv.org/pdf/2406.02465v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02462v1","updated":"2024-06-04T16:30:37Z","published":"2024-06-04T16:30:37Z","title":"Learning Image Priors through Patch-based Diffusion Models for Solving\n Inverse Problems","summary":" Diffusion models can learn strong image priors from underlying data\ndistribution and use them to solve inverse problems, but the training process\nis computationally expensive and requires lots of data. Such bottlenecks\nprevent most existing works from being feasible for high-dimensional and\nhigh-resolution data such as 3D images. This paper proposes a method to learn\nan efficient data prior for the entire image by training diffusion models only\non patches of images. Specifically, we propose a patch-based position-aware\ndiffusion inverse solver, called PaDIS, where we obtain the score function of\nthe whole image through scores of patches and their positional encoding and\nutilize this as the prior for solving inverse problems. First of all, we show\nthat this diffusion model achieves an improved memory efficiency and data\nefficiency while still maintaining the capability to generate entire images via\npositional encoding. Additionally, the proposed PaDIS model is highly flexible\nand can be plugged in with different diffusion inverse solvers (DIS). We\ndemonstrate that the proposed PaDIS approach enables solving various inverse\nproblems in both natural and medical image domains, including CT\nreconstruction, deblurring, and superresolution, given only patch-based priors.\nNotably, PaDIS outperforms previous DIS methods trained on entire image priors\nin the case of limited training data, demonstrating the data efficiency of our\nproposed approach by learning patch-based prior.\n","authors":["Jason Hu","Bowen Song","Xiaojian Xu","Liyue Shen","Jeffrey A. Fessler"],"pdf_url":"https://arxiv.org/pdf/2406.02462v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02461v1","updated":"2024-06-04T16:27:09Z","published":"2024-06-04T16:27:09Z","title":"RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting","summary":" The advancement of diffusion models has pushed the boundary of text-to-3D\nobject generation. While it is straightforward to composite objects into a\nscene with reasonable geometry, it is nontrivial to texture such a scene\nperfectly due to style inconsistency and occlusions between objects. To tackle\nthese problems, we propose a coarse-to-fine 3D scene texturing framework,\nreferred to as RoomTex, to generate high-fidelity and style-consistent textures\nfor untextured compositional scene meshes. In the coarse stage, RoomTex first\nunwraps the scene mesh to a panoramic depth map and leverages ControlNet to\ngenerate a room panorama, which is regarded as the coarse reference to ensure\nthe global texture consistency. In the fine stage, based on the panoramic image\nand perspective depth maps, RoomTex will refine and texture every single object\nin the room iteratively along a series of selected camera views, until this\nobject is completely painted. Moreover, we propose to maintain superior\nalignment between RGB and depth spaces via subtle edge detection methods.\nExtensive experiments show our method is capable of generating high-quality and\ndiverse room textures, and more importantly, supporting interactive\nfine-grained texture control and flexible scene editing thanks to our\ninpainting-based framework and compositional mesh input. Our project page is\navailable at https://qwang666.github.io/RoomTex/.\n","authors":["Qi Wang","Ruijie Lu","Xudong Xu","Jingbo Wang","Michael Yu Wang","Bo Dai","Gang Zeng","Dan Xu"],"pdf_url":"https://arxiv.org/pdf/2406.02461v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00687v2","updated":"2024-06-04T16:19:47Z","published":"2024-06-02T09:48:19Z","title":"Lay-A-Scene: Personalized 3D Object Arrangement Using Text-to-Image\n Priors","summary":" Generating 3D visual scenes is at the forefront of visual generative AI, but\ncurrent 3D generation techniques struggle with generating scenes with multiple\nhigh-resolution objects. Here we introduce Lay-A-Scene, which solves the task\nof Open-set 3D Object Arrangement, effectively arranging unseen objects. Given\na set of 3D objects, the task is to find a plausible arrangement of these\nobjects in a scene. We address this task by leveraging pre-trained\ntext-to-image models. We personalize the model and explain how to generate\nimages of a scene that contains multiple predefined objects without neglecting\nany of them. Then, we describe how to infer the 3D poses and arrangement of\nobjects from a 2D generated image by finding a consistent projection of objects\nonto the 2D scene. We evaluate the quality of Lay-A-Scene using 3D objects from\nObjaverse and human raters and find that it often generates coherent and\nfeasible 3D object arrangements.\n","authors":["Ohad Rahamim","Hilit Segev","Idan Achituve","Yuval Atzmon","Yoni Kasten","Gal Chechik"],"pdf_url":"https://arxiv.org/pdf/2406.00687v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00783v2","updated":"2024-06-04T16:08:07Z","published":"2024-06-02T15:51:33Z","title":"AI-Face: A Million-Scale Demographically Annotated AI-Generated Face\n Dataset and Fairness Benchmark","summary":" AI-generated faces have enriched human life, such as entertainment,\neducation, and art. However, they also pose misuse risks. Therefore, detecting\nAI-generated faces becomes crucial, yet current detectors show biased\nperformance across different demographic groups. Mitigating biases can be done\nby designing algorithmic fairness methods, which usually require\ndemographically annotated face datasets for model training. However, no\nexisting dataset comprehensively encompasses both demographic attributes and\ndiverse generative methods, which hinders the development of fair detectors for\nAI-generated faces. In this work, we introduce the AI-Face dataset, the first\nmillion-scale demographically annotated AI-generated face image dataset,\nincluding real faces, faces from deepfake videos, and faces generated by\nGenerative Adversarial Networks and Diffusion Models. Based on this dataset, we\nconduct the first comprehensive fairness benchmark to assess various AI face\ndetectors and provide valuable insights and findings to promote the future fair\ndesign of AI face detectors. Our AI-Face dataset and benchmark code are\npublicly available at https://github.com/Purdue-M2/AI-Face-FairnessBench.\n","authors":["Li Lin"," Santosh","Xin Wang","Shu Hu"],"pdf_url":"https://arxiv.org/pdf/2406.00783v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02435v1","updated":"2024-06-04T15:57:43Z","published":"2024-06-04T15:57:43Z","title":"Generative Active Learning for Long-tailed Instance Segmentation","summary":" Recently, large-scale language-image generative models have gained widespread\nattention and many works have utilized generated data from these models to\nfurther enhance the performance of perception tasks. However, not all generated\ndata can positively impact downstream models, and these methods do not\nthoroughly explore how to better select and utilize generated data. On the\nother hand, there is still a lack of research oriented towards active learning\non generated data. In this paper, we explore how to perform active learning\nspecifically for generated data in the long-tailed instance segmentation task.\nSubsequently, we propose BSGAL, a new algorithm that online estimates the\ncontribution of the generated data based on gradient cache. BSGAL can handle\nunlimited generated data and complex downstream segmentation tasks effectively.\nExperiments show that BSGAL outperforms the baseline approach and effectually\nimproves the performance of long-tailed segmentation. Our code can be found at\nhttps://github.com/aim-uofa/DiverGen.\n","authors":["Muzhi Zhu","Chengxiang Fan","Hao Chen","Yang Liu","Weian Mao","Xiaogang Xu","Chunhua Shen"],"pdf_url":"https://arxiv.org/pdf/2406.02435v1.pdf","comment":"Accepted by ICML 2024"},{"id":"http://arxiv.org/abs/2406.02425v1","updated":"2024-06-04T15:44:25Z","published":"2024-06-04T15:44:25Z","title":"CoNav: A Benchmark for Human-Centered Collaborative Navigation","summary":" Human-robot collaboration, in which the robot intelligently assists the human\nwith the upcoming task, is an appealing objective. To achieve this goal, the\nagent needs to be equipped with a fundamental collaborative navigation ability,\nwhere the agent should reason human intention by observing human activities and\nthen navigate to the human's intended destination in advance of the human.\nHowever, this vital ability has not been well studied in previous literature.\nTo fill this gap, we propose a collaborative navigation (CoNav) benchmark. Our\nCoNav tackles the critical challenge of constructing a 3D navigation\nenvironment with realistic and diverse human activities. To achieve this, we\ndesign a novel LLM-based humanoid animation generation framework, which is\nconditioned on both text descriptions and environmental context. The generated\nhumanoid trajectory obeys the environmental context and can be easily\nintegrated into popular simulators. We empirically find that the existing\nnavigation methods struggle in CoNav task since they neglect the perception of\nhuman intention. To solve this problem, we propose an intention-aware agent for\nreasoning both long-term and short-term human intention. The agent predicts\nnavigation action based on the predicted intention and panoramic observation.\nThe emergent agent behavior including observing humans, avoiding human\ncollision, and navigation reveals the efficiency of the proposed datasets and\nagents.\n","authors":["Changhao Li","Xinyu Sun","Peihao Chen","Jugang Fan","Zixu Wang","Yanxia Liu","Jinhui Zhu","Chuang Gan","Mingkui Tan"],"pdf_url":"https://arxiv.org/pdf/2406.02425v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02422v1","updated":"2024-06-04T15:39:49Z","published":"2024-06-04T15:39:49Z","title":"IterMask2: Iterative Unsupervised Anomaly Segmentation via Spatial and\n Frequency Masking for Brain Lesions in MRI","summary":" Unsupervised anomaly segmentation approaches to pathology segmentation train\na model on images of healthy subjects, that they define as the 'normal' data\ndistribution. At inference, they aim to segment any pathologies in new images\nas 'anomalies', as they exhibit patterns that deviate from those in 'normal'\ntraining data. Prevailing methods follow the 'corrupt-and-reconstruct'\nparadigm. They intentionally corrupt an input image, reconstruct it to follow\nthe learned 'normal' distribution, and subsequently segment anomalies based on\nreconstruction error. Corrupting an input image, however, inevitably leads to\nsuboptimal reconstruction even of normal regions, causing false positives. To\nalleviate this, we propose a novel iterative spatial mask-refining strategy\nIterMask2. We iteratively mask areas of the image, reconstruct them, and update\nthe mask based on reconstruction error. This iterative process progressively\nadds information about areas that are confidently normal as per the model. The\nincreasing content guides reconstruction of nearby masked areas, improving\nreconstruction of normal tissue under these areas, reducing false positives. We\nalso use high-frequency image content as an auxiliary input to provide\nadditional structural information for masked areas. This further improves\nreconstruction error of normal in comparison to anomalous areas, facilitating\nsegmentation of the latter. We conduct experiments on several brain lesion\ndatasets and demonstrate effectiveness of our method. Code is available at:\nhttps://github.com/ZiyunLiang/IterMasks2\n","authors":["Ziyun Liang","Xiaoqing Guo","J. Alison Noble","Konstantinos Kamnitsas"],"pdf_url":"https://arxiv.org/pdf/2406.02422v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.13534v2","updated":"2024-06-04T15:23:51Z","published":"2024-04-21T05:09:56Z","title":"Motion-aware Latent Diffusion Models for Video Frame Interpolation","summary":" With the advancement of AIGC, video frame interpolation (VFI) has become a\ncrucial component in existing video generation frameworks, attracting\nwidespread research interest. For the VFI task, the motion estimation between\nneighboring frames plays a crucial role in avoiding motion ambiguity. However,\nexisting VFI methods always struggle to accurately predict the motion\ninformation between consecutive frames, and this imprecise estimation leads to\nblurred and visually incoherent interpolated frames. In this paper, we propose\na novel diffusion framework, motion-aware latent diffusion models (MADiff),\nwhich is specifically designed for the VFI task. By incorporating motion priors\nbetween the conditional neighboring frames with the target interpolated frame\npredicted throughout the diffusion sampling procedure, MADiff progressively\nrefines the intermediate outcomes, culminating in generating both visually\nsmooth and realistic results. Extensive experiments conducted on benchmark\ndatasets demonstrate that our method achieves state-of-the-art performance\nsignificantly outperforming existing approaches, especially under challenging\nscenarios involving dynamic textures with complex motion.\n","authors":["Zhilin Huang","Yijie Yu","Ling Yang","Chujun Qin","Bing Zheng","Xiawu Zheng","Zikun Zhou","Yaowei Wang","Wenming Yang"],"pdf_url":"https://arxiv.org/pdf/2404.13534v2.pdf","comment":"17 pages, 4 figures. arXiv admin note: substantial text overlap with\n arXiv:2303.09508 by other authors"},{"id":"http://arxiv.org/abs/2406.02411v1","updated":"2024-06-04T15:21:37Z","published":"2024-06-04T15:21:37Z","title":"Decoupling of neural network calibration measures","summary":" A lot of effort is currently invested in safeguarding autonomous driving\nsystems, which heavily rely on deep neural networks for computer vision. We\ninvestigate the coupling of different neural network calibration measures with\na special focus on the Area Under the Sparsification Error curve (AUSE) metric.\nWe elaborate on the well-known inconsistency in determining optimal calibration\nusing the Expected Calibration Error (ECE) and we demonstrate similar issues\nfor the AUSE, the Uncertainty Calibration Score (UCS), as well as the\nUncertainty Calibration Error (UCE). We conclude that the current methodologies\nleave a degree of freedom, which prevents a unique model calibration for the\nhomologation of safety-critical functionalities. Furthermore, we propose the\nAUSE as an indirect measure for the residual uncertainty, which is irreducible\nfor a fixed network architecture and is driven by the stochasticity in the\nunderlying data generation process (aleatoric contribution) as well as the\nlimitation in the hypothesis space (epistemic contribution).\n","authors":["Dominik Werner Wolf","Prasannavenkatesh Balaji","Alexander Braun","Markus Ulrich"],"pdf_url":"https://arxiv.org/pdf/2406.02411v1.pdf","comment":"Submitted to the German Conference on Pattern Recognition (GCPR) 2024"},{"id":"http://arxiv.org/abs/2406.02407v1","updated":"2024-06-04T15:17:37Z","published":"2024-06-04T15:17:37Z","title":"WE-GS: An In-the-wild Efficient 3D Gaussian Representation for\n Unconstrained Photo Collections","summary":" Novel View Synthesis (NVS) from unconstrained photo collections is\nchallenging in computer graphics. Recently, 3D Gaussian Splatting (3DGS) has\nshown promise for photorealistic and real-time NVS of static scenes. Building\non 3DGS, we propose an efficient point-based differentiable rendering framework\nfor scene reconstruction from photo collections. Our key innovation is a\nresidual-based spherical harmonic coefficients transfer module that adapts 3DGS\nto varying lighting conditions and photometric post-processing. This\nlightweight module can be pre-computed and ensures efficient gradient\npropagation from rendered images to 3D Gaussian attributes. Additionally, we\nobserve that the appearance encoder and the transient mask predictor, the two\nmost critical parts of NVS from unconstrained photo collections, can be\nmutually beneficial. We introduce a plug-and-play lightweight spatial attention\nmodule to simultaneously predict transient occluders and latent appearance\nrepresentation for each image. After training and preprocessing, our method\naligns with the standard 3DGS format and rendering pipeline, facilitating\nseamlessly integration into various 3DGS applications. Extensive experiments on\ndiverse datasets show our approach outperforms existing approaches on the\nrendering quality of novel view and appearance synthesis with high converge and\nrendering speed.\n","authors":["Yuze Wang","Junyi Wang","Yue Qi"],"pdf_url":"https://arxiv.org/pdf/2406.02407v1.pdf","comment":"Our project page is available at\n https://yuzewang1998.github.io/we-gs.github.io/"},{"id":"http://arxiv.org/abs/2406.02395v1","updated":"2024-06-04T15:09:29Z","published":"2024-06-04T15:09:29Z","title":"GrootVL: Tree Topology is All You Need in State Space Model","summary":" The state space models, employing recursively propagated features,\ndemonstrate strong representation capabilities comparable to Transformer models\nand superior efficiency. However, constrained by the inherent geometric\nconstraints of sequences, it still falls short in modeling long-range\ndependencies. To address this issue, we propose the GrootVL network, which\nfirst dynamically generates a tree topology based on spatial relationships and\ninput features. Then, feature propagation is performed based on this graph,\nthereby breaking the original sequence constraints to achieve stronger\nrepresentation capabilities. Additionally, we introduce a linear complexity\ndynamic programming algorithm to enhance long-range interactions without\nincreasing computational cost. GrootVL is a versatile multimodal framework that\ncan be applied to both visual and textual tasks. Extensive experiments\ndemonstrate that our method significantly outperforms existing structured state\nspace models on image classification, object detection and segmentation.\nBesides, by fine-tuning large language models, our approach achieves consistent\nimprovements in multiple textual tasks at minor training cost.\n","authors":["Yicheng Xiao","Lin Song","Shaoli Huang","Jiangshan Wang","Siyu Song","Yixiao Ge","Xiu Li","Ying Shan"],"pdf_url":"https://arxiv.org/pdf/2406.02395v1.pdf","comment":"The code is available at https://github.com/EasonXiao-888/GrootVL"},{"id":"http://arxiv.org/abs/2406.02385v1","updated":"2024-06-04T15:00:49Z","published":"2024-06-04T15:00:49Z","title":"Low-Rank Adaption on Transformer-based Oriented Object Detector for\n Satellite Onboard Processing of Remote Sensing Images","summary":" Deep learning models in satellite onboard enable real-time interpretation of\nremote sensing images, reducing the need for data transmission to the ground\nand conserving communication resources. As satellite numbers and observation\nfrequencies increase, the demand for satellite onboard real-time image\ninterpretation grows, highlighting the expanding importance and development of\nthis technology. However, updating the extensive parameters of models deployed\non the satellites for spaceborne object detection model is challenging due to\nthe limitations of uplink bandwidth in wireless satellite communications. To\naddress this issue, this paper proposes a method based on parameter-efficient\nfine-tuning technology with low-rank adaptation (LoRA) module. It involves\ntraining low-rank matrix parameters and integrating them with the original\nmodel's weight matrix through multiplication and summation, thereby fine-tuning\nthe model parameters to adapt to new data distributions with minimal weight\nupdates. The proposed method combines parameter-efficient fine-tuning with full\nfine-tuning in the parameter update strategy of the oriented object detection\nalgorithm architecture. This strategy enables model performance improvements\nclose to full fine-tuning effects with minimal parameter updates. In addition,\nlow rank approximation is conducted to pick an optimal rank value for LoRA\nmatrices. Extensive experiments verify the effectiveness of the proposed\nmethod. By fine-tuning and updating only 12.4$\\%$ of the model's total\nparameters, it is able to achieve 97$\\%$ to 100$\\%$ of the performance of full\nfine-tuning models. Additionally, the reduced number of trainable parameters\naccelerates model training iterations and enhances the generalization and\nrobustness of the oriented object detection model. The source code is available\nat: \\url{https://github.com/fudanxu/LoRA-Det}.\n","authors":["Xinyang Pu","Feng Xu"],"pdf_url":"https://arxiv.org/pdf/2406.02385v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02383v1","updated":"2024-06-04T14:59:38Z","published":"2024-06-04T14:59:38Z","title":"Learning to Edit Visual Programs with Self-Supervision","summary":" We design a system that learns how to edit visual programs. Our edit network\nconsumes a complete input program and a visual target. From this input, we task\nour network with predicting a local edit operation that could be applied to the\ninput program to improve its similarity to the target. In order to apply this\nscheme for domains that lack program annotations, we develop a self-supervised\nlearning approach that integrates this edit network into a bootstrapped\nfinetuning loop along with a network that predicts entire programs in one-shot.\nOur joint finetuning scheme, when coupled with an inference procedure that\ninitializes a population from the one-shot model and evolves members of this\npopulation with the edit network, helps to infer more accurate visual programs.\nOver multiple domains, we experimentally compare our method against the\nalternative of using only the one-shot model, and find that even under equal\nsearch-time budgets, our editing-based paradigm provides significant\nadvantages.\n","authors":["R. Kenny Jones","Renhao Zhang","Aditya Ganeshan","Daniel Ritchie"],"pdf_url":"https://arxiv.org/pdf/2406.02383v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02380v1","updated":"2024-06-04T14:57:56Z","published":"2024-06-04T14:57:56Z","title":"EUFCC-340K: A Faceted Hierarchical Dataset for Metadata Annotation in\n GLAM Collections","summary":" In this paper, we address the challenges of automatic metadata annotation in\nthe domain of Galleries, Libraries, Archives, and Museums (GLAMs) by\nintroducing a novel dataset, EUFCC340K, collected from the Europeana portal.\nComprising over 340,000 images, the EUFCC340K dataset is organized across\nmultiple facets: Materials, Object Types, Disciplines, and Subjects, following\na hierarchical structure based on the Art & Architecture Thesaurus (AAT). We\ndeveloped several baseline models, incorporating multiple heads on a ConvNeXT\nbackbone for multi-label image tagging on these facets, and fine-tuning a CLIP\nmodel with our image text pairs. Our experiments to evaluate model robustness\nand generalization capabilities in two different test scenarios demonstrate the\nutility of the dataset in improving multi-label classification tools that have\nthe potential to alleviate cataloging tasks in the cultural heritage sector.\n","authors":["Francesc Net","Marc Folia","Pep Casals","Andrew D. Bagdanov","Lluis Gomez"],"pdf_url":"https://arxiv.org/pdf/2406.02380v1.pdf","comment":"23 pages, 13 figures"},{"id":"http://arxiv.org/abs/2406.00609v2","updated":"2024-06-04T14:47:45Z","published":"2024-06-02T03:44:50Z","title":"SuperGaussian: Repurposing Video Models for 3D Super Resolution","summary":" We present a simple, modular, and generic method that upsamples coarse 3D\nmodels by adding geometric and appearance details. While generative 3D models\nnow exist, they do not yet match the quality of their counterparts in image and\nvideo domains. We demonstrate that it is possible to directly repurpose\nexisting (pretrained) video models for 3D super-resolution and thus sidestep\nthe problem of the shortage of large repositories of high-quality 3D training\nmodels. We describe how to repurpose video upsampling models, which are not 3D\nconsistent, and combine them with 3D consolidation to produce 3D-consistent\nresults. As output, we produce high quality Gaussian Splat models, which are\nobject centric and effective. Our method is category agnostic and can be easily\nincorporated into existing 3D workflows. We evaluate our proposed SuperGaussian\non a variety of 3D inputs, which are diverse both in terms of complexity and\nrepresentation (e.g., Gaussian Splats or NeRFs), and demonstrate that our\nsimple method significantly improves the fidelity of the final 3D models. Check\nour project website for details: supergaussian.github.io\n","authors":["Yuan Shen","Duygu Ceylan","Paul Guerrero","Zexiang Xu","Niloy J. Mitra","Shenlong Wang","Anna Frühstück"],"pdf_url":"https://arxiv.org/pdf/2406.00609v2.pdf","comment":"Check our project website for details:\n https://supergaussian.github.io"},{"id":"http://arxiv.org/abs/2406.02355v1","updated":"2024-06-04T14:34:13Z","published":"2024-06-04T14:34:13Z","title":"FedDr+: Stabilizing Dot-regression with Global Feature Distillation for\n Federated Learning","summary":" Federated Learning (FL) has emerged as a pivotal framework for the\ndevelopment of effective global models (global FL) or personalized models\n(personalized FL) across clients with heterogeneous, non-iid data distribution.\nA key challenge in FL is client drift, where data heterogeneity impedes the\naggregation of scattered knowledge. Recent studies have tackled the client\ndrift issue by identifying significant divergence in the last classifier layer.\nTo mitigate this divergence, strategies such as freezing the classifier weights\nand aligning the feature extractor accordingly have proven effective. Although\nthe local alignment between classifier and feature extractor has been studied\nas a crucial factor in FL, we observe that it may lead the model to\noveremphasize the observed classes within each client. Thus, our objectives are\ntwofold: (1) enhancing local alignment while (2) preserving the representation\nof unseen class samples. This approach aims to effectively integrate knowledge\nfrom individual clients, thereby improving performance for both global and\npersonalized FL. To achieve this, we introduce a novel algorithm named FedDr+,\nwhich empowers local model alignment using dot-regression loss. FedDr+ freezes\nthe classifier as a simplex ETF to align the features and improves aggregated\nglobal models by employing a feature distillation mechanism to retain\ninformation about unseen/missing classes. Consequently, we provide empirical\nevidence demonstrating that our algorithm surpasses existing methods that use a\nfrozen classifier to boost alignment across the diverse distribution.\n","authors":["Seongyoon Kim","Minchan Jeong","Sungnyun Kim","Sungwoo Cho","Sumyeong Ahn","Se-Young Yun"],"pdf_url":"https://arxiv.org/pdf/2406.02355v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00210v2","updated":"2024-06-04T14:26:06Z","published":"2024-05-31T21:47:05Z","title":"A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature\n Inheritance Strategies","summary":" The Stable Diffusion Model (SDM) is a prevalent and effective model for\ntext-to-image (T2I) and image-to-image (I2I) generation. Despite various\nattempts at sampler optimization, model distillation, and network\nquantification, these approaches typically maintain the original network\narchitecture. The extensive parameter scale and substantial computational\ndemands have limited research into adjusting the model architecture. This study\nfocuses on reducing redundant computation in SDM and optimizes the model\nthrough both tuning and tuning-free methods. 1) For the tuning method, we\ndesign a model assembly strategy to reconstruct a lightweight model while\npreserving performance through distillation. Second, to mitigate performance\nloss due to pruning, we incorporate multi-expert conditional convolution\n(ME-CondConv) into compressed UNets to enhance network performance by\nincreasing capacity without sacrificing speed. Third, we validate the\neffectiveness of the multi-UNet switching method for improving network speed.\n2) For the tuning-free method, we propose a feature inheritance strategy to\naccelerate inference by skipping local computations at the block, layer, or\nunit level within the network structure. We also examine multiple sampling\nmodes for feature inheritance at the time-step level. Experiments demonstrate\nthat both the proposed tuning and the tuning-free methods can improve the speed\nand performance of the SDM. The lightweight model reconstructed by the model\nassembly strategy increases generation speed by $22.4%$, while the feature\ninheritance strategy enhances the SDM generation speed by $40.0%$.\n","authors":["Jinchao Zhu","Yuxuan Wang","Siyuan Pan","Pengfei Wan","Di Zhang","Gao Huang"],"pdf_url":"https://arxiv.org/pdf/2406.00210v2.pdf","comment":"19 pages, 16 figures, submitted to IEEE Transactions on Neural\n Networks and Learning Systems"},{"id":"http://arxiv.org/abs/2406.02349v1","updated":"2024-06-04T14:24:35Z","published":"2024-06-04T14:24:35Z","title":"CADE: Cosine Annealing Differential Evolution for Spiking Neural Network","summary":" Spiking neural networks (SNNs) have gained prominence for their potential in\nneuromorphic computing and energy-efficient artificial intelligence, yet\noptimizing them remains a formidable challenge for gradient-based methods due\nto their discrete, spike-based computation. This paper attempts to tackle the\nchallenges by introducing Cosine Annealing Differential Evolution (CADE),\ndesigned to modulate the mutation factor (F) and crossover rate (CR) of\ndifferential evolution (DE) for the SNN model, i.e., Spiking Element Wise (SEW)\nResNet. Extensive empirical evaluations were conducted to analyze CADE. CADE\nshowed a balance in exploring and exploiting the search space, resulting in\naccelerated convergence and improved accuracy compared to existing\ngradient-based and DE-based methods. Moreover, an initialization method based\non a transfer learning setting was developed, pretraining on a source dataset\n(i.e., CIFAR-10) and fine-tuning the target dataset (i.e., CIFAR-100), to\nimprove population diversity. It was found to further enhance CADE for SNN.\nRemarkably, CADE elevates the performance of the highest accuracy SEW model by\nan additional 0.52 percentage points, underscoring its effectiveness in\nfine-tuning and enhancing SNNs. These findings emphasize the pivotal role of a\nscheduler for F and CR adjustment, especially for DE-based SNN. Source Code on\nGithub: https://github.com/Tank-Jiang/CADE4SNN.\n","authors":["Runhua Jiang","Guodong Du","Shuyang Yu","Yifei Guo","Sim Kuan Goh","Ho-Kin Tang"],"pdf_url":"https://arxiv.org/pdf/2406.02349v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02347v1","updated":"2024-06-04T14:23:27Z","published":"2024-06-04T14:23:27Z","title":"Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few\n Steps Image Generation","summary":" In this paper, we propose an efficient, fast, and versatile distillation\nmethod to accelerate the generation of pre-trained diffusion models: Flash\nDiffusion. The method reaches state-of-the-art performances in terms of FID and\nCLIP-Score for few steps image generation on the COCO2014 and COCO2017\ndatasets, while requiring only several GPU hours of training and fewer\ntrainable parameters than existing methods. In addition to its efficiency, the\nversatility of the method is also exposed across several tasks such as\ntext-to-image, inpainting, face-swapping, super-resolution and using different\nbackbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart-$\\alpha$),\nas well as adapters. In all cases, the method allowed to reduce drastically the\nnumber of sampling steps while maintaining very high-quality image generation.\nThe official implementation is available at\nhttps://github.com/gojasper/flash-diffusion.\n","authors":["Clement Chadebec","Onur Tasar","Eyal Benaroche","Benjamin Aubin"],"pdf_url":"https://arxiv.org/pdf/2406.02347v1.pdf","comment":"16 pages + 16 pages appendices"},{"id":"http://arxiv.org/abs/2406.02345v1","updated":"2024-06-04T14:21:41Z","published":"2024-06-04T14:21:41Z","title":"Progressive Confident Masking Attention Network for Audio-Visual\n Segmentation","summary":" Audio and visual signals typically occur simultaneously, and humans possess\nan innate ability to correlate and synchronize information from these two\nmodalities. Recently, a challenging problem known as Audio-Visual Segmentation\n(AVS) has emerged, intending to produce segmentation maps for sounding objects\nwithin a scene. However, the methods proposed so far have not sufficiently\nintegrated audio and visual information, and the computational costs have been\nextremely high. Additionally, the outputs of different stages have not been\nfully utilized. To facilitate this research, we introduce a novel Progressive\nConfident Masking Attention Network (PMCANet). It leverages attention\nmechanisms to uncover the intrinsic correlations between audio signals and\nvisual frames. Furthermore, we design an efficient and effective\ncross-attention module to enhance semantic perception by selecting query\ntokens. This selection is determined through confidence-driven units based on\nthe network's multi-stage predictive outputs. Experiments demonstrate that our\nnetwork outperforms other AVS methods while requiring less computational\nresources.\n","authors":["Yuxuan Wang","Feng Dong","Jinchao Zhu"],"pdf_url":"https://arxiv.org/pdf/2406.02345v1.pdf","comment":"10 pages, 9 figures, submitted to IEEE TRANSACTIONS ON CIRCUITS AND\n SYSTEMS FOR VIDEO TECHNOLOGY"},{"id":"http://arxiv.org/abs/2406.02343v1","updated":"2024-06-04T14:19:50Z","published":"2024-06-04T14:19:50Z","title":"Cluster-Aware Similarity Diffusion for Instance Retrieval","summary":" Diffusion-based re-ranking is a common method used for retrieving instances\nby performing similarity propagation in a nearest neighbor graph. However,\nexisting techniques that construct the affinity graph based on pairwise\ninstances can lead to the propagation of misinformation from outliers and other\nmanifolds, resulting in inaccurate results. To overcome this issue, we propose\na novel Cluster-Aware Similarity (CAS) diffusion for instance retrieval. The\nprimary concept of CAS is to conduct similarity diffusion within local\nclusters, which can reduce the influence from other manifolds explicitly. To\nobtain a symmetrical and smooth similarity matrix, our Bidirectional Similarity\nDiffusion strategy introduces an inverse constraint term to the optimization\nobjective of local cluster diffusion. Additionally, we have optimized a\nNeighbor-guided Similarity Smoothing approach to ensure similarity consistency\namong the local neighbors of each instance. Evaluations in instance retrieval\nand object re-identification validate the effectiveness of the proposed CAS,\nour code is publicly available.\n","authors":["Jifei Luo","Hantao Yao","Changsheng Xu"],"pdf_url":"https://arxiv.org/pdf/2406.02343v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02327v1","updated":"2024-06-04T13:57:34Z","published":"2024-06-04T13:57:34Z","title":"Continual Unsupervised Out-of-Distribution Detection","summary":" Deep learning models excel when the data distribution during training aligns\nwith testing data. Yet, their performance diminishes when faced with\nout-of-distribution (OOD) samples, leading to great interest in the field of\nOOD detection. Current approaches typically assume that OOD samples originate\nfrom an unconcentrated distribution complementary to the training distribution.\nWhile this assumption is appropriate in the traditional unsupervised OOD\n(U-OOD) setting, it proves inadequate when considering the place of deployment\nof the underlying deep learning model. To better reflect this real-world\nscenario, we introduce the novel setting of continual U-OOD detection. To\ntackle this new setting, we propose a method that starts from a U-OOD detector,\nwhich is agnostic to the OOD distribution, and slowly updates during deployment\nto account for the actual OOD distribution. Our method uses a new U-OOD scoring\nfunction that combines the Mahalanobis distance with a nearest-neighbor\napproach. Furthermore, we design a confidence-scaled few-shot OOD detector that\noutperforms previous methods. We show our method greatly improves upon strong\nbaselines from related fields.\n","authors":["Lars Doorenbos","Raphael Sznitman","Pablo Márquez-Neila"],"pdf_url":"https://arxiv.org/pdf/2406.02327v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.10820v2","updated":"2024-06-04T13:15:16Z","published":"2024-03-16T06:10:22Z","title":"Active Label Correction for Semantic Segmentation with Foundation Models","summary":" Training and validating models for semantic segmentation require datasets\nwith pixel-wise annotations, which are notoriously labor-intensive. Although\nuseful priors such as foundation models or crowdsourced datasets are available,\nthey are error-prone. We hence propose an effective framework of active label\ncorrection (ALC) based on a design of correction query to rectify pseudo labels\nof pixels, which in turn is more annotator-friendly than the standard one\ninquiring to classify a pixel directly according to our theoretical analysis\nand user study. Specifically, leveraging foundation models providing useful\nzero-shot predictions on pseudo labels and superpixels, our method comprises\ntwo key techniques: (i) an annotator-friendly design of correction query with\nthe pseudo labels, and (ii) an acquisition function looking ahead label\nexpansions based on the superpixels. Experimental results on PASCAL,\nCityscapes, and Kvasir-SEG datasets demonstrate the effectiveness of our ALC\nframework, outperforming prior methods for active semantic segmentation and\nlabel correction. Notably, utilizing our method, we obtained a revised dataset\nof PASCAL by rectifying errors in 2.6 million pixels in PASCAL dataset.\n","authors":["Hoyoung Kim","Sehyun Hwang","Suha Kwak","Jungseul Ok"],"pdf_url":"https://arxiv.org/pdf/2403.10820v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02287v1","updated":"2024-06-04T13:00:22Z","published":"2024-06-04T13:00:22Z","title":"Optimised ProPainter for Video Diminished Reality Inpainting","summary":" In this paper, part of the DREAMING Challenge - Diminished Reality for\nEmerging Applications in Medicine through Inpainting, we introduce a refined\nvideo inpainting technique optimised from the ProPainter method to meet the\nspecialised demands of medical imaging, specifically in the context of oral and\nmaxillofacial surgery. Our enhanced algorithm employs the zero-shot ProPainter,\nfeaturing optimized parameters and pre-processing, to adeptly manage the\ncomplex task of inpainting surgical video sequences, without requiring any\ntraining process. It aims to produce temporally coherent and detail-rich\nreconstructions of occluded regions, facilitating clearer views of operative\nfields. The efficacy of our approach is evaluated using comprehensive metrics,\npositioning it as a significant advancement in the application of diminished\nreality for medical purposes.\n","authors":["Pengze Li","Lihao Liu","Carola-Bibiane Schönlieb","Angelica I Aviles-Rivero"],"pdf_url":"https://arxiv.org/pdf/2406.02287v1.pdf","comment":"Accepted to ISBI 2024"},{"id":"http://arxiv.org/abs/2406.02265v1","updated":"2024-06-04T12:41:54Z","published":"2024-06-04T12:41:54Z","title":"Understanding Retrieval Robustness for Retrieval-Augmented Image\n Captioning","summary":" Recent advancements in retrieval-augmented models for image captioning\nhighlight the significance of retrieving related captions for efficient,\nlightweight models with strong domain-transfer capabilities. While these models\ndemonstrate the success of retrieval augmentation, retrieval models are still\nfar from perfect in practice. Retrieved information can sometimes mislead the\nmodel generation, negatively impacting performance. In this paper, we analyze\nthe robustness of the SmallCap retrieval-augmented captioning model. Our\nanalysis shows that SmallCap is sensitive to tokens that appear in the majority\nof the retrieved captions, and integrated gradients attribution shows that\nthose tokens are likely copied into the final caption. Given these findings, we\npropose to train the model by sampling retrieved captions from more diverse\nsets. This reduces the probability that the model learns to copy majority\ntokens and improves both in-domain and cross-domain performance effectively.\n","authors":["Wenyan Li","Jiaang Li","Rita Ramos","Raphael Tang","Desmond Elliott"],"pdf_url":"https://arxiv.org/pdf/2406.02265v1.pdf","comment":"9 pages, long paper at ACL 2024"},{"id":"http://arxiv.org/abs/2406.02264v1","updated":"2024-06-04T12:37:11Z","published":"2024-06-04T12:37:11Z","title":"Image contrast enhancement based on the Schrödinger operator spectrum","summary":" This study proposes a novel image contrast enhancement method based on image\nprojection onto the squared eigenfunctions of the two dimensional Schr\\\"odinger\noperator. This projection depends on a design parameter\n\\texorpdfstring{\\(\\gamma\\)}{gamma} which is proposed to control the pixel\nintensity during image reconstruction. The performance of the proposed method\nis investigated through its application to color images. The selection of\n\\texorpdfstring{\\(\\gamma\\)}{gamma} values is performed using k-means, which\nhelps preserve the image spatial adjacency information. Furthermore,\nmulti-objective optimization using the Non dominated Sorting Genetic Algorithm\nII (NSAG2) algorithm is proposed to select the optimal values of\n\\texorpdfstring{\\(\\gamma\\)}{gamma} and the semi-classical parameter h from the\n2DSCSA. The results demonstrate the effectiveness of the proposed method for\nenhancing image contrast while preserving the inherent characteristics of the\noriginal image, producing the desired enhancement with almost no artifacts.\n","authors":["Juan M. Vargas","Taous-Meriem Laleg-Kirati"],"pdf_url":"https://arxiv.org/pdf/2406.02264v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02263v1","updated":"2024-06-04T12:33:02Z","published":"2024-06-04T12:33:02Z","title":"M3DM-NR: RGB-3D Noisy-Resistant Industrial Anomaly Detection via\n Multimodal Denoising","summary":" Existing industrial anomaly detection methods primarily concentrate on\nunsupervised learning with pristine RGB images. Yet, both RGB and 3D data are\ncrucial for anomaly detection, and the datasets are seldom completely clean in\npractical scenarios. To address above challenges, this paper initially delves\ninto the RGB-3D multi-modal noisy anomaly detection, proposing a novel\nnoise-resistant M3DM-NR framework to leveraging strong multi-modal\ndiscriminative capabilities of CLIP. M3DM-NR consists of three stages: Stage-I\nintroduces the Suspected References Selection module to filter a few normal\nsamples from the training dataset, using the multimodal features extracted by\nthe Initial Feature Extraction, and a Suspected Anomaly Map Computation module\nto generate a suspected anomaly map to focus on abnormal regions as reference.\nStage-II uses the suspected anomaly maps of the reference samples as reference,\nand inputs image, point cloud, and text information to achieve denoising of the\ntraining samples through intra-modal comparison and multi-scale aggregation\noperations. Finally, Stage-III proposes the Point Feature Alignment,\nUnsupervised Feature Fusion, Noise Discriminative Coreset Selection, and\nDecision Layer Fusion modules to learn the pattern of the training dataset,\nenabling anomaly detection and segmentation while filtering out noise.\nExtensive experiments show that M3DM-NR outperforms state-of-the-art methods in\n3D-RGB multi-modal noisy anomaly detection.\n","authors":["Chengjie Wang","Haokun Zhu","Jinlong Peng","Yue Wang","Ran Yi","Yunsheng Wu","Lizhuang Ma","Jiangning Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.02263v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.08078v5","updated":"2024-06-04T12:27:38Z","published":"2023-12-13T11:47:28Z","title":"Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable\n Cyclic Image-Report Generation","summary":" To address these issues, we propose a novel Adaptive patch-word Matching\n(AdaMatch) model to correlate chest X-ray (CXR) image regions with words in\nmedical reports and apply it to CXR-report generation to provide explainability\nfor the generation process. AdaMatch exploits the fine-grained relation between\nadaptive patches and words to provide explanations of specific image regions\nwith corresponding words. To capture the abnormal regions of varying sizes and\npositions, we introduce the Adaptive Patch extraction (AdaPatch) module to\nacquire the adaptive patches for these regions adaptively. In order to provide\nexplicit explainability for CXR-report generation task, we propose an\nAdaMatch-based bidirectional large language model for Cyclic CXR-report\ngeneration (AdaMatch-Cyclic). It employs the AdaMatch to obtain the keywords\nfor CXR images and `keypatches' for medical reports as hints to guide\nCXR-report generation. Extensive experiments on two publicly available CXR\ndatasets prove the effectiveness of our method and its superior performance to\nexisting methods.\n","authors":["Wenting Chen","Linlin Shen","Jingyang Lin","Jiebo Luo","Xiang Li","Yixuan Yuan"],"pdf_url":"https://arxiv.org/pdf/2312.08078v5.pdf","comment":"Accepted by ACL 2024"},{"id":"http://arxiv.org/abs/2406.02253v1","updated":"2024-06-04T12:19:09Z","published":"2024-06-04T12:19:09Z","title":"PuFace: Defending against Facial Cloaking Attacks for Facial Recognition\n Models","summary":" The recently proposed facial cloaking attacks add invisible perturbation\n(cloaks) to facial images to protect users from being recognized by\nunauthorized facial recognition models. However, we show that the \"cloaks\" are\nnot robust enough and can be removed from images.\n This paper introduces PuFace, an image purification system leveraging the\ngeneralization ability of neural networks to diminish the impact of cloaks by\npushing the cloaked images towards the manifold of natural (uncloaked) images\nbefore the training process of facial recognition models. Specifically, we\ndevise a purifier that takes all the training images including both cloaked and\nnatural images as input and generates the purified facial images close to the\nmanifold where natural images lie. To meet the defense goal, we propose to\ntrain the purifier on particularly amplified cloaked images with a loss\nfunction that combines image loss and feature loss. Our empirical experiment\nshows PuFace can effectively defend against two state-of-the-art facial\ncloaking attacks and reduces the attack success rate from 69.84\\% to 7.61\\% on\naverage without degrading the normal accuracy for various facial recognition\nmodels. Moreover, PuFace is a model-agnostic defense mechanism that can be\napplied to any facial recognition model without modifying the model structure.\n","authors":["Jing Wen"],"pdf_url":"https://arxiv.org/pdf/2406.02253v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.18454v2","updated":"2024-06-04T11:59:54Z","published":"2024-04-29T06:24:32Z","title":"3D Gaussian Splatting with Deferred Reflection","summary":" The advent of neural and Gaussian-based radiance field methods have achieved\ngreat success in the field of novel view synthesis. However, specular\nreflection remains non-trivial, as the high frequency radiance field is\nnotoriously difficult to fit stably and accurately. We present a deferred\nshading method to effectively render specular reflection with Gaussian\nsplatting. The key challenge comes from the environment map reflection model,\nwhich requires accurate surface normal while simultaneously bottlenecks normal\nestimation with discontinuous gradients. We leverage the per-pixel reflection\ngradients generated by deferred shading to bridge the optimization process of\nneighboring Gaussians, allowing nearly correct normal estimations to gradually\npropagate and eventually spread over all reflective objects. Our method\nsignificantly outperforms state-of-the-art techniques and concurrent work in\nsynthesizing high-quality specular reflection effects, demonstrating a\nconsistent improvement of peak signal-to-noise ratio (PSNR) for both synthetic\nand real-world scenes, while running at a frame rate almost identical to\nvanilla Gaussian splatting.\n","authors":["Keyang Ye","Qiming Hou","Kun Zhou"],"pdf_url":"https://arxiv.org/pdf/2404.18454v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16475v2","updated":"2024-06-04T11:53:44Z","published":"2024-05-26T07:58:51Z","title":"Looks Too Good To Be True: An Information-Theoretic Analysis of\n Hallucinations in Generative Restoration Models","summary":" The pursuit of high perceptual quality in image restoration has driven the\ndevelopment of revolutionary generative models, capable of producing results\noften visually indistinguishable from real data. However, as their perceptual\nquality continues to improve, these models also exhibit a growing tendency to\ngenerate hallucinations - realistic-looking details that do not exist in the\nground truth images. The presence of hallucinations introduces uncertainty\nregarding the reliability of the models' predictions, raising major concerns\nabout their practical application. In this paper, we employ information-theory\ntools to investigate this phenomenon, revealing a fundamental tradeoff between\nuncertainty and perception. We rigorously analyze the relationship between\nthese two factors, proving that the global minimal uncertainty in generative\nmodels grows in tandem with perception. In particular, we define the inherent\nuncertainty of the restoration problem and show that attaining perfect\nperceptual quality entails at least twice this uncertainty. Additionally, we\nestablish a relation between mean squared-error distortion, uncertainty and\nperception, through which we prove the aforementioned uncertainly-perception\ntradeoff induces the well-known perception-distortion tradeoff. This work\nuncovers fundamental limitations of generative models in achieving both high\nperceptual quality and reliable predictions for image restoration. We\ndemonstrate our theoretical findings through an analysis of single image\nsuper-resolution algorithms. Our work aims to raise awareness among\npractitioners about this inherent tradeoff, empowering them to make informed\ndecisions and potentially prioritize safety over perceptual performance.\n","authors":["Regev Cohen","Idan Kligvasser","Ehud Rivlin","Daniel Freedman"],"pdf_url":"https://arxiv.org/pdf/2405.16475v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02230v1","updated":"2024-06-04T11:48:44Z","published":"2024-06-04T11:48:44Z","title":"I4VGen: Image as Stepping Stone for Text-to-Video Generation","summary":" Text-to-video generation has lagged behind text-to-image synthesis in quality\nand diversity due to the complexity of spatio-temporal modeling and limited\nvideo-text datasets. This paper presents I4VGen, a training-free and\nplug-and-play video diffusion inference framework, which enhances text-to-video\ngeneration by leveraging robust image techniques. Specifically, following\ntext-to-image-to-video, I4VGen decomposes the text-to-video generation into two\nstages: anchor image synthesis and anchor image-guided video synthesis.\nCorrespondingly, a well-designed generation-selection pipeline is employed to\nachieve visually-realistic and semantically-faithful anchor image, and an\ninnovative Noise-Invariant Video Score Distillation Sampling is incorporated to\nanimate the image to a dynamic video, followed by a video regeneration process\nto refine the video. This inference strategy effectively mitigates the\nprevalent issue of non-zero terminal signal-to-noise ratio. Extensive\nevaluations show that I4VGen not only produces videos with higher visual\nrealism and textual fidelity but also integrates seamlessly into existing\nimage-to-video diffusion models, thereby improving overall video quality.\n","authors":["Xiefan Guo","Jinlin Liu","Miaomiao Cui","Di Huang"],"pdf_url":"https://arxiv.org/pdf/2406.02230v1.pdf","comment":"Project page: https://xiefan-guo.github.io/i4vgen"},{"id":"http://arxiv.org/abs/2401.15578v2","updated":"2024-06-04T11:47:15Z","published":"2024-01-28T06:23:55Z","title":"ASCNet: Asymmetric Sampling Correction Network for Infrared Image\n Destriping","summary":" In a real-world infrared imaging system, effectively learning a consistent\nstripe noise removal model is essential. Most existing destriping methods\ncannot precisely reconstruct images due to cross-level semantic gaps and\ninsufficient characterization of the global column features. To tackle this\nproblem, we propose a novel infrared image destriping method, called Asymmetric\nSampling Correction Network (ASCNet), that can effectively capture global\ncolumn relationships and embed them into a U-shaped framework, providing\ncomprehensive discriminative representation and seamless semantic connectivity.\nOur ASCNet consists of three core elements: Residual Haar Discrete Wavelet\nTransform (RHDWT), Pixel Shuffle (PS), and Column Non-uniformity Correction\nModule (CNCM). Specifically, RHDWT is a novel downsampler that employs\ndouble-branch modeling to effectively integrate stripe-directional prior\nknowledge and data-driven semantic interaction to enrich the feature\nrepresentation. Observing the semantic patterns crosstalk of stripe noise, PS\nis introduced as an upsampler to prevent excessive apriori decoding and\nperforming semantic-bias-free image reconstruction. After each sampling, CNCM\ncaptures the column relationships in long-range dependencies. By incorporating\ncolumn, spatial, and self-dependence information, CNCM well establishes a\nglobal context to distinguish stripes from the scene's vertical structures.\nExtensive experiments on synthetic data, real data, and infrared small target\ndetection tasks demonstrate that the proposed method outperforms\nstate-of-the-art single-image destriping methods both visually and\nquantitatively. Our code will be made publicly available at\nhttps://github.com/xdFai/ASCNet.\n","authors":["Shuai Yuan","Hanlin Qin","Xiang Yan","Shiqi Yang","Shuowen Yang","Naveed Akhtar"],"pdf_url":"https://arxiv.org/pdf/2401.15578v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.18576v3","updated":"2024-06-04T11:39:34Z","published":"2023-11-30T14:15:39Z","title":"Fingerprint Matching with Localized Deep Representation","summary":" Compared to minutia-based fingerprint representations, fixed-length\nrepresentations are attractive due to simple and efficient matching. However,\nfixed-length fingerprint representations are limited in accuracy when matching\nfingerprints with different visible areas, which can occur due to different\nfinger poses or acquisition methods. To address this issue, we propose a\nlocalized deep representation of fingerprint, named LDRF. By focusing on the\ndiscriminative characteristics within local regions, LDRF provides a more\nrobust and accurate fixed-length representation for fingerprints with variable\nvisible areas. LDRF can be adapted to retain information within any valid area,\nmaking it highly flexible. The matching scores produced by LDRF also exhibit\nintuitive statistical characteristics, which led us to propose a matching score\nnormalization technique to mitigate the uncertainty in the cases of very small\noverlapping area. With this new technique, we can maintain a high level of\naccuracy and reliability in our fingerprint matching, even as the size of the\ndatabase grows rapidly. Our experimental results on 21 datasets containing over\n140K fingerprints of various finger poses and impression types show that LDRF\noutperforms other fixed-length representations and is robust to sensing\ntechnologies and impression types. Besides, the proposed matching score\nnormalization effectively reduces the false match rate (FMR) in large-scale\nidentification experiments comprising over 5.11 million fingerprints.\nSpecifically, this technique results in a reduction of two orders of magnitude\ncompared to matching without matching score normalization and five orders of\nmagnitude compared to prior works.\n","authors":["Yongjie Duan","Zhiyu Pan","Jianjiang Feng","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2311.18576v3.pdf","comment":"The paper requires major revision"},{"id":"http://arxiv.org/abs/2406.02223v1","updated":"2024-06-04T11:33:40Z","published":"2024-06-04T11:33:40Z","title":"SMCL: Saliency Masked Contrastive Learning for Long-tailed Recognition","summary":" Real-world data often follow a long-tailed distribution with a high imbalance\nin the number of samples between classes. The problem with training from\nimbalanced data is that some background features, common to all classes, can be\nunobserved in classes with scarce samples. As a result, this background\ncorrelates to biased predictions into ``major\" classes. In this paper, we\npropose saliency masked contrastive learning, a new method that uses saliency\nmasking and contrastive learning to mitigate the problem and improve the\ngeneralizability of a model. Our key idea is to mask the important part of an\nimage using saliency detection and use contrastive learning to move the masked\nimage towards minor classes in the feature space, so that background features\npresent in the masked image are no longer correlated with the original class.\nExperiment results show that our method achieves state-of-the-art level\nperformance on benchmark long-tailed datasets.\n","authors":["Sanglee Park","Seung-won Hwang","Jungmin So"],"pdf_url":"https://arxiv.org/pdf/2406.02223v1.pdf","comment":"accepted at ICASSP 2023"},{"id":"http://arxiv.org/abs/2405.09550v3","updated":"2024-06-04T11:28:42Z","published":"2024-03-20T12:27:30Z","title":"Mask-based Invisible Backdoor Attacks on Object Detection","summary":" Deep learning models have achieved unprecedented performance in the domain of\nobject detection, resulting in breakthroughs in areas such as autonomous\ndriving and security. However, deep learning models are vulnerable to backdoor\nattacks. These attacks prompt models to behave similarly to standard models\nwithout a trigger; however, they act maliciously upon detecting a predefined\ntrigger. Despite extensive research on backdoor attacks in image\nclassification, their application to object detection remains relatively\nunderexplored. Given the widespread application of object detection in critical\nreal-world scenarios, the sensitivity and potential impact of these\nvulnerabilities cannot be overstated. In this study, we propose an effective\ninvisible backdoor attack on object detection utilizing a mask-based approach.\nThree distinct attack scenarios were explored for object detection: object\ndisappearance, object misclassification, and object generation attack. Through\nextensive experiments, we comprehensively examined the effectiveness of these\nattacks and tested certain defense methods to determine effective\ncountermeasures. Code will be available at\nhttps://github.com/jeongjin0/invisible-backdoor-object-detection\n","authors":["Jeongjin Shin"],"pdf_url":"https://arxiv.org/pdf/2405.09550v3.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2403.01306v2","updated":"2024-06-04T11:08:42Z","published":"2024-03-02T20:36:10Z","title":"ICC: Quantifying Image Caption Concreteness for Multimodal Dataset\n Curation","summary":" Web-scale training on paired text-image data is becoming increasingly central\nto multimodal learning, but is challenged by the highly noisy nature of\ndatasets in the wild. Standard data filtering approaches succeed in removing\nmismatched text-image pairs, but permit semantically related but highly\nabstract or subjective text. These approaches lack the fine-grained ability to\nisolate the most concrete samples that provide the strongest signal for\nlearning in a noisy dataset. In this work, we propose a new metric, image\ncaption concreteness, that evaluates caption text without an image reference to\nmeasure its concreteness and relevancy for use in multimodal learning. Our\napproach leverages strong foundation models for measuring visual-semantic\ninformation loss in multimodal representations. We demonstrate that this\nstrongly correlates with human evaluation of concreteness in both single-word\nand sentence-level texts. Moreover, we show that curation using ICC complements\nexisting approaches: It succeeds in selecting the highest quality samples from\nmultimodal web-scale datasets to allow for efficient training in\nresource-constrained settings.\n","authors":["Moran Yanuka","Morris Alper","Hadar Averbuch-Elor","Raja Giryes"],"pdf_url":"https://arxiv.org/pdf/2403.01306v2.pdf","comment":"Accepted to ACL 2024 (Finding). For Project webpage, see\n https://moranyanuka.github.io/icc/"},{"id":"http://arxiv.org/abs/2312.04465v2","updated":"2024-06-04T11:08:25Z","published":"2023-12-07T17:35:49Z","title":"FitDiff: Robust monocular 3D facial shape and reflectance estimation\n using Diffusion Models","summary":" The remarkable progress in 3D face reconstruction has resulted in high-detail\nand photorealistic facial representations. Recently, Diffusion Models have\nrevolutionized the capabilities of generative methods by surpassing the\nperformance of GANs. In this work, we present FitDiff, a diffusion-based 3D\nfacial avatar generative model. Leveraging diffusion principles, our model\naccurately generates relightable facial avatars, utilizing an identity\nembedding extracted from an \"in-the-wild\" 2D facial image. The introduced\nmulti-modal diffusion model is the first to concurrently output facial\nreflectance maps (diffuse and specular albedo and normals) and shapes,\nshowcasing great generalization capabilities. It is solely trained on an\nannotated subset of a public facial dataset, paired with 3D reconstructions. We\nrevisit the typical 3D facial fitting approach by guiding a reverse diffusion\nprocess using perceptual and face recognition losses. Being the first 3D LDM\nconditioned on face recognition embeddings, FitDiff reconstructs relightable\nhuman avatars, that can be used as-is in common rendering engines, starting\nonly from an unconstrained facial image, and achieving state-of-the-art\nperformance.\n","authors":["Stathis Galanakis","Alexandros Lattas","Stylianos Moschoglou","Stefanos Zafeiriou"],"pdf_url":"https://arxiv.org/pdf/2312.04465v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02208v1","updated":"2024-06-04T11:06:13Z","published":"2024-06-04T11:06:13Z","title":"Why Only Text: Empowering Vision-and-Language Navigation with\n Multi-modal Prompts","summary":" Current Vision-and-Language Navigation (VLN) tasks mainly employ textual\ninstructions to guide agents. However, being inherently abstract, the same\ntextual instruction can be associated with different visual signals, causing\nsevere ambiguity and limiting the transfer of prior knowledge in the vision\ndomain from the user to the agent. To fill this gap, we propose\nVision-and-Language Navigation with Multi-modal Prompts (VLN-MP), a novel task\naugmenting traditional VLN by integrating both natural language and images in\ninstructions. VLN-MP not only maintains backward compatibility by effectively\nhandling text-only prompts but also consistently shows advantages with\ndifferent quantities and relevance of visual prompts. Possible forms of visual\nprompts include both exact and similar object images, providing adaptability\nand versatility in diverse navigation scenarios. To evaluate VLN-MP under a\nunified framework, we implement a new benchmark that offers: (1) a\ntraining-free pipeline to transform textual instructions into multi-modal forms\nwith landmark images; (2) diverse datasets with multi-modal instructions for\ndifferent downstream tasks; (3) a novel module designed to process various\nimage prompts for seamless integration with state-of-the-art VLN models.\nExtensive experiments on four VLN benchmarks (R2R, RxR, REVERIE, CVDN) show\nthat incorporating visual prompts significantly boosts navigation performance.\nWhile maintaining efficiency with text-only prompts, VLN-MP enables agents to\nnavigate in the pre-explore setting and outperform text-based models, showing\nits broader applicability.\n","authors":["Haodong Hong","Sen Wang","Zi Huang","Qi Wu","Jiajun Liu"],"pdf_url":"https://arxiv.org/pdf/2406.02208v1.pdf","comment":"IJCAI 2024"},{"id":"http://arxiv.org/abs/2406.02202v1","updated":"2024-06-04T10:57:59Z","published":"2024-06-04T10:57:59Z","title":"Can CLIP help CLIP in learning 3D?","summary":" In this study, we explore an alternative approach to enhance contrastive\ntext-image-3D alignment in the absence of textual descriptions for 3D objects.\nWe introduce two unsupervised methods, $I2I$ and $(I2L)^2$, which leverage CLIP\nknowledge about textual and 2D data to compute the neural perceived similarity\nbetween two 3D samples. We employ the proposed methods to mine 3D hard\nnegatives, establishing a multimodal contrastive pipeline with hard negative\nweighting via a custom loss function. We train on different configurations of\nthe proposed hard negative mining approach, and we evaluate the accuracy of our\nmodels in 3D classification and on the cross-modal retrieval benchmark, testing\nimage-to-shape and shape-to-image retrieval. Results demonstrate that our\napproach, even without explicit text alignment, achieves comparable or superior\nperformance on zero-shot and standard 3D classification, while significantly\nimproving both image-to-shape and shape-to-image retrieval compared to previous\nmethods.\n","authors":["Cristian Sbrolli","Matteo Matteucci"],"pdf_url":"https://arxiv.org/pdf/2406.02202v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.10223v4","updated":"2024-06-04T10:52:16Z","published":"2023-05-17T13:56:48Z","title":"Advancing Unsupervised Low-light Image Enhancement: Noise Estimation,\n Illumination Interpolation, and Self-Regulation","summary":" Contemporary Low-Light Image Enhancement (LLIE) techniques have made notable\nadvancements in preserving image details and enhancing contrast, achieving\ncommendable results on specific datasets. Nevertheless, these approaches\nencounter persistent challenges in efficiently mitigating dynamic noise and\naccommodating diverse low-light scenarios. Insufficient constraints on complex\npixel-wise mapping learning lead to overfitting to specific types of noise and\nartifacts associated with low-light conditions, reducing effectiveness in\nvariable lighting scenarios. To this end, we first propose a method for\nestimating the noise level in low light images in a quick and accurate way.\nThis facilitates precise denoising, prevents over-smoothing, and adapts to\ndynamic noise patterns. Subsequently, we devise a Learnable Illumination\nInterpolator (LII), which employs learnlable interpolation operations between\nthe input and unit vector to satisfy general constraints between illumination\nand input. Finally, we introduce a self-regularization loss that incorporates\nintrinsic image properties and essential visual attributes to guide the output\ntowards meeting human visual expectations. Comprehensive experiments validate\nthe competitiveness of our proposed algorithm in both qualitative and\nquantitative assessments. Notably, our noise estimation method, with linear\ntime complexity and suitable for various denoisers, significantly improves both\ndenoising and enhancement performance. Benefiting from this, our approach\nachieves a 0.675dB PSNR improvement on the LOL dataset and 0.818dB on the MIT\ndataset on LLIE task, even compared to supervised methods. The source code is\navailable at \\href{https://doi.org/10.5281/zenodo.11463142}{this DOI\nrepository} and the specific code for noise estimation can be found at\n\\href{https://github.com/GoogolplexGoodenough/noise_estimate}{this separate\nGitHub link}.\n","authors":["Xiaofeng Liu","Jiaxin Gao","Xin Fan","Risheng Liu"],"pdf_url":"https://arxiv.org/pdf/2305.10223v4.pdf","comment":"Image processing, low-light image enhancement, noise estimation,\n illumination learning"},{"id":"http://arxiv.org/abs/2403.05196v2","updated":"2024-06-04T10:47:02Z","published":"2024-03-08T10:19:00Z","title":"Denoising Autoregressive Representation Learning","summary":" In this paper, we explore a new generative approach for learning visual\nrepresentations. Our method, DARL, employs a decoder-only Transformer to\npredict image patches autoregressively. We find that training with Mean Squared\nError (MSE) alone leads to strong representations. To enhance the image\ngeneration ability, we replace the MSE loss with the diffusion objective by\nusing a denoising patch decoder. We show that the learned representation can be\nimproved by using tailored noise schedules and longer training in larger\nmodels. Notably, the optimal schedule differs significantly from the typical\nones used in standard image diffusion models. Overall, despite its simple\narchitecture, DARL delivers performance remarkably close to state-of-the-art\nmasked prediction models under the fine-tuning protocol. This marks an\nimportant step towards a unified model capable of both visual perception and\ngeneration, effectively combining the strengths of autoregressive and denoising\ndiffusion models.\n","authors":["Yazhe Li","Jorg Bornschein","Ting Chen"],"pdf_url":"https://arxiv.org/pdf/2403.05196v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.16518v2","updated":"2024-06-04T10:30:58Z","published":"2023-11-27T18:11:19Z","title":"SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution","summary":" Owe to the powerful generative priors, the pre-trained text-to-image (T2I)\ndiffusion models have become increasingly popular in solving the real-world\nimage super-resolution problem. However, as a consequence of the heavy quality\ndegradation of input low-resolution (LR) images, the destruction of local\nstructures can lead to ambiguous image semantics. As a result, the content of\nreproduced high-resolution image may have semantic errors, deteriorating the\nsuper-resolution performance. To address this issue, we present a\nsemantics-aware approach to better preserve the semantic fidelity of generative\nreal-world image super-resolution. First, we train a degradation-aware prompt\nextractor, which can generate accurate soft and hard semantic prompts even\nunder strong degradation. The hard semantic prompts refer to the image tags,\naiming to enhance the local perception ability of the T2I model, while the soft\nsemantic prompts compensate for the hard ones to provide additional\nrepresentation information. These semantic prompts encourage the T2I model to\ngenerate detailed and semantically accurate results. Furthermore, during the\ninference process, we integrate the LR images into the initial sampling noise\nto mitigate the diffusion model's tendency to generate excessive random\ndetails. The experiments show that our method can reproduce more realistic\nimage details and hold better the semantics. The source code of our method can\nbe found at https://github.com/cswry/SeeSR.\n","authors":["Rongyuan Wu","Tao Yang","Lingchen Sun","Zhengqiang Zhang","Shuai Li","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.16518v2.pdf","comment":"Accepted by CVPR2024"},{"id":"http://arxiv.org/abs/2406.02184v1","updated":"2024-06-04T10:29:18Z","published":"2024-06-04T10:29:18Z","title":"GraVITON: Graph based garment warping with attention guided inversion\n for Virtual-tryon","summary":" Virtual try-on, a rapidly evolving field in computer vision, is transforming\ne-commerce by improving customer experiences through precise garment warping\nand seamless integration onto the human body. While existing methods such as\nTPS and flow address the garment warping but overlook the finer contextual\ndetails. In this paper, we introduce a novel graph based warping technique\nwhich emphasizes the value of context in garment flow. Our graph based warping\nmodule generates warped garment as well as a coarse person image, which is\nutilised by a simple refinement network to give a coarse virtual tryon image.\nThe proposed work exploits latent diffusion model to generate the final tryon,\ntreating garment transfer as an inpainting task. The diffusion model is\nconditioned with decoupled cross attention based inversion of visual and\ntextual information. We introduce an occlusion aware warping constraint that\ngenerates dense warped garment, without any holes and occlusion. Our method,\nvalidated on VITON-HD and Dresscode datasets, showcases substantial\nstate-of-the-art qualitative and quantitative results showing considerable\nimprovement in garment warping, texture preservation, and overall realism.\n","authors":["Sanhita Pathak","Vinay Kaushik","Brejesh Lall"],"pdf_url":"https://arxiv.org/pdf/2406.02184v1.pdf","comment":"18 pages, 7 Figures and 6 Tables"},{"id":"http://arxiv.org/abs/2204.09389v2","updated":"2024-06-04T10:24:11Z","published":"2022-04-20T11:01:51Z","title":"Epistemic Uncertainty-Weighted Loss for Visual Bias Mitigation","summary":" Deep neural networks are highly susceptible to learning biases in visual\ndata. While various methods have been proposed to mitigate such bias, the\nmajority require explicit knowledge of the biases present in the training data\nin order to mitigate. We argue the relevance of exploring methods which are\ncompletely ignorant of the presence of any bias, but are capable of identifying\nand mitigating them. Furthermore, we propose using Bayesian neural networks\nwith a predictive uncertainty-weighted loss function to dynamically identify\npotential bias in individual training samples and to weight them during\ntraining. We find a positive correlation between samples subject to bias and\nhigher epistemic uncertainties. Finally, we show the method has potential to\nmitigate visual bias on a bias benchmark dataset and on a real-world face\ndetection problem, and we consider the merits and weaknesses of our approach.\n","authors":["Rebecca S Stone","Nishant Ravikumar","Andrew J Bulpitt","David C Hogg"],"pdf_url":"https://arxiv.org/pdf/2204.09389v2.pdf","comment":"Published in 2022 IEEE CVPR Workshop on Fair, Data Efficient and\n Trusted Computer Vision"},{"id":"http://arxiv.org/abs/2312.15271v2","updated":"2024-06-04T09:59:48Z","published":"2023-12-23T14:43:52Z","title":"SSFlowNet: Semi-supervised Scene Flow Estimation On Point Clouds With\n Pseudo Label","summary":" In the domain of supervised scene flow estimation, the process of manual\nlabeling is both time-intensive and financially demanding. This paper\nintroduces SSFlowNet, a semi-supervised approach for scene flow estimation,\nthat utilizes a blend of labeled and unlabeled data, optimizing the balance\nbetween the cost of labeling and the precision of model training. SSFlowNet\nstands out through its innovative use of pseudo-labels, mainly reducing the\ndependency on extensively labeled datasets while maintaining high model\naccuracy. The core of our model is its emphasis on the intricate geometric\nstructures of point clouds, both locally and globally, coupled with a novel\nspatial memory feature. This feature is adept at learning the geometric\nrelationships between points over sequential time frames. By identifying\nsimilarities between labeled and unlabeled points, SSFlowNet dynamically\nconstructs a correlation matrix to evaluate scene flow dependencies at\nindividual point level. Furthermore, the integration of a flow consistency\nmodule within SSFlowNet enhances its capability to consistently estimate flow,\nan essential aspect for analyzing dynamic scenes. Empirical results demonstrate\nthat SSFlowNet surpasses existing methods in pseudo-label generation and shows\nadaptability across varying data volumes. Moreover, our semi-supervised\ntraining technique yields promising outcomes even with different smaller ratio\nlabeled data, marking a substantial advancement in the field of scene flow\nestimation.\n","authors":["Jingze Chen","Junfeng Yao","Qiqin Lin","Rongzhou Zhou","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2312.15271v2.pdf","comment":"Accepted by 33rd International Conference on Artificial Neural\n Networks (ICANN 2024)"},{"id":"http://arxiv.org/abs/2406.02158v1","updated":"2024-06-04T09:45:04Z","published":"2024-06-04T09:45:04Z","title":"Radar Spectra-Language Model for Automotive Scene Parsing","summary":" Radar sensors are low cost, long-range, and weather-resilient. Therefore,\nthey are widely used for driver assistance functions, and are expected to be\ncrucial for the success of autonomous driving in the future. In many perception\ntasks only pre-processed radar point clouds are considered. In contrast, radar\nspectra are a raw form of radar measurements and contain more information than\nradar point clouds. However, radar spectra are rather difficult to interpret.\nIn this work, we aim to explore the semantic information contained in spectra\nin the context of automated driving, thereby moving towards better\ninterpretability of radar spectra. To this end, we create a radar\nspectra-language model, allowing us to query radar spectra measurements for the\npresence of scene elements using free text. We overcome the scarcity of radar\nspectra data by matching the embedding space of an existing vision-language\nmodel (VLM). Finally, we explore the benefit of the learned representation for\nscene parsing, and obtain improvements in free space segmentation and object\ndetection merely by injecting the spectra embedding into a baseline model.\n","authors":["Mariia Pushkareva","Yuri Feldman","Csaba Domokos","Kilian Rambach","Dotan Di Castro"],"pdf_url":"https://arxiv.org/pdf/2406.02158v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02153v1","updated":"2024-06-04T09:41:40Z","published":"2024-06-04T09:41:40Z","title":"Analyzing the Feature Extractor Networks for Face Image Synthesis","summary":" Advancements like Generative Adversarial Networks have attracted the\nattention of researchers toward face image synthesis to generate ever more\nrealistic images. Thereby, the need for the evaluation criteria to assess the\nrealism of the generated images has become apparent. While FID utilized with\nInceptionV3 is one of the primary choices for benchmarking, concerns about\nInceptionV3's limitations for face images have emerged. This study investigates\nthe behavior of diverse feature extractors -- InceptionV3, CLIP, DINOv2, and\nArcFace -- considering a variety of metrics -- FID, KID, Precision\\&Recall.\nWhile the FFHQ dataset is used as the target domain, as the source domains, the\nCelebA-HQ dataset and the synthetic datasets generated using StyleGAN2 and\nProjected FastGAN are used. Experiments include deep-down analysis of the\nfeatures: $L_2$ normalization, model attention during extraction, and domain\ndistributions in the feature space. We aim to give valuable insights into the\nbehavior of feature extractors for evaluating face image synthesis\nmethodologies. The code is publicly available at\nhttps://github.com/ThEnded32/AnalyzingFeatureExtractors.\n","authors":["Erdi Sarıtaş","Hazım Kemal Ekenel"],"pdf_url":"https://arxiv.org/pdf/2406.02153v1.pdf","comment":"Accepted at 18th International Conference on Automatic Face and\n Gesture Recognition (FG) on 1st SD-FGA Workshop 2024"},{"id":"http://arxiv.org/abs/2406.02147v1","updated":"2024-06-04T09:34:46Z","published":"2024-06-04T09:34:46Z","title":"UA-Track: Uncertainty-Aware End-to-End 3D Multi-Object Tracking","summary":" 3D multiple object tracking (MOT) plays a crucial role in autonomous driving\nperception. Recent end-to-end query-based trackers simultaneously detect and\ntrack objects, which have shown promising potential for the 3D MOT task.\nHowever, existing methods overlook the uncertainty issue, which refers to the\nlack of precise confidence about the state and location of tracked objects.\nUncertainty arises owing to various factors during motion observation by\ncameras, especially occlusions and the small size of target objects, resulting\nin an inaccurate estimation of the object's position, label, and identity. To\nthis end, we propose an Uncertainty-Aware 3D MOT framework, UA-Track, which\ntackles the uncertainty problem from multiple aspects. Specifically, we first\nintroduce an Uncertainty-aware Probabilistic Decoder to capture the uncertainty\nin object prediction with probabilistic attention. Secondly, we propose an\nUncertainty-guided Query Denoising strategy to further enhance the training\nprocess. We also utilize Uncertainty-reduced Query Initialization, which\nleverages predicted 2D object location and depth information to reduce query\nuncertainty. As a result, our UA-Track achieves state-of-the-art performance on\nthe nuScenes benchmark, i.e., 66.3% AMOTA on the test split, surpassing the\nprevious best end-to-end solution by a significant margin of 8.9% AMOTA.\n","authors":["Lijun Zhou","Tao Tang","Pengkun Hao","Zihang He","Kalok Ho","Shuo Gu","Wenbo Hou","Zhihui Hao","Haiyang Sun","Kun Zhan","Peng Jia","Xianpeng Lang","Xiaodan Liang"],"pdf_url":"https://arxiv.org/pdf/2406.02147v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02142v1","updated":"2024-06-04T09:29:59Z","published":"2024-06-04T09:29:59Z","title":"Analyzing the Effect of Combined Degradations on Face Recognition","summary":" A face recognition model is typically trained on large datasets of images\nthat may be collected from controlled environments. This results in performance\ndiscrepancies when applied to real-world scenarios due to the domain gap\nbetween clean and in-the-wild images. Therefore, some researchers have\ninvestigated the robustness of these models by analyzing synthetic\ndegradations. Yet, existing studies have mostly focused on single degradation\nfactors, which may not fully capture the complexity of real-world degradations.\nThis work addresses this problem by analyzing the impact of both single and\ncombined degradations using a real-world degradation pipeline extended with\nunder/over-exposure conditions. We use the LFW dataset for our experiments and\nassess the model's performance based on verification accuracy. Results reveal\nthat single and combined degradations show dissimilar model behavior. The\ncombined effect of degradation significantly lowers performance even if its\nsingle effect is negligible. This work emphasizes the importance of accounting\nfor real-world complexity to assess the robustness of face recognition models\nin real-world settings. The code is publicly available at\nhttps://github.com/ThEnded32/AnalyzingCombinedDegradations.\n","authors":["Erdi Sarıtaş","Hazım Kemal Ekenel"],"pdf_url":"https://arxiv.org/pdf/2406.02142v1.pdf","comment":"Accepted at 18th International Conference on Automatic Face and\n Gesture Recognition (FG) on 2nd PrivAAL Workshop 2024"},{"id":"http://arxiv.org/abs/2405.21013v3","updated":"2024-06-04T09:14:39Z","published":"2024-05-31T16:55:04Z","title":"StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image\n Perception, Comprehension, and Beyond","summary":" Text-rich images have significant and extensive value, deeply integrated into\nvarious aspects of human life. Notably, both visual cues and linguistic symbols\nin text-rich images play crucial roles in information transmission but are\naccompanied by diverse challenges. Therefore, the efficient and effective\nunderstanding of text-rich images is a crucial litmus test for the capability\nof Vision-Language Models. We have crafted an efficient vision-language model,\nStrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.\nThe significant design of StrucTexTv3 is presented in the following aspects:\nFirstly, we adopt a combination of an effective multi-scale reduced visual\ntransformer and a multi-granularity token sampler (MG-Sampler) as a visual\ntoken generator, successfully solving the challenges of high-resolution input\nand complex representation learning for text-rich images. Secondly, we enhance\nthe perception and comprehension abilities of StrucTexTv3 through instruction\nlearning, seamlessly integrating various text-oriented tasks into a unified\nframework. Thirdly, we have curated a comprehensive collection of high-quality\ntext-rich images, abbreviated as TIM-30M, encompassing diverse scenarios like\nincidental scenes, office documents, web pages, and screenshots, thereby\nimproving the robustness of our model. Our method achieved SOTA results in\ntext-rich image perception tasks, and significantly improved performance in\ncomprehension tasks. Among multimodal models with LLM decoder of approximately\n1.8B parameters, it stands out as a leader, which also makes the deployment of\nedge devices feasible. In summary, the StrucTexTv3 model, featuring efficient\nstructural design, outstanding performance, and broad adaptability, offers\nrobust support for diverse intelligent application tasks involving text-rich\nimages, thus exhibiting immense potential for widespread application.\n","authors":["Pengyuan Lyu","Yulin Li","Hao Zhou","Weihong Ma","Xingyu Wan","Qunyi Xie","Liang Wu","Chengquan Zhang","Kun Yao","Errui Ding","Jingdong Wang"],"pdf_url":"https://arxiv.org/pdf/2405.21013v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.11180v5","updated":"2024-06-04T09:11:21Z","published":"2023-06-19T22:07:20Z","title":"Hyperbolic Active Learning for Semantic Segmentation under Domain Shift","summary":" We introduce a hyperbolic neural network approach to pixel-level active\nlearning for semantic segmentation. Analysis of the data statistics leads to a\nnovel interpretation of the hyperbolic radius as an indicator of data scarcity.\nIn HALO (Hyperbolic Active Learning Optimization), for the first time, we\npropose the use of epistemic uncertainty as a data acquisition strategy,\nfollowing the intuition of selecting data points that are the least known. The\nhyperbolic radius, complemented by the widely-adopted prediction entropy,\neffectively approximates epistemic uncertainty. We perform extensive\nexperimental analysis based on two established synthetic-to-real benchmarks,\ni.e. GTAV $\\rightarrow$ Cityscapes and SYNTHIA $\\rightarrow$ Cityscapes.\nAdditionally, we test HALO on Cityscape $\\rightarrow$ ACDC for domain\nadaptation under adverse weather conditions, and we benchmark both\nconvolutional and attention-based backbones. HALO sets a new state-of-the-art\nin active learning for semantic segmentation under domain shift and it is the\nfirst active learning approach that surpasses the performance of supervised\ndomain adaptation while using only a small portion of labels (i.e., 1%).\n","authors":["Luca Franco","Paolo Mandica","Konstantinos Kallidromitis","Devin Guillory","Yu-Teng Li","Trevor Darrell","Fabio Galasso"],"pdf_url":"https://arxiv.org/pdf/2306.11180v5.pdf","comment":"ICML 2024. Project repository: https://github.com/paolomandica/HALO"},{"id":"http://arxiv.org/abs/2406.02125v1","updated":"2024-06-04T09:10:02Z","published":"2024-06-04T09:10:02Z","title":"Domain Game: Disentangle Anatomical Feature for Single Domain\n Generalized Segmentation","summary":" Single domain generalization aims to address the challenge of\nout-of-distribution generalization problem with only one source domain\navailable. Feature distanglement is a classic solution to this purpose, where\nthe extracted task-related feature is presumed to be resilient to domain shift.\nHowever, the absence of references from other domains in a single-domain\nscenario poses significant uncertainty in feature disentanglement\n(ill-posedness). In this paper, we propose a new framework, named\n\\textit{Domain Game}, to perform better feature distangling for medical image\nsegmentation, based on the observation that diagnostic relevant features are\nmore sensitive to geometric transformations, whilist domain-specific features\nprobably will remain invariant to such operations. In domain game, a set of\nrandomly transformed images derived from a singular source image is\nstrategically encoded into two separate feature sets to represent diagnostic\nfeatures and domain-specific features, respectively, and we apply forces to\npull or repel them in the feature space, accordingly. Results from cross-site\ntest domain evaluation showcase approximately an ~11.8% performance boost in\nprostate segmentation and around ~10.5% in brain tumor segmentation compared to\nthe second-best method.\n","authors":["Hao Chen","Hongrun Zhang","U Wang Chan","Rui Yin","Xiaofei Wang","Chao Li"],"pdf_url":"https://arxiv.org/pdf/2406.02125v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01136v2","updated":"2024-06-04T09:02:14Z","published":"2024-06-03T09:27:57Z","title":"Towards Practical Single-shot Motion Synthesis","summary":" Despite the recent advances in the so-called \"cold start\" generation from\ntext prompts, their needs in data and computing resources, as well as the\nambiguities around intellectual property and privacy concerns pose certain\ncounterarguments for their utility. An interesting and relatively unexplored\nalternative has been the introduction of unconditional synthesis from a single\nsample, which has led to interesting generative applications. In this paper we\nfocus on single-shot motion generation and more specifically on accelerating\nthe training time of a Generative Adversarial Network (GAN). In particular, we\ntackle the challenge of GAN's equilibrium collapse when using mini-batch\ntraining by carefully annealing the weights of the loss functions that prevent\nmode collapse. Additionally, we perform statistical analysis in the generator\nand discriminator models to identify correlations between training stages and\nenable transfer learning. Our improved GAN achieves competitive quality and\ndiversity on the Mixamo benchmark when compared to the original GAN\narchitecture and a single-shot diffusion model, while being up to x6.8 faster\nin training time from the former and x1.75 from the latter. Finally, we\ndemonstrate the ability of our improved GAN to mix and compose motion with a\nsingle forward pass. Project page available at\nhttps://moverseai.github.io/single-shot.\n","authors":["Konstantinos Roditakis","Spyridon Thermos","Nikolaos Zioulis"],"pdf_url":"https://arxiv.org/pdf/2406.01136v2.pdf","comment":"CVPR 2024, AI for 3D Generation Workshop, Project page:\n https://moverseai.github.io/single-shot"},{"id":"http://arxiv.org/abs/2303.10559v2","updated":"2024-06-04T08:57:38Z","published":"2023-03-19T04:00:05Z","title":"Deep Learning for Camera Calibration and Beyond: A Survey","summary":" Camera calibration involves estimating camera parameters to infer geometric\nfeatures from captured sequences, which is crucial for computer vision and\nrobotics. However, conventional calibration is laborious and requires dedicated\ncollection. Recent efforts show that learning-based solutions have the\npotential to be used in place of the repeatability works of manual\ncalibrations. Among these solutions, various learning strategies, networks,\ngeometric priors, and datasets have been investigated. In this paper, we\nprovide a comprehensive survey of learning-based camera calibration techniques,\nby analyzing their strengths and limitations. Our main calibration categories\ninclude the standard pinhole camera model, distortion camera model, cross-view\nmodel, and cross-sensor model, following the research trend and extended\napplications. As there is no benchmark in this community, we collect a holistic\ncalibration dataset that can serve as a public platform to evaluate the\ngeneralization of existing methods. It comprises both synthetic and real-world\ndata, with images and videos captured by different cameras in diverse scenes.\nToward the end of this paper, we discuss the challenges and provide further\nresearch directions. To our knowledge, this is the first survey for the\nlearning-based camera calibration (spanned 8 years). The summarized methods,\ndatasets, and benchmarks are available and will be regularly updated at\nhttps://github.com/KangLiao929/Awesome-Deep-Camera-Calibration.\n","authors":["Kang Liao","Lang Nie","Shujuan Huang","Chunyu Lin","Jing Zhang","Yao Zhao","Moncef Gabbouj","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2303.10559v2.pdf","comment":"Github repository:\n https://github.com/KangLiao929/Awesome-Deep-Camera-Calibration"},{"id":"http://arxiv.org/abs/2405.07857v2","updated":"2024-06-04T08:56:57Z","published":"2024-05-13T15:42:46Z","title":"Synergistic Integration of Coordinate Network and Tensorial Feature for\n Improving Neural Radiance Fields from Sparse Inputs","summary":" The multi-plane representation has been highlighted for its fast training and\ninference across static and dynamic neural radiance fields. This approach\nconstructs relevant features via projection onto learnable grids and\ninterpolating adjacent vertices. However, it has limitations in capturing\nlow-frequency details and tends to overuse parameters for low-frequency\nfeatures due to its bias toward fine details, despite its multi-resolution\nconcept. This phenomenon leads to instability and inefficiency when training\nposes are sparse. In this work, we propose a method that synergistically\nintegrates multi-plane representation with a coordinate-based MLP network known\nfor strong bias toward low-frequency signals. The coordinate-based network is\nresponsible for capturing low-frequency details, while the multi-plane\nrepresentation focuses on capturing fine-grained details. We demonstrate that\nusing residual connections between them seamlessly preserves their own inherent\nproperties. Additionally, the proposed progressive training scheme accelerates\nthe disentanglement of these two features. We demonstrate empirically that our\nproposed method outperforms baseline models for both static and dynamic NeRFs\nwith sparse inputs, achieving comparable results with fewer parameters.\n","authors":["Mingyu Kim","Jun-Seong Kim","Se-Young Yun","Jin-Hwa Kim"],"pdf_url":"https://arxiv.org/pdf/2405.07857v2.pdf","comment":"ICML2024 ; Project page is accessible at\n https://mingyukim87.github.io/SynergyNeRF ; Code is available at\n https://github.com/MingyuKim87/SynergyNeRF"},{"id":"http://arxiv.org/abs/2211.13984v2","updated":"2024-06-04T08:54:51Z","published":"2022-11-25T09:47:34Z","title":"Aggregated Text Transformer for Scene Text Detection","summary":" This paper explores the multi-scale aggregation strategy for scene text\ndetection in natural images. We present the Aggregated Text TRansformer(ATTR),\nwhich is designed to represent texts in scene images with a multi-scale\nself-attention mechanism. Starting from the image pyramid with multiple\nresolutions, the features are first extracted at different scales with shared\nweight and then fed into an encoder-decoder architecture of Transformer. The\nmulti-scale image representations are robust and contain rich information on\ntext contents of various sizes. The text Transformer aggregates these features\nto learn the interaction across different scales and improve text\nrepresentation. The proposed method detects scene texts by representing each\ntext instance as an individual binary mask, which is tolerant of curve texts\nand regions with dense instances. Extensive experiments on public scene text\ndetection datasets demonstrate the effectiveness of the proposed framework.\n","authors":["Zhao Zhou","Xiangcheng Du","Yingbin Zheng","Cheng Jin"],"pdf_url":"https://arxiv.org/pdf/2211.13984v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00772v2","updated":"2024-06-04T08:53:24Z","published":"2024-06-02T15:19:07Z","title":"Unsupervised Contrastive Analysis for Salient Pattern Detection using\n Conditional Diffusion Models","summary":" Contrastive Analysis (CA) regards the problem of identifying patterns in\nimages that allow distinguishing between a background (BG) dataset (i.e.\nhealthy subjects) and a target (TG) dataset (i.e. unhealthy subjects). Recent\nworks on this topic rely on variational autoencoders (VAE) or contrastive\nlearning strategies to learn the patterns that separate TG samples from BG\nsamples in a supervised manner. However, the dependency on target (unhealthy)\nsamples can be challenging in medical scenarios due to their limited\navailability. Also, the blurred reconstructions of VAEs lack utility and\ninterpretability. In this work, we redefine the CA task by employing a\nself-supervised contrastive encoder to learn a latent representation encoding\nonly common patterns from input images, using samples exclusively from the BG\ndataset during training, and approximating the distribution of the target\npatterns by leveraging data augmentation techniques. Subsequently, we exploit\nstate-of-the-art generative methods, i.e. diffusion models, conditioned on the\nlearned latent representation to produce a realistic (healthy) version of the\ninput image encoding solely the common patterns. Thorough validation on a\nfacial image dataset and experiments across three brain MRI datasets\ndemonstrate that conditioning the generative process of state-of-the-art\ngenerative methods with the latent representation from our self-supervised\ncontrastive encoder yields improvements in the generated image quality and in\nthe accuracy of image classification. The code is available at\nhttps://github.com/CristianoPatricio/unsupervised-contrastive-cond-diff.\n","authors":["Cristiano Patrício","Carlo Alberto Barbano","Attilio Fiandrotti","Riccardo Renzulli","Marco Grangetto","Luis F. Teixeira","João C. Neves"],"pdf_url":"https://arxiv.org/pdf/2406.00772v2.pdf","comment":"18 pages, 11 figures"},{"id":"http://arxiv.org/abs/2405.15477v2","updated":"2024-06-04T08:35:14Z","published":"2024-05-24T11:58:02Z","title":"MagicBathyNet: A Multimodal Remote Sensing Dataset for Bathymetry\n Prediction and Pixel-based Classification in Shallow Waters","summary":" Accurate, detailed, and high-frequent bathymetry, coupled with complex\nsemantic content, is crucial for the undermapped shallow seabed areas facing\nintense climatological and anthropogenic pressures. Current methods exploiting\nremote sensing images to derive bathymetry or seabed classes mainly exploit\nnon-open data. This lack of openly accessible benchmark archives prevents the\nwider use of deep learning methods in such applications. To address this issue,\nin this paper we present the MagicBathyNet, which is a benchmark dataset made\nup of image patches of Sentinel2, SPOT-6 and aerial imagery, bathymetry in\nraster format and annotations of seabed classes. MagicBathyNet is then\nexploited to benchmark state-of-the-art methods in learning-based bathymetry\nand pixel-based classification. Dataset, pre-trained weights, and code are\npublicly available at www.magicbathy.eu/magicbathynet.html.\n","authors":["Panagiotis Agrafiotis","Łukasz Janowski","Dimitrios Skarlatos","Begüm Demir"],"pdf_url":"https://arxiv.org/pdf/2405.15477v2.pdf","comment":"5 pages, 3 figures, 5 tables. Accepted at IEEE International\n Geoscience and Remote Sensing Symposium (IGARSS) 2024"},{"id":"http://arxiv.org/abs/2406.01425v2","updated":"2024-06-04T08:20:27Z","published":"2024-06-03T15:25:45Z","title":"Sensitivity-Informed Augmentation for Robust Segmentation","summary":" Segmentation is an integral module in many visual computing applications such\nas virtual try-on, medical imaging, autonomous driving, and agricultural\nautomation. These applications often involve either widespread consumer use or\nhighly variable environments, both of which can degrade the quality of visual\nsensor data, whether from a common mobile phone or an expensive satellite\nimaging camera. In addition to external noises like user difference or weather\nconditions, internal noises such as variations in camera quality or lens\ndistortion can affect the performance of segmentation models during both\ndevelopment and deployment. In this work, we present an efficient, adaptable,\nand gradient-free method to enhance the robustness of learning-based\nsegmentation models across training. First, we introduce a novel adaptive\nsensitivity analysis (ASA) using Kernel Inception Distance (KID) on basis\nperturbations to benchmark perturbation sensitivity of pre-trained segmentation\nmodels. Then, we model the sensitivity curve using the adaptive SA and sample\nperturbation hyperparameter values accordingly. Finally, we conduct adversarial\ntraining with the selected perturbation values and dynamically re-evaluate\nrobustness during online training. Our method, implemented end-to-end with\nminimal fine-tuning required, consistently outperforms state-of-the-art data\naugmentation techniques for segmentation. It shows significant improvement in\nboth clean data evaluation and real-world adverse scenario evaluation across\nvarious segmentation datasets used in visual computing and computer graphics\napplications.\n","authors":["Laura Zheng","Wenjie Wei","Tony Wu","Jacob Clements","Shreelekha Revankar","Andre Harrison","Yu Shen","Ming C. Lin"],"pdf_url":"https://arxiv.org/pdf/2406.01425v2.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2406.02077v1","updated":"2024-06-04T07:57:34Z","published":"2024-06-04T07:57:34Z","title":"Multi-target stain normalization for histology slides","summary":" Traditional staining normalization approaches, e.g. Macenko, typically rely\non the choice of a single representative reference image, which may not\nadequately account for the diverse staining patterns of datasets collected in\npractical scenarios. In this study, we introduce a novel approach that\nleverages multiple reference images to enhance robustness against stain\nvariation. Our method is parameter-free and can be adopted in existing\ncomputational pathology pipelines with no significant changes. We evaluate the\neffectiveness of our method through experiments using a deep-learning pipeline\nfor automatic nuclei segmentation on colorectal images. Our results show that\nby leveraging multiple reference images, better results can be achieved when\ngeneralizing to external data, where the staining can widely differ from the\ntraining set.\n","authors":["Desislav Ivanov","Carlo Alberto Barbano","Marco Grangetto"],"pdf_url":"https://arxiv.org/pdf/2406.02077v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02074v1","updated":"2024-06-04T07:54:10Z","published":"2024-06-04T07:54:10Z","title":"FaceCom: Towards High-fidelity 3D Facial Shape Completion via\n Optimization and Inpainting Guidance","summary":" We propose FaceCom, a method for 3D facial shape completion, which delivers\nhigh-fidelity results for incomplete facial inputs of arbitrary forms. Unlike\nend-to-end shape completion methods based on point clouds or voxels, our\napproach relies on a mesh-based generative network that is easy to optimize,\nenabling it to handle shape completion for irregular facial scans. We first\ntrain a shape generator on a mixed 3D facial dataset containing 2405\nidentities. Based on the incomplete facial input, we fit complete faces using\nan optimization approach under image inpainting guidance. The completion\nresults are refined through a post-processing step. FaceCom demonstrates the\nability to effectively and naturally complete facial scan data with varying\nmissing regions and degrees of missing areas. Our method can be used in medical\nprosthetic fabrication and the registration of deficient scanning data. Our\nexperimental results demonstrate that FaceCom achieves exceptional performance\nin fitting and shape completion tasks. The code is available at\nhttps://github.com/dragonylee/FaceCom.git.\n","authors":["Yinglong Li","Hongyu Wu","Xiaogang Wang","Qingzhao Qin","Yijiao Zhao","Yong wang","Aimin Hao"],"pdf_url":"https://arxiv.org/pdf/2406.02074v1.pdf","comment":"accepted to CVPR2024"},{"id":"http://arxiv.org/abs/2406.02064v1","updated":"2024-06-04T07:45:27Z","published":"2024-06-04T07:45:27Z","title":"Advancing Generalized Transfer Attack with Initialization Derived\n Bilevel Optimization and Dynamic Sequence Truncation","summary":" Transfer attacks generate significant interest for real-world black-box\napplications by crafting transferable adversarial examples through surrogate\nmodels. Whereas, existing works essentially directly optimize the single-level\nobjective w.r.t. the surrogate model, which always leads to poor\ninterpretability of attack mechanism and limited generalization performance\nover unknown victim models. In this work, we propose the\n\\textbf{B}il\\textbf{E}vel \\textbf{T}ransfer \\textbf{A}ttac\\textbf{K} (BETAK)\nframework by establishing an initialization derived bilevel optimization\nparadigm, which explicitly reformulates the nested constraint relationship\nbetween the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL)\nsurrogate attacker. Algorithmically, we introduce the Hyper Gradient Response\n(HGR) estimation as an effective feedback for the transferability over\npseudo-victim attackers, and propose the Dynamic Sequence Truncation (DST)\ntechnique to dynamically adjust the back-propagation path for HGR and reduce\ncomputational overhead simultaneously. Meanwhile, we conduct detailed\nalgorithmic analysis and provide convergence guarantee to support non-convexity\nof the LL surrogate attacker. Extensive evaluations demonstrate substantial\nimprovement of BETAK (e.g., $\\mathbf{53.41}$\\% increase of attack success rates\nagainst IncRes-v$2_{ens}$) against different victims and defense methods in\ntargeted and untargeted attack scenarios. The source code is available at\nhttps://github.com/callous-youth/BETAK.\n","authors":["Yaohua Liu","Jiaxin Gao","Xuan Liu","Xianghao Jiao","Xin Fan","Risheng Liu"],"pdf_url":"https://arxiv.org/pdf/2406.02064v1.pdf","comment":"Accepted by IJCAI 2024. 10 pages"},{"id":"http://arxiv.org/abs/2406.02058v1","updated":"2024-06-04T07:42:33Z","published":"2024-06-04T07:42:33Z","title":"OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary\n Understanding","summary":" This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting\n(3DGS) capable of 3D point-level open vocabulary understanding. Our primary\nmotivation stems from observing that existing 3DGS-based open vocabulary\nmethods mainly focus on 2D pixel-level parsing. These methods struggle with 3D\npoint-level tasks due to weak feature expressiveness and inaccurate 2D-3D\nfeature associations. To ensure robust feature presentation and 3D point-level\nunderstanding, we first employ SAM masks without cross-frame associations to\ntrain instance features with 3D consistency. These features exhibit both\nintra-object consistency and inter-object distinction. Then, we propose a\ntwo-stage codebook to discretize these features from coarse to fine levels. At\nthe coarse level, we consider the positional information of 3D points to\nachieve location-based clustering, which is then refined at the fine level.\nFinally, we introduce an instance-level 3D-2D feature association method that\nlinks 3D points to 2D masks, which are further associated with 2D CLIP\nfeatures. Extensive experiments, including open vocabulary-based 3D object\nselection, 3D point cloud understanding, click-based 3D object selection, and\nablation studies, demonstrate the effectiveness of our proposed method. Project\npage: https://3d-aigc.github.io/OpenGaussian\n","authors":["Yanmin Wu","Jiarui Meng","Haijie Li","Chenming Wu","Yahao Shi","Xinhua Cheng","Chen Zhao","Haocheng Feng","Errui Ding","Jingdong Wang","Jian Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.02058v1.pdf","comment":"technical report, 15 pages"},{"id":"http://arxiv.org/abs/2406.01489v2","updated":"2024-06-04T07:39:20Z","published":"2024-06-03T16:13:33Z","title":"DA-HFNet: Progressive Fine-Grained Forgery Image Detection and\n Localization Based on Dual Attention","summary":" The increasing difficulty in accurately detecting forged images generated by\nAIGC(Artificial Intelligence Generative Content) poses many risks,\nnecessitating the development of effective methods to identify and further\nlocate forged areas. In this paper, to facilitate research efforts, we\nconstruct a DA-HFNet forged image dataset guided by text or image-assisted GAN\nand Diffusion model. Our goal is to utilize a hierarchical progressive network\nto capture forged artifacts at different scales for detection and localization.\nSpecifically, it relies on a dual-attention mechanism to adaptively fuse\nmulti-modal image features in depth, followed by a multi-branch interaction\nnetwork to thoroughly interact image features at different scales and improve\ndetector performance by leveraging dependencies between layers. Additionally,\nwe extract more sensitive noise fingerprints to obtain more prominent forged\nartifact features in the forged areas. Extensive experiments validate the\neffectiveness of our approach, demonstrating significant performance\nimprovements compared to state-of-the-art methods for forged image detection\nand localization.The code and dataset will be released in the future.\n","authors":["Yang Liu","Xiaofei Li","Jun Zhang","Shengze Hu","Jun Lei"],"pdf_url":"https://arxiv.org/pdf/2406.01489v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01460v2","updated":"2024-06-04T07:36:57Z","published":"2024-06-03T15:49:11Z","title":"MLIP: Efficient Multi-Perspective Language-Image Pretraining with\n Exhaustive Data Utilization","summary":" Contrastive Language-Image Pretraining (CLIP) has achieved remarkable\nsuccess, leading to rapid advancements in multimodal studies. However, CLIP\nfaces a notable challenge in terms of inefficient data utilization. It relies\non a single contrastive supervision for each image-text pair during\nrepresentation learning, disregarding a substantial amount of valuable\ninformation that could offer richer supervision. Additionally, the retention of\nnon-informative tokens leads to increased computational demands and time costs,\nparticularly in CLIP's ViT image encoder. To address these issues, we propose\nMulti-Perspective Language-Image Pretraining (MLIP). In MLIP, we leverage the\nfrequency transform's sensitivity to both high and low-frequency variations,\nwhich complements the spatial domain's sensitivity limited to low-frequency\nvariations only. By incorporating frequency transforms and token-level\nalignment, we expand CILP's single supervision into multi-domain and\nmulti-level supervision, enabling a more thorough exploration of informative\nimage features. Additionally, we introduce a token merging method guided by\ncomprehensive semantics from the frequency and spatial domains. This allows us\nto merge tokens to multi-granularity tokens with a controllable compression\nrate to accelerate CLIP. Extensive experiments validate the effectiveness of\nour design.\n","authors":["Yu Zhang","Qi Zhang","Zixuan Gong","Yiwei Shi","Yepeng Liu","Duoqian Miao","Yang Liu","Ke Liu","Kun Yi","Wei Fan","Liang Hu","Changwei Wang"],"pdf_url":"https://arxiv.org/pdf/2406.01460v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2312.00851v2","updated":"2024-06-04T07:34:05Z","published":"2023-12-01T13:25:16Z","title":"Physics Inspired Criterion for Pruning-Quantization Joint Learning","summary":" Pruning-quantization joint learning always facilitates the deployment of deep\nneural networks (DNNs) on resource-constrained edge devices. However, most\nexisting methods do not jointly learn a global criterion for pruning and\nquantization in an interpretable way. In this paper, we propose a novel physics\ninspired criterion for pruning-quantization joint learning (PIC-PQ), which is\nexplored from an analogy we first draw between elasticity dynamics (ED) and\nmodel compression (MC). Specifically, derived from Hooke's law in ED, we\nestablish a linear relationship between the filters' importance distribution\nand the filter property (FP) by a learnable deformation scale in the physics\ninspired criterion (PIC). Furthermore, we extend PIC with a relative shift\nvariable for a global view. To ensure feasibility and flexibility, available\nmaximum bitwidth and penalty factor are introduced in quantization bitwidth\nassignment. Experiments on benchmarks of image classification demonstrate that\nPIC-PQ yields a good trade-off between accuracy and bit-operations (BOPs)\ncompression ratio e.g., 54.96X BOPs compression ratio in ResNet56 on CIFAR10\nwith 0.10% accuracy drop and 53.24X in ResNet18 on ImageNet with 0.61% accuracy\ndrop). The code will be available at https://github.com/fanxxxxyi/PIC-PQ.\n","authors":["Weiying Xie","Xiaoyi Fan","Xin Zhang","Yunsong Li","Jie Lei","Leyuan Fang"],"pdf_url":"https://arxiv.org/pdf/2312.00851v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02038v1","updated":"2024-06-04T07:23:41Z","published":"2024-06-04T07:23:41Z","title":"Leveraging Predicate and Triplet Learning for Scene Graph Generation","summary":" Scene Graph Generation (SGG) aims to identify entities and predict the\nrelationship triplets \\textit{\\textless subject, predicate, object\\textgreater\n} in visual scenes. Given the prevalence of large visual variations of\nsubject-object pairs even in the same predicate, it can be quite challenging to\nmodel and refine predicate representations directly across such pairs, which is\nhowever a common strategy adopted by most existing SGG methods. We observe that\nvisual variations within the identical triplet are relatively small and certain\nrelation cues are shared in the same type of triplet, which can potentially\nfacilitate the relation learning in SGG. Moreover, for the long-tail problem\nwidely studied in SGG task, it is also crucial to deal with the limited types\nand quantity of triplets in tail predicates. Accordingly, in this paper, we\npropose a Dual-granularity Relation Modeling (DRM) network to leverage\nfine-grained triplet cues besides the coarse-grained predicate ones. DRM\nutilizes contexts and semantics of predicate and triplet with Dual-granularity\nConstraints, generating compact and balanced representations from two\nperspectives to facilitate relation recognition. Furthermore, a\nDual-granularity Knowledge Transfer (DKT) strategy is introduced to transfer\nvariation from head predicates/triplets to tail ones, aiming to enrich the\npattern diversity of tail classes to alleviate the long-tail problem. Extensive\nexperiments demonstrate the effectiveness of our method, which establishes new\nstate-of-the-art performance on Visual Genome, Open Image, and GQA datasets.\nOur code is available at \\url{https://github.com/jkli1998/DRM}\n","authors":["Jiankai Li","Yunhong Wang","Xiefan Guo","Ruijie Yang","Weixin Li"],"pdf_url":"https://arxiv.org/pdf/2406.02038v1.pdf","comment":"CVPR 2024"},{"id":"http://arxiv.org/abs/2406.02037v1","updated":"2024-06-04T07:23:09Z","published":"2024-06-04T07:23:09Z","title":"Multi-Scale Direction-Aware Network for Infrared Small Target Detection","summary":" Infrared small target detection faces the problem that it is difficult to\neffectively separate the background and the target. Existing deep\nlearning-based methods focus on appearance features and ignore high-frequency\ndirectional features. Therefore, we propose a multi-scale direction-aware\nnetwork (MSDA-Net), which is the first attempt to integrate the high-frequency\ndirectional features of infrared small targets as domain prior knowledge into\nneural networks. Specifically, an innovative multi-directional feature\nawareness (MDFA) module is constructed, which fully utilizes the prior\nknowledge of targets and emphasizes the focus on high-frequency directional\nfeatures. On this basis, combined with the multi-scale local relation learning\n(MLRL) module, a multi-scale direction-aware (MSDA) module is further\nconstructed. The MSDA module promotes the full extraction of local relations at\ndifferent scales and the full perception of key features in different\ndirections. Meanwhile, a high-frequency direction injection (HFDI) module\nwithout training parameters is constructed to inject the high-frequency\ndirectional information of the original image into the network. This helps\nguide the network to pay attention to detailed information such as target edges\nand shapes. In addition, we propose a feature aggregation (FA) structure that\naggregates multi-level features to solve the problem of small targets\ndisappearing in deep feature maps. Furthermore, a lightweight feature alignment\nfusion (FAF) module is constructed, which can effectively alleviate the pixel\noffset existing in multi-level feature map fusion. Extensive experimental\nresults show that our MSDA-Net achieves state-of-the-art (SOTA) results on the\npublic NUDT-SIRST, SIRST and IRSTD-1k datasets.\n","authors":["Jinmiao Zhao","Zelin Shi","Chuang Yu","Yunpeng Liu"],"pdf_url":"https://arxiv.org/pdf/2406.02037v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.16358v2","updated":"2024-06-04T07:22:48Z","published":"2024-03-25T01:44:34Z","title":"ChebMixer: Efficient Graph Representation Learning with MLP Mixer","summary":" Graph neural networks have achieved remarkable success in learning graph\nrepresentations, especially graph Transformer, which has recently shown\nsuperior performance on various graph mining tasks. However, graph Transformer\ngenerally treats nodes as tokens, which results in quadratic complexity\nregarding the number of nodes during self-attention computation. The graph MLP\nMixer addresses this challenge by using the efficient MLP Mixer technique from\ncomputer vision. However, the time-consuming process of extracting graph tokens\nlimits its performance. In this paper, we present a novel architecture named\nChebMixer, a newly graph MLP Mixer that uses fast Chebyshev polynomials-based\nspectral filtering to extract a sequence of tokens. Firstly, we produce\nmultiscale representations of graph nodes via fast Chebyshev polynomial-based\nspectral filtering. Next, we consider each node's multiscale representations as\na sequence of tokens and refine the node representation with an effective MLP\nMixer. Finally, we aggregate the multiscale representations of nodes through\nChebyshev interpolation. Owing to the powerful representation capabilities and\nfast computational properties of MLP Mixer, we can quickly extract more\ninformative node representations to improve the performance of downstream\ntasks. The experimental results prove our significant improvements in a variety\nof scenarios ranging from graph node classification to medical image\nsegmentation.\n","authors":["Xiaoyan Kui","Haonan Yan","Qinsong Li","Liming Chen","Beiji Zou"],"pdf_url":"https://arxiv.org/pdf/2403.16358v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02027v1","updated":"2024-06-04T07:06:06Z","published":"2024-06-04T07:06:06Z","title":"Inference Attacks in Machine Learning as a Service: A Taxonomy, Review,\n and Promising Directions","summary":" The prosperity of machine learning has also brought people's concerns about\ndata privacy. Among them, inference attacks can implement privacy breaches in\nvarious MLaaS scenarios and model training/prediction phases. Specifically,\ninference attacks can perform privacy inference on undisclosed target training\nsets based on outputs of the target model, including but not limited to\nstatistics, membership, semantics, data representation, etc. For instance,\ninfer whether the target data has the characteristics of AIDS. In addition, the\nrapid development of the machine learning community in recent years, especially\nthe surge of model types and application scenarios, has further stimulated the\ninference attacks' research. Thus, studying inference attacks and analyzing\nthem in depth is urgent and significant. However, there is still a gap in the\nsystematic discussion of inference attacks from taxonomy, global perspective,\nattack, and defense perspectives. This survey provides an in-depth and\ncomprehensive inference of attacks and corresponding countermeasures in\nML-as-a-service based on taxonomy and the latest researches. Without\ncompromising researchers' intuition, we first propose the 3MP taxonomy based on\nthe community research status, trying to normalize the confusing naming system\nof inference attacks. Also, we analyze the pros and cons of each type of\ninference attack, their workflow, countermeasure, and how they interact with\nother attacks. In the end, we point out several promising directions for\nresearchers from a more comprehensive and novel perspective.\n","authors":["Feng Wu","Lei Cui","Shaowen Yao","Shui Yu"],"pdf_url":"https://arxiv.org/pdf/2406.02027v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02021v1","updated":"2024-06-04T07:00:14Z","published":"2024-06-04T07:00:14Z","title":"MetaMixer Is All You Need","summary":" Transformer, composed of self-attention and Feed-Forward Network, has\nrevolutionized the landscape of network design across various vision tasks. FFN\nis a versatile operator seamlessly integrated into nearly all AI models to\neffectively harness rich representations. Recent works also show that FFN\nfunctions like key-value memories. Thus, akin to the query-key-value mechanism\nwithin self-attention, FFN can be viewed as a memory network, where the input\nserves as query and the two projection weights operate as keys and values,\nrespectively. We hypothesize that the importance lies in query-key-value\nframework itself rather than in self-attention. To verify this, we propose\nconverting self-attention into a more FFN-like efficient token mixer with only\nconvolutions while retaining query-key-value framework, namely FFNification.\nSpecifically, FFNification replaces query-key and attention coefficient-value\ninteractions with large kernel convolutions and adopts GELU activation function\ninstead of softmax. The derived token mixer, FFNified attention, serves as\nkey-value memories for detecting locally distributed spatial patterns, and\noperates in the opposite dimension to the ConvNeXt block within each\ncorresponding sub-operation of the query-key-value framework. Building upon the\nabove two modules, we present a family of Fast-Forward Networks. Our FFNet\nachieves remarkable performance improvements over previous state-of-the-art\nmethods across a wide range of tasks. The strong and general performance of our\nproposed method validates our hypothesis and leads us to introduce MetaMixer, a\ngeneral mixer architecture that does not specify sub-operations within the\nquery-key-value framework. We show that using only simple operations like\nconvolution and GELU in the MetaMixer can achieve superior performance.\n","authors":["Seokju Yun","Dongheon Lee","Youngmin Ro"],"pdf_url":"https://arxiv.org/pdf/2406.02021v1.pdf","comment":"Code: https://github.com/ysj9909/FFNet"},{"id":"http://arxiv.org/abs/2402.11874v4","updated":"2024-06-04T06:56:43Z","published":"2024-02-19T06:32:23Z","title":"Language-guided Image Reflection Separation","summary":" This paper studies the problem of language-guided reflection separation,\nwhich aims at addressing the ill-posed reflection separation problem by\nintroducing language descriptions to provide layer content. We propose a\nunified framework to solve this problem, which leverages the cross-attention\nmechanism with contrastive learning strategies to construct the correspondence\nbetween language descriptions and image layers. A gated network design and a\nrandomized training strategy are employed to tackle the recognizable layer\nambiguity. The effectiveness of the proposed method is validated by the\nsignificant performance advantage over existing reflection separation methods\non both quantitative and qualitative comparisons.\n","authors":["Haofeng Zhong","Yuchen Hong","Shuchen Weng","Jinxiu Liang","Boxin Shi"],"pdf_url":"https://arxiv.org/pdf/2402.11874v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19957v2","updated":"2024-06-04T06:56:39Z","published":"2024-05-30T11:23:01Z","title":"PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting","summary":" As text-conditioned diffusion models (DMs) achieve breakthroughs in image,\nvideo, and 3D generation, the research community's focus has shifted to the\nmore challenging task of text-to-4D synthesis, which introduces a temporal\ndimension to generate dynamic 3D objects. In this context, we identify Score\nDistillation Sampling (SDS), a widely used technique for text-to-3D synthesis,\nas a significant hindrance to text-to-4D performance due to its Janus-faced and\ntexture-unrealistic problems coupled with high computational costs. In this\npaper, we propose \\textbf{P}ixel-\\textbf{L}evel \\textbf{A}lignments for\nText-to-\\textbf{4D} Gaussian Splatting (\\textbf{PLA4D}), a novel method that\nutilizes text-to-video frames as explicit pixel alignment targets to generate\nstatic 3D objects and inject motion into them. Specifically, we introduce Focal\nAlignment to calibrate camera poses for rendering and GS-Mesh Contrastive\nLearning to distill geometry priors from rendered image contrasts at the pixel\nlevel. Additionally, we develop Motion Alignment using a deformation network to\ndrive changes in Gaussians and implement Reference Refinement for smooth 4D\nobject surfaces. These techniques enable 4D Gaussian Splatting to align\ngeometry, texture, and motion with generated videos at the pixel level.\nCompared to previous methods, PLA4D produces synthesized outputs with better\ntexture details in less time and effectively mitigates the Janus-faced problem.\nPLA4D is fully implemented using open-source models, offering an accessible,\nuser-friendly, and promising direction for 4D digital content creation. Our\nproject page: https://github.com/MiaoQiaowei/PLA4D.github.io.\n","authors":["Qiaowei Miao","Yawei Luo","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2405.19957v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01996v1","updated":"2024-06-04T06:27:48Z","published":"2024-06-04T06:27:48Z","title":"Bayesian Mesh Optimization for Graph Neural Networks to Enhance\n Engineering Performance Prediction","summary":" In engineering design, surrogate models are widely employed to replace\ncomputationally expensive simulations by leveraging design variables and\ngeometric parameters from computer-aided design (CAD) models. However, these\nmodels often lose critical information when simplified to lower dimensions and\nface challenges in parameter definition, especially with the complex 3D shapes\ncommonly found in industrial datasets. To address these limitations, we propose\na Bayesian graph neural network (GNN) framework for a 3D deep-learning-based\nsurrogate model that predicts engineering performance by directly learning\ngeometric features from CAD using mesh representation. Our framework determines\nthe optimal size of mesh elements through Bayesian optimization, resulting in a\nhigh-accuracy surrogate model. Additionally, it effectively handles the\nirregular and complex structures of 3D CADs, which differ significantly from\nthe regular and uniform pixel structures of 2D images typically used in deep\nlearning. Experimental results demonstrate that the quality of the mesh\nsignificantly impacts the prediction accuracy of the surrogate model, with an\noptimally sized mesh achieving superior performance. We compare the performance\nof models based on various 3D representations such as voxel, point cloud, and\ngraph, and evaluate the computational costs of Monte Carlo simulation and\nBayesian optimization methods to find the optimal mesh size. We anticipate that\nour proposed framework has the potential to be applied to mesh-based\nsimulations across various engineering fields, leveraging physics-based\ninformation commonly used in computer-aided engineering.\n","authors":["Jangseop Park","Namwoo Kang"],"pdf_url":"https://arxiv.org/pdf/2406.01996v1.pdf","comment":"17 pages, 8 figures, 3 tables"},{"id":"http://arxiv.org/abs/2406.01994v1","updated":"2024-06-04T06:24:07Z","published":"2024-06-04T06:24:07Z","title":"3D Imaging of Complex Specular Surfaces by Fusing Polarimetric and\n Deflectometric Information","summary":" Accurate and fast 3D imaging of specular surfaces still poses major\nchallenges for state-of-the-art optical measurement principles. Frequently used\nmethods, such as phase-measuring deflectometry (PMD) or shape-from-polarization\n(SfP), rely on strong assumptions about the measured objects, limiting their\ngeneralizability in broader application areas like medical imaging, industrial\ninspection, virtual reality, or cultural heritage analysis. In this paper, we\nintroduce a measurement principle that utilizes a novel technique to\neffectively encode and decode the information contained in a light field\nreflected off a specular surface. We combine polarization cues from SfP with\ngeometric information obtained from PMD to resolve all arising ambiguities in\nthe 3D measurement. Moreover, our approach removes the unrealistic orthographic\nimaging assumption for SfP, which significantly improves the respective\nresults. We showcase our new technique by demonstrating single-shot and\nmulti-shot measurements on complex-shaped specular surfaces, displaying an\nevaluated accuracy of surface normals below $0.6^\\circ$.\n","authors":["Jiazhang Wang","Oliver Cossairt","Florian Willomitzer"],"pdf_url":"https://arxiv.org/pdf/2406.01994v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01993v1","updated":"2024-06-04T06:23:27Z","published":"2024-06-04T06:23:27Z","title":"Choroidal Vessel Segmentation on Indocyanine Green Angiography Images\n via Human-in-the-Loop Labeling","summary":" Human-in-the-loop (HITL) strategy has been recently introduced into the field\nof medical image processing. Indocyanine green angiography (ICGA) stands as a\nwell-established examination for visualizing choroidal vasculature and\ndetecting chorioretinal diseases. However, the intricate nature of choroidal\nvascular networks makes large-scale manual segmentation of ICGA images\nchallenging. Thus, the study aims to develop a high-precision choroidal vessel\nsegmentation model with limited labor using HITL framework. We utilized a\nmulti-source ICGA dataset, including 55 degree view and ultra-widefield ICGA\n(UWF-ICGA) images for model development. The choroidal vessel network was\npre-segmented by a pre-trained vessel segmentation model, and then manually\nmodified by two ophthalmologists. Choroidal vascular diameter, density,\ncomplexity, tortuosity, and branching angle were automatically quantified based\non the segmentation. We finally conducted four cycles of HITL. One hundred and\nfifty 55 degree view ICGA images were used for the first three cycles (50\nimages per cycle), and twenty UWF-ICGA images for the last cycle. The average\ntime needed to manually correct a pre-segmented ICGA image per cycle reduced\nfrom 20 minutes to 1 minute. High segmentation accuracy has been achieved on\nboth 55 degree view ICGA and UWF-ICGA images. Additionally, the\nmulti-dimensional choroidal vascular parameters were significantly associated\nwith various chorioretinal diseases. Our study not only demonstrated the\nfeasibility of the HITL strategy in improving segmentation performance with\nreduced manual labeling, but also innovatively introduced several risk\npredictors for choroidal abnormalities.\n","authors":["Ruoyu Chen","Ziwei Zhao","Mayinuer Yusufu","Xianwen Shang","Danli Shi","Mingguang He"],"pdf_url":"https://arxiv.org/pdf/2406.01993v1.pdf","comment":"25 pages,4 figures"},{"id":"http://arxiv.org/abs/2406.01987v1","updated":"2024-06-04T06:07:24Z","published":"2024-06-04T06:07:24Z","title":"Dealing with All-stage Missing Modality: Towards A Universal Model with\n Robust Reconstruction and Personalization","summary":" Addressing missing modalities presents a critical challenge in multimodal\nlearning. Current approaches focus on developing models that can handle\nmodality-incomplete inputs during inference, assuming that the full set of\nmodalities are available for all the data during training. This reliance on\nfull-modality data for training limits the use of abundant modality-incomplete\nsamples that are often encountered in practical settings. In this paper, we\npropose a robust universal model with modality reconstruction and model\npersonalization, which can effectively tackle the missing modality at both\ntraining and testing stages. Our method leverages a multimodal masked\nautoencoder to reconstruct the missing modality and masked patches\nsimultaneously, incorporating an innovative distribution approximation\nmechanism to fully utilize both modality-complete and modality-incomplete data.\nThe reconstructed modalities then contributes to our designed data-model\nco-distillation scheme to guide the model learning in the presence of missing\nmodalities. Moreover, we propose a CLIP-driven hyper-network to personalize\npartial model parameters, enabling the model to adapt to each distinct missing\nmodality scenario. Our method has been extensively validated on two brain tumor\nsegmentation benchmarks. Experimental results demonstrate the promising\nperformance of our method, which consistently exceeds previous state-of-the-art\napproaches under the all-stage missing modality settings with different missing\nratios. Code will be available.\n","authors":["Yunpeng Zhao","Cheng Chen","Qing You Pang","Quanzheng Li","Carol Tang","Beng-Ti Ang","Yueming Jin"],"pdf_url":"https://arxiv.org/pdf/2406.01987v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01428v2","updated":"2024-06-04T05:39:15Z","published":"2024-06-03T15:26:06Z","title":"Superhuman performance in urology board questions by an explainable\n large language model enabled for context integration of the European\n Association of Urology guidelines: the UroBot study","summary":" Large Language Models (LLMs) are revolutionizing medical Question-Answering\n(medQA) through extensive use of medical literature. However, their performance\nis often hampered by outdated training data and a lack of explainability, which\nlimits clinical applicability. This study aimed to create and assess UroBot, a\nurology-specialized chatbot, by comparing it with state-of-the-art models and\nthe performance of urologists on urological board questions, ensuring full\nclinician-verifiability. UroBot was developed using OpenAI's GPT-3.5, GPT-4,\nand GPT-4o models, employing retrieval-augmented generation (RAG) and the\nlatest 2023 guidelines from the European Association of Urology (EAU). The\nevaluation included ten runs of 200 European Board of Urology (EBU) In-Service\nAssessment (ISA) questions, with performance assessed by the mean Rate of\nCorrect Answers (RoCA). UroBot-4o achieved an average RoCA of 88.4%, surpassing\nGPT-4o by 10.8%, with a score of 77.6%. It was also clinician-verifiable and\nexhibited the highest run agreement as indicated by Fleiss' Kappa (k = 0.979).\nBy comparison, the average performance of urologists on board questions, as\nreported in the literature, is 68.7%. UroBot's clinician-verifiable nature and\nsuperior accuracy compared to both existing models and urologists on board\nquestions highlight its potential for clinical integration. The study also\nprovides the necessary code and instructions for further development of UroBot.\n","authors":["Martin J. Hetz","Nicolas Carl","Sarah Haggenmüller","Christoph Wies","Maurice Stephan Michel","Frederik Wessels","Titus J. Brinker"],"pdf_url":"https://arxiv.org/pdf/2406.01428v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01975v1","updated":"2024-06-04T05:19:32Z","published":"2024-06-04T05:19:32Z","title":"Can Dense Connectivity Benefit Outlier Detection? An Odyssey with NAS","summary":" Recent advances in Out-of-Distribution (OOD) Detection is the driving force\nbehind safe and reliable deployment of Convolutional Neural Networks (CNNs) in\nreal world applications. However, existing studies focus on OOD detection\nthrough confidence score and deep generative model-based methods, without\nconsidering the impact of DNN structures, especially dense connectivity in\narchitecture fabrications. In addition, existing outlier detection approaches\nexhibit high variance in generalization performance, lacking stability and\nconfidence in evaluating and ranking different outlier detectors. In this work,\nwe propose a novel paradigm, Dense Connectivity Search of Outlier Detector\n(DCSOD), that automatically explore the dense connectivity of CNN architectures\non near-OOD detection task using Neural Architecture Search (NAS). We introduce\na hierarchical search space containing versatile convolution operators and\ndense connectivity, allowing a flexible exploration of CNN architectures with\ndiverse connectivity patterns. To improve the quality of evaluation on OOD\ndetection during search, we propose evolving distillation based on our\nmulti-view feature learning explanation. Evolving distillation stabilizes\ntraining for OOD detection evaluation, thus improves the quality of search. We\nthoroughly examine DCSOD on CIFAR benchmarks under OOD detection protocol.\nExperimental results show that DCSOD achieve remarkable performance over widely\nused architectures and previous NAS baselines. Notably, DCSOD achieves\nstate-of-the-art (SOTA) performance on CIFAR benchmark, with AUROC improvement\nof $\\sim$1.0%.\n","authors":["Hao Fu","Tunhou Zhang","Hai Li","Yiran Chen"],"pdf_url":"https://arxiv.org/pdf/2406.01975v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16451v2","updated":"2024-06-04T05:15:50Z","published":"2024-05-26T06:42:06Z","title":"From Macro to Micro: Boosting micro-expression recognition via\n pre-training on macro-expression videos","summary":" Micro-expression recognition (MER) has drawn increasing attention in recent\nyears due to its potential applications in intelligent medical and lie\ndetection. However, the shortage of annotated data has been the major obstacle\nto further improve deep-learning based MER methods. Intuitively, utilizing\nsufficient macro-expression data to promote MER performance seems to be a\nfeasible solution. However, the facial patterns of macro-expressions and\nmicro-expressions are significantly different, which makes naive transfer\nlearning methods difficult to deploy directly. To tacle this issue, we propose\na generalized transfer learning paradigm, called \\textbf{MA}cro-expression\n\\textbf{TO} \\textbf{MI}cro-expression (MA2MI). Under our paradigm, networks can\nlearns the ability to represent subtle facial movement by reconstructing future\nframes. In addition, we also propose a two-branch micro-action network\n(MIACNet) to decouple facial position features and facial action features,\nwhich can help the network more accurately locate facial action locations.\nExtensive experiments on three popular MER benchmarks demonstrate the\nsuperiority of our method.\n","authors":["Hanting Li","Hongjing Niu","Feng Zhao"],"pdf_url":"https://arxiv.org/pdf/2405.16451v2.pdf","comment":"18 pages"},{"id":"http://arxiv.org/abs/2406.01970v1","updated":"2024-06-04T05:06:00Z","published":"2024-06-04T05:06:00Z","title":"The Crystal Ball Hypothesis in diffusion models: Anticipating object\n positions from initial noise","summary":" Diffusion models have achieved remarkable success in text-to-image generation\ntasks; however, the role of initial noise has been rarely explored. In this\nstudy, we identify specific regions within the initial noise image, termed\ntrigger patches, that play a key role for object generation in the resulting\nimages. Notably, these patches are ``universal'' and can be generalized across\nvarious positions, seeds, and prompts. To be specific, extracting these patches\nfrom one noise and injecting them into another noise leads to object generation\nin targeted areas. We identify these patches by analyzing the dispersion of\nobject bounding boxes across generated images, leading to the development of a\nposterior analysis technique. Furthermore, we create a dataset consisting of\nGaussian noises labeled with bounding boxes corresponding to the objects\nappearing in the generated images and train a detector that identifies these\npatches from the initial noise. To explain the formation of these patches, we\nreveal that they are outliers in Gaussian noise, and follow distinct\ndistributions through two-sample tests. Finally, we find the misalignment\nbetween prompts and the trigger patch patterns can result in unsuccessful image\ngenerations. The study proposes a reject-sampling strategy to obtain optimal\nnoise, aiming to improve prompt adherence and positional diversity in image\ngeneration.\n","authors":["Yuanhao Ban","Ruochen Wang","Tianyi Zhou","Boqing Gong","Cho-Jui Hsieh","Minhao Cheng"],"pdf_url":"https://arxiv.org/pdf/2406.01970v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01961v1","updated":"2024-06-04T04:43:58Z","published":"2024-06-04T04:43:58Z","title":"Exploring Real World Map Change Generalization of Prior-Informed HD Map\n Prediction Models","summary":" Building and maintaining High-Definition (HD) maps represents a large barrier\nto autonomous vehicle deployment. This, along with advances in modern online\nmap detection models, has sparked renewed interest in the online mapping\nproblem. However, effectively predicting online maps at a high enough quality\nto enable safe, driverless deployments remains a significant challenge. Recent\nwork on these models proposes training robust online mapping systems using low\nquality map priors with synthetic perturbations in an attempt to simulate\nout-of-date HD map priors. In this paper, we investigate how models trained on\nthese synthetically perturbed map priors generalize to performance on\ndeployment-scale, real world map changes. We present a large-scale experimental\nstudy to determine which synthetic perturbations are most useful in\ngeneralizing to real world HD map changes, evaluated using multiple years of\nreal-world autonomous driving data. We show there is still a substantial\nsim2real gap between synthetic prior perturbations and observed real-world\nchanges, which limits the utility of current prior-informed HD map prediction\nmodels.\n","authors":["Samuel M. Bateman","Ning Xu","H. Charles Zhao","Yael Ben Shalom","Vince Gong","Greg Long","Will Maddern"],"pdf_url":"https://arxiv.org/pdf/2406.01961v1.pdf","comment":"Accepted to CVPR 2024, Workshop on Autonomous Driving"},{"id":"http://arxiv.org/abs/2406.00571v2","updated":"2024-06-04T04:36:22Z","published":"2024-06-01T22:58:08Z","title":"An Image Segmentation Model with Transformed Total Variation","summary":" Based on transformed $\\ell_1$ regularization, transformed total variation\n(TTV) has robust image recovery that is competitive with other nonconvex total\nvariation (TV) regularizers, such as TV$^p$, $0