From b58c3478bf2d78edfc8f706e5f9487e2ab35ee4b Mon Sep 17 00:00:00 2001 From: AlongWY Date: Sun, 1 Sep 2024 05:26:10 +0000 Subject: [PATCH] deploy: 72066be21ad467c8ffc76b74c152b38decf3f0ac --- .nojekyll | 0 cache.json | 1 + favicon.ico | Bin 0 -> 15086 bytes index.css | 355 + index.html | 64390 ++++++++++++++++++++++++++++++++++++++++++++++++++ index.js | 39 + 6 files changed, 64785 insertions(+) create mode 100644 .nojekyll create mode 100644 cache.json create mode 100644 favicon.ico create mode 100644 index.css create mode 100644 index.html create mode 100644 index.js diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/cache.json b/cache.json new file mode 100644 index 00000000..9193b647 --- /dev/null +++ b/cache.json @@ -0,0 +1 @@ +{"2024-08-26T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2408.14471v1","updated":"2024-08-26T17:59:01Z","published":"2024-08-26T17:59:01Z","title":"A Practitioner's Guide to Continual Multimodal Pretraining","summary":" Multimodal foundation models serve numerous applications at the intersection\nof vision and language. Still, despite being pretrained on extensive data, they\nbecome outdated over time. To keep models updated, research into continual\npretraining mainly explores scenarios with either (1) infrequent,\nindiscriminate updates on large-scale new data, or (2) frequent, sample-level\nupdates. However, practical model deployment often operates in the gap between\nthese two limit cases, as real-world applications often demand adaptation to\nspecific subdomains, tasks or concepts -- spread over the entire, varying life\ncycle of a model. In this work, we complement current perspectives on continual\npretraining through a research test bed as well as provide comprehensive\nguidance for effective continual model updates in such scenarios. We first\nintroduce FoMo-in-Flux, a continual multimodal pretraining benchmark with\nrealistic compute constraints and practical deployment requirements,\nconstructed over 63 datasets with diverse visual and semantic coverage. Using\nFoMo-in-Flux, we explore the complex landscape of practical continual\npretraining through multiple perspectives: (1) A data-centric investigation of\ndata mixtures and stream orderings that emulate real-world deployment\nsituations, (2) a method-centric investigation ranging from simple fine-tuning\nand traditional continual learning strategies to parameter-efficient updates\nand model merging, (3) meta learning rate schedules and mechanistic design\nchoices, and (4) the influence of model and compute scaling. Together, our\ninsights provide a practitioner's guide to continual multimodal pretraining for\nreal-world deployment. Our benchmark and code is here:\nhttps://github.com/ExplainableML/fomo_in_flux.\n","authors":["Karsten Roth","Vishaal Udandarao","Sebastian Dziadzio","Ameya Prabhu","Mehdi Cherti","Oriol Vinyals","Olivier Hénaff","Samuel Albanie","Matthias Bethge","Zeynep Akata"],"pdf_url":"https://arxiv.org/pdf/2408.14471v1.pdf","comment":"Technical Report. 52 pages"},{"id":"http://arxiv.org/abs/2408.14470v1","updated":"2024-08-26T17:58:53Z","published":"2024-08-26T17:58:53Z","title":"Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large\n Language Models","summary":" Fine-tuning large language models (LLMs) on downstream tasks requires\nsubstantial computational resources. A class of parameter-efficient fine-tuning\n(PEFT) aims to mitigate these computational challenges by selectively\nfine-tuning only a small fraction of the model parameters. Although\ncomputationally efficient, these techniques often fail to match the performance\nof fully fine-tuned models, primarily due to inherent biases introduced during\nparameter selection. Traditional selective PEFT techniques use a fixed set of\nparameters based on a predefined budget (a process also known as unmasking),\nfailing to capture parameter importance dynamically and often ending up\nexceeding the budget. We introduce $\\text{ID}^3$, a novel selective PEFT method\nthat calculates parameter importance continually and dynamically unmasks\nparameters by balancing exploration and exploitation in parameter selection.\nOur empirical study on 15 tasks spanning natural language understanding and\ngenerative tasks demonstrates the effectiveness of our method compared to\nfixed-masking-based PEFT techniques. We analytically show that $\\text{ID}^3$\nreduces the number of gradient updates by a factor of two, enhancing\ncomputational efficiency. $\\text{ID}^3$ is robust to random initialization of\nneurons and, therefore, can be seamlessly integrated into existing additive and\nreparametrization-based PEFT modules such as adapters and LoRA for dynamic\nsparsification.\n","authors":["Aradhye Agarwal","Suhas K Ramesh","Ayan Sengupta","Tanmoy Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2408.14470v1.pdf","comment":"15 pages, 7 tables, 9 figures"},{"id":"http://arxiv.org/abs/2408.14467v1","updated":"2024-08-26T17:58:17Z","published":"2024-08-26T17:58:17Z","title":"Explicit Inductive Inference using Large Language Models","summary":" Large Language Models (LLMs) are reported to hold undesirable attestation\nbias on inference tasks: when asked to predict if a premise P entails a\nhypothesis H, instead of considering H's conditional truthfulness entailed by\nP, LLMs tend to use the out-of-context truth label of H as a fragile proxy. In\nthis paper, we propose a pipeline that exploits this bias to do explicit\ninductive inference. Our pipeline uses an LLM to transform a premise into a set\nof attested alternatives, and then aggregate answers of the derived new\nentailment inquiries to support the original inference prediction. On a\ndirectional predicate entailment benchmark, we demonstrate that by applying\nthis simple pipeline, we can improve the overall performance of LLMs on\ninference and substantially alleviate the impact of their attestation bias.\n","authors":["Tianyang Liu","Tianyi Li","Liang Cheng","Mark Steedman"],"pdf_url":"https://arxiv.org/pdf/2408.14467v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.11796v2","updated":"2024-08-26T17:50:46Z","published":"2024-08-21T17:38:48Z","title":"LLM Pruning and Distillation in Practice: The Minitron Approach","summary":" We present a comprehensive report on compressing the Llama 3.1 8B and Mistral\nNeMo 12B models to 4B and 8B parameters, respectively, using pruning and\ndistillation. We explore two distinct pruning strategies: (1) depth pruning and\n(2) joint hidden/attention/MLP (width) pruning, and evaluate the results on\ncommon benchmarks from the LM Evaluation Harness. The models are then aligned\nwith NeMo Aligner and tested in instruct-tuned versions. This approach produces\na compelling 4B model from Llama 3.1 8B and a state-of-the-art\nMistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo\n12B. We found that with no access to the original data, it is beneficial to\nslightly fine-tune teacher models on the distillation dataset. We open-source\nour base model weights on Hugging Face with a permissive license.\n","authors":["Sharath Turuvekere Sreenivas","Saurav Muralidharan","Raviraj Joshi","Marcin Chochowski","Mostofa Patwary","Mohammad Shoeybi","Bryan Catanzaro","Jan Kautz","Pavlo Molchanov"],"pdf_url":"https://arxiv.org/pdf/2408.11796v2.pdf","comment":"v2: Added missing references. Cleaned up runtime performance section"},{"id":"http://arxiv.org/abs/2306.13840v3","updated":"2024-08-26T17:34:44Z","published":"2023-06-24T02:25:56Z","title":"Beyond Scale: The Diversity Coefficient as a Data Quality Metric for\n Variability in Natural Language Data","summary":" Current trends in pre-training Large Language Models (LLMs) primarily focus\non the scaling of model and dataset size. While the quality of pre-training\ndata is considered an important factor for training powerful LLMs, it remains a\nnebulous concept that has not been rigorously characterized. To this end, we\npropose a formalization of one key aspect of data quality -- measuring the\nvariability of natural language data -- specifically via a measure we call the\ndiversity coefficient. Our empirical analysis shows that the proposed diversity\ncoefficient aligns with the intuitive properties of diversity and variability,\ne.g., it increases as the number of latent concepts increases. Then, we measure\nthe diversity coefficient of publicly available pre-training datasets and\ndemonstrate that their formal diversity is high compared to theoretical lower\nand upper bounds. Finally, we conduct a comprehensive set of controlled\ninterventional experiments with GPT-2 and LLaMAv2 that demonstrate the\ndiversity coefficient of pre-training data characterizes useful aspects of\ndownstream model evaluation performance -- totaling 44 models of various sizes\n(51M to 7B parameters). We conclude that our formal notion of diversity is an\nimportant aspect of data quality that captures variability and causally leads\nto improved evaluation performance.\n","authors":["Brando Miranda","Alycia Lee","Sudharsan Sundar","Allison Casasola","Sanmi Koyejo"],"pdf_url":"https://arxiv.org/pdf/2306.13840v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.10468v3","updated":"2024-08-26T17:28:23Z","published":"2024-08-20T00:40:49Z","title":"Tracing Privacy Leakage of Language Models to Training Data via Adjusted\n Influence Functions","summary":" The responses generated by Large Language Models (LLMs) can include sensitive\ninformation from individuals and organizations, leading to potential privacy\nleakage. This work implements Influence Functions (IFs) to trace privacy\nleakage back to the training data, thereby mitigating privacy concerns of\nLanguage Models (LMs). However, we notice that current IFs struggle to\naccurately estimate the influence of tokens with large gradient norms,\npotentially overestimating their influence. When tracing the most influential\nsamples, this leads to frequently tracing back to samples with large gradient\nnorm tokens, overshadowing the actual most influential samples even if their\ninfluences are well estimated. To address this issue, we propose Heuristically\nAdjusted IF (HAIF), which reduces the weight of tokens with large gradient\nnorms, thereby significantly improving the accuracy of tracing the most\ninfluential samples. To establish easily obtained groundtruth for tracing\nprivacy leakage, we construct two datasets, PII-E and PII-CR, representing two\ndistinct scenarios: one with identical text in the model outputs and\npre-training data, and the other where models leverage their reasoning\nabilities to generate text divergent from pre-training data. HAIF significantly\nimproves tracing accuracy, enhancing it by 20.96% to 73.71% on the PII-E\ndataset and 3.21% to 45.93% on the PII-CR dataset, compared to the best SOTA\nIFs against various GPT-2 and QWen-1.5 models. HAIF also outperforms SOTA IFs\non real-world pretraining data CLUECorpus2020, demonstrating strong robustness\nregardless prompt and response lengths.\n","authors":["Jinxin Liu","Zao Yang"],"pdf_url":"https://arxiv.org/pdf/2408.10468v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14438v1","updated":"2024-08-26T17:25:16Z","published":"2024-08-26T17:25:16Z","title":"Evaluating Large Language Models on Spatial Tasks: A Multi-Task\n Benchmarking Study","summary":" The advent of large language models such as ChatGPT, Gemini, and others has\nunderscored the importance of evaluating their diverse capabilities, ranging\nfrom natural language understanding to code generation. However, their\nperformance on spatial tasks has not been comprehensively assessed. This study\naddresses this gap by introducing a novel multi-task spatial evaluation\ndataset, designed to systematically explore and compare the performance of\nseveral advanced models on spatial tasks. The dataset encompasses twelve\ndistinct task types, including spatial understanding and path planning, each\nwith verified, accurate answers. We evaluated multiple models, including\nOpenAI's gpt-3.5-turbo, gpt-4o, and ZhipuAI's glm-4, through a two-phase\ntesting approach. Initially, we conducted zero-shot testing, followed by\ncategorizing the dataset by difficulty and performing prompt tuning tests.\nResults indicate that gpt-4o achieved the highest overall accuracy in the first\nphase, with an average of 71.3%. Although moonshot-v1-8k slightly\nunderperformed overall, it surpassed gpt-4o in place name recognition tasks.\nThe study also highlights the impact of prompt strategies on model performance\nin specific tasks. For example, the Chain-of-Thought (COT) strategy increased\ngpt-4o's accuracy in path planning from 12.4% to 87.5%, while a one-shot\nstrategy enhanced moonshot-v1-8k's accuracy in mapping tasks from 10.1% to\n76.3%.\n","authors":["Liuchang Xu Shuo Zhao","Qingming Lin","Luyao Chen","Qianqian Luo","Sensen Wu","Xinyue Ye","Hailin Feng","Zhenhong Du"],"pdf_url":"https://arxiv.org/pdf/2408.14438v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14419v1","updated":"2024-08-26T17:04:23Z","published":"2024-08-26T17:04:23Z","title":"CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language\n Models","summary":" We introduce CHARTOM, a visual theory-of-mind benchmark for multimodal large\nlanguage models. CHARTOM consists of specially designed data visualizing\ncharts. Given a chart, a language model needs to not only correctly comprehend\nthe chart (the FACT question) but also judge if the chart will be misleading to\na human reader (the MIND question). Both questions have significant societal\nbenefits. We detail the construction of the CHARTOM benchmark including its\ncalibration on human performance.\n","authors":["Shubham Bharti","Shiyun Cheng","Jihyun Rho","Martina Rao","Xiaojin Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.14419v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14418v1","updated":"2024-08-26T17:04:00Z","published":"2024-08-26T17:04:00Z","title":"MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR\n Errors with LLM-generated Synthetic Dialogues","summary":" Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech\ninto text, yet the errors they introduce can significantly degrade the\nperformance of downstream tasks like summarization. This issue is particularly\npronounced in clinical dialogue summarization, a low-resource domain where\nsupervised data for fine-tuning is scarce, necessitating the use of ASR models\nas black-box solutions. Employing conventional data augmentation for enhancing\nthe noise robustness of summarization models is not feasible either due to the\nunavailability of sufficient medical dialogue audio recordings and\ncorresponding ASR transcripts. To address this challenge, we propose MEDSAGE,\nan approach for generating synthetic samples for data augmentation using Large\nLanguage Models (LLMs). Specifically, we leverage the in-context learning\ncapabilities of LLMs and instruct them to generate ASR-like errors based on a\nfew available medical dialogue examples with audio recordings. Experimental\nresults show that LLMs can effectively model ASR noise, and incorporating this\nnoisy data into the training process significantly improves the robustness and\naccuracy of medical dialogue summarization systems. This approach addresses the\nchallenges of noisy ASR outputs in critical applications, offering a robust\nsolution to enhance the reliability of clinical dialogue summarization.\n","authors":["Kuluhan Binici","Abhinav Ramesh Kashyap","Viktor Schlegel","Andy T. Liu","Vijay Prakash Dwivedi","Thanh-Tung Nguyen","Xiaoxue Gao","Nancy F. Chen","Stefan Winkler"],"pdf_url":"https://arxiv.org/pdf/2408.14418v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.05720v2","updated":"2024-08-26T16:48:08Z","published":"2024-03-08T23:17:55Z","title":"A Dataset and Benchmark for Hospital Course Summarization with Adapted\n Large Language Models","summary":" Brief hospital course (BHC) summaries are clinical documents that summarize a\npatient's hospital stay. While large language models (LLMs) depict remarkable\ncapabilities in automating real-world tasks, their capabilities for healthcare\napplications such as synthesizing BHCs from clinical notes have not been shown.\nWe introduce a novel pre-processed dataset, the MIMIC-IV-BHC, encapsulating\nclinical note and brief hospital course (BHC) pairs to adapt LLMs for BHC\nsynthesis. Furthermore, we introduce a benchmark of the summarization\nperformance of two general-purpose LLMs and three healthcare-adapted LLMs.\n Using clinical notes as input, we apply prompting-based (using in-context\nlearning) and fine-tuning-based adaptation strategies to three open-source LLMs\n(Clinical-T5-Large, Llama2-13B, FLAN-UL2) and two proprietary LLMs (GPT-3.5,\nGPT-4). We evaluate these LLMs across multiple context-length inputs using\nnatural language similarity metrics. We further conduct a clinical study with\nfive clinicians, comparing clinician-written and LLM-generated BHCs across 30\nsamples, focusing on their potential to enhance clinical decision-making\nthrough improved summary quality. We observe that the Llama2-13B fine-tuned LLM\noutperforms other domain-adapted models given quantitative evaluation metrics\nof BLEU and BERT-Score. GPT-4 with in-context learning shows more robustness to\nincreasing context lengths of clinical note inputs than fine-tuned Llama2-13B.\nDespite comparable quantitative metrics, the reader study depicts a significant\npreference for summaries generated by GPT-4 with in-context learning compared\nto both Llama2-13B fine-tuned summaries and the original summaries,\nhighlighting the need for qualitative clinical evaluation.\n","authors":["Asad Aali","Dave Van Veen","Yamin Ishraq Arefeen","Jason Hom","Christian Bluethgen","Eduardo Pontes Reis","Sergios Gatidis","Namuun Clifford","Joseph Daws","Arash S. Tehrani","Jangwon Kim","Akshay S. Chaudhari"],"pdf_url":"https://arxiv.org/pdf/2403.05720v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14398v1","updated":"2024-08-26T16:29:13Z","published":"2024-08-26T16:29:13Z","title":"Language-specific Calibration for Pruning Multilingual Language Models","summary":" Recent advances in large language model (LLM) pruning have shown\nstate-of-the-art compression results in post-training and retraining-free\nsettings while maintaining high predictive performance. However, such research\nmainly considers calibrating pruning using English text, despite the\nmultilingual nature of modern LLMs and their frequent uses in non-English\nlanguages. In this paper, we set out to explore effective strategies for\ncalibrating the pruning of multilingual language models. We present the first\ncomprehensive empirical study, comparing different calibration languages for\npruning multilingual models across diverse tasks, models, and state-of-the-art\npruning techniques. Our results present practical suggestions, for example,\ncalibrating in the target language can efficiently yield lower perplexity, but\ndoes not necessarily benefit downstream tasks. Our further analysis experiments\nunveil that calibration in the target language mainly contributes to preserving\nlanguage-specific features related to fluency and coherence, but might not\ncontribute to capturing language-agnostic features such as language\nunderstanding and reasoning. Last, we provide practical recommendations for\nfuture practitioners.\n","authors":["Simon Kurz","Zhixue Zhao","Jian-Jia Chen","Lucie Flek"],"pdf_url":"https://arxiv.org/pdf/2408.14398v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14397v1","updated":"2024-08-26T16:28:56Z","published":"2024-08-26T16:28:56Z","title":"Uncovering Knowledge Gaps in Radiology Report Generation Models through\n Knowledge Graphs","summary":" Recent advancements in artificial intelligence have significantly improved\nthe automatic generation of radiology reports. However, existing evaluation\nmethods fail to reveal the models' understanding of radiological images and\ntheir capacity to achieve human-level granularity in descriptions. To bridge\nthis gap, we introduce a system, named ReXKG, which extracts structured\ninformation from processed reports to construct a comprehensive radiology\nknowledge graph. We then propose three metrics to evaluate the similarity of\nnodes (ReXKG-NSC), distribution of edges (ReXKG-AMS), and coverage of subgraphs\n(ReXKG-SCS) across various knowledge graphs. We conduct an in-depth comparative\nanalysis of AI-generated and human-written radiology reports, assessing the\nperformance of both specialist and generalist models. Our study provides a\ndeeper understanding of the capabilities and limitations of current AI models\nin radiology report generation, offering valuable insights for improving model\nperformance and clinical applicability.\n","authors":["Xiaoman Zhang","Julián N. Acosta","Hong-Yu Zhou","Pranav Rajpurkar"],"pdf_url":"https://arxiv.org/pdf/2408.14397v1.pdf","comment":"Code is available at: https://github.com/rajpurkarlab/ReXKG"},{"id":"http://arxiv.org/abs/2408.14380v1","updated":"2024-08-26T16:00:41Z","published":"2024-08-26T16:00:41Z","title":"Probing Causality Manipulation of Large Language Models","summary":" Large language models (LLMs) have shown various ability on natural language\nprocessing, including problems about causality. It is not intuitive for LLMs to\ncommand causality, since pretrained models usually work on statistical\nassociations, and do not focus on causes and effects in sentences. So that\nprobing internal manipulation of causality is necessary for LLMs. This paper\nproposes a novel approach to probe causality manipulation hierarchically, by\nproviding different shortcuts to models and observe behaviors. We exploit\nretrieval augmented generation (RAG) and in-context learning (ICL) for models\non a designed causality classification task. We conduct experiments on\nmainstream LLMs, including GPT-4 and some smaller and domain-specific models.\nOur results suggest that LLMs can detect entities related to causality and\nrecognize direct causal relationships. However, LLMs lack specialized cognition\nfor causality, merely treating them as part of the global semantic of the\nsentence.\n","authors":["Chenyang Zhang","Haibo Tong","Bin Zhang","Dongyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.14380v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.19178v2","updated":"2024-08-26T15:59:03Z","published":"2024-04-30T01:02:15Z","title":"Revenge of the Fallen? Recurrent Models Match Transformers at Predicting\n Human Language Comprehension Metrics","summary":" Transformers have generally supplanted recurrent neural networks as the\ndominant architecture for both natural language processing tasks and for\nmodelling the effect of predictability on online human language comprehension.\nHowever, two recently developed recurrent model architectures, RWKV and Mamba,\nappear to perform natural language tasks comparably to or better than\ntransformers of equivalent scale. In this paper, we show that contemporary\nrecurrent models are now also able to match - and in some cases, exceed - the\nperformance of comparably sized transformers at modeling online human language\ncomprehension. This suggests that transformer language models are not uniquely\nsuited to this task, and opens up new directions for debates about the extent\nto which architectural features of language models make them better or worse\nmodels of human language comprehension.\n","authors":["James A. Michaelov","Catherine Arnett","Benjamin K. Bergen"],"pdf_url":"https://arxiv.org/pdf/2404.19178v2.pdf","comment":"Accepted at COLM 2024"},{"id":"http://arxiv.org/abs/2408.14354v1","updated":"2024-08-26T15:30:05Z","published":"2024-08-26T15:30:05Z","title":"SWE-bench-java: A GitHub Issue Resolving Benchmark for Java","summary":" GitHub issue resolving is a critical task in software engineering, recently\ngaining significant attention in both industry and academia. Within this task,\nSWE-bench has been released to evaluate issue resolving capabilities of large\nlanguage models (LLMs), but has so far only focused on Python version. However,\nsupporting more programming languages is also important, as there is a strong\ndemand in industry. As a first step toward multilingual support, we have\ndeveloped a Java version of SWE-bench, called SWE-bench-java. We have publicly\nreleased the dataset, along with the corresponding Docker-based evaluation\nenvironment and leaderboard, which will be continuously maintained and updated\nin the coming months. To verify the reliability of SWE-bench-java, we implement\na classic method SWE-agent and test several powerful LLMs on it. As is well\nknown, developing a high-quality multi-lingual benchmark is time-consuming and\nlabor-intensive, so we welcome contributions through pull requests or\ncollaboration to accelerate its iteration and refinement, paving the way for\nfully automated programming.\n","authors":["Daoguang Zan","Zhirong Huang","Ailun Yu","Shaoxin Lin","Yifan Shi","Wei Liu","Dong Chen","Zongshuai Qi","Hao Yu","Lei Yu","Dezhi Ran","Muhan Zeng","Bo Shen","Pan Bian","Guangtai Liang","Bei Guan","Pengjie Huang","Tao Xie","Yongji Wang","Qianxiang Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14354v1.pdf","comment":"This work is in progress"},{"id":"http://arxiv.org/abs/2408.14352v1","updated":"2024-08-26T15:29:34Z","published":"2024-08-26T15:29:34Z","title":"Assessing Contamination in Large Language Models: Introducing the\n LogProber method","summary":" In machine learning, contamination refers to situations where testing data\nleak into the training set. The issue is particularly relevant for the\nevaluation of the performance of Large Language Models (LLMs), which are\ngenerally trained on gargantuan, and generally opaque, corpora of text scraped\nfrom the world wide web. Developing tools to detect contamination is therefore\ncrucial to be able to fairly and properly track the evolution of the\nperformance of LLMs. Most recent works in the field are not tailored to\nquantify contamination on short sequences of text like we find in psychology\nquestionnaires. In the present paper we introduce LogProber, a novel,\nefficient, algorithm that we show able to detect contamination using token\nprobability in given sentences. In the second part we investigate the\nlimitations of the method and discuss how different training methods can\ncontaminate models without leaving traces in the token probabilities.\n","authors":["Nicolas Yax","Pierre-Yves Oudeyer","Stefano Palminteri"],"pdf_url":"https://arxiv.org/pdf/2408.14352v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14340v1","updated":"2024-08-26T15:13:14Z","published":"2024-08-26T15:13:14Z","title":"Foundation Models for Music: A Survey","summary":" In recent years, foundation models (FMs) such as large language models (LLMs)\nand latent diffusion models (LDMs) have profoundly impacted diverse sectors,\nincluding music. This comprehensive review examines state-of-the-art (SOTA)\npre-trained models and foundation models in music, spanning from representation\nlearning, generative learning and multimodal learning. We first contextualise\nthe significance of music in various industries and trace the evolution of AI\nin music. By delineating the modalities targeted by foundation models, we\ndiscover many of the music representations are underexplored in FM development.\nThen, emphasis is placed on the lack of versatility of previous methods on\ndiverse music applications, along with the potential of FMs in music\nunderstanding, generation and medical application. By comprehensively exploring\nthe details of the model pre-training paradigm, architectural choices,\ntokenisation, finetuning methodologies and controllability, we emphasise the\nimportant topics that should have been well explored, like instruction tuning\nand in-context learning, scaling law and emergent ability, as well as\nlong-sequence modelling etc. A dedicated section presents insights into music\nagents, accompanied by a thorough analysis of datasets and evaluations\nessential for pre-training and downstream tasks. Finally, by underscoring the\nvital importance of ethical considerations, we advocate that following research\non FM for music should focus more on such issues as interpretability,\ntransparency, human responsibility, and copyright issues. The paper offers\ninsights into future challenges and trends on FMs for music, aiming to shape\nthe trajectory of human-AI collaboration in the music realm.\n","authors":["Yinghao Ma","Anders Øland","Anton Ragni","Bleiz MacSen Del Sette","Charalampos Saitis","Chris Donahue","Chenghua Lin","Christos Plachouras","Emmanouil Benetos","Elio Quinton","Elona Shatri","Fabio Morreale","Ge Zhang","György Fazekas","Gus Xia","Huan Zhang","Ilaria Manco","Jiawen Huang","Julien Guinot","Liwei Lin","Luca Marinelli","Max W. Y. Lam","Megha Sharma","Qiuqiang Kong","Roger B. Dannenberg","Ruibin Yuan","Shangda Wu","Shih-Lun Wu","Shuqi Dai","Shun Lei","Shiyin Kang","Simon Dixon","Wenhu Chen","Wehhao Huang","Xingjian Du","Xingwei Qu","Xu Tan","Yizhi Li","Zeyue Tian","Zhiyong Wu","Zhizheng Wu","Ziyang Ma","Ziyu Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14340v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16528v2","updated":"2024-08-26T14:59:53Z","published":"2024-05-26T11:29:57Z","title":"LoQT: Low Rank Adapters for Quantized Training","summary":" Training of large neural networks requires significant computational\nresources. Despite advances using low-rank adapters and quantization,\npretraining of models such as LLMs on consumer hardware has not been possible\nwithout model sharding, offloading during training, or per-layer gradient\nupdates. To address these limitations, we propose LoQT, a method for\nefficiently training quantized models. LoQT uses gradient-based tensor\nfactorization to initialize low-rank trainable weight matrices that are\nperiodically merged into quantized full-rank weight matrices. Our approach is\nsuitable for both pretraining and fine-tuning of models, which we demonstrate\nexperimentally for language modeling and downstream task adaptation. We find\nthat LoQT enables efficient training of models up to 7B parameters on a\nconsumer-grade 24GB GPU. We also demonstrate the feasibility of training a 13B\nparameter model using per-layer gradient updates on the same hardware.\n","authors":["Sebastian Loeschcke","Mads Toftrup","Michael J. Kastoryano","Serge Belongie","Vésteinn Snæbjarnarson"],"pdf_url":"https://arxiv.org/pdf/2405.16528v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14317v1","updated":"2024-08-26T14:45:03Z","published":"2024-08-26T14:45:03Z","title":"Claim Verification in the Age of Large Language Models: A Survey","summary":" The large and ever-increasing amount of data available on the Internet\ncoupled with the laborious task of manual claim and fact verification has\nsparked the interest in the development of automated claim verification\nsystems. Several deep learning and transformer-based models have been proposed\nfor this task over the years. With the introduction of Large Language Models\n(LLMs) and their superior performance in several NLP tasks, we have seen a\nsurge of LLM-based approaches to claim verification along with the use of novel\nmethods such as Retrieval Augmented Generation (RAG). In this survey, we\npresent a comprehensive account of recent claim verification frameworks using\nLLMs. We describe the different components of the claim verification pipeline\nused in these frameworks in detail including common approaches to retrieval,\nprompting, and fine-tuning. Finally, we describe publicly available English\ndatasets created for this task.\n","authors":["Alphaeus Dmonte","Roland Oruche","Marcos Zampieri","Prasad Calyam","Isabelle Augenstein"],"pdf_url":"https://arxiv.org/pdf/2408.14317v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14307v1","updated":"2024-08-26T14:38:19Z","published":"2024-08-26T14:38:19Z","title":"LLM-3D Print: Large Language Models To Monitor and Control 3D Printing","summary":" Industry 4.0 has revolutionized manufacturing by driving digitalization and\nshifting the paradigm toward additive manufacturing (AM). Fused Deposition\nModeling (FDM), a key AM technology, enables the creation of highly customized,\ncost-effective products with minimal material waste through layer-by-layer\nextrusion, posing a significant challenge to traditional subtractive methods.\nHowever, the susceptibility of material extrusion techniques to errors often\nrequires expert intervention to detect and mitigate defects that can severely\ncompromise product quality. While automated error detection and machine\nlearning models exist, their generalizability across diverse 3D printer setups,\nfirmware, and sensors is limited, and deep learning methods require extensive\nlabeled datasets, hindering scalability and adaptability. To address these\nchallenges, we present a process monitoring and control framework that\nleverages pre-trained Large Language Models (LLMs) alongside 3D printers to\ndetect and address printing defects. The LLM evaluates print quality by\nanalyzing images captured after each layer or print segment, identifying\nfailure modes and querying the printer for relevant parameters. It then\ngenerates and executes a corrective action plan. We validated the effectiveness\nof the proposed framework in identifying defects by comparing it against a\ncontrol group of engineers with diverse AM expertise. Our evaluation\ndemonstrated that LLM-based agents not only accurately identify common 3D\nprinting errors, such as inconsistent extrusion, stringing, warping, and layer\nadhesion, but also effectively determine the parameters causing these failures\nand autonomously correct them without any need for human intervention.\n","authors":["Yayati Jadhav","Peter Pak","Amir Barati Farimani"],"pdf_url":"https://arxiv.org/pdf/2408.14307v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.10594v3","updated":"2024-08-26T14:30:38Z","published":"2024-06-15T11:03:33Z","title":"BlockPruner: Fine-grained Pruning for Large Language Models","summary":" With the rapid growth in the size and complexity of large language models\n(LLMs), the costs associated with their training and inference have escalated\nsignificantly. Research indicates that certain layers in LLMs harbor\nsubstantial redundancy, and pruning these layers has minimal impact on the\noverall performance. While various layer pruning methods have been developed\nbased on this insight, they generally overlook the finer-grained redundancies\nwithin the layers themselves. In this paper, we delve deeper into the\narchitecture of LLMs and demonstrate that finer-grained pruning can be achieved\nby targeting redundancies in multi-head attention (MHA) and multi-layer\nperceptron (MLP) blocks. We propose a novel, training-free structured pruning\napproach called BlockPruner. Unlike existing layer pruning methods, BlockPruner\nsegments each Transformer layer into MHA and MLP blocks. It then assesses the\nimportance of these blocks using perplexity measures and applies a heuristic\nsearch for iterative pruning. We applied BlockPruner to LLMs of various sizes\nand architectures and validated its performance across a wide range of\ndownstream tasks. Experimental results show that BlockPruner achieves more\ngranular and effective pruning compared to state-of-the-art baselines.\n","authors":["Longguang Zhong","Fanqi Wan","Ruijun Chen","Xiaojun Quan","Liangzhi Li"],"pdf_url":"https://arxiv.org/pdf/2406.10594v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14283v1","updated":"2024-08-26T14:09:28Z","published":"2024-08-26T14:09:28Z","title":"Predictability and Causality in Spanish and English Natural Language\n Generation","summary":" In recent years, the field of Natural Language Generation (NLG) has been\nboosted by the recent advances in deep learning technologies. Nonetheless,\nthese new data-intensive methods introduce language-dependent disparities in\nNLG as the main training data sets are in English. Also, most neural NLG\nsystems use decoder-only (causal) transformer language models, which work well\nfor English, but were not designed with other languages in mind. In this work\nwe depart from the hypothesis that they may introduce generation bias in target\nlanguages with less rigid word ordering, subject omission, or different\nattachment preferences for relative clauses, so that for these target languages\nother language generation strategies may be more desirable. This paper first\ncompares causal and non-causal language modeling for English and Spanish, two\nlanguages with different grammatical structures and over 1.5 billion and 0.5\nbillion speakers, respectively. For this purpose, we define a novel metric of\naverage causal and non-causal context-conditioned entropy of the grammatical\ncategory distribution for both languages as an information-theoretic a priori\napproach. The evaluation of natural text sources (such as training data) in\nboth languages reveals lower average non-causal conditional entropy in Spanish\nand lower causal conditional entropy in English. According to this experiment,\nSpanish is more predictable than English given a non-causal context. Then, by\napplying a conditional relative entropy metric to text generation experiments,\nwe obtain as insights that the best performance is respectively achieved with\ncausal NLG in English, and with non-causal NLG in Spanish. These insights\nsupport further research in NLG in Spanish using bidirectional transformer\nlanguage models.\n","authors":["Andrea Busto-Castiñeira","Francisco J. González-Castaño","Silvia García-Méndez","Francisco de Arriba-Pérez"],"pdf_url":"https://arxiv.org/pdf/2408.14283v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.09869v2","updated":"2024-08-26T13:55:59Z","published":"2024-08-19T10:20:06Z","title":"Docling Technical Report","summary":" This technical report introduces Docling, an easy to use, self-contained,\nMIT-licensed open-source package for PDF document conversion. It is powered by\nstate-of-the-art specialized AI models for layout analysis (DocLayNet) and\ntable structure recognition (TableFormer), and runs efficiently on commodity\nhardware in a small resource budget. The code interface allows for easy\nextensibility and addition of new features and models.\n","authors":["Christoph Auer","Maksym Lysak","Ahmed Nassar","Michele Dolfi","Nikolaos Livathinos","Panos Vagenas","Cesar Berrospi Ramis","Matteo Omenetti","Fabian Lindlbauer","Kasper Dinkla","Valery Weber","Lucas Morin","Ingmar Meijer","Viktor Kuropiatnyk","Peter W. J. Staar"],"pdf_url":"https://arxiv.org/pdf/2408.09869v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14277v1","updated":"2024-08-26T13:53:04Z","published":"2024-08-26T13:53:04Z","title":"Epidemic Information Extraction for Event-Based Surveillance using Large\n Language Models","summary":" This paper presents a novel approach to epidemic surveillance, leveraging the\npower of Artificial Intelligence and Large Language Models (LLMs) for effective\ninterpretation of unstructured big data sources, like the popular ProMED and\nWHO Disease Outbreak News. We explore several LLMs, evaluating their\ncapabilities in extracting valuable epidemic information. We further enhance\nthe capabilities of the LLMs using in-context learning, and test the\nperformance of an ensemble model incorporating multiple open-source LLMs. The\nfindings indicate that LLMs can significantly enhance the accuracy and\ntimeliness of epidemic modelling and forecasting, offering a promising tool for\nmanaging future pandemic events.\n","authors":["Sergio Consoli","Peter Markov","Nikolaos I. Stilianakis","Lorenzo Bertolini","Antonio Puertas Gallardo","Mario Ceresa"],"pdf_url":"https://arxiv.org/pdf/2408.14277v1.pdf","comment":"11 pages, 4 figures, Ninth International Congress on Information and\n Communication Technology (ICICT 2024)"},{"id":"http://arxiv.org/abs/2408.14262v1","updated":"2024-08-26T13:29:25Z","published":"2024-08-26T13:29:25Z","title":"Self-supervised Speech Representations Still Struggle with African\n American Vernacular English","summary":" Underperformance of ASR systems for speakers of African American Vernacular\nEnglish (AAVE) and other marginalized language varieties is a well-documented\nphenomenon, and one that reinforces the stigmatization of these varieties. We\ninvestigate whether or not the recent wave of Self-Supervised Learning (SSL)\nspeech models can close the gap in ASR performance between AAVE and Mainstream\nAmerican English (MAE). We evaluate four SSL models (wav2vec 2.0, HuBERT,\nWavLM, and XLS-R) on zero-shot Automatic Speech Recognition (ASR) for these two\nvarieties and find that these models perpetuate the bias in performance against\nAAVE. Additionally, the models have higher word error rates on utterances with\nmore phonological and morphosyntactic features of AAVE. Despite the success of\nSSL speech models in improving ASR for low resource varieties, SSL pre-training\nalone may not bridge the gap between AAVE and MAE. Our code is publicly\navailable at https://github.com/cmu-llab/s3m-aave.\n","authors":["Kalvin Chang","Yi-Hui Chou","Jiatong Shi","Hsuan-Ming Chen","Nicole Holliday","Odette Scharenborg","David R. Mortensen"],"pdf_url":"https://arxiv.org/pdf/2408.14262v1.pdf","comment":"INTERSPEECH 2024"},{"id":"http://arxiv.org/abs/2407.20584v2","updated":"2024-08-26T13:19:48Z","published":"2024-07-30T06:33:44Z","title":"Pruning Large Language Models with Semi-Structural Adaptive Sparse\n Training","summary":" The tremendous success of Large Language Models (LLMs) across various complex\ntasks relies heavily on their substantial scale, which raises challenges during\nmodel deployment due to their large memory consumption. Recently, numerous\nstudies have attempted to compress LLMs using one-shot pruning methods.\nHowever, these methods often experience considerable performance degradation on\ncomplex language understanding tasks, calling into question the feasibility of\npruning in LLMs. To address this issue, we propose a pruning pipeline for\nsemi-structured sparse models via retraining, termed Adaptive Sparse Trainer\n(AST). Unlike previous one-shot pruning methods, AST incrementally transforms\ndense models into sparse ones by applying decay to masked weights while\nallowing the model to adaptively select masks throughout the training process.\nFurthermore, we observe that using distillation with a dense model as the\nteacher can prevent the sparse model from falling into local optima and\naccelerate convergence. In addition, we incorporate extra well-initialized\nparameters to further enhance model performance with minimal increase in memory\nfootprint. AST can significantly enhance model performance, approaching the\nlevel of dense models. When applied to the LLaMA2-7B model, AST reduces the\nzero-shot accuracy gap between dense and semi-structured sparse models to 1.12%\nacross multiple zero-shot tasks, utilizing less than 0.4% of the pretraining\ntokens. Our work demonstrates the feasibility of deploying semi-structured\nsparse large language models and introduces a novel method for achieving highly\ncompressed models when combined with existing quantization techniques.\n","authors":["Weiyu Huang","Yuezhou Hu","Guohao Jian","Jun Zhu","Jianfei Chen"],"pdf_url":"https://arxiv.org/pdf/2407.20584v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14236v1","updated":"2024-08-26T12:50:27Z","published":"2024-08-26T12:50:27Z","title":"DSTI at LLMs4OL 2024 Task A: Intrinsic versus extrinsic knowledge for\n type classification","summary":" We introduce semantic towers, an extrinsic knowledge representation method,\nand compare it to intrinsic knowledge in large language models for ontology\nlearning. Our experiments show a trade-off between performance and semantic\ngrounding for extrinsic knowledge compared to a fine-tuned model intrinsic\nknowledge. We report our findings on the Large Language Models for Ontology\nLearning (LLMs4OL) 2024 challenge.\n","authors":["Hanna Abi Akl"],"pdf_url":"https://arxiv.org/pdf/2408.14236v1.pdf","comment":"8 pages, 4 figures, accepted for the LLMs4OL challenge at the\n International Semantic Web Conference (ISWC) 2024"},{"id":"http://arxiv.org/abs/2406.10265v2","updated":"2024-08-26T10:54:12Z","published":"2024-06-11T07:42:13Z","title":"Improving Language Models for Emotion Analysis: Insights from Cognitive\n Science","summary":" We propose leveraging cognitive science research on emotions and\ncommunication to improve language models for emotion analysis. First, we\npresent the main emotion theories in psychology and cognitive science. Then, we\nintroduce the main methods of emotion annotation in natural language processing\nand their connections to psychological theories. We also present the two main\ntypes of analyses of emotional communication in cognitive pragmatics. Finally,\nbased on the cognitive science research presented, we propose directions for\nimproving language models for emotion analysis. We suggest that these research\nefforts pave the way for constructing new annotation schemes, methods, and a\npossible benchmark for emotional understanding, considering different facets of\nhuman emotion and communication.\n","authors":["Constant Bonard","Gustave Cortal"],"pdf_url":"https://arxiv.org/pdf/2406.10265v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.05141v2","updated":"2024-08-26T10:53:28Z","published":"2024-08-09T15:53:55Z","title":"A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning","summary":" Retrieval-augmented generation (RAG) is a framework enabling large language\nmodels (LLMs) to enhance their accuracy and reduce hallucinations by\nintegrating external knowledge bases. In this paper, we introduce a hybrid RAG\nsystem enhanced through a comprehensive suite of optimizations that\nsignificantly improve retrieval quality, augment reasoning capabilities, and\nrefine numerical computation ability. We refined the text chunks and tables in\nweb pages, added attribute predictors to reduce hallucinations, conducted LLM\nKnowledge Extractor and Knowledge Graph Extractor, and finally built a\nreasoning strategy with all the references. We evaluated our system on the CRAG\ndataset through the Meta CRAG KDD Cup 2024 Competition. Both the local and\nonline evaluations demonstrate that our system significantly enhances complex\nreasoning capabilities. In local evaluations, we have significantly improved\naccuracy and reduced error rates compared to the baseline model, achieving a\nnotable increase in scores. In the meanwhile, we have attained outstanding\nresults in online assessments, demonstrating the performance and generalization\ncapabilities of the proposed system. The source code for our system is released\nin \\url{https://gitlab.aicrowd.com/shizueyy/crag-new}.\n","authors":["Ye Yuan","Chengwu Liu","Jingyang Yuan","Gongbo Sun","Siqi Li","Ming Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.05141v2.pdf","comment":"Technical report for 3rd prize in Task 1 of Meta CRAG KDD Cup 2024"},{"id":"http://arxiv.org/abs/2312.03731v7","updated":"2024-08-26T10:11:45Z","published":"2023-11-28T02:36:53Z","title":"MultiGPrompt for Multi-Task Pre-Training and Prompting on Graphs","summary":" Graphs can inherently model interconnected objects on the Web, thereby\nfacilitating a series of Web applications, such as web analyzing and content\nrecommendation. Recently, Graph Neural Networks (GNNs) have emerged as a\nmainstream technique for graph representation learning. However, their efficacy\nwithin an end-to-end supervised framework is significantly tied to the\navailabilityof task-specific labels. To mitigate labeling costs and enhance\nrobustness in few-shot settings, pre-training on self-supervised tasks has\nemerged as a promising method, while prompting has been proposed to further\nnarrow the objective gap between pretext and downstream tasks. Although there\nhas been some initial exploration of prompt-based learning on graphs, they\nprimarily leverage a single pretext task, resulting in a limited subset of\ngeneral knowledge that could be learned from the pre-training data. Hence, in\nthis paper, we propose MultiGPrompt, a novel multi-task pre-training and\nprompting framework to exploit multiple pretext tasks for more comprehensive\npre-trained knowledge. First, in pre-training, we design a set of pretext\ntokens to synergize multiple pretext tasks. Second, we propose a dual-prompt\nmechanism consisting of composed and open prompts to leverage task-specific and\nglobal pre-training knowledge, to guide downstream tasks in few-shot settings.\nFinally, we conduct extensive experiments on six public datasets to evaluate\nand analyze MultiGPrompt.\n","authors":["Xingtong Yu","Chang Zhou","Yuan Fang","Xinming Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.03731v7.pdf","comment":"WWW2024 research track"},{"id":"http://arxiv.org/abs/2408.14154v1","updated":"2024-08-26T09:57:19Z","published":"2024-08-26T09:57:19Z","title":"Investigating the effect of Mental Models in User Interaction with an\n Adaptive Dialog Agent","summary":" Mental models play an important role in whether user interaction with\nintelligent systems, such as dialog systems is successful or not. Adaptive\ndialog systems present the opportunity to align a dialog agent's behavior with\nheterogeneous user expectations. However, there has been little research into\nwhat mental models users form when interacting with a task-oriented dialog\nsystem, how these models affect users' interactions, or what role system\nadaptation can play in this process, making it challenging to avoid damage to\nhuman-AI partnership. In this work, we collect a new publicly available dataset\nfor exploring user mental models about information seeking dialog systems. We\ndemonstrate that users have a variety of conflicting mental models about such\nsystems, the validity of which directly impacts the success of their\ninteractions and perceived usability of system. Furthermore, we show that\nadapting a dialog agent's behavior to better align with users' mental models,\neven when done implicitly, can improve perceived usability, dialog efficiency,\nand success. To this end, we argue that implicit adaptation can be a valid\nstrategy for task-oriented dialog systems, so long as developers first have a\nsolid understanding of users' mental models.\n","authors":["Lindsey Vanderlyn","Dirk Väth","Ngoc Thang Vu"],"pdf_url":"https://arxiv.org/pdf/2408.14154v1.pdf","comment":"submitted to COLING 2025"},{"id":"http://arxiv.org/abs/2408.14153v1","updated":"2024-08-26T09:55:34Z","published":"2024-08-26T09:55:34Z","title":"Explaining Vision-Language Similarities in Dual Encoders with\n Feature-Pair Attributions","summary":" Dual encoder architectures like CLIP models map two types of inputs into a\nshared embedding space and learn similarities between them. However, it is not\nunderstood how such models compare two inputs. Here, we address this research\ngap with two contributions. First, we derive a method to attribute predictions\nof any differentiable dual encoder onto feature-pair interactions between its\ninputs. Second, we apply our method to CLIP-type models and show that they\nlearn fine-grained correspondences between parts of captions and regions in\nimages. They match objects across input modes and also account for mismatches.\nHowever, this visual-linguistic grounding ability heavily varies between object\nclasses, depends on the training data distribution, and largely improves after\nin-domain training. Using our method we can identify knowledge gaps about\nspecific object classes in individual models and can monitor their improvement\nupon fine-tuning.\n","authors":["Lucas Möller","Pascal Tilli","Ngoc Thang Vu","Sebastian Padó"],"pdf_url":"https://arxiv.org/pdf/2408.14153v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.04660v3","updated":"2024-08-26T09:37:46Z","published":"2024-08-05T20:01:10Z","title":"XMainframe: A Large Language Model for Mainframe Modernization","summary":" Mainframe operating systems, despite their inception in the 1940s, continue\nto support critical sectors like finance and government. However, these systems\nare often viewed as outdated, requiring extensive maintenance and\nmodernization. Addressing this challenge necessitates innovative tools that can\nunderstand and interact with legacy codebases. To this end, we introduce\nXMainframe, a state-of-the-art large language model (LLM) specifically designed\nwith knowledge of mainframe legacy systems and COBOL codebases. Our solution\ninvolves the creation of an extensive data collection pipeline to produce\nhigh-quality training datasets, enhancing XMainframe's performance in this\nspecialized domain. Additionally, we present MainframeBench, a comprehensive\nbenchmark for assessing mainframe knowledge, including multiple-choice\nquestions, question answering, and COBOL code summarization. Our empirical\nevaluations demonstrate that XMainframe consistently outperforms existing\nstate-of-the-art LLMs across these tasks. Specifically, XMainframe achieves 30%\nhigher accuracy than DeepSeek-Coder on multiple-choice questions, doubles the\nBLEU score of Mixtral-Instruct 8x7B on question answering, and scores six times\nhigher than GPT-3.5 on COBOL summarization. Our work highlights the potential\nof XMainframe to drive significant advancements in managing and modernizing\nlegacy systems, thereby enhancing productivity and saving time for software\ndevelopers.\n","authors":["Anh T. V. Dau","Hieu Trung Dao","Anh Tuan Nguyen","Hieu Trung Tran","Phong X. Nguyen","Nghi D. Q. Bui"],"pdf_url":"https://arxiv.org/pdf/2408.04660v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14141v1","updated":"2024-08-26T09:37:42Z","published":"2024-08-26T09:37:42Z","title":"Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in\n Subjective Tasks?","summary":" Subjective tasks in NLP have been mostly relegated to objective standards,\nwhere the gold label is decided by taking the majority vote. This obfuscates\nannotator disagreement and the inherent uncertainty of the label. We argue that\nsubjectivity should factor into model decisions and play a direct role via\ncalibration under a selective prediction setting. Specifically, instead of\ncalibrating confidence purely from the model's perspective, we calibrate models\nfor subjective tasks based on crowd worker agreement. Our method,\nCrowd-Calibrator, models the distance between the distribution of crowd worker\nlabels and the model's own distribution over labels to inform whether the model\nshould abstain from a decision. On two highly subjective tasks, hate speech\ndetection and natural language inference, our experiments show Crowd-Calibrator\neither outperforms or achieves competitive performance with existing selective\nprediction baselines. Our findings highlight the value of bringing human\ndecision-making into model predictions.\n","authors":["Urja Khurana","Eric Nalisnick","Antske Fokkens","Swabha Swayamdipta"],"pdf_url":"https://arxiv.org/pdf/2408.14141v1.pdf","comment":"Accepted at COLM 2024"},{"id":"http://arxiv.org/abs/2403.08564v2","updated":"2024-08-26T09:35:39Z","published":"2024-03-13T14:19:08Z","title":"Non-discrimination Criteria for Generative Language Models","summary":" Generative AI, such as large language models, has undergone rapid development\nwithin recent years. As these models become increasingly available to the\npublic, concerns arise about perpetuating and amplifying harmful biases in\napplications. Gender stereotypes can be harmful and limiting for the\nindividuals they target, whether they consist of misrepresentation or\ndiscrimination. Recognizing gender bias as a pervasive societal construct, this\npaper studies how to uncover and quantify the presence of gender biases in\ngenerative language models. In particular, we derive generative AI analogues of\nthree well-known non-discrimination criteria from classification, namely\nindependence, separation and sufficiency. To demonstrate these criteria in\naction, we design prompts for each of the criteria with a focus on occupational\ngender stereotype, specifically utilizing the medical test to introduce the\nground truth in the generative AI context. Our results address the presence of\noccupational gender bias within such conversational language models.\n","authors":["Sara Sterlie","Nina Weng","Aasa Feragen"],"pdf_url":"https://arxiv.org/pdf/2403.08564v2.pdf","comment":"14 pages, 3 figures"},{"id":"http://arxiv.org/abs/2408.14137v1","updated":"2024-08-26T09:34:36Z","published":"2024-08-26T09:34:36Z","title":"Multi-Faceted Evaluation of Modeling Languages for Augmented Reality\n Applications -- The Case of ARWFML","summary":" The evaluation of modeling languages for augmented reality applications poses\nparticular challenges due to the three-dimensional environment they target. The\npreviously introduced Augmented Reality Workflow Modeling Language (ARWFML)\nenables the model-based creation of augmented reality scenarios without\nprogramming knowledge. Building upon the first design cycle of the language's\nspecification, this paper presents two further design iterations for refining\nthe language based on multi-faceted evaluations. These include a comparative\nevaluation of implementation options and workflow capabilities, the\nintroduction of a 3D notation, and the development of a new 3D modeling\nenvironment. On this basis, a comprehensibility study of the language was\nconducted. Thereby, we show how modeling languages for augmented reality can be\nevolved towards a maturity level suitable for empirical evaluations.\n","authors":["Fabian Muff","Hans-Georg Fill"],"pdf_url":"https://arxiv.org/pdf/2408.14137v1.pdf","comment":"Accepted manuscript for the 43rd International Conference on\n Conceptual Modeling Conceptual Modeling, AI, and Beyond 28-31 October 2024 |\n Pittsburgh, Pennsylvania, USA"},{"id":"http://arxiv.org/abs/2408.14119v1","updated":"2024-08-26T09:08:26Z","published":"2024-08-26T09:08:26Z","title":"Contrastive Learning Subspace for Text Clustering","summary":" Contrastive learning has been frequently investigated to learn effective\nrepresentations for text clustering tasks. While existing contrastive\nlearning-based text clustering methods only focus on modeling instance-wise\nsemantic similarity relationships, they ignore contextual information and\nunderlying relationships among all instances that needs to be clustered. In\nthis paper, we propose a novel text clustering approach called Subspace\nContrastive Learning (SCL) which models cluster-wise relationships among\ninstances. Specifically, the proposed SCL consists of two main modules: (1) a\nself-expressive module that constructs virtual positive samples and (2) a\ncontrastive learning module that further learns a discriminative subspace to\ncapture task-specific cluster-wise relationships among texts. Experimental\nresults show that the proposed SCL method not only has achieved superior\nresults on multiple task clustering datasets but also has less complexity in\npositive sample construction.\n","authors":["Qian Yong","Chen Chen","Xiabing Zhou"],"pdf_url":"https://arxiv.org/pdf/2408.14119v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.11534v6","updated":"2024-08-26T08:52:44Z","published":"2023-08-21T06:51:56Z","title":"PlatoLM: Teaching LLMs in Multi-Round Dialogue via a User Simulator","summary":" The unparalleled performance of closed-sourced ChatGPT has sparked efforts\ntowards its democratization, with notable strides made by leveraging real user\nand ChatGPT dialogues, as evidenced by Vicuna. However, due to challenges in\ngathering dialogues involving human participation, current endeavors like Baize\nand UltraChat rely on ChatGPT conducting roleplay to simulate humans based on\ninstructions, resulting in overdependence on seeds, diminished human-likeness,\nlimited topic diversity, and an absence of genuine multi-round conversational\ndynamics. To address the above issues, we propose a paradigm to simulate human\nbehavior better and explore the benefits of incorporating more human-like\nquestions in multi-turn conversations. Specifically, we directly target human\nquestions extracted from genuine human-machine conversations as a learning goal\nand provide a novel user simulator called `Socratic'. The experimental results\nshow our response model, `PlatoLM', achieves SoTA performance among LLaMA-based\n7B models in MT-Bench. Our findings further demonstrate that our method\nintroduces highly human-like questioning patterns and rich topic structures,\nwhich can teach the response model better than previous works in multi-round\nconversations.\n","authors":["Chuyi Kong","Yaxin Fan","Xiang Wan","Feng Jiang","Benyou Wang"],"pdf_url":"https://arxiv.org/pdf/2308.11534v6.pdf","comment":"23 pages"},{"id":"http://arxiv.org/abs/2406.10833v2","updated":"2024-08-26T08:47:54Z","published":"2024-06-16T08:03:24Z","title":"A Comprehensive Survey of Scientific Large Language Models and Their\n Applications in Scientific Discovery","summary":" In many scientific fields, large language models (LLMs) have revolutionized\nthe way text and other modalities of data (e.g., molecules and proteins) are\nhandled, achieving superior performance in various applications and augmenting\nthe scientific discovery process. Nevertheless, previous surveys on scientific\nLLMs often concentrate on one or two fields or a single modality. In this\npaper, we aim to provide a more holistic view of the research landscape by\nunveiling cross-field and cross-modal connections between scientific LLMs\nregarding their architectures and pre-training techniques. To this end, we\ncomprehensively survey over 250 scientific LLMs, discuss their commonalities\nand differences, as well as summarize pre-training datasets and evaluation\ntasks for each field and modality. Moreover, we investigate how LLMs have been\ndeployed to benefit scientific discovery. Resources related to this survey are\navailable at https://github.com/yuzhimanhua/Awesome-Scientific-Language-Models.\n","authors":["Yu Zhang","Xiusi Chen","Bowen Jin","Sheng Wang","Shuiwang Ji","Wei Wang","Jiawei Han"],"pdf_url":"https://arxiv.org/pdf/2406.10833v2.pdf","comment":"34 pages (GitHub:\n https://github.com/yuzhimanhua/Awesome-Scientific-Language-Models)"},{"id":"http://arxiv.org/abs/2403.11322v4","updated":"2024-08-26T08:25:01Z","published":"2024-03-17T19:54:16Z","title":"StateFlow: Enhancing LLM Task-Solving through State-Driven Workflows","summary":" It is a notable trend to use Large Language Models (LLMs) to tackle complex\ntasks, e.g., tasks that require a sequence of actions and dynamic interaction\nwith tools and external environments. In this paper, we propose StateFlow, a\nnovel LLM-based task-solving paradigm that conceptualizes complex task-solving\nprocesses as state machines. In StateFlow, we distinguish between \"process\ngrounding\" (via state and state transitions) and \"sub-task solving\" (through\nactions within a state), enhancing control and interpretability of the\ntask-solving procedure. A state represents the status of a running process. The\ntransitions between states are controlled by heuristic rules or decisions made\nby the LLM, allowing for a dynamic and adaptive progression. Upon entering a\nstate, a series of actions is executed, involving not only calling LLMs guided\nby different prompts, but also the utilization of external tools as needed. Our\nresults show that StateFlow significantly enhances LLMs' efficiency. For\ninstance, StateFlow achieves 13% and 28% higher success rates compared to ReAct\nin InterCode SQL and ALFWorld benchmark, with 5x and 3x less cost respectively.\nWe also show that StateFlow can be combined with iterative refining methods\nlike Reflexion to further improve performance.\n","authors":["Yiran Wu","Tianwei Yue","Shaokun Zhang","Chi Wang","Qingyun Wu"],"pdf_url":"https://arxiv.org/pdf/2403.11322v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.03624v2","updated":"2024-08-26T08:09:39Z","published":"2024-07-04T04:19:50Z","title":"Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks","summary":" Although LLMs have the potential to transform many fields, they still\nunderperform humans in reasoning tasks. Existing methods induce the model to\nproduce step-by-step calculations, but this research explores the question:\nDoes making the LLM analyze the question improve its performance? We propose a\nnovel prompting strategy called Question Analysis Prompting (QAP), in which the\nmodel is prompted to explain the question in $n$ words before solving. The\nvalue of $n$ influences the length of response generated by the model. QAP is\nevaluated on GPT 3.5 Turbo and GPT 4 Turbo on arithmetic datasets GSM8K, AQuA,\nand SAT and commonsense dataset StrategyQA. QAP is compared with other\nstate-of-the-art prompts including Chain-of-Thought (CoT), Plan and Solve\nPrompting (PS+) and Take A Deep Breath (TADB). QAP outperforms all\nstate-of-the-art prompts on AQuA and SAT datasets on both GPT3.5 and GPT4. QAP\nconsistently ranks among the top-2 prompts on 75\\% of the tests. A key factor\nof QAP performance can be attributed to response length, where detailed\nresponses are beneficial when answering harder questions, but can negatively\naffect easy questions.\n","authors":["Dharunish Yugeswardeenoo","Kevin Zhu","Sean O'Brien"],"pdf_url":"https://arxiv.org/pdf/2407.03624v2.pdf","comment":"Accepted in Proceedings of the 62nd Annual Meeting of the Association\n for Computational Linguistics: Student Research Workshop (ACL-SRW 2024) 11\n pages, 8 figures"},{"id":"http://arxiv.org/abs/2407.09893v2","updated":"2024-08-26T07:54:27Z","published":"2024-07-13T13:58:24Z","title":"Synergistic Multi-Agent Framework with Trajectory Learning for\n Knowledge-Intensive Tasks","summary":" Recent advancements in Large Language Models (LLMs) have led to significant\nbreakthroughs in various natural language processing tasks. However, generating\nfactually consistent responses in knowledge-intensive scenarios remains a\nchallenge due to issues such as hallucination, difficulty in acquiring\nlong-tailed knowledge, and limited memory expansion. This paper introduces\nSMART, a novel multi-agent framework that leverages external knowledge to\nenhance the interpretability and factual consistency of LLM-generated\nresponses. SMART comprises four specialized agents, each performing a specific\nsub-trajectory action to navigate complex knowledge-intensive tasks. We propose\na multi-agent co-training paradigm, Long-Short Trajectory Learning, which\nensures synergistic collaboration among agents while maintaining fine-grained\nexecution by each agent. Extensive experiments on five knowledge-intensive\ntasks demonstrate SMART's superior performance compared to widely adopted\nknowledge internalization and knowledge enhancement methods. Our framework can\nextend beyond knowledge-intensive tasks to more complex scenarios. Our code is\navailable at https://github.com/yueshengbin/SMART.\n","authors":["Shengbin Yue","Siyuan Wang","Wei Chen","Xuanjing Huang","Zhongyu Wei"],"pdf_url":"https://arxiv.org/pdf/2407.09893v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.04975v4","updated":"2024-08-26T07:48:19Z","published":"2024-08-09T09:56:30Z","title":"reCSE: Portable Reshaping Features for Sentence Embedding in\n Self-supervised Contrastive Learning","summary":" We propose reCSE, a self supervised contrastive learning sentence\nrepresentation framework based on feature reshaping. This framework is\ndifferent from the current advanced models that use discrete data augmentation\nmethods, but instead reshapes the input features of the original sentence,\naggregates the global information of each token in the sentence, and alleviates\nthe common problems of representation polarity and GPU memory consumption\nlinear increase in current advanced models. In addition, our reCSE has achieved\ncompetitive performance in semantic similarity tasks. And the experiment proves\nthat our proposed feature reshaping method has strong universality, which can\nbe transplanted to other self supervised contrastive learning frameworks and\nenhance their representation ability, even achieving state-of-the-art\nperformance. Our code is available at https://github.com/heavenhellchen/reCSE.\n","authors":["Fufangchen Zhao","Jian Gao","Danfeng Yan"],"pdf_url":"https://arxiv.org/pdf/2408.04975v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.10903v3","updated":"2024-08-26T07:37:19Z","published":"2024-08-20T14:47:38Z","title":"BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General\n Role-Playing Language Model","summary":" The rapid advancement of large language models (LLMs) has revolutionized\nrole-playing, enabling the development of general role-playing models. However,\ncurrent role-playing training has two significant issues: (I) Using a\npredefined role profile to prompt dialogue training for specific scenarios\nusually leads to inconsistencies and even conflicts between the dialogue and\nthe profile, resulting in training biases. (II) The model learns to imitate the\nrole based solely on the profile, neglecting profile-dialogue alignment at the\nsentence level. In this work, we propose a simple yet effective framework\ncalled BEYOND DIALOGUE, designed to overcome these hurdles. This framework\ninnovatively introduces \"beyond dialogue\" tasks to align dialogue with profile\ntraits based on each specific scenario, thereby eliminating biases during\ntraining. Furthermore, by adopting an innovative prompting mechanism that\ngenerates reasoning outcomes for training, the framework allows the model to\nachieve fine-grained alignment between profile and dialogue at the sentence\nlevel. The aforementioned methods are fully automated and low-cost.\nAdditionally, the integration of automated dialogue and objective evaluation\nmethods forms a comprehensive framework, paving the way for general\nrole-playing. Experimental results demonstrate that our model excels in\nadhering to and reflecting various dimensions of role profiles, outperforming\nmost proprietary general and specialized role-playing baselines. All code and\ndatasets are available at https://github.com/yuyouyu32/BeyondDialogue.\n","authors":["Yeyong Yu","Rusheng Yu","Haojie Wei","Zhanqiu Zhang","Quan Qian"],"pdf_url":"https://arxiv.org/pdf/2408.10903v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14053v1","updated":"2024-08-26T07:19:07Z","published":"2024-08-26T07:19:07Z","title":"Enhancing Depression Diagnosis with Chain-of-Thought Prompting","summary":" When using AI to detect signs of depressive disorder, AI models habitually\ndraw preemptive conclusions. We theorize that using chain-of-thought (CoT)\nprompting to evaluate Patient Health Questionnaire-8 (PHQ-8) scores will\nimprove the accuracy of the scores determined by AI models. In our findings,\nwhen the models reasoned with CoT, the estimated PHQ-8 scores were consistently\ncloser on average to the accepted true scores reported by each participant\ncompared to when not using CoT. Our goal is to expand upon AI models'\nunderstanding of the intricacies of human conversation, allowing them to more\neffectively assess a patient's feelings and tone, therefore being able to more\naccurately discern mental disorder symptoms; ultimately, we hope to augment AI\nmodels' abilities, so that they can be widely accessible and used in the\nmedical field.\n","authors":["Elysia Shi","Adithri Manda","London Chowdhury","Runeema Arun","Kevin Zhu","Michael Lam"],"pdf_url":"https://arxiv.org/pdf/2408.14053v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.16535v3","updated":"2024-08-26T07:14:46Z","published":"2023-09-28T15:47:03Z","title":"KLoB: a Benchmark for Assessing Knowledge Locating Methods in Language\n Models","summary":" Recently, Locate-Then-Edit paradigm has emerged as one of the main approaches\nin changing factual knowledge stored in the Language models. However, there is\na lack of research on whether present locating methods can pinpoint the exact\nparameters embedding the desired knowledge. Moreover, although many researchers\nhave questioned the validity of locality hypothesis of factual knowledge, no\nmethod is provided to test the a hypothesis for more in-depth discussion and\nresearch. Therefore, we introduce KLoB, a benchmark examining three essential\nproperties that a reliable knowledge locating method should satisfy. KLoB can\nserve as a benchmark for evaluating existing locating methods in language\nmodels, and can contributes a method to reassessing the validity of locality\nhypothesis of factual knowledge. KLoB is publicly available at an anonymous\nGitHub: \\url{https://github.com/anon6662/KLoB}.\n","authors":["Yiming Ju","Xingrun Xing","Zhixiong Zeng"],"pdf_url":"https://arxiv.org/pdf/2309.16535v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.03791v2","updated":"2024-08-26T07:13:47Z","published":"2024-07-04T09:55:04Z","title":"M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal\n Models Across Multilingual and Multicultural Vision-Language Tasks","summary":" Since the release of ChatGPT, the field of Natural Language Processing has\nexperienced rapid advancements, particularly in Large Language Models (LLMs)\nand their multimodal counterparts, Large Multimodal Models (LMMs). Despite\ntheir impressive capabilities, LLMs often exhibit significant performance\ndisparities across different languages and cultural contexts, as demonstrated\nby various text-only benchmarks. However, current research lacks such\nbenchmarks for multimodal visio-linguistic settings. This work fills this gap\nby introducing M5, the first comprehensive benchmark designed to evaluate LMMs\non diverse vision-language tasks within a multilingual and multicultural\ncontext. M5 includes eight datasets covering five tasks and $41$ languages,\nwith a focus on underrepresented languages and culturally diverse images.\nFurthermore, we introduce two novel datasets, M5-VGR and M5-VLOD, including a\nnew Visio-Linguistic Outlier Detection task, in which all evaluated open-source\nmodels fail to significantly surpass the random baseline. Through extensive\nevaluation and analyses, we highlight substantial task-agnostic performance\ndisparities between high- and low-resource languages. Moreover, we show that\nlarger models do not necessarily outperform smaller ones in a multilingual\nsetting.\n","authors":["Florian Schneider","Sunayana Sitaram"],"pdf_url":"https://arxiv.org/pdf/2407.03791v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.06607v4","updated":"2024-08-26T06:57:51Z","published":"2023-11-11T16:37:41Z","title":"Monkey: Image Resolution and Text Label Are Important Things for Large\n Multi-modal Models","summary":" Large Multimodal Models (LMMs) have shown promise in vision-language tasks\nbut struggle with high-resolution input and detailed scene understanding.\nAddressing these challenges, we introduce Monkey to enhance LMM capabilities.\nFirstly, Monkey processes input images by dividing them into uniform patches,\neach matching the size (e.g., 448x448) used in the original training of the\nwell-trained vision encoder. Equipped with individual adapter for each patch,\nMonkey can handle higher resolutions up to 1344x896 pixels, enabling the\ndetailed capture of complex visual information. Secondly, it employs a\nmulti-level description generation method, enriching the context for\nscene-object associations. This two-part strategy ensures more effective\nlearning from generated data: the higher resolution allows for a more detailed\ncapture of visuals, which in turn enhances the effectiveness of comprehensive\ndescriptions. Extensive ablative results validate the effectiveness of our\ndesigns. Additionally, experiments on 18 datasets further demonstrate that\nMonkey surpasses existing LMMs in many tasks like Image Captioning and various\nVisual Question Answering formats. Specially, in qualitative tests focused on\ndense text question answering, Monkey has exhibited encouraging results\ncompared with GPT4V. Code is available at\nhttps://github.com/Yuliang-Liu/Monkey.\n","authors":["Zhang Li","Biao Yang","Qiang Liu","Zhiyin Ma","Shuo Zhang","Jingxu Yang","Yabo Sun","Yuliang Liu","Xiang Bai"],"pdf_url":"https://arxiv.org/pdf/2311.06607v4.pdf","comment":"CVPR 2024 Highlight"},{"id":"http://arxiv.org/abs/2408.03633v3","updated":"2024-08-26T06:19:53Z","published":"2024-08-07T08:44:44Z","title":"CARE: A Clue-guided Assistant for CSRs to Read User Manuals","summary":" It is time-saving to build a reading assistant for customer service\nrepresentations (CSRs) when reading user manuals, especially information-rich\nones. Current solutions don't fit the online custom service scenarios well due\nto the lack of attention to user questions and possible responses. Hence, we\npropose to develop a time-saving and careful reading assistant for CSRs, named\nCARE. It can help the CSRs quickly find proper responses from the user manuals\nvia explicit clue chains. Specifically, each of the clue chains is formed by\ninferring over the user manuals, starting from the question clue aligned with\nthe user question and ending at a possible response. To overcome the shortage\nof supervised data, we adopt the self-supervised strategy for model learning.\nThe offline experiment shows that CARE is efficient in automatically inferring\naccurate responses from the user manual. The online experiment further\ndemonstrates the superiority of CARE to reduce CSRs' reading burden and keep\nhigh service quality, in particular with >35% decrease in time spent and\nkeeping a >0.75 ICC score.\n","authors":["Weihong Du","Jia Liu","Zujie Wen","Dingnan Jin","Hongru Liang","Wenqiang Lei"],"pdf_url":"https://arxiv.org/pdf/2408.03633v3.pdf","comment":"Accepted to The 62nd Annual Meeting of the Association for\n Computational Linguistics (ACL 2024)"},{"id":"http://arxiv.org/abs/2408.14028v1","updated":"2024-08-26T05:38:27Z","published":"2024-08-26T05:38:27Z","title":"SurGen: Text-Guided Diffusion Model for Surgical Video Generation","summary":" Diffusion-based video generation models have made significant strides,\nproducing outputs with improved visual fidelity, temporal coherence, and user\ncontrol. These advancements hold great promise for improving surgical education\nby enabling more realistic, diverse, and interactive simulation environments.\nIn this study, we introduce SurGen, a text-guided diffusion model tailored for\nsurgical video synthesis, producing the highest resolution and longest duration\nvideos among existing surgical video generation models. We validate the visual\nand temporal quality of the outputs using standard image and video generation\nmetrics. Additionally, we assess their alignment to the corresponding text\nprompts through a deep learning classifier trained on surgical data. Our\nresults demonstrate the potential of diffusion models to serve as valuable\neducational tools for surgical trainees.\n","authors":["Joseph Cho","Samuel Schmidgall","Cyril Zakka","Mrudang Mathur","Rohan Shad","William Hiesinger"],"pdf_url":"https://arxiv.org/pdf/2408.14028v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14026v1","updated":"2024-08-26T05:36:35Z","published":"2024-08-26T05:36:35Z","title":"Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling","summary":" In this study, we tackle the challenge of limited labeled data for\nlow-resource languages in ASR, focusing on Hindi. Specifically, we explore\npseudo-labeling, by proposing a generic framework combining multiple ideas from\nexisting works. Our framework integrates multiple base models for transcription\nand evaluators for assessing audio-transcript pairs, resulting in robust\npseudo-labeling for low resource languages. We validate our approach with a new\nbenchmark, IndicYT, comprising diverse YouTube audio files from multiple\ncontent categories. Our findings show that augmenting pseudo labeled data from\nYouTube with existing training data leads to significant performance\nimprovements on IndicYT, without affecting performance on out-of-domain\nbenchmarks, demonstrating the efficacy of pseudo-labeled data in enhancing ASR\ncapabilities for low-resource languages. The benchmark, code and models\ndeveloped as a part of this work will be made publicly available.\n","authors":["Kaushal Santosh Bhogale","Deovrat Mehendale","Niharika Parasa","Sathish Kumar Reddy G","Tahir Javed","Pratyush Kumar","Mitesh M. Khapra"],"pdf_url":"https://arxiv.org/pdf/2408.14026v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.05561v5","updated":"2024-08-26T05:31:38Z","published":"2024-01-10T22:07:21Z","title":"TrustLLM: Trustworthiness in Large Language Models","summary":" Large language models (LLMs), exemplified by ChatGPT, have gained\nconsiderable attention for their excellent natural language processing\ncapabilities. Nonetheless, these LLMs present many challenges, particularly in\nthe realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs\nemerges as an important topic. This paper introduces TrustLLM, a comprehensive\nstudy of trustworthiness in LLMs, including principles for different dimensions\nof trustworthiness, established benchmark, evaluation, and analysis of\ntrustworthiness for mainstream LLMs, and discussion of open challenges and\nfuture directions. Specifically, we first propose a set of principles for\ntrustworthy LLMs that span eight different dimensions. Based on these\nprinciples, we further establish a benchmark across six dimensions including\ntruthfulness, safety, fairness, robustness, privacy, and machine ethics. We\nthen present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of\nover 30 datasets. Our findings firstly show that in general trustworthiness and\nutility (i.e., functional effectiveness) are positively related. Secondly, our\nobservations reveal that proprietary LLMs generally outperform most open-source\ncounterparts in terms of trustworthiness, raising concerns about the potential\nrisks of widely accessible open-source LLMs. However, a few open-source LLMs\ncome very close to proprietary ones. Thirdly, it is important to note that some\nLLMs may be overly calibrated towards exhibiting trustworthiness, to the extent\nthat they compromise their utility by mistakenly treating benign prompts as\nharmful and consequently not responding. Finally, we emphasize the importance\nof ensuring transparency not only in the models themselves but also in the\ntechnologies that underpin trustworthiness. Knowing the specific trustworthy\ntechnologies that have been employed is crucial for analyzing their\neffectiveness.\n","authors":["Yue Huang","Lichao Sun","Haoran Wang","Siyuan Wu","Qihui Zhang","Yuan Li","Chujie Gao","Yixin Huang","Wenhan Lyu","Yixuan Zhang","Xiner Li","Zhengliang Liu","Yixin Liu","Yijue Wang","Zhikun Zhang","Bertie Vidgen","Bhavya Kailkhura","Caiming Xiong","Chaowei Xiao","Chunyuan Li","Eric Xing","Furong Huang","Hao Liu","Heng Ji","Hongyi Wang","Huan Zhang","Huaxiu Yao","Manolis Kellis","Marinka Zitnik","Meng Jiang","Mohit Bansal","James Zou","Jian Pei","Jian Liu","Jianfeng Gao","Jiawei Han","Jieyu Zhao","Jiliang Tang","Jindong Wang","Joaquin Vanschoren","John Mitchell","Kai Shu","Kaidi Xu","Kai-Wei Chang","Lifang He","Lifu Huang","Michael Backes","Neil Zhenqiang Gong","Philip S. Yu","Pin-Yu Chen","Quanquan Gu","Ran Xu","Rex Ying","Shuiwang Ji","Suman Jana","Tianlong Chen","Tianming Liu","Tianyi Zhou","William Wang","Xiang Li","Xiangliang Zhang","Xiao Wang","Xing Xie","Xun Chen","Xuyu Wang","Yan Liu","Yanfang Ye","Yinzhi Cao","Yong Chen","Yue Zhao"],"pdf_url":"https://arxiv.org/pdf/2401.05561v5.pdf","comment":"This work is still under work and we welcome your contribution"},{"id":"http://arxiv.org/abs/2405.14213v2","updated":"2024-08-26T04:59:05Z","published":"2024-05-23T06:17:23Z","title":"From Text to Pixel: Advancing Long-Context Understanding in MLLMs","summary":" The rapid progress in Multimodal Large Language Models (MLLMs) has\nsignificantly advanced their ability to process and understand complex visual\nand textual information. However, the integration of multiple images and\nextensive textual contexts remains a challenge due to the inherent limitation\nof the models' capacity to handle long input sequences efficiently. In this\npaper, we introduce SEEKER, a multimodal large language model designed to\ntackle this issue. SEEKER aims to optimize the compact encoding of long text by\ncompressing the text sequence into the visual pixel space via images, enabling\nthe model to handle long text within a fixed token-length budget efficiently.\nOur empirical experiments on six long-context multimodal tasks demonstrate that\nSEEKER can leverage fewer image tokens to convey the same amount of textual\ninformation compared with the OCR-based approach, and is more efficient in\nunderstanding long-form multimodal input and generating long-form textual\noutput, outperforming all existing proprietary and open-source MLLMs by large\nmargins.\n","authors":["Yujie Lu","Xiujun Li","Tsu-Jui Fu","Miguel Eckstein","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2405.14213v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.12529v2","updated":"2024-08-26T04:28:41Z","published":"2024-07-17T13:11:28Z","title":"Crafting the Path: Robust Query Rewriting for Information Retrieval","summary":" Query rewriting aims to generate a new query that can complement the original\nquery to improve the information retrieval system. Recent studies on query\nrewriting, such as query2doc, query2expand and querey2cot, rely on the internal\nknowledge of Large Language Models (LLMs) to generate a relevant passage to add\ninformation to the query. Nevertheless, the efficacy of these methodologies may\nmarkedly decline in instances where the requisite knowledge is not encapsulated\nwithin the model's intrinsic parameters. In this paper, we propose a novel\nstructured query rewriting method called Crafting the Path tailored for\nretrieval systems. Crafting the Path involves a three-step process that crafts\nquery-related information necessary for finding the passages to be searched in\neach step. Specifically, the Crafting the Path begins with Query Concept\nComprehension, proceeds to Query Type Identification, and finally conducts\nExpected Answer Extraction. Experimental results show that our method\noutperforms previous rewriting methods, especially in less familiar domains for\nLLMs. We demonstrate that our method is less dependent on the internal\nparameter knowledge of the model and generates queries with fewer factual\ninaccuracies. Furthermore, we observe that \\name{} demonstrates superior\nperformance in the retrieval-augmented generation scenarios.\n","authors":["Ingeol Baek","Jimin Lee","Joonho Yang","Hwanhee Lee"],"pdf_url":"https://arxiv.org/pdf/2407.12529v2.pdf","comment":"3 figures, 13 tables"},{"id":"http://arxiv.org/abs/2408.12321v2","updated":"2024-08-26T04:27:54Z","published":"2024-08-22T11:57:16Z","title":"MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework\n for Multimodal Large Language Model","summary":" This paper presents MaVEn, an innovative Multi-granularity Visual Encoding\nframework designed to enhance the capabilities of Multimodal Large Language\nModels (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on\nsingle-image visual understanding, limiting their ability to interpret and\nintegrate information across multiple images. MaVEn addresses this limitation\nby combining discrete visual symbol sequences, which abstract coarse-grained\nsemantic concepts, with traditional continuous representation sequences that\nmodel fine-grained features. This dual approach bridges the semantic gap\nbetween visual and textual data, thereby improving the model's ability to\nprocess and interpret information from multiple images effectively.\nAdditionally, we design a dynamic reduction mechanism by for long-sequence\ncontinuous features to enhance multi-image processing efficiency. Experimental\nresults demonstrate that MaVEn significantly enhances MLLMs' understanding in\ncomplex multi-image scenarios, while also improving performance in single-image\ncontexts.\n","authors":["Chaoya Jiang","Jia Hongrui","Haiyang Xu","Wei Ye","Mengfan Dong","Ming Yan","Ji Zhang","Fei Huang","Shikun Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.12321v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13282v1","updated":"2024-08-26T02:53:55Z","published":"2024-08-26T02:53:55Z","title":"Question answering system of bridge design specification based on large\n language model","summary":" This paper constructs question answering system for bridge design\nspecification based on large language model. Three implementation schemes are\ntried: full fine-tuning of the Bert pretrained model, parameter-efficient\nfine-tuning of the Bert pretrained model, and self-built language model from\nscratch. Through the self-built question and answer task dataset, based on the\ntensorflow and keras deep learning platform framework, the model is constructed\nand trained to predict the start position and end position of the answer in the\nbridge design specification given by the user. The experimental results show\nthat full fine-tuning of the Bert pretrained model achieves 100% accuracy in\nthe training-dataset, validation-dataset and test-dataset, and the system can\nextract the answers from the bridge design specification given by the user to\nanswer various questions of the user; While parameter-efficient fine-tuning of\nthe Bert pretrained model and self-built language model from scratch perform\nwell in the training-dataset, their generalization ability in the test-dataset\nneeds to be improved. The research of this paper provides a useful reference\nfor the development of question answering system in professional field.\n","authors":["Leye Zhang","Xiangxiang Tian","Hongjun Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.13282v1.pdf","comment":"10 pages, 7 figures"},{"id":"http://arxiv.org/abs/2408.13987v1","updated":"2024-08-26T02:53:24Z","published":"2024-08-26T02:53:24Z","title":"Focused Large Language Models are Stable Many-Shot Learners","summary":" In-Context Learning (ICL) enables large language models (LLMs) to achieve\nrapid task adaptation by learning from demonstrations. With the increase in\navailable context length of LLMs, recent experiments have shown that the\nperformance of ICL does not necessarily scale well in many-shot (demonstration)\nsettings. We theoretically and experimentally confirm that the reason lies in\nmore demonstrations dispersing the model attention from the query, hindering\nits understanding of key content. Inspired by how humans learn from examples,\nwe propose a training-free method FocusICL, which conducts triviality filtering\nto avoid attention being diverted by unimportant contents at token-level and\noperates hierarchical attention to further ensure sufficient attention towards\ncurrent query at demonstration-level. We also design an efficient\nhyperparameter searching strategy for FocusICL based on model perplexity of\ndemonstrations. Comprehensive experiments validate that FocusICL achieves an\naverage performance improvement of 5.2% over vanilla ICL and scales well with\nmany-shot demonstrations.\n","authors":["Peiwen Yuan","Shaoxiong Feng","Yiwei Li","Xinglin Wang","Yueqi Zhang","Chuyi Tan","Boyuan Pan","Heda Wang","Yao Hu","Kan Li"],"pdf_url":"https://arxiv.org/pdf/2408.13987v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2305.07895v7","updated":"2024-08-26T02:37:14Z","published":"2023-05-13T11:28:37Z","title":"OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models","summary":" Large models have recently played a dominant role in natural language\nprocessing and multimodal vision-language learning. However, their\neffectiveness in text-related visual tasks remains relatively unexplored. In\nthis paper, we conducted a comprehensive evaluation of Large Multimodal Models,\nsuch as GPT4V and Gemini, in various text-related visual tasks including Text\nRecognition, Scene Text-Centric Visual Question Answering (VQA),\nDocument-Oriented VQA, Key Information Extraction (KIE), and Handwritten\nMathematical Expression Recognition (HMER). To facilitate the assessment of\nOptical Character Recognition (OCR) capabilities in Large Multimodal Models, we\npropose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29\ndatasets, making it the most comprehensive OCR evaluation benchmark available.\nFurthermore, our study reveals both the strengths and weaknesses of these\nmodels, particularly in handling multilingual text, handwritten text,\nnon-semantic text, and mathematical expression recognition. Most importantly,\nthe baseline results presented in this study could provide a foundational\nframework for the conception and assessment of innovative strategies targeted\nat enhancing zero-shot multimodal techniques. The evaluation pipeline and\nbenchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.\n","authors":["Yuliang Liu","Zhang Li","Mingxin Huang","Biao Yang","Wenwen Yu","Chunyuan Li","Xucheng Yin","Cheng-lin Liu","Lianwen Jin","Xiang Bai"],"pdf_url":"https://arxiv.org/pdf/2305.07895v7.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13986v1","updated":"2024-08-26T02:36:55Z","published":"2024-08-26T02:36:55Z","title":"AgentMove: Predicting Human Mobility Anywhere Using Large Language Model\n based Agentic Framework","summary":" Human mobility prediction plays a crucial role in various real-world\napplications. Although deep learning based models have shown promising results\nover the past decade, their reliance on extensive private mobility data for\ntraining and their inability to perform zero-shot predictions, have hindered\nfurther advancements. Recently, attempts have been made to apply large language\nmodels (LLMs) to mobility prediction task. However, their performance has been\nconstrained by the absence of a systematic design of workflow. They directly\ngenerate the final output using LLMs, which limits the potential of LLMs to\nuncover complex mobility patterns and underestimates their extensive reserve of\nglobal geospatial knowledge. In this paper, we introduce AgentMove, a\nsystematic agentic prediction framework to achieve generalized mobility\nprediction for any cities worldwide. In AgentMove, we first decompose the\nmobility prediction task into three sub-tasks and then design corresponding\nmodules to complete these subtasks, including spatial-temporal memory for\nindividual mobility pattern mining, world knowledge generator for modeling the\neffects of urban structure and collective knowledge extractor for capturing the\nshared patterns among population. Finally, we combine the results of three\nmodules and conduct a reasoning step to generate the final predictions.\nExtensive experiments on mobility data from two sources in 12 cities\ndemonstrate that AgentMove outperforms the best baseline more than 8% in\nvarious metrics and it shows robust predictions with various LLMs as base and\nalso less geographical bias across cities. Codes and data can be found in\nhttps://github.com/tsinghua-fib-lab/AgentMove.\n","authors":["Jie Feng","Yuwei Du","Jie Zhao","Yong Li"],"pdf_url":"https://arxiv.org/pdf/2408.13986v1.pdf","comment":"13 pages"},{"id":"http://arxiv.org/abs/2408.13985v1","updated":"2024-08-26T02:35:37Z","published":"2024-08-26T02:35:37Z","title":"TF-Attack: Transferable and Fast Adversarial Attacks on Large Language\n Models","summary":" With the great advancements in large language models (LLMs), adversarial\nattacks against LLMs have recently attracted increasing attention. We found\nthat pre-existing adversarial attack methodologies exhibit limited\ntransferability and are notably inefficient, particularly when applied to LLMs.\nIn this paper, we analyze the core mechanisms of previous predominant\nadversarial attack methods, revealing that 1) the distributions of importance\nscore differ markedly among victim models, restricting the transferability; 2)\nthe sequential attack processes induces substantial time overheads. Based on\nthe above two insights, we introduce a new scheme, named TF-Attack, for\nTransferable and Fast adversarial attacks on LLMs. TF-Attack employs an\nexternal LLM as a third-party overseer rather than the victim model to identify\ncritical units within sentences. Moreover, TF-Attack introduces the concept of\nImportance Level, which allows for parallel substitutions of attacks. We\nconduct extensive experiments on 6 widely adopted benchmarks, evaluating the\nproposed method through both automatic and human metrics. Results show that our\nmethod consistently surpasses previous methods in transferability and delivers\nsignificant speed improvements, up to 20 times faster than earlier attack\nstrategies.\n","authors":["Zelin Li","Kehai Chen","Xuefeng Bai","Lemao Liu","Mingming Yang","Yang Xiang","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.13985v1.pdf","comment":"14 pages, 6 figures. arXiv admin note: text overlap with\n arXiv:2305.17440 by other authors"},{"id":"http://arxiv.org/abs/2408.12095v2","updated":"2024-08-26T02:26:31Z","published":"2024-08-22T03:08:49Z","title":"uMedSum: A Unified Framework for Advancing Medical Abstractive\n Summarization","summary":" Medical abstractive summarization faces the challenge of balancing\nfaithfulness and informativeness. Current methods often sacrifice key\ninformation for faithfulness or introduce confabulations when prioritizing\ninformativeness. While recent advancements in techniques like in-context\nlearning (ICL) and fine-tuning have improved medical summarization, they often\noverlook crucial aspects such as faithfulness and informativeness without\nconsidering advanced methods like model reasoning and self-improvement.\nMoreover, the field lacks a unified benchmark, hindering systematic evaluation\ndue to varied metrics and datasets. This paper addresses these gaps by\npresenting a comprehensive benchmark of six advanced abstractive summarization\nmethods across three diverse datasets using five standardized metrics. Building\non these findings, we propose uMedSum, a modular hybrid summarization framework\nthat introduces novel approaches for sequential confabulation removal followed\nby key missing information addition, ensuring both faithfulness and\ninformativeness. Our work improves upon previous GPT-4-based state-of-the-art\n(SOTA) medical summarization methods, significantly outperforming them in both\nquantitative metrics and qualitative domain expert evaluations. Notably, we\nachieve an average relative performance improvement of 11.8% in reference-free\nmetrics over the previous SOTA. Doctors prefer uMedSum's summaries 6 times more\nthan previous SOTA in difficult cases where there are chances of confabulations\nor missing information. These results highlight uMedSum's effectiveness and\ngeneralizability across various datasets and metrics, marking a significant\nadvancement in medical summarization.\n","authors":["Aishik Nagar","Yutong Liu","Andy T. Liu","Viktor Schlegel","Vijay Prakash Dwivedi","Arun-Kumar Kaliya-Perumal","Guna Pratheep Kalanchiam","Yili Tang","Robby T. Tan"],"pdf_url":"https://arxiv.org/pdf/2408.12095v2.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2310.03328v3","updated":"2024-08-26T02:05:37Z","published":"2023-10-05T05:55:06Z","title":"Reformulating Domain Adaptation of Large Language Models as\n Adapt-Retrieve-Revise: A Case Study on Chinese Legal Domain","summary":" While large language models (LLMs) like GPT-4 have recently demonstrated\nastonishing zero-shot capabilities in general domain tasks, they often generate\ncontent with hallucinations in specific domains such as Chinese law, hindering\ntheir application in these areas. This is typically due to the absence of\ntraining data that encompasses such a specific domain, preventing GPT-4 from\nacquiring in-domain knowledge. A pressing challenge is that it's not plausible\nto continue training LLMs of such scale on in-domain data.\n This paper introduces a simple and effective domain adaptation framework for\nGPT-4 by reformulating generation as an \\textbf{adapt-retrieve-revise} process.\nThe initial step is to \\textbf{adapt} an affordable 7B LLM to the target domain\nby continuing learning on in-domain data. When solving a task, we leverage the\nadapted LLM to generate a draft answer given a task query. Then, the draft\nanswer will be used to \\textbf{retrieve} supporting evidence candidates from an\nexternal in-domain knowledge base. Finally, the draft answer and retrieved\nevidence are concatenated into a whole prompt to let GPT-4 assess the evidence\nand \\textbf{revise} the draft answer to generate the final answer.\n Our proposal combines the advantages of the efficiency of adapting a smaller\n7B model with the evidence-assessing capability of GPT-4 and effectively\nprevents GPT-4 from generating hallucinatory content. In the zero-shot setting\nof four Chinese legal tasks, our method improves accuracy by 33.3\\% compared to\nthe direct generation by GPT-4. When compared to two stronger retrieval-based\nbaselines, our method outperforms them by 15.4\\% and 23.9\\%. Our code will be\nreleased\n","authors":["Zhen wan","Yating Zhang","Yexiang Wang","Fei Cheng","Sadao Kurohashi"],"pdf_url":"https://arxiv.org/pdf/2310.03328v3.pdf","comment":"Accepted by ACL 2024 Findings"},{"id":"http://arxiv.org/abs/2407.07275v2","updated":"2024-08-26T00:55:01Z","published":"2024-07-09T23:39:37Z","title":"Remastering Divide and Remaster: A Cinematic Audio Source Separation\n Dataset with Multilingual Support","summary":" Cinematic audio source separation (CASS), as a problem of extracting the\ndialogue, music, and effects stems from their mixture, is a relatively new\nsubtask of audio source separation. To date, only one publicly available\ndataset exists for CASS, that is, the Divide and Remaster (DnR) dataset, which\nis currently at version 2. While DnR v2 has been an incredibly useful resource\nfor CASS, several areas of improvement have been identified, particularly\nthrough its use in the 2023 Sound Demixing Challenge. In this work, we develop\nversion 3 of the DnR dataset, addressing issues relating to vocal content in\nnon-dialogue stems, loudness distributions, mastering process, and linguistic\ndiversity. In particular, the dialogue stem of DnR v3 includes speech content\nfrom more than 30 languages from multiple families including but not limited to\nthe Germanic, Romance, Indo-Aryan, Dravidian, Malayo-Polynesian, and Bantu\nfamilies. Benchmark results using the Bandit model indicated that training on\nmultilingual data yields significant generalizability to the model even in\nlanguages with low data availability. Even in languages with high data\navailability, the multilingual model often performs on par or better than\ndedicated models trained on monolingual CASS datasets. Dataset and model\nimplementation will be made available at\nhttps://github.com/kwatcharasupat/source-separation-landing.\n","authors":["Karn N. Watcharasupat","Chih-Wei Wu","Iroro Orife"],"pdf_url":"https://arxiv.org/pdf/2407.07275v2.pdf","comment":"Accepted to the 5th IEEE International Symposium on the Internet of\n Sounds. Camera-ready version"},{"id":"http://arxiv.org/abs/2408.13966v1","updated":"2024-08-26T00:23:56Z","published":"2024-08-26T00:23:56Z","title":"Reducing the Cost: Cross-Prompt Pre-Finetuning for Short Answer Scoring","summary":" Automated Short Answer Scoring (SAS) is the task of automatically scoring a\ngiven input to a prompt based on rubrics and reference answers. Although SAS is\nuseful in real-world applications, both rubrics and reference answers differ\nbetween prompts, thus requiring a need to acquire new data and train a model\nfor each new prompt. Such requirements are costly, especially for schools and\nonline courses where resources are limited and only a few prompts are used. In\nthis work, we attempt to reduce this cost through a two-phase approach: train a\nmodel on existing rubrics and answers with gold score signals and finetune it\non a new prompt. Specifically, given that scoring rubrics and reference answers\ndiffer for each prompt, we utilize key phrases, or representative expressions\nthat the answer should contain to increase scores, and train a SAS model to\nlearn the relationship between key phrases and answers using already annotated\nprompts (i.e., cross-prompts). Our experimental results show that finetuning on\nexisting cross-prompt data with key phrases significantly improves scoring\naccuracy, especially when the training data is limited. Finally, our extensive\nanalysis shows that it is crucial to design the model so that it can learn the\ntask's general property.\n","authors":["Hiroaki Funayama","Yuya Asazuma","Yuichiroh Matsubayashi","Tomoya Mizumoto","Kentaro Inui"],"pdf_url":"https://arxiv.org/pdf/2408.13966v1.pdf","comment":"This is the draft submitted to AIED 2023. For the latest version,\n please visit: https://link.springer.com/chapter/10.1007/978-3-031-36272-9_7"},{"id":"http://arxiv.org/abs/2408.14698v1","updated":"2024-08-26T23:52:27Z","published":"2024-08-26T23:52:27Z","title":"Smart Multi-Modal Search: Contextual Sparse and Dense Embedding\n Integration in Adobe Express","summary":" As user content and queries become increasingly multi-modal, the need for\neffective multi-modal search systems has grown. Traditional search systems\noften rely on textual and metadata annotations for indexed images, while\nmulti-modal embeddings like CLIP enable direct search using text and image\nembeddings. However, embedding-based approaches face challenges in integrating\ncontextual features such as user locale and recency. Building a scalable\nmulti-modal search system requires fine-tuning several components. This paper\npresents a multi-modal search architecture and a series of AB tests that\noptimize embeddings and multi-modal technologies in Adobe Express template\nsearch. We address considerations such as embedding model selection, the roles\nof embeddings in matching and ranking, and the balance between dense and sparse\nembeddings. Our iterative approach demonstrates how utilizing sparse, dense,\nand contextual features enhances short and long query search, significantly\nreduces null rates (over 70\\%), and increases click-through rates (CTR). Our\nfindings provide insights into developing robust multi-modal search systems,\nthereby enhancing relevance for complex queries.\n","authors":["Cherag Aroraa","Tracy Holloway King","Jayant Kumar","Yi Lu","Sanat Sharma","Arvind Srikantan","David Uvalle","Josep Valls-Vargas","Harsha Vardhan"],"pdf_url":"https://arxiv.org/pdf/2408.14698v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14690v1","updated":"2024-08-26T23:30:15Z","published":"2024-08-26T23:30:15Z","title":"Training-Free Activation Sparsity in Large Language Models","summary":" Activation sparsity can enable practical inference speedups in large language\nmodels (LLMs) by reducing the compute and memory-movement required for matrix\nmultiplications during the forward pass. However, existing methods face\nlimitations that inhibit widespread adoption. Some approaches are tailored\ntowards older models with ReLU-based sparsity, while others require extensive\ncontinued pre-training on up to hundreds of billions of tokens. This paper\ndescribes TEAL, a simple training-free method that applies magnitude-based\nactivation sparsity to hidden states throughout the entire model. TEAL achieves\n40-50% model-wide sparsity with minimal performance degradation across Llama-2,\nLlama-3, and Mistral families, with sizes varying from 7B to 70B. We improve\nexisting sparse kernels and demonstrate wall-clock decoding speed-ups of up to\n1.53$\\times$ and 1.8$\\times$ at 40% and 50% model-wide sparsity. TEAL is\ncompatible with weight quantization, enabling further efficiency gains.\n","authors":["James Liu","Pragaash Ponnusamy","Tianle Cai","Han Guo","Yoon Kim","Ben Athiwaratkun"],"pdf_url":"https://arxiv.org/pdf/2408.14690v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.06264v2","updated":"2024-08-26T22:53:51Z","published":"2024-02-09T09:25:18Z","title":"LLaVA-Docent: Instruction Tuning with Multimodal Large Language Model to\n Support Art Appreciation Education","summary":" Art appreciation is vital in nurturing critical thinking and emotional\nintelligence among learners. However, traditional art appreciation education\nhas often been hindered by limited access to art resources, especially for\ndisadvantaged students, and an imbalanced emphasis on STEM subjects in\nmainstream education. In response to these challenges, recent technological\nadvancements have paved the way for innovative solutions. This study explores\nthe application of multi-modal large language models (MLLMs) in art\nappreciation education, focusing on developing LLaVA-Docent, a model that\nleverages these advancements. Our approach involved a comprehensive literature\nreview and consultations with experts in the field, leading to developing a\nrobust data framework. Utilizing this framework, we generated a virtual\ndialogue dataset that was leveraged by GPT-4. This dataset was instrumental in\ntraining the MLLM, named LLaVA-Docent. Six researchers conducted quantitative\nand qualitative evaluations of LLaVA-Docent to assess its effectiveness,\nbenchmarking it against the GPT-4 model in a few-shot setting. The evaluation\nprocess revealed distinct strengths and weaknesses of the LLaVA-Docent model.\nOur findings highlight the efficacy of LLaVA-Docent in enhancing the\naccessibility and engagement of art appreciation education. By harnessing the\npotential of MLLMs, this study makes a significant contribution to the field of\nart education, proposing a novel methodology that reimagines the way art\nappreciation is taught and experienced.\n","authors":["Unggi Lee","Minji Jeon","Yunseo Lee","Gyuri Byun","Yoorim Son","Jaeyoon Shin","Hongkyu Ko","Hyeoncheol Kim"],"pdf_url":"https://arxiv.org/pdf/2402.06264v2.pdf","comment":"37 pages, 4 figures, 10 tables"},{"id":"http://arxiv.org/abs/2406.10774v2","updated":"2024-08-26T21:01:02Z","published":"2024-06-16T01:33:02Z","title":"Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference","summary":" As the demand for long-context large language models (LLMs) increases, models\nwith context windows of up to 128K or 1M tokens are becoming increasingly\nprevalent. However, long-context LLM inference is challenging since the\ninference speed decreases significantly as the sequence length grows. This\nslowdown is primarily caused by loading a large KV cache during self-attention.\nPrevious works have shown that a small portion of critical tokens will dominate\nthe attention outcomes. However, we observe the criticality of a token highly\ndepends on the query. To this end, we propose Quest, a query-aware KV cache\nselection algorithm. Quest keeps track of the minimal and maximal Key values in\nKV cache pages and estimates the criticality of a given page using Query\nvectors. By only loading the Top-K critical KV cache pages for attention, Quest\nsignificantly speeds up self-attention without sacrificing accuracy. We show\nthat Quest can achieve up to 2.23x self-attention speedup, which reduces\ninference latency by 7.03x while performing well on tasks with long\ndependencies with negligible accuracy loss. Code is available at\nhttp://github.com/mit-han-lab/Quest .\n","authors":["Jiaming Tang","Yilong Zhao","Kan Zhu","Guangxuan Xiao","Baris Kasikci","Song Han"],"pdf_url":"https://arxiv.org/pdf/2406.10774v2.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2408.14636v1","updated":"2024-08-26T21:00:25Z","published":"2024-08-26T21:00:25Z","title":"Relationships are Complicated! An Analysis of Relationships Between\n Datasets on the Web","summary":" The Web today has millions of datasets, and the number of datasets continues\nto grow at a rapid pace. These datasets are not standalone entities; rather,\nthey are intricately connected through complex relationships. Semantic\nrelationships between datasets provide critical insights for research and\ndecision-making processes. In this paper, we study dataset relationships from\nthe perspective of users who discover, use, and share datasets on the Web: what\nrelationships are important for different tasks? What contextual information\nmight users want to know? We first present a comprehensive taxonomy of\nrelationships between datasets on the Web and map these relationships to user\ntasks performed during dataset discovery. We develop a series of methods to\nidentify these relationships and compare their performance on a large corpus of\ndatasets generated from Web pages with schema.org markup. We demonstrate that\nmachine-learning based methods that use dataset metadata achieve multi-class\nclassification accuracy of 90%. Finally, we highlight gaps in available\nsemantic markup for datasets and discuss how incorporating comprehensive\nsemantics can facilitate the identification of dataset relationships. By\nproviding a comprehensive overview of dataset relationships at scale, this\npaper sets a benchmark for future research.\n","authors":["Kate Lin","Tarfah Alrashed","Natasha Noy"],"pdf_url":"https://arxiv.org/pdf/2408.14636v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.03605v2","updated":"2024-08-26T20:48:19Z","published":"2024-04-04T17:25:30Z","title":"Mitigating the Impact of Outlier Channels for Language Model\n Quantization with Activation Regularization","summary":" We consider the problem of accurate quantization for language models, where\nboth the weights and activations are uniformly quantized to 4 bits per\nparameter, the lowest bitwidth format natively supported by GPU hardware. In\nthis context, the key challenge is activation quantization: it is known that\nlanguage models contain outlier channels whose values on average are orders of\nmagnitude higher than than other channels, which prevents accurate low-bitwidth\nquantization with known techniques. We systematically study this phenomena and\nfind that these outlier channels emerge early in training, and that they occur\nmore frequently in layers with residual streams. We then propose a simple\nstrategy which regularizes a layer's inputs via quantization-aware training\n(QAT) and its outputs via activation kurtosis regularization. We show that\nregularizing both the inputs and outputs is crucial for preventing a model's\n\"migrating\" the difficulty in input quantization to the weights, which makes\npost-training quantization (PTQ) of weights more difficult. When combined with\nweight PTQ, we show that our approach can obtain a W4A4 model that performs\ncompetitively to the standard-precision W16A16 baseline.\n","authors":["Aniruddha Nrusimha","Mayank Mishra","Naigang Wang","Dan Alistarh","Rameswar Panda","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2404.03605v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14623v1","updated":"2024-08-26T20:36:52Z","published":"2024-08-26T20:36:52Z","title":"MODOC: A Modular Interface for Flexible Interlinking of Text Retrieval\n and Text Generation Functions","summary":" Large Language Models (LLMs) produce eloquent texts but often the content\nthey generate needs to be verified. Traditional information retrieval systems\ncan assist with this task, but most systems have not been designed with\nLLM-generated queries in mind. As such, there is a compelling need for\nintegrated systems that provide both retrieval and generation functionality\nwithin a single user interface.\n We present MODOC, a modular user interface that leverages the capabilities of\nLLMs and provides assistance with detecting their confabulations, promoting\nintegrity in scientific writing. MODOC represents a significant step forward in\nscientific writing assistance. Its modular architecture supports flexible\nfunctions for retrieving information and for writing and generating text in a\nsingle, user-friendly interface.\n","authors":["Yingqiang Gao","Jhony Prada","Nianlong Gu","Jessica Lam","Richard H. R. Hahnloser"],"pdf_url":"https://arxiv.org/pdf/2408.14623v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14622v1","updated":"2024-08-26T20:35:42Z","published":"2024-08-26T20:35:42Z","title":"What Makes a Good Story and How Can We Measure It? A Comprehensive\n Survey of Story Evaluation","summary":" With the development of artificial intelligence, particularly the success of\nLarge Language Models (LLMs), the quantity and quality of automatically\ngenerated stories have significantly increased. This has led to the need for\nautomatic story evaluation to assess the generative capabilities of computing\nsystems and analyze the quality of both automatic-generated and human-written\nstories. Evaluating a story can be more challenging than other generation\nevaluation tasks. While tasks like machine translation primarily focus on\nassessing the aspects of fluency and accuracy, story evaluation demands complex\nadditional measures such as overall coherence, character development,\ninterestingness, etc. This requires a thorough review of relevant research. In\nthis survey, we first summarize existing storytelling tasks, including\ntext-to-text, visual-to-text, and text-to-visual. We highlight their evaluation\nchallenges, identify various human criteria to measure stories, and present\nexisting benchmark datasets. Then, we propose a taxonomy to organize evaluation\nmetrics that have been developed or can be adopted for story evaluation. We\nalso provide descriptions of these metrics, along with the discussion of their\nmerits and limitations. Later, we discuss the human-AI collaboration for story\nevaluation and generation. Finally, we suggest potential future research\ndirections, extending from story evaluation to general evaluations.\n","authors":["Dingyi Yang","Qin Jin"],"pdf_url":"https://arxiv.org/pdf/2408.14622v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.01663v3","updated":"2024-08-26T20:30:40Z","published":"2024-04-02T06:07:35Z","title":"CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small\n Language Models","summary":" Open large language models (LLMs) have significantly advanced the field of\nnatural language processing, showcasing impressive performance across various\ntasks.Despite the significant advancements in LLMs, their effective operation\nstill relies heavily on human input to accurately guide the dialogue flow, with\nagent tuning being a crucial optimization technique that involves human\nadjustments to the model for better response to such guidance.Addressing this\ndependency, our work introduces the TinyAgent model, trained on a meticulously\ncurated high-quality dataset. We also present the Collaborative Multi-Agent\nTuning (CMAT) framework, an innovative system designed to augment language\nagent capabilities through adaptive weight updates based on environmental\nfeedback. This framework fosters collaborative learning and real-time\nadaptation among multiple intelligent agents, enhancing their context-awareness\nand long-term memory. In this research, we propose a new communication agent\nframework that integrates multi-agent systems with environmental feedback\nmechanisms, offering a scalable method to explore cooperative behaviors.\nNotably, our TinyAgent-7B model exhibits performance on par with GPT-3.5,\ndespite having fewer parameters, signifying a substantial improvement in the\nefficiency and effectiveness of LLMs.\n","authors":["Xuechen Liang","Meiling Tao","Yinghui Xia","Tianyu Shi","Jun Wang","JingSong Yang"],"pdf_url":"https://arxiv.org/pdf/2404.01663v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.06494v2","updated":"2024-08-26T20:10:52Z","published":"2024-08-12T21:04:16Z","title":"What Color Scheme is More Effective in Assisting Readers to Locate\n Information in a Color-Coded Article?","summary":" Color coding, a technique assigning specific colors to cluster information\ntypes, has proven advantages in aiding human cognitive activities, especially\nreading and comprehension. The rise of Large Language Models (LLMs) has\nstreamlined document coding, enabling simple automatic text labeling with\nvarious schemes. This has the potential to make color-coding more accessible\nand benefit more users. However, the impact of color choice on information\nseeking is understudied. We conducted a user study assessing various color\nschemes' effectiveness in LLM-coded text documents, standardizing contrast\nratios to approximately 5.55:1 across schemes. Participants performed timed\ninformation-seeking tasks in color-coded scholarly abstracts. Results showed\nnon-analogous and yellow-inclusive color schemes improved performance, with the\nlatter also being more preferred by participants. These findings can inform\nbetter color scheme choices for text annotation. As LLMs advance document\ncoding, we advocate for more research focusing on the \"color\" aspect of\ncolor-coding techniques.\n","authors":["Ho Yin Ng","Zeyu He","Ting-Hao 'Kenneth' Huang"],"pdf_url":"https://arxiv.org/pdf/2408.06494v2.pdf","comment":"This paper will appear at IEEE VIS 2024"},{"id":"http://arxiv.org/abs/2405.11083v2","updated":"2024-08-26T20:02:16Z","published":"2024-05-17T20:30:49Z","title":"Prompt Exploration with Prompt Regression","summary":" In the advent of democratized usage of large language models (LLMs), there is\na growing desire to systematize LLM prompt creation and selection processes\nbeyond iterative trial-and-error. Prior works majorly focus on searching the\nspace of prompts without accounting for relations between prompt variations.\nHere we propose a framework, Prompt Exploration with Prompt Regression (PEPR),\nto predict the effect of prompt combinations given results for individual\nprompt elements as well as a simple method to select an effective prompt for a\ngiven use-case. We evaluate our approach with open-source LLMs of different\nsizes on several different tasks.\n","authors":["Michael Feffer","Ronald Xu","Yuekai Sun","Mikhail Yurochkin"],"pdf_url":"https://arxiv.org/pdf/2405.11083v2.pdf","comment":"COLM 2024"},{"id":"http://arxiv.org/abs/2406.06484v2","updated":"2024-08-26T19:50:37Z","published":"2024-06-10T17:24:42Z","title":"Parallelizing Linear Transformers with the Delta Rule over Sequence\n Length","summary":" Transformers with linear attention (i.e., linear transformers) and\nstate-space models have recently been suggested as a viable linear-time\nalternative to transformers with softmax attention. However, these models still\nunderperform transformers especially on tasks that require in-context\nretrieval. While more expressive variants of linear transformers which replace\nthe additive outer-product update in linear transformers with the delta rule\nhave been found to be more effective at associative recall, existing algorithms\nfor training such models do not parallelize over sequence length and are thus\ninefficient to train on modern hardware. This work describes a\nhardware-efficient algorithm for training linear transformers with the delta\nrule, which exploits a memory-efficient representation for computing products\nof Householder matrices. This algorithm allows us to scale up DeltaNet to\nstandard language modeling settings. We train a 1.3B model for 100B tokens and\nfind that it outperforms recent linear-time baselines such as Mamba and GLA in\nterms of perplexity and zero-shot performance on downstream tasks (including on\ntasks that focus on recall). We also experiment with two hybrid models which\ncombine DeltaNet layers with (1) sliding-window attention layers every other\nlayer or (2) two global attention layers, and find that these hybrid models\noutperform strong transformer baselines.\n","authors":["Songlin Yang","Bailin Wang","Yu Zhang","Yikang Shen","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2406.06484v2.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2408.14595v1","updated":"2024-08-26T19:26:55Z","published":"2024-08-26T19:26:55Z","title":"Surprisingly Fragile: Assessing and Addressing Prompt Instability in\n Multimodal Foundation Models","summary":" Multimodal foundation models (MFMs) such as OFASys show the potential to\nunlock analysis of complex data such as images, videos, and audio data via text\nprompts alone. However, their performance may suffer in the face of text input\nthat differs even slightly from their training distribution, which is\nsurprising considering the use of modality-specific data to \"ground\" the text\ninput. This study demonstrates that prompt instability is a major concern for\nMFMs, leading to a consistent drop in performance across all modalities, but\nthat instability can be mitigated with additional training with augmented data.\nWe evaluate several methods for grounded prompt perturbation, where we generate\nperturbations and filter based on similarity to text and/or modality data.\nAfter re-training the models on the augmented data, we find improved accuracy\nand more stable performance on the perturbed test data regardless of\nperturbation condition, suggesting that the data augmentation strategy helps\nthe models handle domain shifts more effectively. In error analysis, we find\nconsistent patterns of performance improvement across domains, suggesting that\nretraining on prompt perturbations tends to help general reasoning capabilities\nin MFMs.\n","authors":["Ian Stewart","Sameera Horawalavithana","Brendan Kennedy","Sai Munikoti","Karl Pazdernik"],"pdf_url":"https://arxiv.org/pdf/2408.14595v1.pdf","comment":"in submission"},{"id":"http://arxiv.org/abs/2402.17700v2","updated":"2024-08-26T19:26:06Z","published":"2024-02-27T17:25:37Z","title":"RAVEL: Evaluating Interpretability Methods on Disentangling Language\n Model Representations","summary":" Individual neurons participate in the representation of multiple high-level\nconcepts. To what extent can different interpretability methods successfully\ndisentangle these roles? To help address this question, we introduce RAVEL\n(Resolving Attribute-Value Entanglements in Language Models), a dataset that\nenables tightly controlled, quantitative comparisons between a variety of\nexisting interpretability methods. We use the resulting conceptual framework to\ndefine the new method of Multi-task Distributed Alignment Search (MDAS), which\nallows us to find distributed representations satisfying multiple causal\ncriteria. With Llama2-7B as the target language model, MDAS achieves\nstate-of-the-art results on RAVEL, demonstrating the importance of going beyond\nneuron-level analyses to identify features distributed across activations. We\nrelease our benchmark at https://github.com/explanare/ravel.\n","authors":["Jing Huang","Zhengxuan Wu","Christopher Potts","Mor Geva","Atticus Geiger"],"pdf_url":"https://arxiv.org/pdf/2402.17700v2.pdf","comment":"Proceedings of the 62nd Annual Meeting of the Association for\n Computational Linguistics (ACL 2024)"},{"id":"http://arxiv.org/abs/2408.14572v1","updated":"2024-08-26T18:42:59Z","published":"2024-08-26T18:42:59Z","title":"CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting\n Mitigation","summary":" This paper introduces CURLoRA, a novel approach to fine-tuning large language\nmodels (LLMs) that leverages CUR matrix decomposition in the context of\nLow-Rank Adaptation (LoRA). Our method addresses two critical challenges in LLM\nfine-tuning: mitigating catastrophic forgetting during continual learning and\nreducing the number of trainable parameters. We propose a unique modification\nto the CUR decomposition process, utilizing inverted probabilities for column\nand row selection which acts as an implicit regularization, and initializing\nthe $U$ matrix as a zero matrix, and only fine-tuning it. We demonstrate\nthrough experiments on multiple datasets that CURLoRA outperforms standard LoRA\nin mitigating catastrophic forgetting. It maintains model stability and\nperformance across tasks while significantly reducing the number of trainable\nparameters. Our results show that CURLoRA achieves very good and stable task\naccuracy while maintaining base model's perplexity scores fixed compared to\nLoRA upon continual fine-tuning, particularly in scenarios with limited data.\n","authors":["Muhammad Fawi"],"pdf_url":"https://arxiv.org/pdf/2408.14572v1.pdf","comment":"Code available at https://github.com/MNoorFawi/curlora"},{"id":"http://arxiv.org/abs/2408.14568v1","updated":"2024-08-26T18:39:31Z","published":"2024-08-26T18:39:31Z","title":"Improving Clinical Note Generation from Complex Doctor-Patient\n Conversation","summary":" Writing clinical notes and documenting medical exams is a critical task for\nhealthcare professionals, serving as a vital component of patient care\ndocumentation. However, manually writing these notes is time-consuming and can\nimpact the amount of time clinicians can spend on direct patient interaction\nand other tasks. Consequently, the development of automated clinical note\ngeneration systems has emerged as a clinically meaningful area of research\nwithin AI for health. In this paper, we present three key contributions to the\nfield of clinical note generation using large language models (LLMs). First, we\nintroduce CliniKnote, a comprehensive dataset consisting of 1,200 complex\ndoctor-patient conversations paired with their full clinical notes. This\ndataset, created and curated by medical experts with the help of modern neural\nnetworks, provides a valuable resource for training and evaluating models in\nclinical note generation tasks. Second, we propose the K-SOAP (Keyword,\nSubjective, Objective, Assessment, and Plan) note format, which enhances\ntraditional SOAP~\\cite{podder2023soap} (Subjective, Objective, Assessment, and\nPlan) notes by adding a keyword section at the top, allowing for quick\nidentification of essential information. Third, we develop an automatic\npipeline to generate K-SOAP notes from doctor-patient conversations and\nbenchmark various modern LLMs using various metrics. Our results demonstrate\nsignificant improvements in efficiency and performance compared to standard LLM\nfinetuning methods.\n","authors":["Yizhan Li","Sifan Wu","Christopher Smith","Thomas Lo","Bang Liu"],"pdf_url":"https://arxiv.org/pdf/2408.14568v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14547v1","updated":"2024-08-26T18:00:33Z","published":"2024-08-26T18:00:33Z","title":"Revisiting Image Captioning Training Paradigm via Direct CLIP-based\n Optimization","summary":" The conventional training approach for image captioning involves pre-training\na network using teacher forcing and subsequent fine-tuning with Self-Critical\nSequence Training to maximize hand-crafted captioning metrics. However, when\nattempting to optimize modern and higher-quality metrics like CLIP-Score and\nPAC-Score, this training method often encounters instability and fails to\nacquire the genuine descriptive capabilities needed to produce fluent and\ninformative captions. In this paper, we propose a new training paradigm termed\nDirect CLIP-Based Optimization (DiCO). Our approach jointly learns and\noptimizes a reward model that is distilled from a learnable captioning\nevaluator with high human correlation. This is done by solving a weighted\nclassification problem directly inside the captioner. At the same time, DiCO\nprevents divergence from the original model, ensuring that fluency is\nmaintained. DiCO not only exhibits improved stability and enhanced quality in\nthe generated captions but also aligns more closely with human preferences\ncompared to existing methods, especially in modern metrics. Additionally, it\nmaintains competitive performance in traditional metrics. Our source code and\ntrained models are publicly available at https://github.com/aimagelab/DiCO.\n","authors":["Nicholas Moratelli","Davide Caffagni","Marcella Cornia","Lorenzo Baraldi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2408.14547v1.pdf","comment":"BMVC 2024"},{"id":"http://arxiv.org/abs/2012.12311v4","updated":"2024-08-26T15:34:13Z","published":"2020-12-22T19:32:52Z","title":"Unboxing Engagement in YouTube Influencer Videos: An Attention-Based\n Approach","summary":" Influencer marketing videos have surged in popularity, yet significant gaps\nremain in understanding the relationship between video features and engagement.\nThis challenge is intensified by the complexities of interpreting unstructured\ndata. While deep learning models effectively leverage unstructured data to\npredict business outcomes, they often function as black boxes with limited\ninterpretability, particularly when human validation is hindered by the absence\nof a known ground truth. To address this issue, the authors develop an\n\"interpretable deep learning framework\" that not only makes good out-of-sample\npredictions using unstructured data but also provides insights into the\ncaptured relationships. Inspired by visual attention in print advertising, the\ninterpretation approach uses measures of model attention to video features,\neliminating spurious associations through a two-step process and shortlisting\nrelationships for formal causal testing. This method is applicable across\nwell-known attention mechanisms - additive attention, scaled dot-product\nattention, and gradient-based attention - when analyzing text, audio, or video\nimage data. Validated using simulations, this approach outperforms benchmark\nfeature selection methods. This framework is applied to YouTube influencer\nvideos, linking video features to measures of shallow and deep engagement\ndeveloped based on the dual-system framework of thinking. The findings guide\ninfluencers and brands in prioritizing video features associated with deep\nengagement.\n","authors":["Prashant Rajaram","Puneet Manchanda"],"pdf_url":"https://arxiv.org/pdf/2012.12311v4.pdf","comment":"50 pages, Online Appendix"},{"id":"http://arxiv.org/abs/2408.13985v1","updated":"2024-08-26T02:35:37Z","published":"2024-08-26T02:35:37Z","title":"TF-Attack: Transferable and Fast Adversarial Attacks on Large Language\n Models","summary":" With the great advancements in large language models (LLMs), adversarial\nattacks against LLMs have recently attracted increasing attention. We found\nthat pre-existing adversarial attack methodologies exhibit limited\ntransferability and are notably inefficient, particularly when applied to LLMs.\nIn this paper, we analyze the core mechanisms of previous predominant\nadversarial attack methods, revealing that 1) the distributions of importance\nscore differ markedly among victim models, restricting the transferability; 2)\nthe sequential attack processes induces substantial time overheads. Based on\nthe above two insights, we introduce a new scheme, named TF-Attack, for\nTransferable and Fast adversarial attacks on LLMs. TF-Attack employs an\nexternal LLM as a third-party overseer rather than the victim model to identify\ncritical units within sentences. Moreover, TF-Attack introduces the concept of\nImportance Level, which allows for parallel substitutions of attacks. We\nconduct extensive experiments on 6 widely adopted benchmarks, evaluating the\nproposed method through both automatic and human metrics. Results show that our\nmethod consistently surpasses previous methods in transferability and delivers\nsignificant speed improvements, up to 20 times faster than earlier attack\nstrategies.\n","authors":["Zelin Li","Kehai Chen","Xuefeng Bai","Lemao Liu","Mingming Yang","Yang Xiang","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.13985v1.pdf","comment":"14 pages, 6 figures"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2408.14471v1","updated":"2024-08-26T17:59:01Z","published":"2024-08-26T17:59:01Z","title":"A Practitioner's Guide to Continual Multimodal Pretraining","summary":" Multimodal foundation models serve numerous applications at the intersection\nof vision and language. Still, despite being pretrained on extensive data, they\nbecome outdated over time. To keep models updated, research into continual\npretraining mainly explores scenarios with either (1) infrequent,\nindiscriminate updates on large-scale new data, or (2) frequent, sample-level\nupdates. However, practical model deployment often operates in the gap between\nthese two limit cases, as real-world applications often demand adaptation to\nspecific subdomains, tasks or concepts -- spread over the entire, varying life\ncycle of a model. In this work, we complement current perspectives on continual\npretraining through a research test bed as well as provide comprehensive\nguidance for effective continual model updates in such scenarios. We first\nintroduce FoMo-in-Flux, a continual multimodal pretraining benchmark with\nrealistic compute constraints and practical deployment requirements,\nconstructed over 63 datasets with diverse visual and semantic coverage. Using\nFoMo-in-Flux, we explore the complex landscape of practical continual\npretraining through multiple perspectives: (1) A data-centric investigation of\ndata mixtures and stream orderings that emulate real-world deployment\nsituations, (2) a method-centric investigation ranging from simple fine-tuning\nand traditional continual learning strategies to parameter-efficient updates\nand model merging, (3) meta learning rate schedules and mechanistic design\nchoices, and (4) the influence of model and compute scaling. Together, our\ninsights provide a practitioner's guide to continual multimodal pretraining for\nreal-world deployment. Our benchmark and code is here:\nhttps://github.com/ExplainableML/fomo_in_flux.\n","authors":["Karsten Roth","Vishaal Udandarao","Sebastian Dziadzio","Ameya Prabhu","Mehdi Cherti","Oriol Vinyals","Olivier Hénaff","Samuel Albanie","Matthias Bethge","Zeynep Akata"],"pdf_url":"https://arxiv.org/pdf/2408.14471v1.pdf","comment":"Technical Report. 52 pages"},{"id":"http://arxiv.org/abs/2408.14469v1","updated":"2024-08-26T17:58:47Z","published":"2024-08-26T17:58:47Z","title":"Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos","summary":" This paper considers the problem of Multi-Hop Video Question Answering\n(MH-VidQA) in long-form egocentric videos. This task not only requires to\nanswer visual questions, but also to localize multiple relevant time intervals\nwithin the video as visual evidences. We develop an automated pipeline to\ncreate multi-hop question-answering pairs with associated temporal evidence,\nenabling to construct a large-scale dataset for instruction-tuning. To monitor\nthe progress of this new task, we further curate a high-quality benchmark,\nMultiHop-EgoQA, with careful manual verification and refinement. Experimental\nresults reveal that existing multi-modal systems exhibit inadequate multi-hop\ngrounding and reasoning abilities, resulting in unsatisfactory performance. We\nthen propose a novel architecture, termed as Grounding Scattered Evidence with\nLarge Language Model (GeLM), that enhances multi-modal large language models\n(MLLMs) by incorporating a grounding module to retrieve temporal evidence from\nvideos using flexible grounding tokens. Trained on our visual instruction data,\nGeLM demonstrates improved multi-hop grounding and reasoning capabilities,\nsetting a new baseline for this challenging task. Furthermore, when trained on\nthird-person view videos, the same architecture also achieves state-of-the-art\nperformance on the single-hop VidQA benchmark, ActivityNet-RTL, demonstrating\nits effectiveness.\n","authors":["Qirui Chen","Shangzhe Di","Weidi Xie"],"pdf_url":"https://arxiv.org/pdf/2408.14469v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14457v1","updated":"2024-08-26T17:49:27Z","published":"2024-08-26T17:49:27Z","title":"Dense Center-Direction Regression for Object Counting and Localization\n with Point Supervision","summary":" Object counting and localization problems are commonly addressed with point\nsupervised learning, which allows the use of less labor-intensive point\nannotations. However, learning based on point annotations poses challenges due\nto the high imbalance between the sets of annotated and unannotated pixels,\nwhich is often treated with Gaussian smoothing of point annotations and focal\nloss. However, these approaches still focus on the pixels in the immediate\nvicinity of the point annotations and exploit the rest of the data only\nindirectly. In this work, we propose a novel approach termed CeDiRNet for\npoint-supervised learning that uses a dense regression of directions pointing\ntowards the nearest object centers, i.e. center-directions. This provides\ngreater support for each center point arising from many surrounding pixels\npointing towards the object center. We propose a formulation of\ncenter-directions that allows the problem to be split into the domain-specific\ndense regression of center-directions and the final localization task based on\na small, lightweight, and domain-agnostic localization network that can be\ntrained with synthetic data completely independent of the target domain. We\ndemonstrate the performance of the proposed method on six different datasets\nfor object counting and localization, and show that it outperforms the existing\nstate-of-the-art methods. The code is accessible on GitHub at\nhttps://github.com/vicoslab/CeDiRNet.git.\n","authors":["Domen Tabernik","Jon Muhovič","Danijel Skočaj"],"pdf_url":"https://arxiv.org/pdf/2408.14457v1.pdf","comment":"Published in Pattern Recognition"},{"id":"http://arxiv.org/abs/2408.14456v1","updated":"2024-08-26T17:49:05Z","published":"2024-08-26T17:49:05Z","title":"Center Direction Network for Grasping Point Localization on Cloths","summary":" Object grasping is a fundamental challenge in robotics and computer vision,\ncritical for advancing robotic manipulation capabilities. Deformable objects,\nlike fabrics and cloths, pose additional challenges due to their non-rigid\nnature. In this work, we introduce CeDiRNet-3DoF, a deep-learning model for\ngrasp point detection, with a particular focus on cloth objects. CeDiRNet-3DoF\nemploys center direction regression alongside a localization network, attaining\nfirst place in the perception task of ICRA 2023's Cloth Manipulation Challenge.\nRecognizing the lack of standardized benchmarks in the literature that hinder\neffective method comparison, we present the ViCoS Towel Dataset. This extensive\nbenchmark dataset comprises 8,000 real and 12,000 synthetic images, serving as\na robust resource for training and evaluating contemporary data-driven\ndeep-learning approaches. Extensive evaluation revealed CeDiRNet-3DoF's\nrobustness in real-world performance, outperforming state-of-the-art methods,\nincluding the latest transformer-based models. Our work bridges a crucial gap,\noffering a robust solution and benchmark for cloth grasping in computer vision\nand robotics. Code and dataset are available at:\nhttps://github.com/vicoslab/CeDiRNet-3DoF\n","authors":["Domen Tabernik","Jon Muhovič","Matej Urbas","Danijel Skočaj"],"pdf_url":"https://arxiv.org/pdf/2408.14456v1.pdf","comment":"Accepted for publication in IEEE Robotics and Automation Letters"},{"id":"http://arxiv.org/abs/2408.14442v1","updated":"2024-08-26T17:35:01Z","published":"2024-08-26T17:35:01Z","title":"Model Parallel Training and Transfer Learning for Convolutional Neural\n Networks by Domain Decomposition","summary":" Deep convolutional neural networks (CNNs) have been shown to be very\nsuccessful in a wide range of image processing applications. However, due to\ntheir increasing number of model parameters and an increasing availability of\nlarge amounts of training data, parallelization strategies to efficiently train\ncomplex CNNs are necessary. In previous work by the authors, a novel model\nparallel CNN architecture was proposed which is loosely inspired by domain\ndecomposition. In particular, the novel network architecture is based on a\ndecomposition of the input data into smaller subimages. For each of these\nsubimages, local CNNs with a proportionally smaller number of parameters are\ntrained in parallel and the resulting local classifications are then aggregated\nin a second step by a dense feedforward neural network (DNN). In the present\nwork, we compare the resulting CNN-DNN architecture to less costly alternatives\nto combine the local classifications into a final, global decision.\nAdditionally, we investigate the performance of the CNN-DNN trained as one\ncoherent model as well as using a transfer learning strategy, where the\nparameters of the pre-trained local CNNs are used as initial values for a\nsubsequently trained global coherent CNN-DNN model.\n","authors":["Axel Klawonn","Martin Lanser","Janine Weber"],"pdf_url":"https://arxiv.org/pdf/2408.14442v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14441v1","updated":"2024-08-26T17:33:47Z","published":"2024-08-26T17:33:47Z","title":"Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification","summary":" Exploiting both audio and visual modalities for video classification is a\nchallenging task, as the existing methods require large model architectures,\nleading to high computational complexity and resource requirements. Smaller\narchitectures, on the other hand, struggle to achieve optimal performance. In\nthis paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that\nintroduces a compact model architecture specifically designed to capture\nintricate audio-visual relationships in video data. Through extensive\nexperiments on the challenging YouTube-8M dataset, we demonstrate that\nAttend-Fusion achieves an F1 score of 75.64\\% with only 72M parameters, which\nis comparable to the performance of larger baseline models such as\nFully-Connected Late Fusion (75.96\\% F1 score, 341M parameters). Attend-Fusion\nachieves similar performance to the larger baseline model while reducing the\nmodel size by nearly 80\\%, highlighting its efficiency in terms of model\ncomplexity. Our work demonstrates that the Attend-Fusion model effectively\ncombines audio and visual information for video classification, achieving\ncompetitive performance with significantly reduced model size. This approach\nopens new possibilities for deploying high-performance video understanding\nsystems in resource-constrained environments across various applications.\n","authors":["Mahrukh Awan","Asmar Nadeem","Muhammad Junaid Awan","Armin Mustafa","Syed Sameed Husain"],"pdf_url":"https://arxiv.org/pdf/2408.14441v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14435v1","updated":"2024-08-26T17:21:54Z","published":"2024-08-26T17:21:54Z","title":"Social perception of faces in a vision-language model","summary":" We explore social perception of human faces in CLIP, a widely used\nopen-source vision-language model. To this end, we compare the similarity in\nCLIP embeddings between different textual prompts and a set of face images. Our\ntextual prompts are constructed from well-validated social psychology terms\ndenoting social perception. The face images are synthetic and are\nsystematically and independently varied along six dimensions: the legally\nprotected attributes of age, gender, and race, as well as facial expression,\nlighting, and pose. Independently and systematically manipulating face\nattributes allows us to study the effect of each on social perception and\navoids confounds that can occur in wild-collected data due to uncontrolled\nsystematic correlations between attributes. Thus, our findings are experimental\nrather than observational. Our main findings are three. First, while CLIP is\ntrained on the widest variety of images and texts, it is able to make\nfine-grained human-like social judgments on face images. Second, age, gender,\nand race do systematically impact CLIP's social perception of faces, suggesting\nan undesirable bias in CLIP vis-a-vis legally protected attributes. Most\nstrikingly, we find a strong pattern of bias concerning the faces of Black\nwomen, where CLIP produces extreme values of social perception across different\nages and facial expressions. Third, facial expression impacts social perception\nmore than age and lighting as much as age. The last finding predicts that\nstudies that do not control for unprotected visual attributes may reach the\nwrong conclusions on bias. Our novel method of investigation, which is founded\non the social psychology literature and on the experiments involving the\nmanipulation of individual attributes, yields sharper and more reliable\nobservations than previous observational methods and may be applied to study\nbiases in any vision-language model.\n","authors":["Carina I. Hausladen","Manuel Knott","Colin F. Camerer","Pietro Perona"],"pdf_url":"https://arxiv.org/pdf/2408.14435v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14427v1","updated":"2024-08-26T17:15:37Z","published":"2024-08-26T17:15:37Z","title":"Few-Shot 3D Volumetric Segmentation with Multi-Surrogate Fusion","summary":" Conventional 3D medical image segmentation methods typically require learning\nheavy 3D networks (e.g., 3D-UNet), as well as large amounts of in-domain data\nwith accurate pixel/voxel-level labels to avoid overfitting. These solutions\nare thus extremely time- and labor-expensive, but also may easily fail to\ngeneralize to unseen objects during training. To alleviate this issue, we\npresent MSFSeg, a novel few-shot 3D segmentation framework with a lightweight\nmulti-surrogate fusion (MSF). MSFSeg is able to automatically segment unseen 3D\nobjects/organs (during training) provided with one or a few annotated 2D slices\nor 3D sequence segments, via learning dense query-support organ/lesion anatomy\ncorrelations across patient populations. Our proposed MSF module mines\ncomprehensive and diversified morphology correlations between unlabeled and the\nfew labeled slices/sequences through multiple designated surrogates, making it\nable to generate accurate cross-domain 3D segmentation masks given annotated\nslices or sequences. We demonstrate the effectiveness of our proposed framework\nby showing superior performance on conventional few-shot segmentation\nbenchmarks compared to prior art, and remarkable cross-domain cross-volume\nsegmentation performance on proprietary 3D segmentation datasets for\nchallenging entities, i.e., tubular structures, with only limited 2D or 3D\nlabels.\n","authors":["Meng Zheng","Benjamin Planche","Zhongpai Gao","Terrence Chen","Richard J. Radke","Ziyan Wu"],"pdf_url":"https://arxiv.org/pdf/2408.14427v1.pdf","comment":"Accepted to MICCAI 2024"},{"id":"http://arxiv.org/abs/2408.14421v1","updated":"2024-08-26T17:04:52Z","published":"2024-08-26T17:04:52Z","title":"Evaluating saliency scores in point clouds of natural environments by\n learning surface anomalies","summary":" In recent years, three-dimensional point clouds are used increasingly to\ndocument natural environments. Each dataset contains a diverse set of objects,\nat varying shapes and sizes, distributed throughout the data and intricately\nintertwined with the topography. Therefore, regions of interest are difficult\nto find and consequent analyses become a challenge. Inspired from visual\nperception principles, we propose to differentiate objects of interest from the\ncluttered environment by evaluating how much they stand out from their\nsurroundings, i.e., their geometric salience. Previous saliency detection\napproaches suggested mostly handcrafted attributes for the task. However, such\nmethods fail when the data are too noisy or have high levels of texture. Here\nwe propose a learning-based mechanism that accommodates noise and textured\nsurfaces. We assume that within the natural environment any change from the\nprevalent surface would suggest a salient object. Thus, we first learn the\nunderlying surface and then search for anomalies within it. Initially, a deep\nneural network is trained to reconstruct the surface. Regions where the\nreconstructed part deviates significantly from the original point cloud yield a\nsubstantial reconstruction error, signifying an anomaly, i.e., saliency. We\ndemonstrate the effectiveness of the proposed approach by searching for salient\nfeatures in various natural scenarios, which were acquired by different\nacquisition platforms. We show the strong correlation between the\nreconstruction error and salient objects.\n","authors":["Reuma Arav","Dennis Wittich","Franz Rottensteiner"],"pdf_url":"https://arxiv.org/pdf/2408.14421v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14419v1","updated":"2024-08-26T17:04:23Z","published":"2024-08-26T17:04:23Z","title":"CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language\n Models","summary":" We introduce CHARTOM, a visual theory-of-mind benchmark for multimodal large\nlanguage models. CHARTOM consists of specially designed data visualizing\ncharts. Given a chart, a language model needs to not only correctly comprehend\nthe chart (the FACT question) but also judge if the chart will be misleading to\na human reader (the MIND question). Both questions have significant societal\nbenefits. We detail the construction of the CHARTOM benchmark including its\ncalibration on human performance.\n","authors":["Shubham Bharti","Shiyun Cheng","Jihyun Rho","Martina Rao","Xiaojin Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.14419v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14415v1","updated":"2024-08-26T17:02:25Z","published":"2024-08-26T17:02:25Z","title":"LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation","summary":" Mamba, a State Space Model (SSM), has recently shown competitive performance\nto Convolutional Neural Networks (CNNs) and Transformers in Natural Language\nProcessing and general sequence modeling. Various attempts have been made to\nadapt Mamba to Computer Vision tasks, including medical image segmentation\n(MIS). Vision Mamba (VM)-based networks are particularly attractive due to\ntheir ability to achieve global receptive fields, similar to Vision\nTransformers, while also maintaining linear complexity in the number of tokens.\nHowever, the existing VM models still struggle to maintain both spatially local\nand global dependencies of tokens in high dimensional arrays due to their\nsequential nature. Employing multiple and/or complicated scanning strategies is\ncomputationally costly, which hinders applications of SSMs to high-dimensional\n2D and 3D images that are common in MIS problems. In this work, we propose\nLocal-Global Vision Mamba, LoG-VMamba, that explicitly enforces spatially\nadjacent tokens to remain nearby on the channel axis, and retains the global\ncontext in a compressed form. Our method allows the SSMs to access the local\nand global contexts even before reaching the last token while requiring only a\nsimple scanning strategy. Our segmentation models are computationally efficient\nand substantially outperform both CNN and Transformers-based baselines on a\ndiverse set of 2D and 3D MIS tasks. The implementation of LoG-VMamba is\navailable at \\url{https://github.com/Oulu-IMEDS/LoG-VMamba}.\n","authors":["Trung Dinh Quoc Dang","Huy Hoang Nguyen","Aleksei Tiulpin"],"pdf_url":"https://arxiv.org/pdf/2408.14415v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2310.05873v6","updated":"2024-08-26T16:55:02Z","published":"2023-10-09T17:13:10Z","title":"Implicit Concept Removal of Diffusion Models","summary":" Text-to-image (T2I) diffusion models often inadvertently generate unwanted\nconcepts such as watermarks and unsafe images. These concepts, termed as the\n\"implicit concepts\", could be unintentionally learned during training and then\nbe generated uncontrollably during inference. Existing removal methods still\nstruggle to eliminate implicit concepts primarily due to their dependency on\nthe model's ability to recognize concepts it actually can not discern. To\naddress this, we utilize the intrinsic geometric characteristics of implicit\nconcepts and present the Geom-Erasing, a novel concept removal method based on\nthe geometric-driven control. Specifically, once an unwanted implicit concept\nis identified, we integrate the existence and geometric information of the\nconcept into the text prompts with the help of an accessible classifier or\ndetector model. Subsequently, the model is optimized to identify and\ndisentangle this information, which is then adopted as negative prompts during\ngeneration. Moreover, we introduce the Implicit Concept Dataset (ICD), a novel\nimage-text dataset imbued with three typical implicit concepts (i.e., QR codes,\nwatermarks, and text), reflecting real-life situations where implicit concepts\nare easily injected. Geom-Erasing effectively mitigates the generation of\nimplicit concepts, achieving the state-of-the-art results on the Inappropriate\nImage Prompts (I2P) and our challenging Implicit Concept Dataset (ICD)\nbenchmarks.\n","authors":["Zhili Liu","Kai Chen","Yifan Zhang","Jianhua Han","Lanqing Hong","Hang Xu","Zhenguo Li","Dit-Yan Yeung","James Kwok"],"pdf_url":"https://arxiv.org/pdf/2310.05873v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02060v3","updated":"2024-08-26T16:42:45Z","published":"2023-10-03T14:03:20Z","title":"Global Attractor for a Reaction-Diffusion Model Arising in Biological\n Dynamic in 3D Soil Structure","summary":" Partial Differential Equations (PDEs) play a crucial role as tools for\nmodeling and comprehending intricate natural processes, notably within the\ndomain of biology. This research explores the domain of microbial activity\nwithin the complex matrix of 3D soil structures, providing valuable\nunderstanding into both the existence and uniqueness of solutions and the\nasymptotic behavior of the corresponding PDE model. Our investigation results\nin the discovery of a global attractor, a fundamental feature with significant\nimplications for long-term system behavior. To enhance the clarity of our\nfindings, numerical simulations are employed to visually illustrate the\nattributes of this global attractor.\n","authors":["Mohamed Elghandouri","Khalil Ezzinbi","Mouad Klai","Olivier Monga"],"pdf_url":"https://arxiv.org/pdf/2310.02060v3.pdf","comment":"Preprint submitted to Mathematical Modeling in Natural Phenomena"},{"id":"http://arxiv.org/abs/2408.14400v1","updated":"2024-08-26T16:34:13Z","published":"2024-08-26T16:34:13Z","title":"Satellite Sunroof: High-res Digital Surface Models and Roof Segmentation\n for Global Solar Mapping","summary":" The transition to renewable energy, particularly solar, is key to mitigating\nclimate change. Google's Solar API aids this transition by estimating solar\npotential from aerial imagery, but its impact is constrained by geographical\ncoverage. This paper proposes expanding the API's reach using satellite\nimagery, enabling global solar potential assessment. We tackle challenges\ninvolved in building a Digital Surface Model (DSM) and roof instance\nsegmentation from lower resolution and single oblique views using deep learning\nmodels. Our models, trained on aligned satellite and aerial datasets, produce\n25cm DSMs and roof segments. With ~1m DSM MAE on buildings, ~5deg roof pitch\nerror and ~56% IOU on roof segmentation, they significantly enhance the Solar\nAPI's potential to promote solar adoption.\n","authors":["Vishal Batchu","Alex Wilson","Betty Peng","Carl Elkin","Umangi Jain","Christopher Van Arsdale","Ross Goroshin","Varun Gulshan"],"pdf_url":"https://arxiv.org/pdf/2408.14400v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2408.14397v1","updated":"2024-08-26T16:28:56Z","published":"2024-08-26T16:28:56Z","title":"Uncovering Knowledge Gaps in Radiology Report Generation Models through\n Knowledge Graphs","summary":" Recent advancements in artificial intelligence have significantly improved\nthe automatic generation of radiology reports. However, existing evaluation\nmethods fail to reveal the models' understanding of radiological images and\ntheir capacity to achieve human-level granularity in descriptions. To bridge\nthis gap, we introduce a system, named ReXKG, which extracts structured\ninformation from processed reports to construct a comprehensive radiology\nknowledge graph. We then propose three metrics to evaluate the similarity of\nnodes (ReXKG-NSC), distribution of edges (ReXKG-AMS), and coverage of subgraphs\n(ReXKG-SCS) across various knowledge graphs. We conduct an in-depth comparative\nanalysis of AI-generated and human-written radiology reports, assessing the\nperformance of both specialist and generalist models. Our study provides a\ndeeper understanding of the capabilities and limitations of current AI models\nin radiology report generation, offering valuable insights for improving model\nperformance and clinical applicability.\n","authors":["Xiaoman Zhang","Julián N. Acosta","Hong-Yu Zhou","Pranav Rajpurkar"],"pdf_url":"https://arxiv.org/pdf/2408.14397v1.pdf","comment":"Code is available at: https://github.com/rajpurkarlab/ReXKG"},{"id":"http://arxiv.org/abs/2402.00752v4","updated":"2024-08-26T16:27:42Z","published":"2024-02-01T16:43:58Z","title":"On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection\n Strategy","summary":" 3D Gaussian Splatting has garnered extensive attention and application in\nreal-time neural rendering. Concurrently, concerns have been raised about the\nlimitations of this technology in aspects such as point cloud storage,\nperformance, and robustness in sparse viewpoints, leading to various\nimprovements. However, there has been a notable lack of attention to the\nfundamental problem of projection errors introduced by the local affine\napproximation inherent in the splatting itself, and the consequential impact of\nthese errors on the quality of photo-realistic rendering. This paper addresses\nthe projection error function of 3D Gaussian Splatting, commencing with the\nresidual error from the first-order Taylor expansion of the projection\nfunction. The analysis establishes a correlation between the error and the\nGaussian mean position. Subsequently, leveraging function optimization theory,\nthis paper analyzes the function's minima to provide an optimal projection\nstrategy for Gaussian Splatting referred to Optimal Gaussian Splatting, which\ncan accommodate a variety of camera models. Experimental validation further\nconfirms that this projection methodology reduces artifacts, resulting in a\nmore convincingly realistic rendering.\n","authors":["Letian Huang","Jiayang Bai","Jie Guo","Yuanqi Li","Yanwen Guo"],"pdf_url":"https://arxiv.org/pdf/2402.00752v4.pdf","comment":"Accepted by ECCV2024; Project Page:\n https://letianhuang.github.io/op43dgs/"},{"id":"http://arxiv.org/abs/2404.03507v3","updated":"2024-08-26T16:22:35Z","published":"2024-04-04T15:10:24Z","title":"DQ-DETR: DETR with Dynamic Query for Tiny Object Detection","summary":" Despite previous DETR-like methods having performed successfully in generic\nobject detection, tiny object detection is still a challenging task for them\nsince the positional information of object queries is not customized for\ndetecting tiny objects, whose scale is extraordinarily smaller than general\nobjects. Also, DETR-like methods using a fixed number of queries make them\nunsuitable for aerial datasets, which only contain tiny objects, and the\nnumbers of instances are imbalanced between different images. Thus, we present\na simple yet effective model, named DQ-DETR, which consists of three different\ncomponents: categorical counting module, counting-guided feature enhancement,\nand dynamic query selection to solve the above-mentioned problems. DQ-DETR uses\nthe prediction and density maps from the categorical counting module to\ndynamically adjust the number of object queries and improve the positional\ninformation of queries. Our model DQ-DETR outperforms previous CNN-based and\nDETR-like methods, achieving state-of-the-art mAP 30.2% on the AI-TOD-V2\ndataset, which mostly consists of tiny objects.\n","authors":["Yi-Xin Huang","Hou-I Liu","Hong-Han Shuai","Wen-Huang Cheng"],"pdf_url":"https://arxiv.org/pdf/2404.03507v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.03771v2","updated":"2024-08-26T16:15:57Z","published":"2024-07-04T09:32:12Z","title":"SpikeGS: Reconstruct 3D scene via fast-moving bio-inspired sensors","summary":" 3D Gaussian Splatting (3DGS) demonstrates unparalleled superior performance\nin 3D scene reconstruction. However, 3DGS heavily relies on the sharp images.\nFulfilling this requirement can be challenging in real-world scenarios\nespecially when the camera moves fast, which severely limits the application of\n3DGS. To address these challenges, we proposed Spike Gausian Splatting\n(SpikeGS), the first framework that integrates the spike streams into 3DGS\npipeline to reconstruct 3D scenes via a fast-moving bio-inspired camera. With\naccumulation rasterization, interval supervision, and a specially designed\npipeline, SpikeGS extracts detailed geometry and texture from high temporal\nresolution but texture lacking spike stream, reconstructs 3D scenes captured in\n1 second. Extensive experiments on multiple synthetic and real-world datasets\ndemonstrate the superiority of SpikeGS compared with existing spike-based and\ndeblur 3D scene reconstruction methods. Codes and data will be released soon.\n","authors":["Yijia Guo","Liwen Hu","Lei Ma","Tiejun Huang"],"pdf_url":"https://arxiv.org/pdf/2407.03771v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.05180v2","updated":"2024-08-26T16:13:30Z","published":"2024-04-08T04:10:50Z","title":"GloSoFarID: Global multispectral dataset for Solar Farm IDentification\n in satellite imagery","summary":" Solar Photovoltaic (PV) technology is increasingly recognized as a pivotal\nsolution in the global pursuit of clean and renewable energy. This technology\naddresses the urgent need for sustainable energy alternatives by converting\nsolar power into electricity without greenhouse gas emissions. It not only\ncurtails global carbon emissions but also reduces reliance on finite,\nnon-renewable energy sources. In this context, monitoring solar panel farms\nbecomes essential for understanding and facilitating the worldwide shift toward\nclean energy. This study contributes to this effort by developing the first\ncomprehensive global dataset of multispectral satellite imagery of solar panel\nfarms. This dataset is intended to form the basis for training robust machine\nlearning models, which can accurately map and analyze the expansion and\ndistribution of solar panel farms globally. The insights gained from this\nendeavor will be instrumental in guiding informed decision-making for a\nsustainable energy future. https://github.com/yzyly1992/GloSoFarID\n","authors":["Zhiyuan Yang","Ryan Rad"],"pdf_url":"https://arxiv.org/pdf/2404.05180v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14381v1","updated":"2024-08-26T16:04:13Z","published":"2024-08-26T16:04:13Z","title":"Learning Tree-Structured Composition of Data Augmentation","summary":" Data augmentation is widely used for training a neural network given little\nlabeled data. A common practice of augmentation training is applying a\ncomposition of multiple transformations sequentially to the data. Existing\naugmentation methods such as RandAugment randomly sample from a list of\npre-selected transformations, while methods such as AutoAugment apply advanced\nsearch to optimize over an augmentation set of size $k^d$, which is the number\nof transformation sequences of length $d$, given a list of $k$ transformations.\n In this paper, we design efficient algorithms whose running time complexity\nis much faster than the worst-case complexity of $O(k^d)$, provably. We propose\na new algorithm to search for a binary tree-structured composition of $k$\ntransformations, where each tree node corresponds to one transformation. The\nbinary tree generalizes sequential augmentations, such as the SimCLR\naugmentation scheme for contrastive learning. Using a top-down, recursive\nsearch procedure, our algorithm achieves a runtime complexity of $O(2^d k)$,\nwhich is much faster than $O(k^d)$ as $k$ increases above $2$. We apply our\nalgorithm to tackle data distributions with heterogeneous subpopulations by\nsearching for one tree in each subpopulation and then learning a weighted\ncombination, resulting in a forest of trees.\n We validate our proposed algorithms on numerous graph and image datasets,\nincluding a multi-label graph classification dataset we collected. The dataset\nexhibits significant variations in the sizes of graphs and their average\ndegrees, making it ideal for studying data augmentation. We show that our\napproach can reduce the computation cost by 43% over existing search methods\nwhile improving performance by 4.3%. The tree structures can be used to\ninterpret the relative importance of each transformation, such as identifying\nthe important transformations on small vs. large graphs.\n","authors":["Dongyue Li","Kailai Chen","Predrag Radivojac","Hongyang R. Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.14381v1.pdf","comment":"25 pages"},{"id":"http://arxiv.org/abs/2408.14371v1","updated":"2024-08-26T15:53:50Z","published":"2024-08-26T15:53:50Z","title":"SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery","summary":" In this paper, we address Generalized Category Discovery, aiming to\nsimultaneously uncover novel categories and accurately classify known ones.\nTraditional methods, which lean heavily on self-supervision and contrastive\nlearning, often fall short when distinguishing between fine-grained categories.\nTo address this, we introduce a novel concept called `self-expertise', which\nenhances the model's ability to recognize subtle differences and uncover\nunknown categories. Our approach combines unsupervised and supervised\nself-expertise strategies to refine the model's discernment and generalization.\nInitially, hierarchical pseudo-labeling is used to provide `soft supervision',\nimproving the effectiveness of self-expertise. Our supervised technique differs\nfrom traditional methods by utilizing more abstract positive and negative\nsamples, aiding in the formation of clusters that can generalize to novel\ncategories. Meanwhile, our unsupervised strategy encourages the model to\nsharpen its category distinctions by considering within-category examples as\n`hard' negatives. Supported by theoretical insights, our empirical results\nshowcase that our method outperforms existing state-of-the-art techniques in\nGeneralized Category Discovery across several fine-grained datasets. Our code\nis available at: https://github.com/SarahRastegar/SelEx.\n","authors":["Sarah Rastegar","Mohammadreza Salehi","Yuki M. Asano","Hazel Doughty","Cees G. M. Snoek"],"pdf_url":"https://arxiv.org/pdf/2408.14371v1.pdf","comment":"Accepted by ECCV 2024"},{"id":"http://arxiv.org/abs/2405.03762v2","updated":"2024-08-26T15:40:18Z","published":"2024-05-06T18:01:13Z","title":"Swin transformers are robust to distribution and concept drift in\n endoscopy-based longitudinal rectal cancer assessment","summary":" Endoscopic images are used at various stages of rectal cancer treatment\nstarting from cancer screening, diagnosis, during treatment to assess response\nand toxicity from treatments such as colitis, and at follow up to detect new\ntumor or local regrowth (LR). However, subjective assessment is highly variable\nand can underestimate the degree of response in some patients, subjecting them\nto unnecessary surgery, or overestimate response that places patients at risk\nof disease spread. Advances in deep learning has shown the ability to produce\nconsistent and objective response assessment for endoscopic images. However,\nmethods for detecting cancers, regrowth, and monitoring response during the\nentire course of patient treatment and follow-up are lacking. This is because,\nautomated diagnosis and rectal cancer response assessment requires methods that\nare robust to inherent imaging illumination variations and confounding\nconditions (blood, scope, blurring) present in endoscopy images as well as\nchanges to the normal lumen and tumor during treatment. Hence, a hierarchical\nshifted window (Swin) transformer was trained to distinguish rectal cancer from\nnormal lumen using endoscopy images. Swin as well as two convolutional\n(ResNet-50, WideResNet-50), and vision transformer (ViT) models were trained\nand evaluated on follow-up longitudinal images to detect LR on private dataset\nas well as on out-of-distribution (OOD) public colonoscopy datasets to detect\npre/non-cancerous polyps. Color shifts were applied using optimal transport to\nsimulate distribution shifts. Swin and ResNet models were similarly accurate in\nthe in-distribution dataset. Swin was more accurate than other methods\n(follow-up: 0.84, OOD: 0.83) even when subject to color shifts (follow-up:\n0.83, OOD: 0.87), indicating capability to provide robust performance for\nlongitudinal cancer assessment.\n","authors":["Jorge Tapias Gomez","Aneesh Rangnekar","Hannah Williams","Hannah Thompson","Julio Garcia-Aguilar","Joshua Jesse Smith","Harini Veeraraghavan"],"pdf_url":"https://arxiv.org/pdf/2405.03762v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14358v1","updated":"2024-08-26T15:32:31Z","published":"2024-08-26T15:32:31Z","title":"An Embedding is Worth a Thousand Noisy Labels","summary":" The performance of deep neural networks scales with dataset size and label\nquality, rendering the efficient mitigation of low-quality data annotations\ncrucial for building robust and cost-effective systems. Existing strategies to\naddress label noise exhibit severe limitations due to computational complexity\nand application dependency. In this work, we propose WANN, a Weighted Adaptive\nNearest Neighbor approach that builds on self-supervised feature\nrepresentations obtained from foundation models. To guide the weighted voting\nscheme, we introduce a reliability score, which measures the likelihood of a\ndata label being correct. WANN outperforms reference methods, including a\nlinear layer trained with robust loss functions, on diverse datasets of varying\nsize and under various noise types and severities. WANN also exhibits superior\ngeneralization on imbalanced data compared to both Adaptive-NNs (ANN) and fixed\nk-NNs. Furthermore, the proposed weighting scheme enhances supervised\ndimensionality reduction under noisy labels. This yields a significant boost in\nclassification performance with 10x and 100x smaller image embeddings,\nminimizing latency and storage requirements. Our approach, emphasizing\nefficiency and explainability, emerges as a simple, robust solution to overcome\nthe inherent limitations of deep neural network training. The code is available\nat https://github.com/francescodisalvo05/wann-noisy-labels .\n","authors":["Francesco Di Salvo","Sebastian Doerrich","Ines Rieger","Christian Ledig"],"pdf_url":"https://arxiv.org/pdf/2408.14358v1.pdf","comment":"Preprint submitted to the International Journal of Computer Vision\n (IJCV)"},{"id":"http://arxiv.org/abs/2408.14348v1","updated":"2024-08-26T15:26:27Z","published":"2024-08-26T15:26:27Z","title":"Deep learning-based ecological analysis of camera trap images is\n impacted by training data quality and size","summary":" Large wildlife image collections from camera traps are crucial for\nbiodiversity monitoring, offering insights into species richness, occupancy,\nand activity patterns. However, manual processing of these data is\ntime-consuming, hindering analytical processes. To address this, deep neural\nnetworks have been widely adopted to automate image analysis. Despite their\ngrowing use, the impact of model training decisions on downstream ecological\nmetrics remains unclear. Here, we analyse camera trap data from an African\nsavannah and an Asian sub-tropical dry forest to compare key ecological metrics\nderived from expert-generated species identifications with those generated from\ndeep neural networks. We assess the impact of model architecture, training data\nnoise, and dataset size on ecological metrics, including species richness,\noccupancy, and activity patterns. Our results show that while model\narchitecture has minimal impact, large amounts of noise and reduced dataset\nsize significantly affect these metrics. Nonetheless, estimated ecological\nmetrics are resilient to considerable noise, tolerating up to 10% error in\nspecies labels and a 50% reduction in training set size without changing\nsignificantly. We also highlight that conventional metrics like classification\nerror may not always be representative of a model's ability to accurately\nmeasure ecological metrics. We conclude that ecological metrics derived from\ndeep neural network predictions closely match those calculated from expert\nlabels and remain robust to variations in the factors explored. However,\ntraining decisions for deep neural networks can impact downstream ecological\nanalysis. Therefore, practitioners should prioritize creating large, clean\ntraining sets and evaluate deep neural network solutions based on their ability\nto measure the ecological metrics of interest.\n","authors":["Omiros Pantazis","Peggy Bevan","Holly Pringle","Guilherme Braga Ferreira","Daniel J. Ingram","Emily Madsen","Liam Thomas","Dol Raj Thanet","Thakur Silwal","Santosh Rayamajhi","Gabriel Brostow","Oisin Mac Aodha","Kate E. Jones"],"pdf_url":"https://arxiv.org/pdf/2408.14348v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.11126v2","updated":"2024-08-26T15:19:12Z","published":"2024-08-20T18:26:09Z","title":"Binocular Model: A deep learning solution for online melt pool\n temperature analysis using dual-wavelength Imaging Pyrometry","summary":" In metal Additive Manufacturing (AM), monitoring the temperature of the Melt\nPool (MP) is crucial for ensuring part quality, process stability, defect\nprevention, and overall process optimization. Traditional methods, are slow to\nconverge and require extensive manual effort to translate data into actionable\ninsights, rendering them impractical for real-time monitoring and control. To\naddress this challenge, we propose an Artificial Intelligence (AI)-based\nsolution aimed at reducing manual data processing reliance and improving the\nefficiency of transitioning from data to insight. In our study, we utilize a\ndataset comprising dual-wavelength real-time process monitoring data and\ncorresponding temperature maps. We introduce a deep learning model called the\n\"Binocular model,\" which exploits dual input observations to perform a precise\nanalysis of MP temperature in Laser Powder Bed Fusion (L-PBF). Through advanced\ndeep learning techniques, we seamlessly convert raw data into temperature maps,\nsignificantly streamlining the process and enabling batch processing at a rate\nof up to 750 frames per second, approximately 1000 times faster than\nconventional methods. Our Binocular model achieves high accuracy in temperature\nestimation, evidenced by a 0.95 R-squared score, while simultaneously enhancing\nprocessing efficiency by a factor of $\\sim1000x$ times. This model directly\naddresses the challenge of real-time MP temperature monitoring and offers\ninsights into the encountered constraints and the benefits of our Deep\nLearning-based approach. By combining efficiency and precision, our work\ncontributes to the advancement of temperature monitoring in L-PBF, thus driving\nprogress in the field of metal AM.\n","authors":["Javid Akhavan","Chaitanya Krishna Vallabh","Xiayun Zhao","Souran Manoochehri"],"pdf_url":"https://arxiv.org/pdf/2408.11126v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14343v1","updated":"2024-08-26T15:16:28Z","published":"2024-08-26T15:16:28Z","title":"A Brief Analysis of the Iterative Next Boundary Detection Network for\n Tree Rings Delineation in Images of Pinus taeda","summary":" This work presents the INBD network proposed by Gillert et al. in CVPR-2023\nand studies its application for delineating tree rings in RGB images of Pinus\ntaeda cross sections captured by a smartphone (UruDendro dataset), which are\nimages with different characteristics from the ones used to train the method.\nThe INBD network operates in two stages: first, it segments the background,\npith, and ring boundaries. In the second stage, the image is transformed into\npolar coordinates, and ring boundaries are iteratively segmented from the pith\nto the bark. Both stages are based on the U-Net architecture. The method\nachieves an F-Score of 77.5, a mAR of 0.540, and an ARAND of 0.205 on the\nevaluation set. The code for the experiments is available at\nhttps://github.com/hmarichal93/mlbrief_inbd.\n","authors":["Henry Marichal","Gregory Randall"],"pdf_url":"https://arxiv.org/pdf/2408.14343v1.pdf","comment":"Submitted to IPOL ad an MLBriefs paper"},{"id":"http://arxiv.org/abs/2408.14339v1","updated":"2024-08-26T15:08:12Z","published":"2024-08-26T15:08:12Z","title":"ConceptMix: A Compositional Image Generation Benchmark with Controllable\n Difficulty","summary":" Compositionality is a critical capability in Text-to-Image (T2I) models, as\nit reflects their ability to understand and combine multiple concepts from text\ndescriptions. Existing evaluations of compositional capability rely heavily on\nhuman-designed text prompts or fixed templates, limiting their diversity and\ncomplexity, and yielding low discriminative power. We propose ConceptMix, a\nscalable, controllable, and customizable benchmark which automatically\nevaluates compositional generation ability of T2I models. This is done in two\nstages. First, ConceptMix generates the text prompts: concretely, using\ncategories of visual concepts (e.g., objects, colors, shapes, spatial\nrelationships), it randomly samples an object and k-tuples of visual concepts,\nthen uses GPT4-o to generate text prompts for image generation based on these\nsampled concepts. Second, ConceptMix evaluates the images generated in response\nto these prompts: concretely, it checks how many of the k concepts actually\nappeared in the image by generating one question per visual concept and using a\nstrong VLM to answer them. Through administering ConceptMix to a diverse set of\nT2I models (proprietary as well as open ones) using increasing values of k, we\nshow that our ConceptMix has higher discrimination power than earlier\nbenchmarks. Specifically, ConceptMix reveals that the performance of several\nmodels, especially open models, drops dramatically with increased k.\nImportantly, it also provides insight into the lack of prompt diversity in\nwidely-used training datasets. Additionally, we conduct extensive human studies\nto validate the design of ConceptMix and compare our automatic grading with\nhuman judgement. We hope it will guide future T2I model development.\n","authors":["Xindi Wu","Dingli Yu","Yangsibo Huang","Olga Russakovsky","Sanjeev Arora"],"pdf_url":"https://arxiv.org/pdf/2408.14339v1.pdf","comment":"43 pages"},{"id":"http://arxiv.org/abs/2408.14336v1","updated":"2024-08-26T15:07:01Z","published":"2024-08-26T15:07:01Z","title":"Equivariant Reinforcement Learning under Partial Observability","summary":" Incorporating inductive biases is a promising approach for tackling\nchallenging robot learning domains with sample-efficient solutions. This paper\nidentifies partially observable domains where symmetries can be a useful\ninductive bias for efficient learning. Specifically, by encoding the\nequivariance regarding specific group symmetries into the neural networks, our\nactor-critic reinforcement learning agents can reuse solutions in the past for\nrelated scenarios. Consequently, our equivariant agents outperform\nnon-equivariant approaches significantly in terms of sample efficiency and\nfinal performance, demonstrated through experiments on a range of robotic tasks\nin simulation and real hardware.\n","authors":["Hai Nguyen","Andrea Baisero","David Klee","Dian Wang","Robert Platt","Christopher Amato"],"pdf_url":"https://arxiv.org/pdf/2408.14336v1.pdf","comment":"Conference on Robot Learning, 2023"},{"id":"http://arxiv.org/abs/2408.14329v1","updated":"2024-08-26T14:55:23Z","published":"2024-08-26T14:55:23Z","title":"PHEVA: A Privacy-preserving Human-centric Video Anomaly Detection\n Dataset","summary":" PHEVA, a Privacy-preserving Human-centric Ethical Video Anomaly detection\ndataset. By removing pixel information and providing only de-identified human\nannotations, PHEVA safeguards personally identifiable information. The dataset\nincludes seven indoor/outdoor scenes, featuring one novel, context-specific\ncamera, and offers over 5x the pose-annotated frames compared to the largest\nprevious dataset. This study benchmarks state-of-the-art methods on PHEVA using\na comprehensive set of metrics, including the 10% Error Rate (10ER), a metric\nused for anomaly detection for the first time providing insights relevant to\nreal-world deployment. As the first of its kind, PHEVA bridges the gap between\nconventional training and real-world deployment by introducing continual\nlearning benchmarks, with models outperforming traditional methods in 82.14% of\ncases. The dataset is publicly available at\nhttps://github.com/TeCSAR-UNCC/PHEVA.git.\n","authors":["Ghazal Alinezhad Noghre","Shanle Yao","Armin Danesh Pazho","Babak Rahimi Ardabili","Vinit Katariya","Hamed Tabkhi"],"pdf_url":"https://arxiv.org/pdf/2408.14329v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14326v1","updated":"2024-08-26T14:54:14Z","published":"2024-08-26T14:54:14Z","title":"Streamline tractography of the fetal brain in utero with machine\n learning","summary":" Diffusion-weighted magnetic resonance imaging (dMRI) is the only non-invasive\ntool for studying white matter tracts and structural connectivity of the brain.\nThese assessments rely heavily on tractography techniques, which reconstruct\nvirtual streamlines representing white matter fibers. Much effort has been\ndevoted to improving tractography methodology for adult brains, while\ntractography of the fetal brain has been largely neglected. Fetal tractography\nfaces unique difficulties due to low dMRI signal quality, immature and rapidly\ndeveloping brain structures, and paucity of reference data. This work presents\nthe first machine learning model for fetal tractography. The model input\nconsists of five sources of information: (1) Fiber orientation, inferred from a\ndiffusion tensor fit to the dMRI signal; (2) Directions of recent propagation\nsteps; (3) Global spatial information, encoded as distances to keypoints in the\nbrain cortex; (4) Tissue segmentation information; and (5) Prior information\nabout the expected local fiber orientations supplied with an atlas. In order to\nmitigate the local tensor estimation error, a large spatial context around the\ncurrent point in the diffusion tensor image is encoded using convolutional and\nattention neural network modules. Moreover, the diffusion tensor information at\na hypothetical next point is included in the model input. Filtering rules based\non anatomically constrained tractography are applied to prune implausible\nstreamlines. We trained the model on manually-refined whole-brain fetal\ntractograms and validated the trained model on an independent set of 11 test\nscans with gestational ages between 23 and 36 weeks. Results show that our\nproposed method achieves superior performance across all evaluated tracts. The\nnew method can significantly advance the capabilities of dMRI for studying\nnormal and abnormal brain development in utero.\n","authors":["Weide Liu","Camilo Calixto","Simon K. Warfield","Davood Karimi"],"pdf_url":"https://arxiv.org/pdf/2408.14326v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15699v3","updated":"2024-08-26T14:39:33Z","published":"2023-05-25T04:14:49Z","title":"Cross-view Action Recognition Understanding From Exocentric to\n Egocentric Perspective","summary":" Understanding action recognition in egocentric videos has emerged as a vital\nresearch topic with numerous practical applications. With the limitation in the\nscale of egocentric data collection, learning robust deep learning-based action\nrecognition models remains difficult. Transferring knowledge learned from the\nlarge-scale exocentric data to the egocentric data is challenging due to the\ndifference in videos across views. Our work introduces a novel cross-view\nlearning approach to action recognition (CVAR) that effectively transfers\nknowledge from the exocentric to the selfish view. First, we present a novel\ngeometric-based constraint into the self-attention mechanism in Transformer\nbased on analyzing the camera positions between two views. Then, we propose a\nnew cross-view self-attention loss learned on unpaired cross-view data to\nenforce the self-attention mechanism learning to transfer knowledge across\nviews. Finally, to further improve the performance of our cross-view learning\napproach, we present the metrics to measure the correlations in videos and\nattention maps effectively. Experimental results on standard egocentric action\nrecognition benchmarks, i.e., Charades-Ego, EPIC-Kitchens-55, and\nEPIC-Kitchens-100, have shown our approach's effectiveness and state-of-the-art\nperformance.\n","authors":["Thanh-Dat Truong","Khoa Luu"],"pdf_url":"https://arxiv.org/pdf/2305.15699v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14284v1","updated":"2024-08-26T14:09:40Z","published":"2024-08-26T14:09:40Z","title":"May the Forgetting Be with You: Alternate Replay for Learning with Noisy\n Labels","summary":" Forgetting presents a significant challenge during incremental training,\nmaking it particularly demanding for contemporary AI systems to assimilate new\nknowledge in streaming data environments. To address this issue, most\napproaches in Continual Learning (CL) rely on the replay of a restricted buffer\nof past data. However, the presence of noise in real-world scenarios, where\nhuman annotation is constrained by time limitations or where data is\nautomatically gathered from the web, frequently renders these strategies\nvulnerable. In this study, we address the problem of CL under Noisy Labels\n(CLN) by introducing Alternate Experience Replay (AER), which takes advantage\nof forgetting to maintain a clear distinction between clean, complex, and noisy\nsamples in the memory buffer. The idea is that complex or mislabeled examples,\nwhich hardly fit the previously learned data distribution, are most likely to\nbe forgotten. To grasp the benefits of such a separation, we equip AER with\nAsymmetric Balanced Sampling (ABS): a new sample selection strategy that\nprioritizes purity on the current task while retaining relevant samples from\nthe past. Through extensive computational comparisons, we demonstrate the\neffectiveness of our approach in terms of both accuracy and purity of the\nobtained buffer, resulting in a remarkable average gain of 4.71% points in\naccuracy with respect to existing loss-based purification strategies. Code is\navailable at https://github.com/aimagelab/mammoth.\n","authors":["Monica Millunzi","Lorenzo Bonicelli","Angelo Porrello","Jacopo Credi","Petter N. Kolm","Simone Calderara"],"pdf_url":"https://arxiv.org/pdf/2408.14284v1.pdf","comment":"25 pages, 5 figures. Accepted at the The 35th British Machine Vision\n Conference 2024 (BMVC 2024), Glasgow, UK"},{"id":"http://arxiv.org/abs/2408.12615v2","updated":"2024-08-26T14:06:59Z","published":"2024-08-08T14:11:06Z","title":"Pediatric TSC-Related Epilepsy Classification from Clinical MR Images\n Using Quantum Neural Network","summary":" Tuberous sclerosis complex (TSC) manifests as a multisystem disorder with\nsignificant neurological implications. This study addresses the critical need\nfor robust classification models tailored to TSC in pediatric patients,\nintroducing QResNet,a novel deep learning model seamlessly integrating\nconventional convolutional neural networks with quantum neural networks. The\nmodel incorporates a two-layer quantum layer (QL), comprising ZZFeatureMap and\nAnsatz layers, strategically designed for processing classical data within a\nquantum framework. A comprehensive evaluation, demonstrates the superior\nperformance of QResNet in TSC MRI image classification compared to conventional\n3D-ResNet models. These compelling findings underscore the potential of quantum\ncomputing to revolutionize medical imaging and diagnostics.Remarkably, this\nmethod surpasses conventional CNNs in accuracy and Area Under the Curve (AUC)\nmetrics with the current dataset. Future research endeavors may focus on\nexploring the scalability and practical implementation of quantum algorithms in\nreal-world medical imaging scenarios.\n","authors":["Ling Lin","Yihang Zhou","Zhanqi Hu","Dian Jiang","Congcong Liu","Shuo Zhou","Yanjie Zhu","Jianxiang Liao","Dong Liang","Hairong Zheng","Haifeng Wang"],"pdf_url":"https://arxiv.org/pdf/2408.12615v2.pdf","comment":"5 pages,4 figures,2 tables,presented at ISBI 2024"},{"id":"http://arxiv.org/abs/2408.14281v1","updated":"2024-08-26T14:02:30Z","published":"2024-08-26T14:02:30Z","title":"Uncertainties of Latent Representations in Computer Vision","summary":" Uncertainty quantification is a key pillar of trustworthy machine learning.\nIt enables safe reactions under unsafe inputs, like predicting only when the\nmachine learning model detects sufficient evidence, discarding anomalous data,\nor emitting warnings when an error is likely to be inbound. This is\nparticularly crucial in safety-critical areas like medical image classification\nor self-driving cars. Despite the plethora of proposed uncertainty\nquantification methods achieving increasingly higher scores on performance\nbenchmarks, uncertainty estimates are often shied away from in practice. Many\nmachine learning projects start from pretrained latent representations that\ncome without uncertainty estimates. Uncertainties would need to be trained by\npractitioners on their own, which is notoriously difficult and\nresource-intense.\n This thesis makes uncertainty estimates easily accessible by adding them to\nthe latent representation vectors of pretrained computer vision models. Besides\nproposing approaches rooted in probability and decision theory, such as\nMonte-Carlo InfoNCE (MCInfoNCE) and loss prediction, we delve into both\ntheoretical and empirical questions. We show that these unobservable\nuncertainties about unobservable latent representations are indeed provably\ncorrect. We also provide an uncertainty-aware representation learning (URL)\nbenchmark to compare these unobservables against observable ground-truths.\nFinally, we compile our findings to pretrain lightweight representation\nuncertainties on large-scale computer vision models that transfer to unseen\ndatasets in a zero-shot manner.\n Our findings do not only advance the current theoretical understanding of\nuncertainties over latent variables, but also facilitate the access to\nuncertainty quantification for future researchers inside and outside the field,\nenabling straightforward but trustworthy machine learning.\n","authors":["Michael Kirchhof"],"pdf_url":"https://arxiv.org/pdf/2408.14281v1.pdf","comment":"Doctoral thesis"},{"id":"http://arxiv.org/abs/2403.05451v2","updated":"2024-08-26T13:58:16Z","published":"2024-03-08T16:57:47Z","title":"Attention-guided Feature Distillation for Semantic Segmentation","summary":" In contrast to existing complex methodologies commonly employed for\ndistilling knowledge from a teacher to a student, this paper showcases the\nefficacy of a simple yet powerful method for utilizing refined feature maps to\ntransfer attention. The proposed method has proven to be effective in\ndistilling rich information, outperforming existing methods in semantic\nsegmentation as a dense prediction task. The proposed Attention-guided Feature\nDistillation (AttnFD) method, employs the Convolutional Block Attention Module\n(CBAM), which refines feature maps by taking into account both channel-specific\nand spatial information content. Simply using the Mean Squared Error (MSE) loss\nfunction between the refined feature maps of the teacher and the student,\nAttnFD demonstrates outstanding performance in semantic segmentation, achieving\nstate-of-the-art results in terms of improving the mean Intersection over Union\n(mIoU) of the student network on the PascalVoc 2012, Cityscapes, COCO, and\nCamVid datasets.\n","authors":["Amir M. Mansourian","Arya Jalali","Rozhan Ahmadi","Shohreh Kasaei"],"pdf_url":"https://arxiv.org/pdf/2403.05451v2.pdf","comment":"9 pages, 8 figures, and 6 tables"},{"id":"http://arxiv.org/abs/2408.09869v2","updated":"2024-08-26T13:55:59Z","published":"2024-08-19T10:20:06Z","title":"Docling Technical Report","summary":" This technical report introduces Docling, an easy to use, self-contained,\nMIT-licensed open-source package for PDF document conversion. It is powered by\nstate-of-the-art specialized AI models for layout analysis (DocLayNet) and\ntable structure recognition (TableFormer), and runs efficiently on commodity\nhardware in a small resource budget. The code interface allows for easy\nextensibility and addition of new features and models.\n","authors":["Christoph Auer","Maksym Lysak","Ahmed Nassar","Michele Dolfi","Nikolaos Livathinos","Panos Vagenas","Cesar Berrospi Ramis","Matteo Omenetti","Fabian Lindlbauer","Kasper Dinkla","Valery Weber","Lucas Morin","Ingmar Meijer","Viktor Kuropiatnyk","Peter W. J. Staar"],"pdf_url":"https://arxiv.org/pdf/2408.09869v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14279v1","updated":"2024-08-26T13:55:42Z","published":"2024-08-26T13:55:42Z","title":"Learning Local Pattern Modularization for Point Cloud Reconstruction\n from Unseen Classes","summary":" It is challenging to reconstruct 3D point clouds in unseen classes from\nsingle 2D images. Instead of object-centered coordinate system, current methods\ngeneralized global priors learned in seen classes to reconstruct 3D shapes from\nunseen classes in viewer-centered coordinate system. However, the\nreconstruction accuracy and interpretability are still eager to get improved.\nTo resolve this issue, we introduce to learn local pattern modularization for\nreconstructing 3D shapes in unseen classes, which achieves both good\ngeneralization ability and high reconstruction accuracy. Our insight is to\nlearn a local prior which is class-agnostic and easy to generalize in\nobject-centered coordinate system. Specifically, the local prior is learned via\na process of learning and customizing local pattern modularization in seen\nclasses. During this process, we first learn a set of patterns in local\nregions, which is the basis in the object-centered coordinate system to\nrepresent an arbitrary region on shapes across different classes. Then, we\nmodularize each region on an initially reconstructed shape using the learned\nlocal patterns. Based on that, we customize the local pattern modularization\nusing the input image by refining the reconstruction with more details. Our\nmethod enables to reconstruct high fidelity point clouds from unseen classes in\nobject-centered coordinate system without requiring a large number of patterns\nor any additional information, such as segmentation supervision or camera\nposes. Our experimental results under widely used benchmarks show that our\nmethod achieves the state-of-the-art reconstruction accuracy for shapes from\nunseen classes. The code is available at https://github.com/chenchao15/Unseen.\n","authors":["Chao Chen","Zhizhong Han","Yu-Shen Liu"],"pdf_url":"https://arxiv.org/pdf/2408.14279v1.pdf","comment":"14pages, 11figures, accepted by ECCV 2024"},{"id":"http://arxiv.org/abs/2312.06726v3","updated":"2024-08-26T13:52:18Z","published":"2023-12-11T05:57:09Z","title":"Filter & Align: Curating Image-Text Data with Human Knowledge","summary":" The increasing availability of image-text pairs has largely fueled the rapid\nadvancement in vision-language foundation models. However, the vast scale of\nthese datasets inevitably introduces significant variability in data quality,\nwhich can adversely affect the model performance. This highlights the critical\nrole of data filtering, not only to enhance training efficiency but also to\nimprove overall data quality. Existing methods typically rely on metrics such\nas CLIP Score and BLIP Score, which are derived from pre-trained models.\nHowever, these models are often trained on uncurated, noisy datasets, which can\nperpetuate errors and misalignments in the filtered dataset. We present a novel\nalgorithm that incorporates human knowledge on image-text alignment to guide\nfiltering vast corpus of web-crawled image-text datasets into a compact and\nhigh-quality form. To systemically capture human preferences on image-text\nalignments, we collect a diverse image-text dataset where each image is\nassociated with multiple captions from various sources, and establish a\ncomprehensive set of both subjective and objective criteria for critically\nguiding the alignment assessment from labelers. Additionally, we train a reward\nmodel on these human-preference annotations to internalize the nuanced human\nunderstanding of image-text alignment. The resulting reward model thus can act\nas a human-like referee to filter image-text pairs. Extensive experiments\ndemonstrate that we can maintain, sometimes even improve, model performance\nwhile compressing the image-text datasets up to ~90%. An impressive example is\nthat, by aggressively reducing the total training sample from 130M to only\n15.5M, our BLIP-B/16 models consistently show an average improvement of 2.9% on\nretrieval tasks and 11.5% on captioning tasks compared to full-size-dataset\ncounterparts.\n","authors":["Lei Zhang","Fangxun Shu","Tianyang Liu","Sucheng Ren","Hao Jiang","Cihang Xie"],"pdf_url":"https://arxiv.org/pdf/2312.06726v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14270v1","updated":"2024-08-26T13:45:58Z","published":"2024-08-26T13:45:58Z","title":"Reliable Multi-modal Medical Image-to-image Translation Independent of\n Pixel-wise Aligned Data","summary":" The current mainstream multi-modal medical image-to-image translation methods\nface a contradiction. Supervised methods with outstanding performance rely on\npixel-wise aligned training data to constrain the model optimization. However,\nobtaining pixel-wise aligned multi-modal medical image datasets is challenging.\nUnsupervised methods can be trained without paired data, but their reliability\ncannot be guaranteed. At present, there is no ideal multi-modal medical\nimage-to-image translation method that can generate reliable translation\nresults without the need for pixel-wise aligned data. This work aims to develop\na novel medical image-to-image translation model that is independent of\npixel-wise aligned data (MITIA), enabling reliable multi-modal medical\nimage-to-image translation under the condition of misaligned training data. The\nproposed MITIA model utilizes a prior extraction network composed of a\nmulti-modal medical image registration module and a multi-modal misalignment\nerror detection module to extract pixel-level prior information from training\ndata with misalignment errors to the largest extent. The extracted prior\ninformation is then used to construct a regularization term to constrain the\noptimization of the unsupervised cycle-consistent GAN model, restricting its\nsolution space and thereby improving the performance and reliability of the\ngenerator. We trained the MITIA model using six datasets containing different\nmisalignment errors and two well-aligned datasets. Subsequently, we compared\nthe proposed method with six other state-of-the-art image-to-image translation\nmethods. The results of both quantitative analysis and qualitative visual\ninspection indicate that MITIA achieves superior performance compared to the\ncompeting state-of-the-art methods, both on misaligned data and aligned data.\n","authors":["Langrui Zhou","Guang Li"],"pdf_url":"https://arxiv.org/pdf/2408.14270v1.pdf","comment":"This paper has been accepted as a research article by Medical Physics"},{"id":"http://arxiv.org/abs/2210.07182v7","updated":"2024-08-26T13:43:46Z","published":"2022-10-13T17:03:36Z","title":"PDEBENCH: An Extensive Benchmark for Scientific Machine Learning","summary":" Machine learning-based modeling of physical systems has experienced increased\ninterest in recent years. Despite some impressive progress, there is still a\nlack of benchmarks for Scientific ML that are easy to use but still challenging\nand representative of a wide range of problems. We introduce PDEBench, a\nbenchmark suite of time-dependent simulation tasks based on Partial\nDifferential Equations (PDEs). PDEBench comprises both code and data to\nbenchmark the performance of novel machine learning models against both\nclassical numerical simulations and machine learning baselines. Our proposed\nset of benchmark problems contribute the following unique features: (1) A much\nwider range of PDEs compared to existing benchmarks, ranging from relatively\ncommon examples to more realistic and difficult problems; (2) much larger\nready-to-use datasets compared to prior work, comprising multiple simulation\nruns across a larger number of initial and boundary conditions and PDE\nparameters; (3) more extensible source codes with user-friendly APIs for data\ngeneration and baseline results with popular machine learning models (FNO,\nU-Net, PINN, Gradient-Based Inverse Method). PDEBench allows researchers to\nextend the benchmark freely for their own purposes using a standardized API and\nto compare the performance of new models to existing baseline methods. We also\npropose new evaluation metrics with the aim to provide a more holistic\nunderstanding of learning methods in the context of Scientific ML. With those\nmetrics we identify tasks which are challenging for recent ML methods and\npropose these tasks as future challenges for the community. The code is\navailable at https://github.com/pdebench/PDEBench.\n","authors":["Makoto Takamoto","Timothy Praditia","Raphael Leiteritz","Dan MacKinlay","Francesco Alesiani","Dirk Pflüger","Mathias Niepert"],"pdf_url":"https://arxiv.org/pdf/2210.07182v7.pdf","comment":"16 pages (main body) + 34 pages (supplemental material), accepted for\n publication in NeurIPS 2022 Track Datasets and Benchmarks"},{"id":"http://arxiv.org/abs/2408.14267v1","updated":"2024-08-26T13:42:43Z","published":"2024-08-26T13:42:43Z","title":"1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit","summary":" Fully quantized training (FQT) accelerates the training of deep neural\nnetworks by quantizing the activations, weights, and gradients into lower\nprecision. To explore the ultimate limit of FQT (the lowest achievable\nprecision), we make a first attempt to 1-bit FQT. We provide a theoretical\nanalysis of FQT based on Adam and SGD, revealing that the gradient variance\ninfluences the convergence of FQT. Building on these theoretical results, we\nintroduce an Activation Gradient Pruning (AGP) strategy. The strategy leverages\nthe heterogeneity of gradients by pruning less informative gradients and\nenhancing the numerical precision of remaining gradients to mitigate gradient\nvariance. Additionally, we propose Sample Channel joint Quantization (SCQ),\nwhich utilizes different quantization strategies in the computation of weight\ngradients and activation gradients to ensure that the method is friendly to\nlow-bitwidth hardware. Finally, we present a framework to deploy our algorithm.\nFor fine-tuning VGGNet-16 and ResNet-18 on multiple datasets, our algorithm\nachieves an average accuracy improvement of approximately 6%, compared to\nper-sample quantization. Moreover, our training speedup can reach a maximum of\n5.13x compared to full precision training.\n","authors":["Chang Gao","Jianfei Chen","Kang Zhao","Jiaqi Wang","Liping Jing"],"pdf_url":"https://arxiv.org/pdf/2408.14267v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.09431v2","updated":"2024-08-26T13:41:14Z","published":"2024-04-15T03:12:12Z","title":"VFMM3D: Releasing the Potential of Image by Vision Foundation Model for\n Monocular 3D Object Detection","summary":" Due to its cost-effectiveness and widespread availability, monocular 3D\nobject detection, which relies solely on a single camera during inference,\nholds significant importance across various applications, including autonomous\ndriving and robotics. Nevertheless, directly predicting the coordinates of\nobjects in 3D space from monocular images poses challenges. Therefore, an\neffective solution involves transforming monocular images into LiDAR-like\nrepresentations and employing a LiDAR-based 3D object detector to predict the\n3D coordinates of objects. The key step in this method is accurately converting\nthe monocular image into a reliable point cloud form. In this paper, we present\nVFMM3D, an innovative framework that leverages the capabilities of Vision\nFoundation Models (VFMs) to accurately transform single-view images into LiDAR\npoint cloud representations. VFMM3D utilizes the Segment Anything Model (SAM)\nand Depth Anything Model (DAM) to generate high-quality pseudo-LiDAR data\nenriched with rich foreground information. Specifically, the Depth Anything\nModel (DAM) is employed to generate dense depth maps. Subsequently, the Segment\nAnything Model (SAM) is utilized to differentiate foreground and background\nregions by predicting instance masks. These predicted instance masks and depth\nmaps are then combined and projected into 3D space to generate pseudo-LiDAR\npoints. Finally, any object detectors based on point clouds can be utilized to\npredict the 3D coordinates of objects. Comprehensive experiments are conducted\non two challenging 3D object detection datasets, KITTI and Waymo. Our VFMM3D\nestablishes a new state-of-the-art performance on both datasets. Additionally,\nexperimental results demonstrate the generality of VFMM3D, showcasing its\nseamless integration into various LiDAR-based 3D object detectors.\n","authors":["Bonan Ding","Jin Xie","Jing Nie","Jiale Cao","Xuelong Li","Yanwei Pang"],"pdf_url":"https://arxiv.org/pdf/2404.09431v2.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2408.14253v1","updated":"2024-08-26T13:16:03Z","published":"2024-08-26T13:16:03Z","title":"Text3DAug -- Prompted Instance Augmentation for LiDAR Perception","summary":" LiDAR data of urban scenarios poses unique challenges, such as heterogeneous\ncharacteristics and inherent class imbalance. Therefore, large-scale datasets\nare necessary to apply deep learning methods. Instance augmentation has emerged\nas an efficient method to increase dataset diversity. However, current methods\nrequire the time-consuming curation of 3D models or costly manual data\nannotation. To overcome these limitations, we propose Text3DAug, a novel\napproach leveraging generative models for instance augmentation. Text3DAug does\nnot depend on labeled data and is the first of its kind to generate instances\nand annotations from text. This allows for a fully automated pipeline,\neliminating the need for manual effort in practical applications. Additionally,\nText3DAug is sensor agnostic and can be applied regardless of the LiDAR sensor\nused. Comprehensive experimental analysis on LiDAR segmentation, detection and\nnovel class discovery demonstrates that Text3DAug is effective in supplementing\nexisting methods or as a standalone method, performing on par or better than\nestablished methods, however while overcoming their specific drawbacks. The\ncode is publicly available.\n","authors":["Laurenz Reichardt","Luca Uhr","Oliver Wasenmüller"],"pdf_url":"https://arxiv.org/pdf/2408.14253v1.pdf","comment":"Accepted at the 2024 IEEE/RSJ International Conference on Intelligent\n Robots and Systems (IROS 2024)"},{"id":"http://arxiv.org/abs/2408.14249v1","updated":"2024-08-26T13:09:23Z","published":"2024-08-26T13:09:23Z","title":"Beyond Few-shot Object Detection: A Detailed Survey","summary":" Object detection is a critical field in computer vision focusing on\naccurately identifying and locating specific objects in images or videos.\nTraditional methods for object detection rely on large labeled training\ndatasets for each object category, which can be time-consuming and expensive to\ncollect and annotate. To address this issue, researchers have introduced\nfew-shot object detection (FSOD) approaches that merge few-shot learning and\nobject detection principles. These approaches allow models to quickly adapt to\nnew object categories with only a few annotated samples. While traditional FSOD\nmethods have been studied before, this survey paper comprehensively reviews\nFSOD research with a specific focus on covering different FSOD settings such as\nstandard FSOD, generalized FSOD, incremental FSOD, open-set FSOD, and domain\nadaptive FSOD. These approaches play a vital role in reducing the reliance on\nextensive labeled datasets, particularly as the need for efficient machine\nlearning models continues to rise. This survey paper aims to provide a\ncomprehensive understanding of the above-mentioned few-shot settings and\nexplore the methodologies for each FSOD task. It thoroughly compares\nstate-of-the-art methods across different FSOD settings, analyzing them in\ndetail based on their evaluation protocols. Additionally, it offers insights\ninto their applications, challenges, and potential future directions in the\nevolving field of object detection with limited data.\n","authors":["Vishal Chudasama","Hiran Sarkar","Pankaj Wasnik","Vineeth N Balasubramanian","Jayateja Kalla"],"pdf_url":"https://arxiv.org/pdf/2408.14249v1.pdf","comment":"43 pages, 8 figures"},{"id":"http://arxiv.org/abs/2406.08282v3","updated":"2024-08-26T13:01:39Z","published":"2024-06-12T14:47:51Z","title":"Interpretable Representation Learning of Cardiac MRI via Attribute\n Regularization","summary":" Interpretability is essential in medical imaging to ensure that clinicians\ncan comprehend and trust artificial intelligence models. Several approaches\nhave been recently considered to encode attributes in the latent space to\nenhance its interpretability. Notably, attribute regularization aims to encode\na set of attributes along the dimensions of a latent representation. However,\nthis approach is based on Variational AutoEncoder and suffers from blurry\nreconstruction. In this paper, we propose an Attributed-regularized Soft\nIntrospective Variational Autoencoder that combines attribute regularization of\nthe latent space within the framework of an adversarially trained variational\nautoencoder. We demonstrate on short-axis cardiac Magnetic Resonance images of\nthe UK Biobank the ability of the proposed method to address blurry\nreconstruction issues of variational autoencoder methods while preserving the\nlatent space interpretability.\n","authors":["Maxime Di Folco","Cosmin I. Bercea","Emily Chan","Julia A. Schnabel"],"pdf_url":"https://arxiv.org/pdf/2406.08282v3.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2312.08915"},{"id":"http://arxiv.org/abs/2408.14244v1","updated":"2024-08-26T12:59:32Z","published":"2024-08-26T12:59:32Z","title":"Cascaded Temporal Updating Network for Efficient Video Super-Resolution","summary":" Existing video super-resolution (VSR) methods generally adopt a recurrent\npropagation network to extract spatio-temporal information from the entire\nvideo sequences, exhibiting impressive performance. However, the key components\nin recurrent-based VSR networks significantly impact model efficiency, e.g.,\nthe alignment module occupies a substantial portion of model parameters, while\nthe bidirectional propagation mechanism significantly amplifies the inference\ntime. Consequently, developing a compact and efficient VSR method that can be\ndeployed on resource-constrained devices, e.g., smartphones, remains\nchallenging. To this end, we propose a cascaded temporal updating network\n(CTUN) for efficient VSR. We first develop an implicit cascaded alignment\nmodule to explore spatio-temporal correspondences from adjacent frames.\nMoreover, we propose a unidirectional propagation updating network to\nefficiently explore long-range temporal information, which is crucial for\nhigh-quality video reconstruction. Specifically, we develop a simple yet\neffective hidden updater that can leverage future information to update hidden\nfeatures during forward propagation, significantly reducing inference time\nwhile maintaining performance. Finally, we formulate all of these components\ninto an end-to-end trainable VSR network. Extensive experimental results show\nthat our CTUN achieves a favorable trade-off between efficiency and performance\ncompared to existing methods. Notably, compared with BasicVSR, our method\nobtains better results while employing only about 30% of the parameters and\nrunning time. The source code and pre-trained models will be available at\nhttps://github.com/House-Leo/CTUN.\n","authors":["Hao Li","Jiangxin Dong","Jinshan Pan"],"pdf_url":"https://arxiv.org/pdf/2408.14244v1.pdf","comment":"Project website: https://github.com/House-Leo/CTUN"},{"id":"http://arxiv.org/abs/2403.12848v2","updated":"2024-08-26T12:55:44Z","published":"2024-03-19T15:54:48Z","title":"Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit\n regularization","summary":" Compositional 3D scene synthesis has diverse applications across a spectrum\nof industries such as robotics, films, and video games, as it closely mirrors\nthe complexity of real-world multi-object environments. Conventional works\ntypically employ shape retrieval based frameworks which naturally suffer from\nlimited shape diversity. Recent progresses have been made in object shape\ngeneration with generative models such as diffusion models, which increases the\nshape fidelity. However, these approaches separately treat 3D shape generation\nand layout generation. The synthesized scenes are usually hampered by layout\ncollision, which suggests that the scene-level fidelity is still\nunder-explored. In this paper, we aim at generating realistic and reasonable 3D\nindoor scenes from scene graph. To enrich the priors of the given scene graph\ninputs, large language model is utilized to aggregate the global-wise features\nwith local node-wise and edge-wise features. With a unified graph encoder,\ngraph features are extracted to guide joint layout-shape generation. Additional\nregularization is introduced to explicitly constrain the produced 3D layouts.\nBenchmarked on the SG-FRONT dataset, our method achieves better 3D scene\nsynthesis, especially in terms of scene-level fidelity. The source code will be\nreleased after publication.\n","authors":["Yao Wei","Martin Renqiang Min","George Vosselman","Li Erran Li","Michael Ying Yang"],"pdf_url":"https://arxiv.org/pdf/2403.12848v2.pdf","comment":"16 pages, 10 figures"},{"id":"http://arxiv.org/abs/2401.16712v2","updated":"2024-08-26T12:52:25Z","published":"2024-01-30T03:17:02Z","title":"LF Tracy: A Unified Single-Pipeline Approach for Salient Object\n Detection in Light Field Cameras","summary":" Leveraging rich information is crucial for dense prediction tasks. Light\nfield (LF) cameras are instrumental in this regard, as they allow data to be\nsampled from various perspectives. This capability provides valuable spatial,\ndepth, and angular information, enhancing scene-parsing tasks. However, we have\nidentified two overlooked issues for the LF salient object detection (SOD)\ntask. (1): Previous approaches predominantly employ a customized two-stream\ndesign to discover the spatial and depth features within light field images.\nThe network struggles to learn the implicit angular information between\ndifferent images due to a lack of intra-network data connectivity. (2): Little\nresearch has been directed towards the data augmentation strategy for LF SOD.\nResearch on inter-network data connectivity is scant. In this study, we propose\nan efficient paradigm (LF Tracy) to address those issues. This comprises a\nsingle-pipeline encoder paired with a highly efficient information aggregation\n(IA) module (around 8M parameters) to establish an intra-network connection.\nThen, a simple yet effective data augmentation strategy called MixLD is\ndesigned to bridge the inter-network connections. Owing to this innovative\nparadigm, our model surpasses the existing state-of-the-art method through\nextensive experiments. Especially, LF Tracy demonstrates a 23% improvement over\nprevious results on the latest large-scale PKU dataset. The source code is\npublicly available at: https://github.com/FeiBryantkit/LF-Tracy.\n","authors":["Fei Teng","Jiaming Zhang","Jiawei Liu","Kunyu Peng","Xina Cheng","Zhiyong Li","Kailun Yang"],"pdf_url":"https://arxiv.org/pdf/2401.16712v2.pdf","comment":"Accepted to ICPR 2024. The source code is publicly available at:\n https://github.com/FeiBryantkit/LF-Tracy"},{"id":"http://arxiv.org/abs/2408.14229v1","updated":"2024-08-26T12:44:17Z","published":"2024-08-26T12:44:17Z","title":"Gallery-Aware Uncertainty Estimation For Open-Set Face Recognition","summary":" Accurately estimating image quality and model robustness improvement are\ncritical challenges in unconstrained face recognition, which can be addressed\nthrough uncertainty estimation via probabilistic face embeddings. Previous\nresearch mainly focused on uncertainty estimation in face verification, leaving\nthe open-set face recognition task underexplored. In open-set face recognition,\none seeks to classify an image, which could also be unknown. Here, the low\nvariance of probabilistic embedding does not imply a low error probability: an\nimage embedding could be close to several classes in a gallery, thus yielding\nhigh uncertainty. We propose a method aware of two sources of ambiguity in the\nopen-set recognition system: (1) the gallery uncertainty caused by overlapping\nclasses and (2) the uncertainty of the face embeddings. To detect both types,\nwe use a Bayesian probabilistic model of embedding distribution, which provides\na principled uncertainty estimate. Challenging open-set face recognition\ndatasets, such as IJB-C, serve as a testbed for our method. We also propose a\nnew open-set recognition protocol for whale and dolphin identification. The\nproposed approach better identifies recognition errors than uncertainty\nestimation methods based solely on image quality.\n","authors":["Leonid Erlygin","Alexey Zaytsev"],"pdf_url":"https://arxiv.org/pdf/2408.14229v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14227v1","updated":"2024-08-26T12:43:48Z","published":"2024-08-26T12:43:48Z","title":"TC-PDM: Temporally Consistent Patch Diffusion Models for\n Infrared-to-Visible Video Translation","summary":" Infrared imaging offers resilience against changing lighting conditions by\ncapturing object temperatures. Yet, in few scenarios, its lack of visual\ndetails compared to daytime visible images, poses a significant challenge for\nhuman and machine interpretation. This paper proposes a novel diffusion method,\ndubbed Temporally Consistent Patch Diffusion Models (TC-DPM), for\ninfrared-to-visible video translation. Our method, extending the Patch\nDiffusion Model, consists of two key components. Firstly, we propose a\nsemantic-guided denoising, leveraging the strong representations of\nfoundational models. As such, our method faithfully preserves the semantic\nstructure of generated visible images. Secondly, we propose a novel temporal\nblending module to guide the denoising trajectory, ensuring the temporal\nconsistency between consecutive frames. Experiment shows that TC-PDM\noutperforms state-of-the-art methods by 35.3% in FVD for infrared-to-visible\nvideo translation and by 6.1% in AP50 for day-to-night object detection. Our\ncode is publicly available at https://github.com/dzungdoan6/tc-pdm\n","authors":["Anh-Dzung Doan","Vu Minh Hieu Phan","Surabhi Gupta","Markus Wagner","Tat-Jun Chin","Ian Reid"],"pdf_url":"https://arxiv.org/pdf/2408.14227v1.pdf","comment":"Technical report"},{"id":"http://arxiv.org/abs/2408.14211v1","updated":"2024-08-26T12:10:52Z","published":"2024-08-26T12:10:52Z","title":"MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware\n Diffusion and Iterative Refinement","summary":" Existing works in single-image human reconstruction suffer from weak\ngeneralizability due to insufficient training data or 3D inconsistencies for a\nlack of comprehensive multi-view knowledge. In this paper, we introduce\nMagicMan, a human-specific multi-view diffusion model designed to generate\nhigh-quality novel view images from a single reference image. As its core, we\nleverage a pre-trained 2D diffusion model as the generative prior for\ngeneralizability, with the parametric SMPL-X model as the 3D body prior to\npromote 3D awareness. To tackle the critical challenge of maintaining\nconsistency while achieving dense multi-view generation for improved 3D human\nreconstruction, we first introduce hybrid multi-view attention to facilitate\nboth efficient and thorough information interchange across different views.\nAdditionally, we present a geometry-aware dual branch to perform concurrent\ngeneration in both RGB and normal domains, further enhancing consistency via\ngeometry cues. Last but not least, to address ill-shaped issues arising from\ninaccurate SMPL-X estimation that conflicts with the reference image, we\npropose a novel iterative refinement strategy, which progressively optimizes\nSMPL-X accuracy while enhancing the quality and consistency of the generated\nmulti-views. Extensive experimental results demonstrate that our method\nsignificantly outperforms existing approaches in both novel view synthesis and\nsubsequent 3D human reconstruction tasks.\n","authors":["Xu He","Xiaoyu Li","Di Kang","Jiangnan Ye","Chaopeng Zhang","Liyang Chen","Xiangjun Gao","Han Zhang","Zhiyong Wu","Haolin Zhuang"],"pdf_url":"https://arxiv.org/pdf/2408.14211v1.pdf","comment":"Project Page: https://thuhcsi.github.io/MagicMan"},{"id":"http://arxiv.org/abs/2408.14197v1","updated":"2024-08-26T11:53:09Z","published":"2024-08-26T11:53:09Z","title":"Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting\n and Planning via World Models for Autonomous Driving","summary":" World models envision potential future states based on various ego actions.\nThey embed extensive knowledge about the driving environment, facilitating safe\nand scalable autonomous driving. Most existing methods primarily focus on\neither data generation or the pretraining paradigms of world models. Unlike the\naforementioned prior works, we propose Drive-OccWorld, which adapts a\nvision-centric 4D forecasting world model to end-to-end planning for autonomous\ndriving. Specifically, we first introduce a semantic and motion-conditional\nnormalization in the memory module, which accumulates semantic and dynamic\ninformation from historical BEV embeddings. These BEV features are then\nconveyed to the world decoder for future occupancy and flow forecasting,\nconsidering both geometry and spatiotemporal modeling. Additionally, we propose\ninjecting flexible action conditions, such as velocity, steering angle,\ntrajectory, and commands, into the world model to enable controllable\ngeneration and facilitate a broader range of downstream applications.\nFurthermore, we explore integrating the generative capabilities of the 4D world\nmodel with end-to-end planning, enabling continuous forecasting of future\nstates and the selection of optimal trajectories using an occupancy-based cost\nfunction. Extensive experiments on the nuScenes dataset demonstrate that our\nmethod can generate plausible and controllable 4D occupancy, opening new\navenues for driving world generation and end-to-end planning.\n","authors":["Yu Yang","Jianbiao Mei","Yukai Ma","Siliang Du","Wenqing Chen","Yijie Qian","Yuxiang Feng","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2408.14197v1.pdf","comment":"18 pages, 10 figures"},{"id":"http://arxiv.org/abs/2408.14192v1","updated":"2024-08-26T11:36:38Z","published":"2024-08-26T11:36:38Z","title":"Feature Aligning Few shot Learning Method Using Local Descriptors\n Weighted Rules","summary":" Few-shot classification involves identifying new categories using a limited\nnumber of labeled samples. Current few-shot classification methods based on\nlocal descriptors primarily leverage underlying consistent features across\nvisible and invisible classes, facing challenges including redundant\nneighboring information, noisy representations, and limited interpretability.\nThis paper proposes a Feature Aligning Few-shot Learning Method Using Local\nDescriptors Weighted Rules (FAFD-LDWR). It innovatively introduces a\ncross-normalization method into few-shot image classification to preserve the\ndiscriminative information of local descriptors as much as possible; and\nenhances classification performance by aligning key local descriptors of\nsupport and query sets to remove background noise. FAFD-LDWR performs\nexcellently on three benchmark datasets , outperforming state-of-the-art\nmethods in both 1-shot and 5-shot settings. The designed visualization\nexperiments also demonstrate FAFD-LDWR's improvement in prediction\ninterpretability.\n","authors":["Bingchen Yan"],"pdf_url":"https://arxiv.org/pdf/2408.14192v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14189v1","updated":"2024-08-26T11:26:27Z","published":"2024-08-26T11:26:27Z","title":"EMDFNet: Efficient Multi-scale and Diverse Feature Network for Traffic\n Sign Detection","summary":" The detection of small objects, particularly traffic signs, is a critical\nsubtask within object detection and autonomous driving. Despite the notable\nachievements in previous research, two primary challenges persist. Firstly, the\nmain issue is the singleness of feature extraction. Secondly, the detection\nprocess fails to effectively integrate with objects of varying sizes or scales.\nThese issues are also prevalent in generic object detection. Motivated by these\nchallenges, in this paper, we propose a novel object detection network named\nEfficient Multi-scale and Diverse Feature Network (EMDFNet) for traffic sign\ndetection that integrates an Augmented Shortcut Module and an Efficient Hybrid\nEncoder to address the aforementioned issues simultaneously. Specifically, the\nAugmented Shortcut Module utilizes multiple branches to integrate various\nspatial semantic information and channel semantic information, thereby\nenhancing feature diversity. The Efficient Hybrid Encoder utilizes global\nfeature fusion and local feature interaction based on various features to\ngenerate distinctive classification features by integrating feature information\nin an adaptable manner. Extensive experiments on the Tsinghua-Tencent 100K\n(TT100K) benchmark and the German Traffic Sign Detection Benchmark (GTSDB)\ndemonstrate that our EMDFNet outperforms other state-of-the-art detectors in\nperformance while retaining the real-time processing capabilities of\nsingle-stage models. This substantiates the effectiveness of EMDFNet in\ndetecting small traffic signs.\n","authors":["Pengyu Li","Chenhe Liu","Tengfei Li","Xinyu Wang","Shihui Zhang","Dongyang Yu"],"pdf_url":"https://arxiv.org/pdf/2408.14189v1.pdf","comment":"15 pages,5 figures,accepted to ICANN"},{"id":"http://arxiv.org/abs/2408.14187v1","updated":"2024-08-26T11:24:13Z","published":"2024-08-26T11:24:13Z","title":"Ensemble Predicate Decoding for Unbiased Scene Graph Generation","summary":" Scene Graph Generation (SGG) aims to generate a comprehensive graphical\nrepresentation that accurately captures the semantic information of a given\nscenario. However, the SGG model's performance in predicting more fine-grained\npredicates is hindered by a significant predicate bias. According to existing\nworks, the long-tail distribution of predicates in training data results in the\nbiased scene graph. However, the semantic overlap between predicate categories\nmakes predicate prediction difficult, and there is a significant difference in\nthe sample size of semantically similar predicates, making the predicate\nprediction more difficult. Therefore, higher requirements are placed on the\ndiscriminative ability of the model. In order to address this problem, this\npaper proposes Ensemble Predicate Decoding (EPD), which employs multiple\ndecoders to attain unbiased scene graph generation. Two auxiliary decoders\ntrained on lower-frequency predicates are used to improve the discriminative\nability of the model. Extensive experiments are conducted on the VG, and the\nexperiment results show that EPD enhances the model's representation capability\nfor predicates. In addition, we find that our approach ensures a relatively\nsuperior predictive capability for more frequent predicates compared to\nprevious unbiased SGG methods.\n","authors":["Jiasong Feng","Lichun Wang","Hongbo Xu","Kai Xu","Baocai Yin"],"pdf_url":"https://arxiv.org/pdf/2408.14187v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14186v1","updated":"2024-08-26T11:22:52Z","published":"2024-08-26T11:22:52Z","title":"Affine steerers for structured keypoint description","summary":" We propose a way to train deep learning based keypoint descriptors that makes\nthem approximately equivariant for locally affine transformations of the image\nplane. The main idea is to use the representation theory of GL(2) to generalize\nthe recently introduced concept of steerers from rotations to affine\ntransformations. Affine steerers give high control over how keypoint\ndescriptions transform under image transformations. We demonstrate the\npotential of using this control for image matching. Finally, we propose a way\nto finetune keypoint descriptors with a set of steerers on upright images and\nobtain state-of-the-art results on several standard benchmarks. Code will be\npublished at github.com/georg-bn/affine-steerers.\n","authors":["Georg Bökman","Johan Edstedt","Michael Felsberg","Fredrik Kahl"],"pdf_url":"https://arxiv.org/pdf/2408.14186v1.pdf","comment":"To be presented at ECCV 2024"},{"id":"http://arxiv.org/abs/2408.14180v1","updated":"2024-08-26T11:08:44Z","published":"2024-08-26T11:08:44Z","title":"I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing","summary":" Significant progress has been made in the field of Instruction-based Image\nEditing (IIE). However, evaluating these models poses a significant challenge.\nA crucial requirement in this field is the establishment of a comprehensive\nevaluation benchmark for accurately assessing editing results and providing\nvaluable insights for its further development. In response to this need, we\npropose I2EBench, a comprehensive benchmark designed to automatically evaluate\nthe quality of edited images produced by IIE models from multiple dimensions.\nI2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding\noriginal and diverse instructions. It offers three distinctive characteristics:\n1) Comprehensive Evaluation Dimensions: I2EBench comprises 16 evaluation\ndimensions that cover both high-level and low-level aspects, providing a\ncomprehensive assessment of each IIE model. 2) Human Perception Alignment: To\nensure the alignment of our benchmark with human perception, we conducted an\nextensive user study for each evaluation dimension. 3) Valuable Research\nInsights: By analyzing the advantages and disadvantages of existing IIE models\nacross the 16 dimensions, we offer valuable research insights to guide future\ndevelopment in the field. We will open-source I2EBench, including all\ninstructions, input images, human annotations, edited images from all evaluated\nmethods, and a simple script for evaluating the results from new IIE models.\nThe code, dataset and generated images from all IIE models are provided in\ngithub: https://github.com/cocoshe/I2EBench.\n","authors":["Yiwei Ma","Jiayi Ji","Ke Ye","Weihuang Lin","Zhibin Wang","Yonghan Zheng","Qiang Zhou","Xiaoshuai Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2408.14180v1.pdf","comment":"Tech report, 39 pages, 41 figures"},{"id":"http://arxiv.org/abs/2408.13149v2","updated":"2024-08-26T11:05:58Z","published":"2024-08-23T15:16:01Z","title":"Focus on Neighbors and Know the Whole: Towards Consistent Dense\n Multiview Text-to-Image Generator for 3D Creation","summary":" Generating dense multiview images from text prompts is crucial for creating\nhigh-fidelity 3D assets. Nevertheless, existing methods struggle with\nspace-view correspondences, resulting in sparse and low-quality outputs. In\nthis paper, we introduce CoSER, a novel consistent dense Multiview\nText-to-Image Generator for Text-to-3D, achieving both efficiency and quality\nby meticulously learning neighbor-view coherence and further alleviating\nambiguity through the swift traversal of all views. For achieving neighbor-view\nconsistency, each viewpoint densely interacts with adjacent viewpoints to\nperceive the global spatial structure, and aggregates information along motion\npaths explicitly defined by physical principles to refine details. To further\nenhance cross-view consistency and alleviate content drift, CoSER rapidly scan\nall views in spiral bidirectional manner to aware holistic information and then\nscores each point based on semantic material. Subsequently, we conduct weighted\ndown-sampling along the spatial dimension based on scores, thereby facilitating\nprominent information fusion across all views with lightweight computation.\nTechnically, the core module is built by integrating the attention mechanism\nwith a selective state space model, exploiting the robust learning capabilities\nof the former and the low overhead of the latter. Extensive evaluation shows\nthat CoSER is capable of producing dense, high-fidelity, content-consistent\nmultiview images that can be flexibly integrated into various 3D generation\nmodels.\n","authors":["Bonan Li","Zicheng Zhang","Xingyi Yang","Xinchao Wang"],"pdf_url":"https://arxiv.org/pdf/2408.13149v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.02474v3","updated":"2024-08-26T11:04:45Z","published":"2024-02-04T13:09:13Z","title":"Deep Spectral Improvement for Unsupervised Image Instance Segmentation","summary":" Deep spectral methods reframe the image decomposition process as a graph\npartitioning task by extracting features using self-supervised learning and\nutilizing the Laplacian of the affinity matrix to obtain eigensegments.\nHowever, instance segmentation has received less attention compared to other\ntasks within the context of deep spectral methods. This paper addresses the\nfact that not all channels of the feature map extracted from a self-supervised\nbackbone contain sufficient information for instance segmentation purposes. In\nfact, Some channels are noisy and hinder the accuracy of the task. To overcome\nthis issue, this paper proposes two channel reduction modules: Noise Channel\nReduction (NCR) and Deviation-based Channel Reduction (DCR). The NCR retains\nchannels with lower entropy, as they are less likely to be noisy, while DCR\nprunes channels with low standard deviation, as they lack sufficient\ninformation for effective instance segmentation. Furthermore, the paper\ndemonstrates that the dot product, commonly used in deep spectral methods, is\nnot suitable for instance segmentation due to its sensitivity to feature map\nvalues, potentially leading to incorrect instance segments. A new similarity\nmetric called Bray-Curtis over Chebyshev (BoC) is proposed to address this\nissue. It takes into account the distribution of features in addition to their\nvalues, providing a more robust similarity measure for instance segmentation.\nQuantitative and qualitative results on the Youtube-VIS2019 dataset highlight\nthe improvements achieved by the proposed channel reduction methods and the use\nof BoC instead of the conventional dot product for creating the affinity\nmatrix. These improvements are observed in terms of mean Intersection over\nUnion and extracted instance segments, demonstrating enhanced instance\nsegmentation performance. The code is available on:\nhttps://github.com/farnooshar/SpecUnIIS\n","authors":["Farnoosh Arefi","Amir M. Mansourian","Shohreh Kasaei"],"pdf_url":"https://arxiv.org/pdf/2402.02474v3.pdf","comment":"11 pages, 13 figures and 5 tables"},{"id":"http://arxiv.org/abs/2408.04249v2","updated":"2024-08-26T10:57:15Z","published":"2024-08-08T06:29:32Z","title":"InstantStyleGaussian: Efficient Art Style Transfer with 3D Gaussian\n Splatting","summary":" We present InstantStyleGaussian, an innovative 3D style transfer method based\non the 3D Gaussian Splatting (3DGS) scene representation. By inputting a\ntarget-style image, it quickly generates new 3D GS scenes. Our method operates\non pre-reconstructed GS scenes, combining diffusion models with an improved\niterative dataset update strategy. It utilizes diffusion models to generate\ntarget style images, adds these new images to the training dataset, and uses\nthis dataset to iteratively update and optimize the GS scenes, significantly\naccelerating the style editing process while ensuring the quality of the\ngenerated scenes. Extensive experimental results demonstrate that our method\nensures high-quality stylized scenes while offering significant advantages in\nstyle transfer speed and consistency.\n","authors":["Xin-Yi Yu","Jun-Xin Yu","Li-Bo Zhou","Yan Wei","Lin-Lin Ou"],"pdf_url":"https://arxiv.org/pdf/2408.04249v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14177v1","updated":"2024-08-26T10:50:14Z","published":"2024-08-26T10:50:14Z","title":"NimbleD: Enhancing Self-supervised Monocular Depth Estimation with\n Pseudo-labels and Large-scale Video Pre-training","summary":" We introduce NimbleD, an efficient self-supervised monocular depth estimation\nlearning framework that incorporates supervision from pseudo-labels generated\nby a large vision model. This framework does not require camera intrinsics,\nenabling large-scale pre-training on publicly available videos. Our\nstraightforward yet effective learning strategy significantly enhances the\nperformance of fast and lightweight models without introducing any overhead,\nallowing them to achieve performance comparable to state-of-the-art\nself-supervised monocular depth estimation models. This advancement is\nparticularly beneficial for virtual and augmented reality applications\nrequiring low latency inference. The source code, model weights, and\nacknowledgments are available at https://github.com/xapaxca/nimbled .\n","authors":["Albert Luginov","Muhammad Shahzad"],"pdf_url":"https://arxiv.org/pdf/2408.14177v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.17428v2","updated":"2024-08-26T10:49:29Z","published":"2023-11-29T08:09:01Z","title":"SigFormer: Sparse Signal-Guided Transformer for Multi-Modal Human Action\n Segmentation","summary":" Multi-modal human action segmentation is a critical and challenging task with\na wide range of applications. Nowadays, the majority of approaches concentrate\non the fusion of dense signals (i.e., RGB, optical flow, and depth maps).\nHowever, the potential contributions of sparse IoT sensor signals, which can be\ncrucial for achieving accurate recognition, have not been fully explored. To\nmake up for this, we introduce a Sparse signalguided Transformer (SigFormer) to\ncombine both dense and sparse signals. We employ mask attention to fuse\nlocalized features by constraining cross-attention within the regions where\nsparse signals are valid. However, since sparse signals are discrete, they lack\nsufficient information about the temporal action boundaries. Therefore, in\nSigFormer, we propose to emphasize the boundary information at two stages to\nalleviate this problem. In the first feature extraction stage, we introduce an\nintermediate bottleneck module to jointly learn both category and boundary\nfeatures of each dense modality through the inner loss functions. After the\nfusion of dense modalities and sparse signals, we then devise a two-branch\narchitecture that explicitly models the interrelationship between action\ncategory and temporal boundary. Experimental results demonstrate that SigFormer\noutperforms the state-of-the-art approaches on a multi-modal action\nsegmentation dataset from real industrial environments, reaching an outstanding\nF1 score of 0.958. The codes and pre-trained models have been available at\nhttps://github.com/LIUQI-creat/SigFormer.\n","authors":["Qi Liu","Xinchen Liu","Kun Liu","Xiaoyan Gu","Wu Liu"],"pdf_url":"https://arxiv.org/pdf/2311.17428v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14176v1","updated":"2024-08-26T10:42:53Z","published":"2024-08-26T10:42:53Z","title":"SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its\n Teacher","summary":" In this paper, we aim to enhance the performance of SwiftBrush, a prominent\none-step text-to-image diffusion model, to be competitive with its multi-step\nStable Diffusion counterpart. Initially, we explore the quality-diversity\ntrade-off between SwiftBrush and SD Turbo: the former excels in image\ndiversity, while the latter excels in image quality. This observation motivates\nour proposed modifications in the training methodology, including better weight\ninitialization and efficient LoRA training. Moreover, our introduction of a\nnovel clamped CLIP loss enhances image-text alignment and results in improved\nimage quality. Remarkably, by combining the weights of models trained with\nefficient LoRA and full training, we achieve a new state-of-the-art one-step\ndiffusion model, achieving an FID of 8.14 and surpassing all GAN-based and\nmulti-step Stable Diffusion models. The evaluation code is available at:\nhttps://github.com/vinairesearch/swiftbrushv2.\n","authors":["Trung Dao","Thuan Hoang Nguyen","Thanh Le","Duc Vu","Khoi Nguyen","Cuong Pham","Anh Tran"],"pdf_url":"https://arxiv.org/pdf/2408.14176v1.pdf","comment":"Accepted to ECCV'24"},{"id":"http://arxiv.org/abs/2402.11237v2","updated":"2024-08-26T10:39:22Z","published":"2024-02-17T10:02:22Z","title":"Be Persistent: Towards a Unified Solution for Mitigating Shortcuts in\n Deep Learning","summary":" Deep neural networks (DNNs) are vulnerable to shortcut learning: rather than\nlearning the intended task, they tend to draw inconclusive relationships\nbetween their inputs and outputs. Shortcut learning is ubiquitous among many\nfailure cases of neural networks, and traces of this phenomenon can be seen in\ntheir generalizability issues, domain shift, adversarial vulnerability, and\neven bias towards majority groups. In this paper, we argue that this\ncommonality in the cause of various DNN issues creates a significant\nopportunity that should be leveraged to find a unified solution for shortcut\nlearning. To this end, we outline the recent advances in topological data\nanalysis (TDA), and persistent homology (PH) in particular, to sketch a unified\nroadmap for detecting shortcuts in deep learning. We demonstrate our arguments\nby investigating the topological features of computational graphs in DNNs using\ntwo cases of unlearnable examples and bias in decision-making as our test\nstudies. Our analysis of these two failure cases of DNNs reveals that finding a\nunified solution for shortcut learning in DNNs is not out of reach, and TDA can\nplay a significant role in forming such a framework.\n","authors":["Hadi M. Dolatabadi","Sarah M. Erfani","Christopher Leckie"],"pdf_url":"https://arxiv.org/pdf/2402.11237v2.pdf","comment":"Accepted to the 2024 European Conference on Artificial Intelligence\n (ECAI)"},{"id":"http://arxiv.org/abs/2408.14173v1","updated":"2024-08-26T10:39:01Z","published":"2024-08-26T10:39:01Z","title":"BackFlip: The Impact of Local and Global Data Augmentations on Artistic\n Image Aesthetic Assessment","summary":" Assessing the aesthetic quality of artistic images presents unique challenges\ndue to the subjective nature of aesthetics and the complex visual\ncharacteristics inherent to artworks. Basic data augmentation techniques\ncommonly applied to natural images in computer vision may not be suitable for\nart images in aesthetic evaluation tasks, as they can change the composition of\nthe art images. In this paper, we explore the impact of local and global data\naugmentation techniques on artistic image aesthetic assessment (IAA). We\nintroduce BackFlip, a local data augmentation technique designed specifically\nfor artistic IAA. We evaluate the performance of BackFlip across three artistic\nimage datasets and four neural network architectures, comparing it with the\ncommonly used data augmentation techniques. Then, we analyze the effects of\ncomponents within the BackFlip pipeline through an ablation study. Our findings\ndemonstrate that local augmentations, such as BackFlip, tend to outperform\nglobal augmentations on artistic IAA in most cases, probably because they do\nnot perturb the composition of the art images. These results emphasize the\nimportance of considering both local and global augmentations in future\ncomputational aesthetics research.\n","authors":["Ombretta Strafforello","Gonzalo Muradas Odriozola","Fatemeh Behrad","Li-Wei Chen","Anne-Sofie Maerten","Derya Soydaner","Johan Wagemans"],"pdf_url":"https://arxiv.org/pdf/2408.14173v1.pdf","comment":"Published at the VISART VII workshop at ECCV 2024. Ombretta\n Strafforello, Gonzalo Muradas Odriozola, Fatemeh Behrad, Li-Wei Chen,\n Anne-Sofie Maerten and Derya Soydaner contributed equally to this work"},{"id":"http://arxiv.org/abs/2408.01224v3","updated":"2024-08-26T09:59:55Z","published":"2024-08-02T12:27:15Z","title":"Multi-head Spatial-Spectral Mamba for Hyperspectral Image Classification","summary":" Spatial-Spectral Mamba (SSM) improves computational efficiency and captures\nlong-range dependencies, addressing Transformer limitations. However,\ntraditional Mamba models overlook rich spectral information in HSIs and\nstruggle with high dimensionality and sequential data. To address these issues,\nwe propose the SSM with multi-head self-attention and token enhancement\n(MHSSMamba). This model integrates spectral and spatial information by\nenhancing spectral tokens and using multi-head attention to capture complex\nrelationships between spectral bands and spatial locations. It also manages\nlong-range dependencies and the sequential nature of HSI data, preserving\ncontextual information across spectral bands. MHSSMamba achieved remarkable\nclassification accuracies of 97.62\\% on Pavia University, 96.92\\% on the\nUniversity of Houston, 96.85\\% on Salinas, and 99.49\\% on Wuhan-longKou\ndatasets. The source code is available at\n\\href{https://github.com/MHassaanButt/MHA\\_SS\\_Mamba}{GitHub}.\n","authors":["Muhammad Ahmad","Muhammad Hassaan Farooq Butt","Muhammad Usama","Hamad Ahmed Altuwaijri","Manuel Mazzara","Salvatore Distefano"],"pdf_url":"https://arxiv.org/pdf/2408.01224v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14153v1","updated":"2024-08-26T09:55:34Z","published":"2024-08-26T09:55:34Z","title":"Explaining Vision-Language Similarities in Dual Encoders with\n Feature-Pair Attributions","summary":" Dual encoder architectures like CLIP models map two types of inputs into a\nshared embedding space and learn similarities between them. However, it is not\nunderstood how such models compare two inputs. Here, we address this research\ngap with two contributions. First, we derive a method to attribute predictions\nof any differentiable dual encoder onto feature-pair interactions between its\ninputs. Second, we apply our method to CLIP-type models and show that they\nlearn fine-grained correspondences between parts of captions and regions in\nimages. They match objects across input modes and also account for mismatches.\nHowever, this visual-linguistic grounding ability heavily varies between object\nclasses, depends on the training data distribution, and largely improves after\nin-domain training. Using our method we can identify knowledge gaps about\nspecific object classes in individual models and can monitor their improvement\nupon fine-tuning.\n","authors":["Lucas Möller","Pascal Tilli","Ngoc Thang Vu","Sebastian Padó"],"pdf_url":"https://arxiv.org/pdf/2408.14153v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14152v1","updated":"2024-08-26T09:55:32Z","published":"2024-08-26T09:55:32Z","title":"Application of Disentanglement to Map Registration Problem","summary":" Geospatial data come from various sources, such as satellites, aircraft, and\nLiDAR. The variability of the source is not limited to the types of data\nacquisition techniques, as we have maps from different time periods. To\nincorporate these data for a coherent analysis, it is essential to first align\ndifferent \"styles\" of geospatial data to its matching images that point to the\nsame location on the surface of the Earth. In this paper, we approach the image\nregistration as a two-step process of (1) extracting geospatial contents\ninvariant to visual (and any other non-content-related) information, and (2)\nmatching the data based on such (purely) geospatial contents. We hypothesize\nthat a combination of $\\beta$-VAE-like architecture [2] and adversarial\ntraining will achieve both the disentanglement of the geographic information\nand artistic styles and generation of new map tiles by composing the encoded\ngeographic information with any artistic style.\n","authors":["Hae Jin Song","Patrycja Krawczuk","Po-Hsuan Huang"],"pdf_url":"https://arxiv.org/pdf/2408.14152v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14143v1","updated":"2024-08-26T09:41:40Z","published":"2024-08-26T09:41:40Z","title":"2D-Malafide: Adversarial Attacks Against Face Deepfake Detection Systems","summary":" We introduce 2D-Malafide, a novel and lightweight adversarial attack designed\nto deceive face deepfake detection systems. Building upon the concept of 1D\nconvolutional perturbations explored in the speech domain, our method leverages\n2D convolutional filters to craft perturbations which significantly degrade the\nperformance of state-of-the-art face deepfake detectors. Unlike traditional\nadditive noise approaches, 2D-Malafide optimises a small number of filter\ncoefficients to generate robust adversarial perturbations which are\ntransferable across different face images. Experiments, conducted using the\nFaceForensics++ dataset, demonstrate that 2D-Malafide substantially degrades\ndetection performance in both white-box and black-box settings, with larger\nfilter sizes having the greatest impact. Additionally, we report an\nexplainability analysis using GradCAM which illustrates how 2D-Malafide\nmisleads detection systems by altering the image areas used most for\nclassification. Our findings highlight the vulnerability of current deepfake\ndetection systems to convolutional adversarial attacks as well as the need for\nfuture work to enhance detection robustness through improved image fidelity\nconstraints.\n","authors":["Chiara Galdi","Michele Panariello","Massimiliano Todisco","Nicholas Evans"],"pdf_url":"https://arxiv.org/pdf/2408.14143v1.pdf","comment":"Accepted at BIOSIG 2024"},{"id":"http://arxiv.org/abs/2408.14135v1","updated":"2024-08-26T09:32:16Z","published":"2024-08-26T09:32:16Z","title":"Foodfusion: A Novel Approach for Food Image Composition via Diffusion\n Models","summary":" Food image composition requires the use of existing dish images and\nbackground images to synthesize a natural new image, while diffusion models\nhave made significant advancements in image generation, enabling the\nconstruction of end-to-end architectures that yield promising results. However,\nexisting diffusion models face challenges in processing and fusing information\nfrom multiple images and lack access to high-quality publicly available\ndatasets, which prevents the application of diffusion models in food image\ncomposition. In this paper, we introduce a large-scale, high-quality food image\ncomposite dataset, FC22k, which comprises 22,000 foreground, background, and\nground truth ternary image pairs. Additionally, we propose a novel food image\ncomposition method, Foodfusion, which leverages the capabilities of the\npre-trained diffusion models and incorporates a Fusion Module for processing\nand integrating foreground and background information. This fused information\naligns the foreground features with the background structure by merging the\nglobal structural information at the cross-attention layer of the denoising\nUNet. To further enhance the content and structure of the background, we also\nintegrate a Content-Structure Control Module. Extensive experiments demonstrate\nthe effectiveness and scalability of our proposed method.\n","authors":["Chaohua Shi","Xuan Wang","Si Shi","Xule Wang","Mingrui Zhu","Nannan Wang","Xinbo Gao"],"pdf_url":"https://arxiv.org/pdf/2408.14135v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2408.14131v1","updated":"2024-08-26T09:26:08Z","published":"2024-08-26T09:26:08Z","title":"GenFormer -- Generated Images are All You Need to Improve Robustness of\n Transformers on Small Datasets","summary":" Recent studies showcase the competitive accuracy of Vision Transformers\n(ViTs) in relation to Convolutional Neural Networks (CNNs), along with their\nremarkable robustness. However, ViTs demand a large amount of data to achieve\nadequate performance, which makes their application to small datasets\nchallenging, falling behind CNNs. To overcome this, we propose GenFormer, a\ndata augmentation strategy utilizing generated images, thereby improving\ntransformer accuracy and robustness on small-scale image classification tasks.\nIn our comprehensive evaluation we propose Tiny ImageNetV2, -R, and -A as new\ntest set variants of Tiny ImageNet by transferring established ImageNet\ngeneralization and robustness benchmarks to the small-scale data domain.\nSimilarly, we introduce MedMNIST-C and EuroSAT-C as corrupted test set variants\nof established fine-grained datasets in the medical and aerial domain. Through\na series of experiments conducted on small datasets of various domains,\nincluding Tiny ImageNet, CIFAR, EuroSAT and MedMNIST datasets, we demonstrate\nthe synergistic power of our method, in particular when combined with common\ntrain and test time augmentations, knowledge distillation, and architectural\ndesign choices. Additionally, we prove the effectiveness of our approach under\nchallenging conditions with limited training data, demonstrating significant\nimprovements in both accuracy and robustness, bridging the gap between CNNs and\nViTs in the small-scale dataset domain.\n","authors":["Sven Oehri","Nikolas Ebert","Ahmed Abdullah","Didier Stricker","Oliver Wasenmüller"],"pdf_url":"https://arxiv.org/pdf/2408.14131v1.pdf","comment":"This paper has been accepted at International Conference on Pattern\n Recognition (ICPR), 2023"},{"id":"http://arxiv.org/abs/2406.02978v2","updated":"2024-08-26T09:23:44Z","published":"2024-06-05T06:21:54Z","title":"Self-Supervised Skeleton-Based Action Representation Learning: A\n Benchmark and Beyond","summary":" Self-supervised learning (SSL), which aims to learn meaningful prior\nrepresentations from unlabeled data, has been proven effective for\nskeleton-based action understanding. Different from the image domain, skeleton\ndata possesses sparser spatial structures and diverse representation forms,\nwith the absence of background clues and the additional temporal dimension,\npresenting new challenges for spatial-temporal motion pretext task design.\nRecently, many endeavors have been made for skeleton-based SSL, achieving\nremarkable progress. However, a systematic and thorough review is still\nlacking. In this paper, we conduct, for the first time, a comprehensive survey\non self-supervised skeleton-based action representation learning. Following the\ntaxonomy of context-based, generative learning, and contrastive learning\napproaches, we make a thorough review and benchmark of existing works and shed\nlight on the future possible directions. Remarkably, our investigation\ndemonstrates that most SSL works rely on the single paradigm, learning\nrepresentations of a single level, and are evaluated on the action recognition\ntask solely, which leaves the generalization power of skeleton SSL models\nunder-explored. To this end, a novel and effective SSL method for skeleton is\nfurther proposed, which integrates versatile representation learning objectives\nof different granularity, substantially boosting the generalization capacity\nfor multiple skeleton downstream tasks. Extensive experiments under three\nlarge-scale datasets demonstrate our method achieves superior generalization\nperformance on various downstream tasks, including recognition, retrieval,\ndetection, and few-shot learning.\n","authors":["Jiahang Zhang","Lilang Lin","Shuai Yang","Jiaying Liu"],"pdf_url":"https://arxiv.org/pdf/2406.02978v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.15318v2","updated":"2024-08-26T09:21:35Z","published":"2024-04-03T13:33:07Z","title":"VASARI-auto: equitable, efficient, and economical featurisation of\n glioma MRI","summary":" The VASARI MRI feature set is a quantitative system designed to standardise\nglioma imaging descriptions. Though effective, deriving VASARI is\ntime-consuming and seldom used in clinical practice. This is a problem that\nmachine learning could plausibly automate. Using glioma data from 1172\npatients, we developed VASARI-auto, an automated labelling software applied to\nboth open-source lesion masks and our openly available tumour segmentation\nmodel. In parallel, two consultant neuroradiologists independently quantified\nVASARI features in a subsample of 100 glioblastoma cases. We quantified: 1)\nagreement across neuroradiologists and VASARI-auto; 2) calibration of\nperformance equity; 3) an economic workforce analysis; and 4) fidelity in\npredicting patient survival. Tumour segmentation was compatible with the\ncurrent state of the art and equally performant regardless of age or sex. A\nmodest inter-rater variability between in-house neuroradiologists was\ncomparable to between neuroradiologists and VASARI-auto, with far higher\nagreement between VASARI-auto methods. The time taken for neuroradiologists to\nderive VASARI was substantially higher than VASARI-auto (mean time per case 317\nvs. 3 seconds). A UK hospital workforce analysis forecast that three years of\nVASARI featurisation would demand 29,777 consultant neuroradiologist workforce\nhours ({\\pounds}1,574,935), reducible to 332 hours of computing time (and\n{\\pounds}146 of power) with VASARI-auto. The best-performing survival model\nutilised VASARI-auto features as opposed to those derived by neuroradiologists.\nVASARI-auto is a highly efficient automated labelling system with equitable\nperformance across patient age or sex, a favourable economic profile if used as\na decision support tool, and with non-inferior fidelity in downstream patient\nsurvival prediction. Future work should iterate upon and integrate such tools\nto enhance patient care.\n","authors":["James K Ruffle","Samia Mohinta","Kelly Pegoretti Baruteau","Rebekah Rajiah","Faith Lee","Sebastian Brandner","Parashkev Nachev","Harpreet Hyare"],"pdf_url":"https://arxiv.org/pdf/2404.15318v2.pdf","comment":"36 pages, 8 figures, 2 tables"},{"id":"http://arxiv.org/abs/2401.07729v2","updated":"2024-08-26T09:16:57Z","published":"2024-01-15T14:43:40Z","title":"SSL-Interactions: Pretext Tasks for Interactive Trajectory Prediction","summary":" This paper addresses motion forecasting in multi-agent environments, pivotal\nfor ensuring safety of autonomous vehicles. Traditional as well as recent\ndata-driven marginal trajectory prediction methods struggle to properly learn\nnon-linear agent-to-agent interactions. We present SSL-Interactions that\nproposes pretext tasks to enhance interaction modeling for trajectory\nprediction. We introduce four interaction-aware pretext tasks to encapsulate\nvarious aspects of agent interactions: range gap prediction, closest distance\nprediction, direction of movement prediction, and type of interaction\nprediction. We further propose an approach to curate interaction-heavy\nscenarios from datasets. This curated data has two advantages: it provides a\nstronger learning signal to the interaction model, and facilitates generation\nof pseudo-labels for interaction-centric pretext tasks. We also propose three\nnew metrics specifically designed to evaluate predictions in interactive\nscenes. Our empirical evaluations indicate SSL-Interactions outperforms\nstate-of-the-art motion forecasting methods quantitatively with up to 8%\nimprovement, and qualitatively, for interaction-heavy scenarios.\n","authors":["Prarthana Bhattacharyya","Chengjie Huang","Krzysztof Czarnecki"],"pdf_url":"https://arxiv.org/pdf/2401.07729v2.pdf","comment":"Accepted at IV-2024. 13 pages, 5 figures"},{"id":"http://arxiv.org/abs/2407.05206v4","updated":"2024-08-26T09:15:11Z","published":"2024-07-06T23:16:41Z","title":"Helios: An extremely low power event-based gesture recognition for\n always-on smart eyewear","summary":" This paper introduces Helios, the first extremely low-power, real-time,\nevent-based hand gesture recognition system designed for all-day on smart\neyewear. As augmented reality (AR) evolves, current smart glasses like the Meta\nRay-Bans prioritize visual and wearable comfort at the expense of\nfunctionality. Existing human-machine interfaces (HMIs) in these devices, such\nas capacitive touch and voice controls, present limitations in ergonomics,\nprivacy and power consumption. Helios addresses these challenges by leveraging\nnatural hand interactions for a more intuitive and comfortable user experience.\nOur system utilizes a extremely low-power and compact 3mmx4mm/20mW event camera\nto perform natural hand-based gesture recognition for always-on smart eyewear.\nThe camera's output is processed by a convolutional neural network (CNN)\nrunning on a NXP Nano UltraLite compute platform, consuming less than 350mW.\nHelios can recognize seven classes of gestures, including subtle microgestures\nlike swipes and pinches, with 91% accuracy. We also demonstrate real-time\nperformance across 20 users at a remarkably low latency of 60ms. Our user\ntesting results align with the positive feedback we received during our recent\nsuccessful demo at AWE-USA-2024.\n","authors":["Prarthana Bhattacharyya","Joshua Mitton","Ryan Page","Owen Morgan","Ben Menzies","Gabriel Homewood","Kemi Jacobs","Paolo Baesso","David Trickett","Chris Mair","Taru Muhonen","Rory Clark","Louis Berridge","Richard Vigars","Iain Wallace"],"pdf_url":"https://arxiv.org/pdf/2407.05206v4.pdf","comment":"Accepted at ECCV-Integrating Computer Vision in Smart Eyewear, 2024.\n 18 pages, 10 figures. First three authors contributed equally to this paper"},{"id":"http://arxiv.org/abs/2211.07546v2","updated":"2024-08-26T09:01:23Z","published":"2022-11-14T17:11:15Z","title":"Vision meets algae: A novel way for microalgae recognization and health\n monitor","summary":" Marine microalgae are widespread in the ocean and play a crucial role in the\necosystem. Automatic identification and location of marine microalgae in\nmicroscopy images would help establish marine ecological environment monitoring\nand water quality evaluation system. We proposed a new dataset for the\ndetection of marine microalgae and a range of detection methods, the dataset\nincluding images of different genus of algae and the same genus in different\nstates. We set the number of unbalanced classes in the data set and added\nimages of mixed water samples in the test set to simulate the actual situation\nin the field. Then we trained, validated and tested the, TOOD, YOLOv5, YOLOv8\nand variants of RCNN algorithms on this dataset. The results showed both\none-stage and two-stage object detection models can achieve high mean average\nprecision, which proves the ability of computer vision in multi-object\ndetection of microalgae, and provides basic data and models for real-time\ndetection of microalgal cells.\n","authors":["Shizheng Zhou","Juntao Jiang","Xiaohan Hong","Yan Hong","Pengcheng Fu"],"pdf_url":"https://arxiv.org/pdf/2211.07546v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14114v1","updated":"2024-08-26T08:59:22Z","published":"2024-08-26T08:59:22Z","title":"ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors\n and Mamba Blocks for 3D EM Image Segmentation","summary":" Electron microscopy (EM) imaging offers unparalleled resolution for analyzing\nneural tissues, crucial for uncovering the intricacies of synaptic connections\nand neural processes fundamental to understanding behavioral mechanisms.\nRecently, the foundation models have demonstrated impressive performance across\nnumerous natural and medical image segmentation tasks. However, applying these\nfoundation models to EM segmentation faces significant challenges due to domain\ndisparities. This paper presents ShapeMamba-EM, a specialized fine-tuning\nmethod for 3D EM segmentation, which employs adapters for long-range dependency\nmodeling and an encoder for local shape description within the original\nfoundation model. This approach effectively addresses the unique volumetric and\nmorphological complexities of EM data. Tested over a wide range of EM images,\ncovering five segmentation tasks and 10 datasets, ShapeMamba-EM outperforms\nexisting methods, establishing a new standard in EM image segmentation and\nenhancing the understanding of neural tissue architecture.\n","authors":["Ruohua Shi","Qiufan Pang","Lei Ma","Lingyu Duan","Tiejun Huang","Tingting Jiang"],"pdf_url":"https://arxiv.org/pdf/2408.14114v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14111v1","updated":"2024-08-26T08:55:16Z","published":"2024-08-26T08:55:16Z","title":"Bengali Sign Language Recognition through Hand Pose Estimation using\n Multi-Branch Spatial-Temporal Attention Model","summary":" Hand gesture-based sign language recognition (SLR) is one of the most\nadvanced applications of machine learning, and computer vision uses hand\ngestures. Although, in the past few years, many researchers have widely\nexplored and studied how to address BSL problems, specific unaddressed issues\nremain, such as skeleton and transformer-based BSL recognition. In addition,\nthe lack of evaluation of the BSL model in various concealed environmental\nconditions can prove the generalized property of the existing model by facing\ndaily life signs. As a consequence, existing BSL recognition systems provide a\nlimited perspective of their generalisation ability as they are tested on\ndatasets containing few BSL alphabets that have a wide disparity in gestures\nand are easy to differentiate. To overcome these limitations, we propose a\nspatial-temporal attention-based BSL recognition model considering hand joint\nskeletons extracted from the sequence of images. The main aim of utilising hand\nskeleton-based BSL data is to ensure the privacy and low-resolution sequence of\nimages, which need minimum computational cost and low hardware configurations.\nOur model captures discriminative structural displacements and short-range\ndependency based on unified joint features projected onto high-dimensional\nfeature space. Specifically, the use of Separable TCN combined with a powerful\nmulti-head spatial-temporal attention architecture generated high-performance\naccuracy. The extensive experiments with a proposed dataset and two benchmark\nBSL datasets with a wide range of evaluations, such as intra- and inter-dataset\nevaluation settings, demonstrated that our proposed models achieve competitive\nperformance with extremely low computational complexity and run faster than\nexisting models.\n","authors":["Abu Saleh Musa Miah","Md. Al Mehedi Hasan","Md Hadiuzzaman","Muhammad Nazrul Islam","Jungpil Shin"],"pdf_url":"https://arxiv.org/pdf/2408.14111v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.11311v2","updated":"2024-08-26T08:47:00Z","published":"2024-06-17T08:18:41Z","title":"Syn-to-Real Unsupervised Domain Adaptation for Indoor 3D Object\n Detection","summary":" The use of synthetic data in indoor 3D object detection offers the potential\nof greatly reducing the manual labor involved in 3D annotations and training\neffective zero-shot detectors. However, the complicated domain shifts across\nsyn-to-real indoor datasets remains underexplored. In this paper, we propose a\nnovel Object-wise Hierarchical Domain Alignment (OHDA) framework for\nsyn-to-real unsupervised domain adaptation in indoor 3D object detection. Our\napproach includes an object-aware augmentation strategy to effectively\ndiversify the source domain data, and we introduce a two-branch adaptation\nframework consisting of an adversarial training branch and a pseudo labeling\nbranch, in order to simultaneously reach holistic-level and class-level domain\nalignment. The pseudo labeling is further refined through two proposed schemes\nspecifically designed for indoor UDA. Our adaptation results from synthetic\ndataset 3D-FRONT to real-world datasets ScanNetV2 and SUN RGB-D demonstrate\nremarkable mAP25 improvements of 9.7% and 9.1% over Source-Only baselines,\nrespectively, and consistently outperform the methods adapted from 2D and 3D\noutdoor scenarios. The code will be publicly available upon paper acceptance.\n","authors":["Yunsong Wang","Na Zhao","Gim Hee Lee"],"pdf_url":"https://arxiv.org/pdf/2406.11311v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14087v1","updated":"2024-08-26T08:16:58Z","published":"2024-08-26T08:16:58Z","title":"LSM-YOLO: A Compact and Effective ROI Detector for Medical Detection","summary":" In existing medical Region of Interest (ROI) detection, there lacks an\nalgorithm that can simultaneously satisfy both real-time performance and\naccuracy, not meeting the growing demand for automatic detection in medicine.\nAlthough the basic YOLO framework ensures real-time detection due to its fast\nspeed, it still faces challenges in maintaining precision concurrently. To\nalleviate the above problems, we propose a novel model named Lightweight Shunt\nMatching-YOLO (LSM-YOLO), with Lightweight Adaptive Extraction (LAE) and\nMultipath Shunt Feature Matching (MSFM). Firstly, by using LAE to refine\nfeature extraction, the model can obtain more contextual information and\nhigh-resolution details from multiscale feature maps, thereby extracting\ndetailed features of ROI in medical images while reducing the influence of\nnoise. Secondly, MSFM is utilized to further refine the fusion of high-level\nsemantic features and low-level visual features, enabling better fusion between\nROI features and neighboring features, thereby improving the detection rate for\nbetter diagnostic assistance. Experimental results demonstrate that LSM-YOLO\nachieves 48.6% AP on a private dataset of pancreatic tumors, 65.1% AP on the\nBCCD blood cell detection public dataset, and 73.0% AP on the Br35h brain tumor\ndetection public dataset. Our model achieves state-of-the-art performance with\nminimal parameter cost on the above three datasets. The source codes are at:\nhttps://github.com/VincentYuuuuuu/LSM-YOLO.\n","authors":["Zhongwen Yu","Qiu Guan","Jianmin Yang","Zhiqiang Yang","Qianwei Zhou","Yang Chen","Feng Chen"],"pdf_url":"https://arxiv.org/pdf/2408.14087v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14084v1","updated":"2024-08-26T08:11:35Z","published":"2024-08-26T08:11:35Z","title":"HABD: a houma alliance book ancient handwritten character recognition\n database","summary":" The Houma Alliance Book, one of history's earliest calligraphic examples, was\nunearthed in the 1970s. These artifacts were meticulously organized,\nreproduced, and copied by the Shanxi Provincial Institute of Cultural Relics.\nHowever, because of their ancient origins and severe ink erosion, identifying\ncharacters in the Houma Alliance Book is challenging, necessitating the use of\ndigital technology. In this paper, we propose a new ancient handwritten\ncharacter recognition database for the Houma alliance book, along with a novel\nbenchmark based on deep learning architectures. More specifically, a collection\nof 26,732 characters samples from the Houma Alliance Book were gathered,\nencompassing 327 different types of ancient characters through iterative\nannotation. Furthermore, benchmark algorithms were proposed by combining four\ndeep neural network classifiers with two data augmentation methods. This\nresearch provides valuable resources and technical support for further studies\non the Houma Alliance Book and other ancient characters. This contributes to\nour understanding of ancient culture and history, as well as the preservation\nand inheritance of humanity's cultural heritage.\n","authors":["Xiaoyu Yuan","Xiaohua Huang","Zibo Zhang","Yabo Sun"],"pdf_url":"https://arxiv.org/pdf/2408.14084v1.pdf","comment":"8 pages, 5 figures"},{"id":"http://arxiv.org/abs/2408.14080v1","updated":"2024-08-26T08:02:57Z","published":"2024-08-26T08:02:57Z","title":"SONICS: Synthetic Or Not -- Identifying Counterfeit Songs","summary":" The recent surge in AI-generated songs presents exciting possibilities and\nchallenges. While these tools democratize music creation, they also necessitate\nthe ability to distinguish between human-composed and AI-generated songs for\nsafeguarding artistic integrity and content curation. Existing research and\ndatasets in fake song detection only focus on singing voice deepfake detection\n(SVDD), where the vocals are AI-generated but the instrumental music is sourced\nfrom real songs. However, this approach is inadequate for contemporary\nend-to-end AI-generated songs where all components (vocals, lyrics, music, and\nstyle) could be AI-generated. Additionally, existing datasets lack lyrics-music\ndiversity, long-duration songs, and open fake songs. To address these gaps, we\nintroduce SONICS, a novel dataset for end-to-end Synthetic Song Detection\n(SSD), comprising over 97k songs with over 49k synthetic songs from popular\nplatforms like Suno and Udio. Furthermore, we highlight the importance of\nmodeling long-range temporal dependencies in songs for effective authenticity\ndetection, an aspect overlooked in existing methods. To capture these patterns,\nwe propose a novel model, SpecTTTra, that is up to 3 times faster and 6 times\nmore memory efficient compared to popular CNN and Transformer-based models\nwhile maintaining competitive performance. Finally, we offer both AI-based and\nHuman evaluation benchmarks, addressing another deficiency in current research.\n","authors":["Md Awsafur Rahman","Zaber Ibn Abdul Hakim","Najibul Haque Sarker","Bishmoy Paul","Shaikh Anowarul Fattah"],"pdf_url":"https://arxiv.org/pdf/2408.14080v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14060v1","updated":"2024-08-26T07:37:59Z","published":"2024-08-26T07:37:59Z","title":"Evaluating the Visual Similarity of Southwest China's Ethnic Minority\n Brocade Based on Deep Learning","summary":" This paper employs deep learning methods to investigate the visual similarity\nof ethnic minority patterns in Southwest China. A customized SResNet-18 network\nwas developed, achieving an accuracy of 98.7% on the test set, outperforming\nResNet-18, VGGNet-16, and AlexNet. The extracted feature vectors from\nSResNet-18 were evaluated using three metrics: cosine similarity, Euclidean\ndistance, and Manhattan distance. The analysis results were visually\nrepresented on an ethnic thematic map, highlighting the connections between\nethnic patterns and their regional distributions.\n","authors":["Shichen Liu","Huaxing Lu"],"pdf_url":"https://arxiv.org/pdf/2408.14060v1.pdf","comment":"8 pages,2tables,5 figures"},{"id":"http://arxiv.org/abs/2405.17137v3","updated":"2024-08-26T07:36:03Z","published":"2024-05-27T12:54:09Z","title":"Jump-teaching: Ultra Efficient and Robust Learning with Noisy Label","summary":" Sample selection is the most straightforward technique to combat label noise,\naiming to distinguish mislabeled samples during training and avoid the\ndegradation of the robustness of the model. In the workflow, $\\textit{selecting\npossibly clean data}$ and $\\textit{model update}$ are iterative. However, their\ninterplay and intrinsic characteristics hinder the robustness and efficiency of\nlearning with noisy labels: 1) The model chooses clean data with selection\nbias, leading to the accumulated error in the model update. 2) Most selection\nstrategies leverage partner networks or supplementary information to mitigate\nlabel corruption, albeit with increased computation resources and lower\nthroughput speed. Therefore, we employ only one network with the jump manner\nupdate to decouple the interplay and mine more semantic information from the\nloss for a more precise selection. Specifically, the selection of clean data\nfor each model update is based on one of the prior models, excluding the last\niteration. The strategy of model update exhibits a jump behavior in the form.\nMoreover, we map the outputs of the network and labels into the same semantic\nfeature space, respectively. In this space, a detailed and simple loss\ndistribution is generated to distinguish clean samples more effectively. Our\nproposed approach achieves almost up to $2.53\\times$ speedup, $0.46\\times$ peak\nmemory footprint, and superior robustness over state-of-the-art works with\nvarious noise settings.\n","authors":["Kangye Ji","Fei Cheng","Zeqing Wang","Bohu Huang"],"pdf_url":"https://arxiv.org/pdf/2405.17137v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14051v1","updated":"2024-08-26T07:17:05Z","published":"2024-08-26T07:17:05Z","title":"Let Video Teaches You More: Video-to-Image Knowledge Distillation using\n DEtection TRansformer for Medical Video Lesion Detection","summary":" AI-assisted lesion detection models play a crucial role in the early\nscreening of cancer. However, previous image-based models ignore the\ninter-frame contextual information present in videos. On the other hand,\nvideo-based models capture the inter-frame context but are computationally\nexpensive. To mitigate this contradiction, we delve into Video-to-Image\nknowledge distillation leveraging DEtection TRansformer (V2I-DETR) for the task\nof medical video lesion detection. V2I-DETR adopts a teacher-student network\nparadigm. The teacher network aims at extracting temporal contexts from\nmultiple frames and transferring them to the student network, and the student\nnetwork is an image-based model dedicated to fast prediction in inference. By\ndistilling multi-frame contexts into a single frame, the proposed V2I-DETR\ncombines the advantages of utilizing temporal contexts from video-based models\nand the inference speed of image-based models. Through extensive experiments,\nV2I-DETR outperforms previous state-of-the-art methods by a large margin while\nachieving the real-time inference speed (30 FPS) as the image-based model.\n","authors":["Yuncheng Jiang","Zixun Zhang","Jun Wei","Chun-Mei Feng","Guanbin Li","Xiang Wan","Shuguang Cui","Zhen Li"],"pdf_url":"https://arxiv.org/pdf/2408.14051v1.pdf","comment":"BIBM2024"},{"id":"http://arxiv.org/abs/2305.12236v2","updated":"2024-08-26T07:09:52Z","published":"2023-05-20T17:01:52Z","title":"Searching a Compact Architecture for Robust Multi-Exposure Image Fusion","summary":" In recent years, learning-based methods have achieved significant\nadvancements in multi-exposure image fusion. However, two major stumbling\nblocks hinder the development, including pixel misalignment and inefficient\ninference. Reliance on aligned image pairs in existing methods causes\nsusceptibility to artifacts due to device motion. Additionally, existing\ntechniques often rely on handcrafted architectures with huge network\nengineering, resulting in redundant parameters, adversely impacting inference\nefficiency and flexibility. To mitigate these limitations, this study\nintroduces an architecture search-based paradigm incorporating self-alignment\nand detail repletion modules for robust multi-exposure image fusion.\n Specifically, targeting the extreme discrepancy of exposure, we propose the\nself-alignment module, leveraging scene relighting to constrain the\nillumination degree for following alignment and feature extraction. Detail\nrepletion is proposed to enhance the texture details of scenes. Additionally,\nincorporating a hardware-sensitive constraint, we present the fusion-oriented\narchitecture search to explore compact and efficient networks for fusion. The\nproposed method outperforms various competitive schemes, achieving a noteworthy\n3.19\\% improvement in PSNR for general scenarios and an impressive 23.5\\%\nenhancement in misaligned scenarios. Moreover, it significantly reduces\ninference time by 69.1\\%. The code will be available at\nhttps://github.com/LiuZhu-CV/CRMEF.\n","authors":["Zhu Liu","Jinyuan Liu","Guanyao Wu","Zihang Chen","Xin Fan","Risheng Liu"],"pdf_url":"https://arxiv.org/pdf/2305.12236v2.pdf","comment":"14 pages, 11 figures"},{"id":"http://arxiv.org/abs/2408.14047v1","updated":"2024-08-26T07:02:17Z","published":"2024-08-26T07:02:17Z","title":"Alleviating Class Imbalance in Semi-supervised Multi-organ Segmentation\n via Balanced Subclass Regularization","summary":" Semi-supervised learning (SSL) has shown notable potential in relieving the\nheavy demand of dense prediction tasks on large-scale well-annotated datasets,\nespecially for the challenging multi-organ segmentation (MoS). However, the\nprevailing class-imbalance problem in MoS, caused by the substantial variations\nin organ size, exacerbates the learning difficulty of the SSL network. To\nalleviate this issue, we present a two-phase semi-supervised network (BSR-Net)\nwith balanced subclass regularization for MoS. Concretely, in Phase I, we\nintroduce a class-balanced subclass generation strategy based on balanced\nclustering to effectively generate multiple balanced subclasses from original\nbiased ones according to their pixel proportions. Then, in Phase II, we design\nan auxiliary subclass segmentation (SCS) task within the multi-task framework\nof the main MoS task. The SCS task contributes a balanced subclass\nregularization to the main MoS task and transfers unbiased knowledge to the MoS\nnetwork, thus alleviating the influence of the class-imbalance problem.\nExtensive experiments conducted on two publicly available datasets, i.e., the\nMICCAI FLARE 2022 dataset and the WORD dataset, verify the superior performance\nof our method compared with other methods.\n","authors":["Zhenghao Feng","Lu Wen","Binyu Yan","Jiaqi Cui","Yan Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14047v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.06607v4","updated":"2024-08-26T06:57:51Z","published":"2023-11-11T16:37:41Z","title":"Monkey: Image Resolution and Text Label Are Important Things for Large\n Multi-modal Models","summary":" Large Multimodal Models (LMMs) have shown promise in vision-language tasks\nbut struggle with high-resolution input and detailed scene understanding.\nAddressing these challenges, we introduce Monkey to enhance LMM capabilities.\nFirstly, Monkey processes input images by dividing them into uniform patches,\neach matching the size (e.g., 448x448) used in the original training of the\nwell-trained vision encoder. Equipped with individual adapter for each patch,\nMonkey can handle higher resolutions up to 1344x896 pixels, enabling the\ndetailed capture of complex visual information. Secondly, it employs a\nmulti-level description generation method, enriching the context for\nscene-object associations. This two-part strategy ensures more effective\nlearning from generated data: the higher resolution allows for a more detailed\ncapture of visuals, which in turn enhances the effectiveness of comprehensive\ndescriptions. Extensive ablative results validate the effectiveness of our\ndesigns. Additionally, experiments on 18 datasets further demonstrate that\nMonkey surpasses existing LMMs in many tasks like Image Captioning and various\nVisual Question Answering formats. Specially, in qualitative tests focused on\ndense text question answering, Monkey has exhibited encouraging results\ncompared with GPT4V. Code is available at\nhttps://github.com/Yuliang-Liu/Monkey.\n","authors":["Zhang Li","Biao Yang","Qiang Liu","Zhiyin Ma","Shuo Zhang","Jingxu Yang","Yabo Sun","Yuliang Liu","Xiang Bai"],"pdf_url":"https://arxiv.org/pdf/2311.06607v4.pdf","comment":"CVPR 2024 Highlight"},{"id":"http://arxiv.org/abs/2406.10737v2","updated":"2024-08-26T06:37:24Z","published":"2024-06-15T20:47:38Z","title":"Dynamic Domains, Dynamic Solutions: DPCore for Continual Test-Time\n Adaptation","summary":" Continual Test-Time Adaptation (CTTA) seeks to adapt a source pre-trained\nmodel to continually changing, unlabeled target domains. Existing TTA methods\nare typically designed for environments where domain changes occur sequentially\nand can struggle in more dynamic scenarios, as illustrated in Figure\n\\ref{fig:settings}. Inspired by the principles of online K-Means, we introduce\na novel approach to CTTA through visual prompting. We propose a \\emph{Dynamic\nPrompt Coreset} that not only preserves knowledge from previously visited\ndomains but also accommodates learning from new potential domains. This is\ncomplemented by a distance-based \\emph{Weight Updating Mechanism} that ensures\nthe coreset remains current and relevant. Our approach employs a fixed model\narchitecture alongside the coreset and an innovative updating system to\neffectively mitigate challenges such as catastrophic forgetting and error\naccumulation. Extensive testing on four widely-used benchmarks demonstrates\nthat our method consistently outperforms state-of-the-art alternatives in both\nclassification and segmentation CTTA tasks across the structured and dynamic\nCTTA settings, with $99\\%$ fewer trainable parameters.\n","authors":["Yunbei Zhang","Akshay Mehra","Jihun Hamm"],"pdf_url":"https://arxiv.org/pdf/2406.10737v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2108.08158v4","updated":"2024-08-26T06:30:52Z","published":"2021-08-18T14:04:52Z","title":"Practical X-ray Gastric Cancer Screening Using Refined Stochastic Data\n Augmentation and Hard Boundary Box Training","summary":" Endoscopy is widely used to diagnose gastric cancer and has a high diagnostic\nperformance, but it must be performed by a physician, which limits the number\nof people who can be diagnosed. In contrast, gastric X-rays can be performed by\ntechnicians and screen a much larger number of patients, but accurate diagnosis\nrequires experience. We propose an unprecedented and practical gastric cancer\ndiagnosis support system for gastric X-ray images, enabling more people to be\nscreened. The system is based on a general deep learning-based object detection\nmodel and incorporates two novel techniques: refined probabilistic stomach\nimage augmentation (R-sGAIA) and hard boundary box training (HBBT). R-sGAIA\nenhances the probabilistic gastric fold region, providing more learning\npatterns for cancer detection models. HBBT is an efficient training method that\nimproves model performance by allowing the use of unannotated negative (i.e.,\nhealthy control) samples, which are typically unusable in conventional\ndetection models. The proposed system achieves a sensitivity (SE) for gastric\ncancer of 90.2%, higher than that of an expert (85.5%). Additionally, two out\nof five detected candidate boxes are cancerous, maintaining high precision\nwhile processing images at a speed of 0.51 seconds per image. The system also\noutperforms methods using the same object detection model and state-of-the-art\ndata augmentation, showing a 5.9-point improvement in the F1 score. In summary,\nthis system efficiently identifies areas for radiologists to examine within a\npractical timeframe, significantly reducing their workload.\n","authors":["Hideaki Okamoto","Quan Huu Cap","Takakiyo Nomura","Kazuhito Nabeshima","Jun Hashimoto","Hitoshi Iyatomi"],"pdf_url":"https://arxiv.org/pdf/2108.08158v4.pdf","comment":"20 pages, 6 figures"},{"id":"http://arxiv.org/abs/2408.14039v1","updated":"2024-08-26T06:22:54Z","published":"2024-08-26T06:22:54Z","title":"Collaborative Perception in Multi-Robot Systems: Case Studies in\n Household Cleaning and Warehouse Operations","summary":" This paper explores the paradigm of Collaborative Perception (CP), where\nmultiple robots and sensors in the environment share and integrate sensor data\nto construct a comprehensive representation of the surroundings. By aggregating\ndata from various sensors and utilizing advanced algorithms, the collaborative\nperception framework improves task efficiency, coverage, and safety. Two case\nstudies are presented to showcase the benefits of collaborative perception in\nmulti-robot systems. The first case study illustrates the benefits and\nadvantages of using CP for the task of household cleaning with a team of\ncleaning robots. The second case study performs a comparative analysis of the\nperformance of CP versus Standalone Perception (SP) for Autonomous Mobile\nRobots operating in a warehouse environment. The case studies validate the\neffectiveness of CP in enhancing multi-robot coordination, task completion, and\noverall system performance and its potential to impact operations in other\napplications as well. Future investigations will focus on optimizing the\nframework and validating its performance through empirical testing.\n","authors":["Bharath Rajiv Nair"],"pdf_url":"https://arxiv.org/pdf/2408.14039v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.02462v2","updated":"2024-08-26T06:12:05Z","published":"2024-04-03T05:04:55Z","title":"A Unified Membership Inference Method for Visual Self-supervised Encoder\n via Part-aware Capability","summary":" Self-supervised learning shows promise in harnessing extensive unlabeled\ndata, but it also confronts significant privacy concerns, especially in vision.\nIn this paper, we aim to perform membership inference on visual self-supervised\nmodels in a more realistic setting: self-supervised training method and details\nare unknown for an adversary when attacking as he usually faces a black-box\nsystem in practice. In this setting, considering that self-supervised model\ncould be trained by completely different self-supervised paradigms, e.g.,\nmasked image modeling and contrastive learning, with complex training details,\nwe propose a unified membership inference method called PartCrop. It is\nmotivated by the shared part-aware capability among models and stronger part\nresponse on the training data. Specifically, PartCrop crops parts of objects in\nan image to query responses with the image in representation space. We conduct\nextensive attacks on self-supervised models with different training protocols\nand structures using three widely used image datasets. The results verify the\neffectiveness and generalization of PartCrop. Moreover, to defend against\nPartCrop, we evaluate two common approaches, i.e., early stop and differential\nprivacy, and propose a tailored method called shrinking crop scale range. The\ndefense experiments indicate that all of them are effective. Our code is\navailable at https://github.com/JiePKU/PartCrop.\n","authors":["Jie Zhu","Jirong Zha","Ding Li","Leye Wang"],"pdf_url":"https://arxiv.org/pdf/2404.02462v2.pdf","comment":"Accepted by ACM CCS2024, Full version"},{"id":"http://arxiv.org/abs/2408.14035v1","updated":"2024-08-26T06:01:54Z","published":"2024-08-26T06:01:54Z","title":"FAST-LIVO2: Fast, Direct LiDAR-Inertial-Visual Odometry","summary":" This paper proposes FAST-LIVO2: a fast, direct LiDAR-inertial-visual odometry\nframework to achieve accurate and robust state estimation in SLAM tasks and\nprovide great potential in real-time, onboard robotic applications. FAST-LIVO2\nfuses the IMU, LiDAR and image measurements efficiently through an ESIKF. To\naddress the dimension mismatch between the heterogeneous LiDAR and image\nmeasurements, we use a sequential update strategy in the Kalman filter. To\nenhance the efficiency, we use direct methods for both the visual and LiDAR\nfusion, where the LiDAR module registers raw points without extracting edge or\nplane features and the visual module minimizes direct photometric errors\nwithout extracting ORB or FAST corner features. The fusion of both visual and\nLiDAR measurements is based on a single unified voxel map where the LiDAR\nmodule constructs the geometric structure for registering new LiDAR scans and\nthe visual module attaches image patches to the LiDAR points. To enhance the\naccuracy of image alignment, we use plane priors from the LiDAR points in the\nvoxel map (and even refine the plane prior) and update the reference patch\ndynamically after new images are aligned. Furthermore, to enhance the\nrobustness of image alignment, FAST-LIVO2 employs an on-demanding raycast\noperation and estimates the image exposure time in real time. Lastly, we detail\nthree applications of FAST-LIVO2: UAV onboard navigation demonstrating the\nsystem's computation efficiency for real-time onboard navigation, airborne\nmapping showcasing the system's mapping accuracy, and 3D model rendering\n(mesh-based and NeRF-based) underscoring the suitability of our reconstructed\ndense map for subsequent rendering tasks. We open source our code, dataset and\napplication on GitHub to benefit the robotics community.\n","authors":["Chunran Zheng","Wei Xu","Zuhao Zou","Tong Hua","Chongjian Yuan","Dongjiao He","Bingyang Zhou","Zheng Liu","Jiarong Lin","Fangcheng Zhu","Yunfan Ren","Rong Wang","Fanle Meng","Fu Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.14035v1.pdf","comment":"30 pages, 31 figures, due to the limitation that 'The abstract field\n cannot exceed 1,920 characters', the abstract presented here is shorter than\n the one in the PDF file"},{"id":"http://arxiv.org/abs/2403.18878v2","updated":"2024-08-26T05:54:21Z","published":"2024-03-27T10:46:24Z","title":"Teaching AI the Anatomy Behind the Scan: Addressing Anatomical Flaws in\n Medical Image Segmentation with Learnable Prior","summary":" Imposing key anatomical features, such as the number of organs, their shapes\nand relative positions, is crucial for building a robust multi-organ\nsegmentation model. Current attempts to incorporate anatomical features include\nbroadening the effective receptive field (ERF) size with data-intensive\nmodules, or introducing anatomical constraints that scales poorly to\nmulti-organ segmentation. We introduce a novel architecture called the\nAnatomy-Informed Cascaded Segmentation Network (AIC-Net). AIC-Net incorporates\na learnable input termed \"Anatomical Prior\", which can be adapted to\npatient-specific anatomy using a differentiable spatial deformation. The\ndeformed prior later guides decoder layers towards more anatomy-informed\npredictions. We repeat this process at a local patch level to enhance the\nrepresentation of intricate objects, resulting in a cascaded network structure.\nAIC-Net is a general method that enhances any existing segmentation models to\nbe more anatomy-aware. We have validated the performance of AIC-Net, with\nvarious backbones, on two multi-organ segmentation tasks: abdominal organs and\nvertebrae. For each respective task, our benchmarks demonstrate improved dice\nscore and Hausdorff distance.\n","authors":["Young Seok Jeon","Hongfei Yang","Huazhu Fu","Mengling Feng"],"pdf_url":"https://arxiv.org/pdf/2403.18878v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14032v1","updated":"2024-08-26T05:52:35Z","published":"2024-08-26T05:52:35Z","title":"More Pictures Say More: Visual Intersection Network for Open Set Object\n Detection","summary":" Open Set Object Detection has seen rapid development recently, but it\ncontinues to pose significant challenges. Language-based methods, grappling\nwith the substantial modal disparity between textual and visual modalities,\nrequire extensive computational resources to bridge this gap. Although\nintegrating visual prompts into these frameworks shows promise for enhancing\nperformance, it always comes with constraints related to textual semantics. In\ncontrast, viusal-only methods suffer from the low-quality fusion of multiple\nvisual prompts. In response, we introduce a strong DETR-based model, Visual\nIntersection Network for Open Set Object Detection (VINO), which constructs a\nmulti-image visual bank to preserve the semantic intersections of each category\nacross all time steps. Our innovative multi-image visual updating mechanism\nlearns to identify the semantic intersections from various visual prompts,\nenabling the flexible incorporation of new information and continuous\noptimization of feature representations. Our approach guarantees a more precise\nalignment between target category semantics and region semantics, while\nsignificantly reducing pre-training time and resource demands compared to\nlanguage-based methods. Furthermore, the integration of a segmentation head\nillustrates the broad applicability of visual intersection in various visual\ntasks. VINO, which requires only 7 RTX4090 GPU days to complete one epoch on\nthe Objects365v1 dataset, achieves competitive performance on par with\nvision-language models on benchmarks such as LVIS and ODinW35.\n","authors":["Bingcheng Dong","Yuning Ding","Jinrong Zhang","Sifan Zhang","Shenglan Liu"],"pdf_url":"https://arxiv.org/pdf/2408.14032v1.pdf","comment":"7pages"},{"id":"http://arxiv.org/abs/2408.14028v1","updated":"2024-08-26T05:38:27Z","published":"2024-08-26T05:38:27Z","title":"SurGen: Text-Guided Diffusion Model for Surgical Video Generation","summary":" Diffusion-based video generation models have made significant strides,\nproducing outputs with improved visual fidelity, temporal coherence, and user\ncontrol. These advancements hold great promise for improving surgical education\nby enabling more realistic, diverse, and interactive simulation environments.\nIn this study, we introduce SurGen, a text-guided diffusion model tailored for\nsurgical video synthesis, producing the highest resolution and longest duration\nvideos among existing surgical video generation models. We validate the visual\nand temporal quality of the outputs using standard image and video generation\nmetrics. Additionally, we assess their alignment to the corresponding text\nprompts through a deep learning classifier trained on surgical data. Our\nresults demonstrate the potential of diffusion models to serve as valuable\neducational tools for surgical trainees.\n","authors":["Joseph Cho","Samuel Schmidgall","Cyril Zakka","Mrudang Mathur","Rohan Shad","William Hiesinger"],"pdf_url":"https://arxiv.org/pdf/2408.14028v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14023v1","updated":"2024-08-26T05:27:14Z","published":"2024-08-26T05:27:14Z","title":"Video-CCAM: Enhancing Video-Language Understanding with Causal\n Cross-Attention Masks for Short and Long Videos","summary":" Multi-modal large language models (MLLMs) have demonstrated considerable\npotential across various downstream tasks that require cross-domain knowledge.\nMLLMs capable of processing videos, known as Video-MLLMs, have attracted broad\ninterest in video-language understanding. However, videos, especially long\nvideos, contain more visual tokens than images, making them difficult for LLMs\nto process. Existing works either downsample visual features or extend the LLM\ncontext size, risking the loss of high-resolution information or slowing down\ninference speed. To address these limitations, we apply cross-attention layers\nin the intermediate projector between the visual encoder and the large language\nmodel (LLM). As the naive cross-attention mechanism is insensitive to temporal\norder, we further introduce causal cross-attention masks (CCAMs) within the\ncross-attention layers. This Video-MLLM, named Video-CCAM, is trained in a\nstraightforward two-stage fashion: feature alignment and visual instruction\ntuning. We develop several Video-CCAM models based on LLMs of different sizes\n(4B, 9B, and 14B). Video-CCAM proves to be a robust Video-MLLM and shows\noutstanding performance from short videos to long ones. Among standard video\nbenchmarks like MVBench and VideoChatGPT-QA, Video-CCAM shows outstanding\nperformances (1st/2nd/3rd in MVBench and TGIF-QA, 2nd/3rd/4th in MSVD-QA,\nMSRVTT-QA, and ActivityNet-QA). In benchmarks encompassing long videos,\nVideo-CCAM models can be directly adapted to long video understanding and still\nachieve exceptional scores despite being trained solely with images and\n16-frame videos. Using 96 frames (6$\\times$ the training number of frames),\nVideo-CCAM models rank 1st/2nd/3rd in VideoVista and 1st/2nd/4th in MLVU among\nall open-source Video-MLLMs, respectively. The code is publicly available in\n\\url{https://github.com/QQ-MM/Video-CCAM}.\n","authors":["Jiajun Fei","Dian Li","Zhidong Deng","Zekun Wang","Gang Liu","Hui Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14023v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2408.11545v2","updated":"2024-08-26T05:21:35Z","published":"2024-08-21T11:53:53Z","title":"UNetMamba: An Efficient UNet-Like Mamba for Semantic Segmentation of\n High-Resolution Remote Sensing Images","summary":" Semantic segmentation of high-resolution remote sensing images is vital in\ndownstream applications such as land-cover mapping, urban planning and disaster\nassessment.Existing Transformer-based methods suffer from the constraint\nbetween accuracy and efficiency, while the recently proposed Mamba is renowned\nfor being efficient. Therefore, to overcome the dilemma, we propose UNetMamba,\na UNet-like semantic segmentation model based on Mamba. It incorporates a mamba\nsegmentation decoder (MSD) that can efficiently decode the complex information\nwithin high-resolution images, and a local supervision module (LSM), which is\ntrain-only but can significantly enhance the perception of local contents.\nExtensive experiments demonstrate that UNetMamba outperforms the\nstate-of-the-art methods with mIoU increased by 0.87% on LoveDA and 0.36% on\nISPRS Vaihingen, while achieving high efficiency through the lightweight\ndesign, less memory footprint and reduced computational cost. The source code\nis available at https://github.com/EnzeZhu2001/UNetMamba.\n","authors":["Enze Zhu","Zhan Chen","Dingkai Wang","Hanru Shi","Xiaoxuan Liu","Lei Wang"],"pdf_url":"https://arxiv.org/pdf/2408.11545v2.pdf","comment":"5 pages, 3 figures"},{"id":"http://arxiv.org/abs/2405.14213v2","updated":"2024-08-26T04:59:05Z","published":"2024-05-23T06:17:23Z","title":"From Text to Pixel: Advancing Long-Context Understanding in MLLMs","summary":" The rapid progress in Multimodal Large Language Models (MLLMs) has\nsignificantly advanced their ability to process and understand complex visual\nand textual information. However, the integration of multiple images and\nextensive textual contexts remains a challenge due to the inherent limitation\nof the models' capacity to handle long input sequences efficiently. In this\npaper, we introduce SEEKER, a multimodal large language model designed to\ntackle this issue. SEEKER aims to optimize the compact encoding of long text by\ncompressing the text sequence into the visual pixel space via images, enabling\nthe model to handle long text within a fixed token-length budget efficiently.\nOur empirical experiments on six long-context multimodal tasks demonstrate that\nSEEKER can leverage fewer image tokens to convey the same amount of textual\ninformation compared with the OCR-based approach, and is more efficient in\nunderstanding long-form multimodal input and generating long-form textual\noutput, outperforming all existing proprietary and open-source MLLMs by large\nmargins.\n","authors":["Yujie Lu","Xiujun Li","Tsu-Jui Fu","Miguel Eckstein","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2405.14213v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14016v1","updated":"2024-08-26T04:56:41Z","published":"2024-08-26T04:56:41Z","title":"Pixel-Aligned Multi-View Generation with Depth Guided Decoder","summary":" The task of image-to-multi-view generation refers to generating novel views\nof an instance from a single image. Recent methods achieve this by extending\ntext-to-image latent diffusion models to multi-view version, which contains an\nVAE image encoder and a U-Net diffusion model. Specifically, these generation\nmethods usually fix VAE and finetune the U-Net only. However, the significant\ndownscaling of the latent vectors computed from the input images and\nindependent decoding leads to notable pixel-level misalignment across multiple\nviews. To address this, we propose a novel method for pixel-level\nimage-to-multi-view generation. Unlike prior work, we incorporate attention\nlayers across multi-view images in the VAE decoder of a latent video diffusion\nmodel. Specifically, we introduce a depth-truncated epipolar attention,\nenabling the model to focus on spatially adjacent regions while remaining\nmemory efficient. Applying depth-truncated attn is challenging during inference\nas the ground-truth depth is usually difficult to obtain and pre-trained depth\nestimation models is hard to provide accurate depth. Thus, to enhance the\ngeneralization to inaccurate depth when ground truth depth is missing, we\nperturb depth inputs during training. During inference, we employ a rapid\nmulti-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the\ndepth-truncated epipolar attention. Our model enables better pixel alignment\nacross multi-view images. Moreover, we demonstrate the efficacy of our approach\nin improving downstream multi-view to 3D reconstruction tasks.\n","authors":["Zhenggang Tang","Peiye Zhuang","Chaoyang Wang","Aliaksandr Siarohin","Yash Kant","Alexander Schwing","Sergey Tulyakov","Hsin-Ying Lee"],"pdf_url":"https://arxiv.org/pdf/2408.14016v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14013v1","updated":"2024-08-26T04:36:10Z","published":"2024-08-26T04:36:10Z","title":"A Multiscale Gradient Fusion Method for Edge Detection in Color Images\n Utilizing the CBM3D Filter","summary":" In this paper, a color edge detection strategy based on collaborative\nfiltering combined with multiscale gradient fusion is proposed. The\nblock-matching and 3D (BM3D) filter are used to enhance the sparse\nrepresentation in the transform domain and achieve the effect of denoising,\nwhereas the multiscale gradient fusion makes up for the defect of loss of\ndetails in single-scale edge detection and improves the edge detection\nresolution and quality. First, the RGB images in the dataset are converted to\nXYZ color space images through mathematical operations. Second, the colored\nblock-matching and 3D (CBM3D) filter are used on the sparse images and to\nremove noise interference. Then, the vector gradients of the color image and\nthe anisotropic Gaussian directional derivative of the two scale parameters are\ncalculated and averaged pixel-by-pixel to obtain a new edge strength map.\nFinally, the edge features are enhanced by image normalization and non-maximum\nsuppression technology, and on that basis, the edge contour is obtained by\ndouble threshold selection and a new morphological refinement method. Through\nan experimental analysis of the edge detection dataset, the method proposed has\ngood noise robustness and high edge quality, which is better than the Color\nSobel, Color Canny, SE and Color AGDD as shown by the PR curve, AUC, PSNR, MSE,\nand FOM indicators.\n","authors":["Zhuoyue Wang","Yiyi Tao","Danqing Ma"],"pdf_url":"https://arxiv.org/pdf/2408.14013v1.pdf","comment":"1 figure, 2 tables"},{"id":"http://arxiv.org/abs/2408.14008v1","updated":"2024-08-26T04:29:52Z","published":"2024-08-26T04:29:52Z","title":"LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models","summary":" The explosive growth of videos on streaming media platforms has underscored\nthe urgent need for effective video quality assessment (VQA) algorithms to\nmonitor and perceptually optimize the quality of streaming videos. However, VQA\nremains an extremely challenging task due to the diverse video content and the\ncomplex spatial and temporal distortions, thus necessitating more advanced\nmethods to address these issues. Nowadays, large multimodal models (LMMs), such\nas GPT-4V, have exhibited strong capabilities for various visual understanding\ntasks, motivating us to leverage the powerful multimodal representation ability\nof LMMs to solve the VQA task. Therefore, we propose the first Large\nMulti-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel\nspatiotemporal visual modeling strategy for quality-aware feature extraction.\nSpecifically, we first reformulate the quality regression problem into a\nquestion and answering (Q&A) task and construct Q&A prompts for VQA instruction\ntuning. Then, we design a spatiotemporal vision encoder to extract spatial and\ntemporal features to represent the quality characteristics of videos, which are\nsubsequently mapped into the language space by the spatiotemporal projector for\nmodality alignment. Finally, the aligned visual tokens and the quality-inquired\ntext tokens are aggregated as inputs for the large language model (LLM) to\ngenerate the quality score and level. Extensive experiments demonstrate that\nLMM-VQA achieves state-of-the-art performance across five VQA benchmarks,\nexhibiting an average improvement of $5\\%$ in generalization ability over\nexisting methods. Furthermore, due to the advanced design of the spatiotemporal\nencoder and projector, LMM-VQA also performs exceptionally well on general\nvideo understanding tasks, further validating its effectiveness. Our code will\nbe released at https://github.com/Sueqk/LMM-VQA.\n","authors":["Qihang Ge","Wei Sun","Yu Zhang","Yunhao Li","Zhongpeng Ji","Fengyu Sun","Shangling Jui","Xiongkuo Min","Guangtao Zhai"],"pdf_url":"https://arxiv.org/pdf/2408.14008v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12321v2","updated":"2024-08-26T04:27:54Z","published":"2024-08-22T11:57:16Z","title":"MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework\n for Multimodal Large Language Model","summary":" This paper presents MaVEn, an innovative Multi-granularity Visual Encoding\nframework designed to enhance the capabilities of Multimodal Large Language\nModels (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on\nsingle-image visual understanding, limiting their ability to interpret and\nintegrate information across multiple images. MaVEn addresses this limitation\nby combining discrete visual symbol sequences, which abstract coarse-grained\nsemantic concepts, with traditional continuous representation sequences that\nmodel fine-grained features. This dual approach bridges the semantic gap\nbetween visual and textual data, thereby improving the model's ability to\nprocess and interpret information from multiple images effectively.\nAdditionally, we design a dynamic reduction mechanism by for long-sequence\ncontinuous features to enhance multi-image processing efficiency. Experimental\nresults demonstrate that MaVEn significantly enhances MLLMs' understanding in\ncomplex multi-image scenarios, while also improving performance in single-image\ncontexts.\n","authors":["Chaoya Jiang","Jia Hongrui","Haiyang Xu","Wei Ye","Mengfan Dong","Ming Yan","Ji Zhang","Fei Huang","Shikun Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.12321v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.08549v2","updated":"2024-08-26T03:59:17Z","published":"2024-04-12T15:45:26Z","title":"Practical Guidelines for Cell Segmentation Models Under Optical\n Aberrations in Microscopy","summary":" Cell segmentation is essential in biomedical research for analyzing cellular\nmorphology and behavior. Deep learning methods, particularly convolutional\nneural networks (CNNs), have revolutionized cell segmentation by extracting\nintricate features from images. However, the robustness of these methods under\nmicroscope optical aberrations remains a critical challenge. This study\nevaluates cell image segmentation models under optical aberrations from\nfluorescence and bright field microscopy. By simulating different types of\naberrations, including astigmatism, coma, spherical aberration, trefoil, and\nmixed aberrations, we conduct a thorough evaluation of various cell instance\nsegmentation models using the DynamicNuclearNet (DNN) and LIVECell datasets,\nrepresenting fluorescence and bright field microscopy cell datasets,\nrespectively. We train and test several segmentation models, including the Otsu\nthreshold method and Mask R-CNN with different network heads (FPN, C3) and\nbackbones (ResNet, VGG, Swin Transformer), under aberrated conditions.\nAdditionally, we provide usage recommendations for the Cellpose 2.0 Toolbox on\ncomplex cell degradation images. The results indicate that the combination of\nFPN and SwinS demonstrates superior robustness in handling simple cell images\naffected by minor aberrations. In contrast, Cellpose 2.0 proves effective for\ncomplex cell images under similar conditions. Furthermore, we innovatively\npropose the Point Spread Function Image Label Classification Model (PLCM). This\nmodel can quickly and accurately identify aberration types and amplitudes from\nPSF images, assisting researchers without optical training. Through PLCM,\nresearchers can better apply our proposed cell segmentation guidelines.\n","authors":["Boyuan Peng","Jiaju Chen","P. Bilha Githinji","Ijaz Gul","Qihui Ye","Minjiang Chen","Peiwu Qin","Xingru Huang","Chenggang Yan","Dongmei Yu","Jiansong Ji","Zhenglin Chen"],"pdf_url":"https://arxiv.org/pdf/2404.08549v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12616v2","updated":"2024-08-26T03:47:06Z","published":"2024-08-08T16:46:14Z","title":"Semantic Communication based on Large Language Model for Underwater\n Image Transmission","summary":" Underwater communication is essential for environmental monitoring, marine\nbiology research, and underwater exploration. Traditional underwater\ncommunication faces limitations like low bandwidth, high latency, and\nsusceptibility to noise, while semantic communication (SC) offers a promising\nsolution by focusing on the exchange of semantics rather than symbols or bits.\nHowever, SC encounters challenges in underwater environments, including\nsemantic information mismatch and difficulties in accurately identifying and\ntransmitting critical information that aligns with the diverse requirements of\nunderwater applications. To address these challenges, we propose a novel\nSemantic Communication (SC) framework based on Large Language Models (LLMs).\nOur framework leverages visual LLMs to perform semantic compression and\nprioritization of underwater image data according to the query from users. By\nidentifying and encoding key semantic elements within the images, the system\nselectively transmits high-priority information while applying higher\ncompression rates to less critical regions. On the receiver side, an LLM-based\nrecovery mechanism, along with Global Vision ControlNet and Key Region\nControlNet networks, aids in reconstructing the images, thereby enhancing\ncommunication efficiency and robustness. Our framework reduces the overall data\nsize to 0.8\\% of the original. Experimental results demonstrate that our method\nsignificantly outperforms existing approaches, ensuring high-quality,\nsemantically accurate image reconstruction.\n","authors":["Weilong Chen","Wenxuan Xu","Haoran Chen","Xinran Zhang","Zhijin Qin","Yanru Zhang","Zhu Han"],"pdf_url":"https://arxiv.org/pdf/2408.12616v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13995v1","updated":"2024-08-26T03:35:13Z","published":"2024-08-26T03:35:13Z","title":"Avatar Concept Slider: Manipulate Concepts In Your Human Avatar With\n Fine-grained Control","summary":" Language based editing of 3D human avatars to precisely match user\nrequirements is challenging due to the inherent ambiguity and limited\nexpressiveness of natural language. To overcome this, we propose the Avatar\nConcept Slider (ACS), a 3D avatar editing method that allows precise\nmanipulation of semantic concepts in human avatars towards a specified\nintermediate point between two extremes of concepts, akin to moving a knob\nalong a slider track. To achieve this, our ACS has three designs. 1) A Concept\nSliding Loss based on Linear Discriminant Analysis to pinpoint the\nconcept-specific axis for precise editing. 2) An Attribute Preserving Loss\nbased on Principal Component Analysis for improved preservation of avatar\nidentity during editing. 3) A 3D Gaussian Splatting primitive selection\nmechanism based on concept-sensitivity, which updates only the primitives that\nare the most sensitive to our target concept, to improve efficiency. Results\ndemonstrate that our ACS enables fine-grained 3D avatar editing with efficient\nfeedback, without harming the avatar quality or compromising the avatar's\nidentifying attributes.\n","authors":["Yixuan He","Lin Geng Foo","Ajmal Saeed Mian","Hossein Rahmani","Jun Jiu"],"pdf_url":"https://arxiv.org/pdf/2408.13995v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.06886v7","updated":"2024-08-26T03:25:12Z","published":"2024-07-09T14:14:47Z","title":"Aligning Cyber Space with Physical World: A Comprehensive Survey on\n Embodied AI","summary":" Embodied Artificial Intelligence (Embodied AI) is crucial for achieving\nArtificial General Intelligence (AGI) and serves as a foundation for various\napplications that bridge cyberspace and the physical world. Recently, the\nemergence of Multi-modal Large Models (MLMs) and World Models (WMs) have\nattracted significant attention due to their remarkable perception,\ninteraction, and reasoning capabilities, making them a promising architecture\nfor the brain of embodied agents. However, there is no comprehensive survey for\nEmbodied AI in the era of MLMs. In this survey, we give a comprehensive\nexploration of the latest advancements in Embodied AI. Our analysis firstly\nnavigates through the forefront of representative works of embodied robots and\nsimulators, to fully understand the research focuses and their limitations.\nThen, we analyze four main research targets: 1) embodied perception, 2)\nembodied interaction, 3) embodied agent, and 4) sim-to-real adaptation,\ncovering the state-of-the-art methods, essential paradigms, and comprehensive\ndatasets. Additionally, we explore the complexities of MLMs in virtual and real\nembodied agents, highlighting their significance in facilitating interactions\nin dynamic digital and physical environments. Finally, we summarize the\nchallenges and limitations of embodied AI and discuss their potential future\ndirections. We hope this survey will serve as a foundational reference for the\nresearch community and inspire continued innovation. The associated project can\nbe found at https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List.\n","authors":["Yang Liu","Weixing Chen","Yongjie Bai","Xiaodan Liang","Guanbin Li","Wen Gao","Liang Lin"],"pdf_url":"https://arxiv.org/pdf/2407.06886v7.pdf","comment":"The first comprehensive review of Embodied AI in the era of MLMs, 39\n pages. We also provide the paper list for Embodied AI:\n https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List"},{"id":"http://arxiv.org/abs/2303.17117v2","updated":"2024-08-26T03:22:08Z","published":"2023-03-30T03:09:25Z","title":"Reliable Representations Learning for Incomplete Multi-View Partial\n Multi-Label Classification","summary":" As a cross-topic of multi-view learning and multi-label classification,\nmulti-view multi-label classification has gradually gained traction in recent\nyears. The application of multi-view contrastive learning has further\nfacilitated this process, however, the existing multi-view contrastive learning\nmethods crudely separate the so-called negative pair, which largely results in\nthe separation of samples belonging to the same category or similar ones.\nBesides, plenty of multi-view multi-label learning methods ignore the possible\nabsence of views and labels. To address these issues, in this paper, we propose\nan incomplete multi-view partial multi-label classification network named RANK.\nIn this network, a label-driven multi-view contrastive learning strategy is\nproposed to leverage supervised information to preserve the structure within\nview and perform consistent alignment across views. Furthermore, we break\nthrough the view-level weights inherent in existing methods and propose a\nquality-aware sub-network to dynamically assign quality scores to each view of\neach sample. The label correlation information is fully utilized in the final\nmulti-label cross-entropy classification loss, effectively improving the\ndiscriminative power. Last but not least, our model is not only able to handle\ncomplete multi-view multi-label datasets, but also works on datasets with\nmissing instances and labels. Extensive experiments confirm that our RANK\noutperforms existing state-of-the-art methods.\n","authors":["Chengliang Liu","Jie Wen","Yong Xu","Bob Zhang","Liqiang Nie","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2303.17117v2.pdf","comment":"Please contact me if you have any questions: liucl1996@163.com"},{"id":"http://arxiv.org/abs/2408.10848v2","updated":"2024-08-26T03:19:45Z","published":"2024-08-20T13:40:25Z","title":"Perception-guided Jailbreak against Text-to-Image Models","summary":" In recent years, Text-to-Image (T2I) models have garnered significant\nattention due to their remarkable advancements. However, security concerns have\nemerged due to their potential to generate inappropriate or Not-Safe-For-Work\n(NSFW) images. In this paper, inspired by the observation that texts with\ndifferent semantics can lead to similar human perceptions, we propose an\nLLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box\njailbreak method that requires no specific T2I model (model-free) and generates\nhighly natural attack prompts. Specifically, we propose identifying a safe\nphrase that is similar in human perception yet inconsistent in text semantics\nwith the target unsafe word and using it as a substitution. The experiments\nconducted on six open-source models and commercial online services with\nthousands of prompts have verified the effectiveness of PGJ.\n","authors":["Yihao Huang","Le Liang","Tianlin Li","Xiaojun Jia","Run Wang","Weikai Miao","Geguang Pu","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2408.10848v2.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2408.13988v1","updated":"2024-08-26T03:02:41Z","published":"2024-08-26T03:02:41Z","title":"Automatic Medical Report Generation: Methods and Applications","summary":" The increasing demand for medical imaging has surpassed the capacity of\navailable radiologists, leading to diagnostic delays and potential\nmisdiagnoses. Artificial intelligence (AI) techniques, particularly in\nautomatic medical report generation (AMRG), offer a promising solution to this\ndilemma. This review comprehensively examines AMRG methods from 2021 to 2024.\nIt (i) presents solutions to primary challenges in this field, (ii) explores\nAMRG applications across various imaging modalities, (iii) introduces publicly\navailable datasets, (iv) outlines evaluation metrics, (v) identifies techniques\nthat significantly enhance model performance, and (vi) discusses unresolved\nissues and potential future research directions. This paper aims to provide a\ncomprehensive understanding of the existing literature and inspire valuable\nfuture research.\n","authors":["Li Guo","Anas M. Tahir","Dong Zhang","Z. Jane Wang","Rabab K. Ward"],"pdf_url":"https://arxiv.org/pdf/2408.13988v1.pdf","comment":"42 pages and 9 figures"},{"id":"http://arxiv.org/abs/2401.12452v2","updated":"2024-08-26T02:50:28Z","published":"2024-01-23T02:41:06Z","title":"Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural\n Calibration","summary":" This paper introduces a novel self-supervised learning framework for\nenhancing 3D perception in autonomous driving scenes. Specifically, our\napproach, namely NCLR, focuses on 2D-3D neural calibration, a novel pretext\ntask that estimates the rigid pose aligning camera and LiDAR coordinate\nsystems. First, we propose the learnable transformation alignment to bridge the\ndomain gap between image and point cloud data, converting features into a\nunified representation space for effective comparison and matching. Second, we\nidentify the overlapping area between the image and point cloud with the fused\nfeatures. Third, we establish dense 2D-3D correspondences to estimate the rigid\npose. The framework not only learns fine-grained matching from points to pixels\nbut also achieves alignment of the image and point cloud at a holistic level,\nunderstanding their relative pose. We demonstrate the efficacy of NCLR by\napplying the pre-trained backbone to downstream tasks, such as LiDAR-based 3D\nsemantic segmentation, object detection, and panoptic segmentation.\nComprehensive experiments on various datasets illustrate the superiority of\nNCLR over existing self-supervised methods. The results confirm that joint\nlearning from different modalities significantly enhances the network's\nunderstanding abilities and effectiveness of learned representation. The code\nis publicly available at https://github.com/Eaphan/NCLR.\n","authors":["Yifan Zhang","Siyu Ren","Junhui Hou","Jinjian Wu","Yixuan Yuan","Guangming Shi"],"pdf_url":"https://arxiv.org/pdf/2401.12452v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2305.07895v7","updated":"2024-08-26T02:37:14Z","published":"2023-05-13T11:28:37Z","title":"OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models","summary":" Large models have recently played a dominant role in natural language\nprocessing and multimodal vision-language learning. However, their\neffectiveness in text-related visual tasks remains relatively unexplored. In\nthis paper, we conducted a comprehensive evaluation of Large Multimodal Models,\nsuch as GPT4V and Gemini, in various text-related visual tasks including Text\nRecognition, Scene Text-Centric Visual Question Answering (VQA),\nDocument-Oriented VQA, Key Information Extraction (KIE), and Handwritten\nMathematical Expression Recognition (HMER). To facilitate the assessment of\nOptical Character Recognition (OCR) capabilities in Large Multimodal Models, we\npropose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29\ndatasets, making it the most comprehensive OCR evaluation benchmark available.\nFurthermore, our study reveals both the strengths and weaknesses of these\nmodels, particularly in handling multilingual text, handwritten text,\nnon-semantic text, and mathematical expression recognition. Most importantly,\nthe baseline results presented in this study could provide a foundational\nframework for the conception and assessment of innovative strategies targeted\nat enhancing zero-shot multimodal techniques. The evaluation pipeline and\nbenchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.\n","authors":["Yuliang Liu","Zhang Li","Mingxin Huang","Biao Yang","Wenwen Yu","Chunyuan Li","Xucheng Yin","Cheng-lin Liu","Lianwen Jin","Xiang Bai"],"pdf_url":"https://arxiv.org/pdf/2305.07895v7.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13983v1","updated":"2024-08-26T02:33:47Z","published":"2024-08-26T02:33:47Z","title":"Dual-Path Adversarial Lifting for Domain Shift Correction in Online\n Test-time Adaptation","summary":" Transformer-based methods have achieved remarkable success in various machine\nlearning tasks. How to design efficient test-time adaptation methods for\ntransformer models becomes an important research task. In this work, motivated\nby the dual-subband wavelet lifting scheme developed in multi-scale signal\nprocessing which is able to efficiently separate the input signals into\nprincipal components and noise components, we introduce a dual-path token\nlifting for domain shift correction in test time adaptation. Specifically, we\nintroduce an extra token, referred to as \\textit{domain shift token}, at each\nlayer of the transformer network. We then perform dual-path lifting with\ninterleaved token prediction and update between the path of domain shift tokens\nand the path of class tokens at all network layers. The prediction and update\nnetworks are learned in an adversarial manner. Specifically, the task of the\nprediction network is to learn the residual noise of domain shift which should\nbe largely invariant across all classes and all samples in the target domain.\nIn other words, the predicted domain shift noise should be indistinguishable\nbetween all sample classes. On the other hand, the task of the update network\nis to update the class tokens by removing the domain shift from the input image\nsamples so that input samples become more discriminative between different\nclasses in the feature space. To effectively learn the prediction and update\nnetworks with two adversarial tasks, both theoretically and practically, we\ndemonstrate that it is necessary to use smooth optimization for the update\nnetwork but non-smooth optimization for the prediction network. Experimental\nresults on the benchmark datasets demonstrate that our proposed method\nsignificantly improves the online fully test-time domain adaptation\nperformance. Code is available at \\url{https://github.com/yushuntang/DPAL}.\n","authors":["Yushun Tang","Shuoshuo Chen","Zhihe Lu","Xinchao Wang","Zhihai He"],"pdf_url":"https://arxiv.org/pdf/2408.13983v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13981v1","updated":"2024-08-26T02:26:09Z","published":"2024-08-26T02:26:09Z","title":"ARANet: Attention-based Residual Adversarial Network with Deep\n Supervision for Radiotherapy Dose Prediction of Cervical Cancer","summary":" Radiation therapy is the mainstay treatment for cervical cancer, and its\nultimate goal is to ensure the planning target volume (PTV) reaches the\nprescribed dose while reducing dose deposition of organs-at-risk (OARs) as much\nas possible. To achieve these clinical requirements, the medical physicist\nneeds to manually tweak the radiotherapy plan repeatedly in a trial-anderror\nmanner until finding the optimal one in the clinic. However, such\ntrial-and-error processes are quite time-consuming, and the quality of plans\nhighly depends on the experience of the medical physicist. In this paper, we\npropose an end-to-end Attentionbased Residual Adversarial Network with deep\nsupervision, namely ARANet, to automatically predict the 3D dose distribution\nof cervical cancer. Specifically, given the computer tomography (CT) images and\ntheir corresponding segmentation masks of PTV and OARs, ARANet employs a\nprediction network to generate the dose maps. We also utilize a multi-scale\nresidual attention module and deep supervision mechanism to enforce the\nprediction network to extract more valuable dose features while suppressing\nirrelevant information. Our proposed method is validated on an in-house dataset\nincluding 54 cervical cancer patients, and experimental results have\ndemonstrated its obvious superiority compared to other state-of-the-art\nmethods.\n","authors":["Lu Wen","Wenxia Yin","Zhenghao Feng","Xi Wu","Deng Xiong","Yan Wang"],"pdf_url":"https://arxiv.org/pdf/2408.13981v1.pdf","comment":"Accepted by 2024 IEEE International Conference on Cybernetics and\n Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and\n Mechatronics (RAM)"},{"id":"http://arxiv.org/abs/2408.13980v1","updated":"2024-08-26T02:20:55Z","published":"2024-08-26T02:20:55Z","title":"FusionSAM: Latent Space driven Segment Anything Model for Multimodal\n Fusion and Segmentation","summary":" Multimodal image fusion and segmentation enhance scene understanding in\nautonomous driving by integrating data from various sensors. However, current\nmodels struggle to efficiently segment densely packed elements in such scenes,\ndue to the absence of comprehensive fusion features that can guide mid-process\nfine-tuning and focus attention on relevant areas. The Segment Anything Model\n(SAM) has emerged as a transformative segmentation method. It provides more\neffective prompts through its flexible prompt encoder, compared to transformers\nlacking fine-tuned control. Nevertheless, SAM has not been extensively studied\nin the domain of multimodal fusion for natural images. In this paper, we\nintroduce SAM into multimodal image segmentation for the first time, proposing\na novel framework that combines Latent Space Token Generation (LSTG) and Fusion\nMask Prompting (FMP) modules to enhance SAM's multimodal fusion and\nsegmentation capabilities. Specifically, we first obtain latent space features\nof the two modalities through vector quantization and embed them into a\ncross-attention-based inter-domain fusion module to establish long-range\ndependencies between modalities. Then, we use these comprehensive fusion\nfeatures as prompts to guide precise pixel-level segmentation. Extensive\nexperiments on several public datasets demonstrate that the proposed method\nsignificantly outperforms SAM and SAM2 in multimodal autonomous driving\nscenarios, achieving at least 3.9$\\%$ higher segmentation mIoU than the\nstate-of-the-art approaches.\n","authors":["Daixun Li","Weiying Xie","Mingxiang Cao","Yunke Wang","Jiaqing Zhang","Yunsong Li","Leyuan Fang","Chang Xu"],"pdf_url":"https://arxiv.org/pdf/2408.13980v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.10650v2","updated":"2024-08-26T02:19:11Z","published":"2024-03-15T19:35:10Z","title":"PALM: Pushing Adaptive Learning Rate Mechanisms for Continual Test-Time\n Adaptation","summary":" Real-world vision models in dynamic environments face rapid shifts in domain\ndistributions, leading to decreased recognition performance. Using unlabeled\ntest data, continual test-time adaptation (CTTA) directly adjusts a pre-trained\nsource discriminative model to these changing domains. A highly effective CTTA\nmethod involves applying layer-wise adaptive learning rates for selectively\nadapting pre-trained layers. However, it suffers from the poor estimation of\ndomain shift and the inaccuracies arising from the pseudo-labels. This work\naims to overcome these limitations by identifying layers for adaptation via\nquantifying model prediction uncertainty without relying on pseudo-labels. We\nutilize the magnitude of gradients as a metric, calculated by backpropagating\nthe KL divergence between the softmax output and a uniform distribution, to\nselect layers for further adaptation. Subsequently, for the parameters\nexclusively belonging to these selected layers, with the remaining ones frozen,\nwe evaluate their sensitivity to approximate the domain shift and adjust their\nlearning rates accordingly. We conduct extensive image classification\nexperiments on CIFAR-10C, CIFAR-100C, and ImageNet-C, demonstrating the\nsuperior efficacy of our method compared to prior approaches.\n","authors":["Sarthak Kumar Maharana","Baoming Zhang","Yunhui Guo"],"pdf_url":"https://arxiv.org/pdf/2403.10650v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.04560v3","updated":"2024-08-26T02:11:58Z","published":"2024-01-09T13:56:37Z","title":"Phase-shifted remote photoplethysmography for estimating heart rate and\n blood pressure from facial video","summary":" Human health can be critically affected by cardiovascular diseases, such as\nhypertension, arrhythmias, and stroke. Heart rate and blood pressure are\nimportant biometric information for the monitoring of cardiovascular system and\nearly diagnosis of cardiovascular diseases. Existing methods for estimating the\nheart rate are based on electrocardiography and photoplethyomography, which\nrequire contacting the sensor to the skin surface. Moreover, catheter and\ncuff-based methods for measuring blood pressure cause inconvenience and have\nlimited applicability. Therefore, in this thesis, we propose a vision-based\nmethod for estimating the heart rate and blood pressure. This thesis proposes a\n2-stage deep learning framework consisting of a dual remote\nphotoplethysmography network (DRP-Net) and bounded blood pressure network\n(BBP-Net). In the first stage, DRP-Net infers remote photoplethysmography\n(rPPG) signals for the acral and facial regions, and these phase-shifted rPPG\nsignals are utilized to estimate the heart rate. In the second stage, BBP-Net\nintegrates temporal features and analyzes phase discrepancy between the acral\nand facial rPPG signals to estimate SBP and DBP values. To improve the accuracy\nof estimating the heart rate, we employed a data augmentation method based on a\nframe interpolation model. Moreover, we designed BBP-Net to infer blood\npressure within a predefined range by incorporating a scaled sigmoid function.\nOur method resulted in estimating the heart rate with the mean absolute error\n(MAE) of 1.78 BPM, reducing the MAE by 34.31 % compared to the recent method,\non the MMSE-HR dataset. The MAE for estimating the systolic blood pressure\n(SBP) and diastolic blood pressure (DBP) were 10.19 mmHg and 7.09 mmHg. On the\nV4V dataset, the MAE for the heart rate, SBP, and DBP were 3.83 BPM, 13.64\nmmHg, and 9.4 mmHg, respectively.\n","authors":["Gyutae Hwang","Sang Jun Lee"],"pdf_url":"https://arxiv.org/pdf/2401.04560v3.pdf","comment":"33 pages, 7 figures"},{"id":"http://arxiv.org/abs/2408.13979v1","updated":"2024-08-26T02:09:05Z","published":"2024-08-26T02:09:05Z","title":"Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models","summary":" With the prevalence of large-scale pretrained vision-language models (VLMs),\nsuch as CLIP, soft-prompt tuning has become a popular method for adapting these\nmodels to various downstream tasks. However, few works delve into the inherent\nproperties of learnable soft-prompt vectors, specifically the impact of their\nnorms to the performance of VLMs. This motivates us to pose an unexplored\nresearch question: ``Do we need to normalize the soft prompts in VLMs?'' To\nfill this research gap, we first uncover a phenomenon, called the\n\\textbf{Low-Norm Effect} by performing extensive corruption experiments,\nsuggesting that reducing the norms of certain learned prompts occasionally\nenhances the performance of VLMs, while increasing them often degrades it. To\nharness this effect, we propose a novel method named \\textbf{N}ormalizing\nth\\textbf{e} soft-pro\\textbf{m}pt v\\textbf{e}ctors of vi\\textbf{si}on-language\nmodel\\textbf{s} (\\textbf{Nemesis}) to normalize soft-prompt vectors in VLMs. To\nthe best of our knowledge, our work is the first to systematically investigate\nthe role of norms of soft-prompt vector in VLMs, offering valuable insights for\nfuture research in soft-prompt tuning. The code is available at\n\\texttt{\\href{https://github.com/ShyFoo/Nemesis}{https://github.com/ShyFoo/Nemesis}}.\n","authors":["Shuai Fu","Xiequn Wang","Qiushi Huang","Yu Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.13979v1.pdf","comment":"Accepted at ICLR 2024 (Spotlight)"},{"id":"http://arxiv.org/abs/2408.13978v1","updated":"2024-08-26T01:54:37Z","published":"2024-08-26T01:54:37Z","title":"Histology Virtual Staining with Mask-Guided Adversarial Transfer\n Learning for Tertiary Lymphoid Structure Detection","summary":" Histological Tertiary Lymphoid Structures (TLSs) are increasingly recognized\nfor their correlation with the efficacy of immunotherapy in various solid\ntumors. Traditionally, the identification and characterization of TLSs rely on\nimmunohistochemistry (IHC) staining techniques, utilizing markers such as CD20\nfor B cells. Despite the specificity of IHC, Hematoxylin-Eosin (H&E) staining\noffers a more accessible and cost-effective choice. Capitalizing on the\nprevalence of H&E staining slides, we introduce a novel Mask-Guided Adversarial\nTransfer Learning method designed for virtual pathological staining. This\nmethod adeptly captures the nuanced color variations across diverse tissue\ntypes under various staining conditions, such as nucleus, red blood cells,\npositive reaction regions, without explicit label information, and adeptly\nsynthesizes realistic IHC-like virtual staining patches, even replicating the\npositive reaction. Further, we propose the Virtual IHC Pathology Analysis\nNetwork (VIPA-Net), an integrated framework encompassing a Mask-Guided Transfer\nModule and an H&E-Based Virtual Staining TLS Detection Module. VIPA-Net\nsynergistically harnesses both H\\&E staining slides and the synthesized virtual\nIHC patches to enhance the detection of TLSs within H&E Whole Slide Images\n(WSIs). We evaluate the network with a comprehensive dataset comprising 1019\nannotated slides from The Cancer Genome Atlas (TCGA). Experimental results\ncompellingly illustrate that the VIPA-Net substantially elevates TLS detection\naccuracy, effectively circumventing the need for actual CD20 staining across\nthe public dataset.\n","authors":["Qiuli Wang","Yongxu Liu","Li Ma","Xianqi Wang","Wei Chen","Xiaohong Yao"],"pdf_url":"https://arxiv.org/pdf/2408.13978v1.pdf","comment":"8 pages, 8 figures"},{"id":"http://arxiv.org/abs/2408.13972v1","updated":"2024-08-26T01:36:46Z","published":"2024-08-26T01:36:46Z","title":"DynaSurfGS: Dynamic Surface Reconstruction with Planar-based Gaussian\n Splatting","summary":" Dynamic scene reconstruction has garnered significant attention in recent\nyears due to its capabilities in high-quality and real-time rendering. Among\nvarious methodologies, constructing a 4D spatial-temporal representation, such\nas 4D-GS, has gained popularity for its high-quality rendered images. However,\nthese methods often produce suboptimal surfaces, as the discrete 3D Gaussian\npoint clouds fail to align with the object's surface precisely. To address this\nproblem, we propose DynaSurfGS to achieve both photorealistic rendering and\nhigh-fidelity surface reconstruction of dynamic scenarios. Specifically, the\nDynaSurfGS framework first incorporates Gaussian features from 4D neural voxels\nwith the planar-based Gaussian Splatting to facilitate precise surface\nreconstruction. It leverages normal regularization to enforce the smoothness of\nthe surface of dynamic objects. It also incorporates the as-rigid-as-possible\n(ARAP) constraint to maintain the approximate rigidity of local neighborhoods\nof 3D Gaussians between timesteps and ensure that adjacent 3D Gaussians remain\nclosely aligned throughout. Extensive experiments demonstrate that DynaSurfGS\nsurpasses state-of-the-art methods in both high-fidelity surface reconstruction\nand photorealistic rendering.\n","authors":["Weiwei Cai","Weicai Ye","Peng Ye","Tong He","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2408.13972v1.pdf","comment":"homepage: https://open3dvlab.github.io/DynaSurfGS/, code:\n https://github.com/Open3DVLab/DynaSurfGS"},{"id":"http://arxiv.org/abs/2408.09403v2","updated":"2024-08-26T01:26:35Z","published":"2024-08-18T08:23:51Z","title":"Obtaining Optimal Spiking Neural Network in Sequence Learning via\n CRNN-SNN Conversion","summary":" Spiking neural networks (SNNs) are becoming a promising alternative to\nconventional artificial neural networks (ANNs) due to their rich neural\ndynamics and the implementation of energy-efficient neuromorphic chips.\nHowever, the non-differential binary communication mechanism makes SNN hard to\nconverge to an ANN-level accuracy. When SNN encounters sequence learning, the\nsituation becomes worse due to the difficulties in modeling long-range\ndependencies. To overcome these difficulties, researchers developed variants of\nLIF neurons and different surrogate gradients but still failed to obtain good\nresults when the sequence became longer (e.g., $>$500). Unlike them, we obtain\nan optimal SNN in sequence learning by directly mapping parameters from a\nquantized CRNN. We design two sub-pipelines to support the end-to-end\nconversion of different structures in neural networks, which is called\nCNN-Morph (CNN $\\rightarrow$ QCNN $\\rightarrow$ BIFSNN) and RNN-Morph (RNN\n$\\rightarrow$ QRNN $\\rightarrow$ RBIFSNN). Using conversion pipelines and the\ns-analog encoding method, the conversion error of our framework is zero.\nFurthermore, we give the theoretical and experimental demonstration of the\nlossless CRNN-SNN conversion. Our results show the effectiveness of our method\nover short and long timescales tasks compared with the state-of-the-art\nlearning- and conversion-based methods. We reach the highest accuracy of 99.16%\n(0.46 $\\uparrow$) on S-MNIST, 94.95% (3.95 $\\uparrow$) on PS-MNIST (sequence\nlength of 784) respectively, and the lowest loss of 0.057 (0.013 $\\downarrow$)\nwithin 8 time-steps in collision avoidance dataset.\n","authors":["Jiahao Su","Kang You","Zekai Xu","Weizhi Xu","Zhezhi He"],"pdf_url":"https://arxiv.org/pdf/2408.09403v2.pdf","comment":"Accepted by 33rd International Conference on Artificial Neural\n Networks"},{"id":"http://arxiv.org/abs/2408.12677v2","updated":"2024-08-26T01:08:36Z","published":"2024-08-22T18:32:50Z","title":"GSFusion: Online RGB-D Mapping Where Gaussian Splatting Meets TSDF\n Fusion","summary":" Traditional volumetric fusion algorithms preserve the spatial structure of 3D\nscenes, which is beneficial for many tasks in computer vision and robotics.\nHowever, they often lack realism in terms of visualization. Emerging 3D\nGaussian splatting bridges this gap, but existing Gaussian-based reconstruction\nmethods often suffer from artifacts and inconsistencies with the underlying 3D\nstructure, and struggle with real-time optimization, unable to provide users\nwith immediate feedback in high quality. One of the bottlenecks arises from the\nmassive amount of Gaussian parameters that need to be updated during\noptimization. Instead of using 3D Gaussian as a standalone map representation,\nwe incorporate it into a volumetric mapping system to take advantage of\ngeometric information and propose to use a quadtree data structure on images to\ndrastically reduce the number of splats initialized. In this way, we\nsimultaneously generate a compact 3D Gaussian map with fewer artifacts and a\nvolumetric map on the fly. Our method, GSFusion, significantly enhances\ncomputational efficiency without sacrificing rendering quality, as demonstrated\non both synthetic and real datasets. Code will be available at\nhttps://github.com/goldoak/GSFusion.\n","authors":["Jiaxin Wei","Stefan Leutenegger"],"pdf_url":"https://arxiv.org/pdf/2408.12677v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14698v1","updated":"2024-08-26T23:52:27Z","published":"2024-08-26T23:52:27Z","title":"Smart Multi-Modal Search: Contextual Sparse and Dense Embedding\n Integration in Adobe Express","summary":" As user content and queries become increasingly multi-modal, the need for\neffective multi-modal search systems has grown. Traditional search systems\noften rely on textual and metadata annotations for indexed images, while\nmulti-modal embeddings like CLIP enable direct search using text and image\nembeddings. However, embedding-based approaches face challenges in integrating\ncontextual features such as user locale and recency. Building a scalable\nmulti-modal search system requires fine-tuning several components. This paper\npresents a multi-modal search architecture and a series of AB tests that\noptimize embeddings and multi-modal technologies in Adobe Express template\nsearch. We address considerations such as embedding model selection, the roles\nof embeddings in matching and ranking, and the balance between dense and sparse\nembeddings. Our iterative approach demonstrates how utilizing sparse, dense,\nand contextual features enhances short and long query search, significantly\nreduces null rates (over 70\\%), and increases click-through rates (CTR). Our\nfindings provide insights into developing robust multi-modal search systems,\nthereby enhancing relevance for complex queries.\n","authors":["Cherag Aroraa","Tracy Holloway King","Jayant Kumar","Yi Lu","Sanat Sharma","Arvind Srikantan","David Uvalle","Josep Valls-Vargas","Harsha Vardhan"],"pdf_url":"https://arxiv.org/pdf/2408.14698v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14681v1","updated":"2024-08-26T23:10:42Z","published":"2024-08-26T23:10:42Z","title":"Enhancing Neural Network Interpretability Through Conductance-Based\n Information Plane Analysis","summary":" The Information Plane is a conceptual framework used to analyze the flow of\ninformation in neural networks, but traditional methods based on activations\nmay not fully capture the dynamics of information processing. This paper\nintroduces a new approach that uses layer conductance, a measure of sensitivity\nto input features, to enhance the Information Plane analysis. By incorporating\ngradient-based contributions, we provide a more precise characterization of\ninformation dynamics within the network. The proposed conductance-based\nInformation Plane and a new Information Transformation Efficiency (ITE) metric\nare evaluated on pretrained ResNet50 and VGG16 models using the ImageNet\ndataset. Our results demonstrate the ability to identify critical hidden layers\nthat contribute significantly to model performance and interpretability, giving\ninsights into information compression, preservation, and utilization across\nlayers. The conductance-based approach offers a granular perspective on feature\nattribution, enhancing our understanding of the decision-making processes\nwithin neural networks. Furthermore, our empirical findings challenge certain\ntheoretical predictions of the Information Bottleneck theory, highlighting the\ncomplexities of information dynamics in real-world data scenarios. The proposed\nmethod not only advances our understanding of information dynamics in neural\nnetworks but also has the potential to significantly impact the broader field\nof Artificial Intelligence by enabling the development of more interpretable,\nefficient, and robust models.\n","authors":["Jaouad Dabounou","Amine Baazzouz"],"pdf_url":"https://arxiv.org/pdf/2408.14681v1.pdf","comment":"16 pages, 10 figures"},{"id":"http://arxiv.org/abs/2407.02534v2","updated":"2024-08-26T22:56:28Z","published":"2024-07-01T16:58:55Z","title":"Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything","summary":" Large Visual Language Model\\textbfs (VLMs) such as GPT-4V have achieved\nremarkable success in generating comprehensive and nuanced responses.\nResearchers have proposed various benchmarks for evaluating the capabilities of\nVLMs. With the integration of visual and text inputs in VLMs, new security\nissues emerge, as malicious attackers can exploit multiple modalities to\nachieve their objectives. This has led to increasing attention on the\nvulnerabilities of VLMs to jailbreak. Most existing research focuses on\ngenerating adversarial images or nonsensical image to jailbreak these models.\nHowever, no researchers evaluate whether logic understanding capabilities of\nVLMs in flowchart can influence jailbreak. Therefore, to fill this gap, this\npaper first introduces a novel dataset Flow-JD specifically designed to\nevaluate the logic-based flowchart jailbreak capabilities of VLMs. We conduct\nan extensive evaluation on GPT-4o, GPT-4V, other 5 SOTA open source VLMs and\nthe jailbreak rate is up to 92.8%. Our research reveals significant\nvulnerabilities in current VLMs concerning image-to-text jailbreak and these\nfindings underscore the the urgency for the development of robust and effective\nfuture defenses.\n","authors":["Xiaotian Zou","Ke Li","Yongkang Chen"],"pdf_url":"https://arxiv.org/pdf/2407.02534v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14674v1","updated":"2024-08-26T22:50:59Z","published":"2024-08-26T22:50:59Z","title":"gWaveNet: Classification of Gravity Waves from Noisy Satellite Data\n using Custom Kernel Integrated Deep Learning Method","summary":" Atmospheric gravity waves occur in the Earths atmosphere caused by an\ninterplay between gravity and buoyancy forces. These waves have profound\nimpacts on various aspects of the atmosphere, including the patterns of\nprecipitation, cloud formation, ozone distribution, aerosols, and pollutant\ndispersion. Therefore, understanding gravity waves is essential to comprehend\nand monitor changes in a wide range of atmospheric behaviors. Limited studies\nhave been conducted to identify gravity waves from satellite data using machine\nlearning techniques. Particularly, without applying noise removal techniques,\nit remains an underexplored area of research. This study presents a novel\nkernel design aimed at identifying gravity waves within satellite images. The\nproposed kernel is seamlessly integrated into a deep convolutional neural\nnetwork, denoted as gWaveNet. Our proposed model exhibits impressive\nproficiency in detecting images containing gravity waves from noisy satellite\ndata without any feature engineering. The empirical results show our model\noutperforms related approaches by achieving over 98% training accuracy and over\n94% test accuracy which is known to be the best result for gravity waves\ndetection up to the time of this work. We open sourced our code at\nhttps://rb.gy/qn68ku.\n","authors":["Seraj Al Mahmud Mostafa","Omar Faruque","Chenxi Wang","Jia Yue","Sanjay Purushotham","Jianwu Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14674v1.pdf","comment":"This paper has been accepted at the 27th International Conference on\n Pattern Recognition (ICPR) 2024"},{"id":"http://arxiv.org/abs/2408.14672v1","updated":"2024-08-26T22:39:08Z","published":"2024-08-26T22:39:08Z","title":"Physically Feasible Semantic Segmentation","summary":" State-of-the-art semantic segmentation models are typically optimized in a\ndata-driven fashion, minimizing solely per-pixel classification objectives on\ntheir training data. This purely data-driven paradigm often leads to absurd\nsegmentations, especially when the domain of input images is shifted from the\none encountered during training. For instance, state-of-the-art models may\nassign the label ``road'' to a segment which is located above a segment that is\nrespectively labeled as ``sky'', although our knowledge of the physical world\ndictates that such a configuration is not feasible for images captured by\nforward-facing upright cameras. Our method, Physically Feasible Semantic\nSegmentation (PhyFea), extracts explicit physical constraints that govern\nspatial class relations from the training sets of semantic segmentation\ndatasets and enforces a differentiable loss function that penalizes violations\nof these constraints to promote prediction feasibility. PhyFea yields\nsignificant performance improvements in mIoU over each state-of-the-art network\nwe use as baseline across ADE20K, Cityscapes and ACDC, notably a $1.5\\%$\nimprovement on ADE20K and a $2.1\\%$ improvement on ACDC.\n","authors":["Shamik Basu","Christos Sakaridis","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2408.14672v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14659v1","updated":"2024-08-26T21:55:06Z","published":"2024-08-26T21:55:06Z","title":"Comparative Analysis: Violence Recognition from Videos using Transfer\n Learning","summary":" Action recognition has become a hot topic in computer vision. However, the\nmain applications of computer vision in video processing have focused on\ndetection of relatively simple actions while complex events such as violence\ndetection have been comparatively less investigated. This study focuses on the\nbenchmarking of various deep learning techniques on a complex dataset. Next, a\nlarger dataset is utilized to test the uplift from increasing volume of data.\nThe dataset size increase from 500 to 1,600 videos resulted in a notable\naverage accuracy improvement of 6% across four models.\n","authors":["Dursun Dashdamirov"],"pdf_url":"https://arxiv.org/pdf/2408.14659v1.pdf","comment":"6 pages, 5 figures, The paper will be published in IEEE AICT 2024\n Conference"},{"id":"http://arxiv.org/abs/2308.13651v5","updated":"2024-08-26T21:11:26Z","published":"2023-08-25T19:40:56Z","title":"PCNN: Probable-Class Nearest-Neighbor Explanations Improve Fine-Grained\n Image Classification Accuracy for AIs and Humans","summary":" Nearest neighbors (NN) are traditionally used to compute final decisions,\ne.g., in Support Vector Machines or k-NN classifiers, and to provide users with\nexplanations for the model's decision. In this paper, we show a novel utility\nof nearest neighbors: To improve predictions of a frozen, pretrained image\nclassifier C. We leverage an image comparator S that (1) compares the input\nimage with NN images from the top-K most probable classes given by C; and (2)\nuses scores from S to weight the confidence scores of C to refine predictions.\nOur method consistently improves fine-grained image classification accuracy on\nCUB-200, Cars-196, and Dogs-120. Also, a human study finds that showing users\nour probable-class nearest neighbors (PCNN) reduces over-reliance on AI, thus\nimproving their decision accuracy over prior work which only shows only the\nmost-probable (top-1) class examples.\n","authors":[" Giang"," Nguyen","Valerie Chen","Mohammad Reza Taesiri","Anh Totti Nguyen"],"pdf_url":"https://arxiv.org/pdf/2308.13651v5.pdf","comment":"Accepted to Transaction on Machine Learning Research 2024; 50 pages,\n 35 Figures & 17 Tables"},{"id":"http://arxiv.org/abs/2408.11836v3","updated":"2024-08-26T20:31:08Z","published":"2024-08-06T22:09:50Z","title":"Analysis of Unstructured High-Density Crowded Scenes for Crowd\n Monitoring","summary":" We are interested in developing an automated system for detection of\norganized movements in human crowds. Computer vision algorithms can extract\ninformation from videos of crowded scenes and automatically detect and track\ngroups of individuals undergoing organized motion that represents an anomalous\nbehavior in the context of conflict aversion. Our system can detect organized\ncohorts against the background of randomly moving objects and we can estimate\nthe number of participants in an organized cohort, the speed and direction of\nmotion in real time, within three to four video frames, which is less than one\nsecond from the onset of motion captured on a CCTV. We have performed\npreliminary analysis in this context in biological cell data containing up to\nfour thousand objects per frame and will extend this numerically to a\nhundred-fold for public safety applications.\n We envisage using the existing infrastructure of video cameras for acquiring\nimage datasets on-the-fly and deploying an easy-to-use data-driven software\nsystem for parsing of significant events by analyzing image sequences taken\ninside and outside of sports stadiums or other public venues. Other prospective\nusers are organizers of political rallies, civic and wildlife organizations,\nsecurity firms, and the military. We will optimize the performance of the\nsoftware by implementing a classification method able to distinguish between\nactivities posing a threat and those not posing a threat.\n","authors":["Alexandre Matov"],"pdf_url":"https://arxiv.org/pdf/2408.11836v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.06494v2","updated":"2024-08-26T20:10:52Z","published":"2024-08-12T21:04:16Z","title":"What Color Scheme is More Effective in Assisting Readers to Locate\n Information in a Color-Coded Article?","summary":" Color coding, a technique assigning specific colors to cluster information\ntypes, has proven advantages in aiding human cognitive activities, especially\nreading and comprehension. The rise of Large Language Models (LLMs) has\nstreamlined document coding, enabling simple automatic text labeling with\nvarious schemes. This has the potential to make color-coding more accessible\nand benefit more users. However, the impact of color choice on information\nseeking is understudied. We conducted a user study assessing various color\nschemes' effectiveness in LLM-coded text documents, standardizing contrast\nratios to approximately 5.55:1 across schemes. Participants performed timed\ninformation-seeking tasks in color-coded scholarly abstracts. Results showed\nnon-analogous and yellow-inclusive color schemes improved performance, with the\nlatter also being more preferred by participants. These findings can inform\nbetter color scheme choices for text annotation. As LLMs advance document\ncoding, we advocate for more research focusing on the \"color\" aspect of\ncolor-coding techniques.\n","authors":["Ho Yin Ng","Zeyu He","Ting-Hao 'Kenneth' Huang"],"pdf_url":"https://arxiv.org/pdf/2408.06494v2.pdf","comment":"This paper will appear at IEEE VIS 2024"},{"id":"http://arxiv.org/abs/2408.14606v1","updated":"2024-08-26T19:59:20Z","published":"2024-08-26T19:59:20Z","title":"BreakNet: Discontinuity-Resilient Multi-Scale Transformer Segmentation\n of Retinal Layers","summary":" Visible light optical coherence tomography (vis-OCT) is gaining traction for\nretinal imaging due to its high resolution and functional capabilities.\nHowever, the significant absorption of hemoglobin in the visible light range\nleads to pronounced shadow artifacts from retinal blood vessels, posing\nchallenges for accurate layer segmentation. In this study, we present BreakNet,\na multi-scale Transformer-based segmentation model designed to address boundary\ndiscontinuities caused by these shadow artifacts. BreakNet utilizes\nhierarchical Transformer and convolutional blocks to extract multi-scale global\nand local feature maps, capturing essential contextual, textural, and edge\ncharacteristics. The model incorporates decoder blocks that expand pathwaproys\nto enhance the extraction of fine details and semantic information, ensuring\nprecise segmentation. Evaluated on rodent retinal images acquired with\nprototype vis-OCT, BreakNet demonstrated superior performance over\nstate-of-the-art segmentation models, such as TCCT-BP and U-Net, even when\nfaced with limited-quality ground truth data. Our findings indicate that\nBreakNet has the potential to significantly improve retinal quantification and\nanalysis.\n","authors":["Razieh Ganjee","Bingjie Wang","Lingyun Wang","Chengcheng Zhao","José-Alain Sahel","Shaohua Pi"],"pdf_url":"https://arxiv.org/pdf/2408.14606v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14601v1","updated":"2024-08-26T19:44:18Z","published":"2024-08-26T19:44:18Z","title":"3D Point Cloud Network Pruning: When Some Weights Do not Matter","summary":" A point cloud is a crucial geometric data structure utilized in numerous\napplications. The adoption of deep neural networks referred to as Point Cloud\nNeural Networks (PC- NNs), for processing 3D point clouds, has significantly\nadvanced fields that rely on 3D geometric data to enhance the efficiency of\ntasks. Expanding the size of both neural network models and 3D point clouds\nintroduces significant challenges in minimizing computational and memory\nrequirements. This is essential for meeting the demanding requirements of\nreal-world applications, which prioritize minimal energy consumption and low\nlatency. Therefore, investigating redundancy in PCNNs is crucial yet\nchallenging due to their sensitivity to parameters. Additionally, traditional\npruning methods face difficulties as these networks rely heavily on weights and\npoints. Nonetheless, our research reveals a promising phenomenon that could\nrefine standard PCNN pruning techniques. Our findings suggest that preserving\nonly the top p% of the highest magnitude weights is crucial for accuracy\npreservation. For example, pruning 99% of the weights from the PointNet model\nstill results in accuracy close to the base level. Specifically, in the\nModelNet40 dataset, where the base accuracy with the PointNet model was 87. 5%,\npreserving only 1% of the weights still achieves an accuracy of 86.8%. Codes\nare available in: https://github.com/apurba-nsu-rnd-lab/PCNN_Pruning\n","authors":["Amrijit Biswas","Md. Ismail Hossain","M M Lutfe Elahi","Ali Cheraghian","Fuad Rahman","Nabeel Mohammed","Shafin Rahman"],"pdf_url":"https://arxiv.org/pdf/2408.14601v1.pdf","comment":"Accepted in BMVC 2024"},{"id":"http://arxiv.org/abs/2408.14600v1","updated":"2024-08-26T19:43:01Z","published":"2024-08-26T19:43:01Z","title":"PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing\n for 3D Object Detection","summary":" The integration of point and voxel representations is becoming more common in\nLiDAR-based 3D object detection. However, this combination often struggles with\ncapturing semantic information effectively. Moreover, relying solely on point\nfeatures within regions of interest can lead to information loss and\nlimitations in local feature representation. To tackle these challenges, we\npropose a novel two-stage 3D object detector, called Point-Voxel Attention\nFusion Network (PVAFN). PVAFN leverages an attention mechanism to improve\nmulti-modal feature fusion during the feature extraction phase. In the\nrefinement stage, it utilizes a multi-pooling strategy to integrate both\nmulti-scale and region-specific information effectively. The point-voxel\nattention mechanism adaptively combines point cloud and voxel-based\nBird's-Eye-View (BEV) features, resulting in richer object representations that\nhelp to reduce false detections. Additionally, a multi-pooling enhancement\nmodule is introduced to boost the model's perception capabilities. This module\nemploys cluster pooling and pyramid pooling techniques to efficiently capture\nkey geometric details and fine-grained shape structures, thereby enhancing the\nintegration of local and global features. Extensive experiments on the KITTI\nand Waymo datasets demonstrate that the proposed PVAFN achieves competitive\nperformance. The code and models will be available.\n","authors":["Yidi Li","Jiahao Wen","Bin Ren","Wenhao Li","Zhenhuan Xu","Hao Guo","Hong Liu","Nicu Sebe"],"pdf_url":"https://arxiv.org/pdf/2408.14600v1.pdf","comment":"3D Object Detection"},{"id":"http://arxiv.org/abs/2404.00491v2","updated":"2024-08-26T19:39:19Z","published":"2024-03-30T23:19:40Z","title":"Denoising Monte Carlo Renders with Diffusion Models","summary":" Physically-based renderings contain Monte-Carlo noise, with variance that\nincreases as the number of rays per pixel decreases. This noise, while\nzero-mean for good modern renderers, can have heavy tails (most notably, for\nscenes containing specular or refractive objects). Learned methods for\nrestoring low fidelity renders are highly developed, because suppressing render\nnoise means one can save compute and use fast renders with few rays per pixel.\nWe demonstrate that a diffusion model can denoise low fidelity renders\nsuccessfully. Furthermore, our method can be conditioned on a variety of\nnatural render information, and this conditioning helps performance.\nQuantitative experiments show that our method is competitive with SOTA across a\nrange of sampling rates. Qualitative examination of the reconstructions\nsuggests that the image prior applied by a diffusion method strongly favors\nreconstructions that are like real images -- so have straight shadow\nboundaries, curved specularities and no fireflies.\n","authors":["Vaibhav Vavilala","Rahul Vasanth","David Forsyth"],"pdf_url":"https://arxiv.org/pdf/2404.00491v2.pdf","comment":"25 pages, 18 figures, 2 tables"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2408.14432v1","updated":"2024-08-26T17:20:34Z","published":"2024-08-26T17:20:34Z","title":"Contextual Bandit with Herding Effects: Algorithms and Recommendation\n Applications","summary":" Contextual bandits serve as a fundamental algorithmic framework for\noptimizing recommendation decisions online. Though extensive attention has been\npaid to tailoring contextual bandits for recommendation applications, the\n\"herding effects\" in user feedback have been ignored. These herding effects\nbias user feedback toward historical ratings, breaking down the assumption of\nunbiased feedback inherent in contextual bandits. This paper develops a novel\nvariant of the contextual bandit that is tailored to address the feedback bias\ncaused by the herding effects. A user feedback model is formulated to capture\nthis feedback bias. We design the TS-Conf (Thompson Sampling under Conformity)\nalgorithm, which employs posterior sampling to balance the exploration and\nexploitation tradeoff. We prove an upper bound for the regret of the algorithm,\nrevealing the impact of herding effects on learning speed. Extensive\nexperiments on datasets demonstrate that TS-Conf outperforms four benchmark\nalgorithms. Analysis reveals that TS-Conf effectively mitigates the negative\nimpact of herding effects, resulting in faster learning and improved\nrecommendation accuracy.\n","authors":["Luyue Xu","Liming Wang","Hong Xie","Mingqiang Zhou"],"pdf_url":"https://arxiv.org/pdf/2408.14432v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14393v1","updated":"2024-08-26T16:21:50Z","published":"2024-08-26T16:21:50Z","title":"CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper\n Influence","summary":" With increasing privacy concerns in artificial intelligence, regulations have\nmandated the right to be forgotten, granting individuals the right to withdraw\ntheir data from models. Machine unlearning has emerged as a potential solution\nto enable selective forgetting in models, particularly in recommender systems\nwhere historical data contains sensitive user information. Despite recent\nadvances in recommendation unlearning, evaluating unlearning methods\ncomprehensively remains challenging due to the absence of a unified evaluation\nframework and overlooked aspects of deeper influence, e.g., fairness. To\naddress these gaps, we propose CURE4Rec, the first comprehensive benchmark for\nrecommendation unlearning evaluation. CURE4Rec covers four aspects, i.e.,\nunlearning Completeness, recommendation Utility, unleaRning efficiency, and\nrecommendation fairnEss, under three data selection strategies, i.e., core\ndata, edge data, and random data. Specifically, we consider the deeper\ninfluence of unlearning on recommendation fairness and robustness towards data\nwith varying impact levels. We construct multiple datasets with CURE4Rec\nevaluation and conduct extensive experiments on existing recommendation\nunlearning methods. Our code is released at\nhttps://github.com/xiye7lai/CURE4Rec.\n","authors":["Chaochao Chen","Jiaming Zhang","Yizhao Zhang","Li Zhang","Lingjuan Lyu","Yuyuan Li","Biao Gong","Chenggang Yan"],"pdf_url":"https://arxiv.org/pdf/2408.14393v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14238v1","updated":"2024-08-26T12:52:02Z","published":"2024-08-26T12:52:02Z","title":"Are LLM-based Recommenders Already the Best? Simple Scaled Cross-entropy\n Unleashes the Potential of Traditional Sequential Recommenders","summary":" Large language models (LLMs) have been garnering increasing attention in the\nrecommendation community. Some studies have observed that LLMs, when fine-tuned\nby the cross-entropy (CE) loss with a full softmax, could achieve\n`state-of-the-art' performance in sequential recommendation. However, most of\nthe baselines used for comparison are trained using a pointwise/pairwise loss\nfunction. This inconsistent experimental setting leads to the underestimation\nof traditional methods and further fosters over-confidence in the ranking\ncapability of LLMs.\n In this study, we provide theoretical justification for the superiority of\nthe cross-entropy loss by demonstrating its two desirable properties: tightness\nand coverage. Furthermore, this study sheds light on additional novel insights:\n1) Taking into account only the recommendation performance, CE is not yet\noptimal as it is not a quite tight bound in terms of some ranking metrics. 2)\nIn scenarios that full softmax cannot be performed, an effective alternative is\nto scale up the sampled normalizing term. These findings then help unleash the\npotential of traditional recommendation models, allowing them to surpass\nLLM-based counterparts. Given the substantial computational burden, existing\nLLM-based methods are not as effective as claimed for sequential\nrecommendation. We hope that these theoretical understandings in conjunction\nwith the empirical results will facilitate an objective evaluation of LLM-based\nrecommendation in the future.\n","authors":["Cong Xu","Zhangchi Zhu","Mo Yu","Jun Wang","Jianyong Wang","Wei Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.14238v1.pdf","comment":"18 pages. arXiv admin note: substantial text overlap with\n arXiv:2402.06216"},{"id":"http://arxiv.org/abs/2408.05141v2","updated":"2024-08-26T10:53:28Z","published":"2024-08-09T15:53:55Z","title":"A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning","summary":" Retrieval-augmented generation (RAG) is a framework enabling large language\nmodels (LLMs) to enhance their accuracy and reduce hallucinations by\nintegrating external knowledge bases. In this paper, we introduce a hybrid RAG\nsystem enhanced through a comprehensive suite of optimizations that\nsignificantly improve retrieval quality, augment reasoning capabilities, and\nrefine numerical computation ability. We refined the text chunks and tables in\nweb pages, added attribute predictors to reduce hallucinations, conducted LLM\nKnowledge Extractor and Knowledge Graph Extractor, and finally built a\nreasoning strategy with all the references. We evaluated our system on the CRAG\ndataset through the Meta CRAG KDD Cup 2024 Competition. Both the local and\nonline evaluations demonstrate that our system significantly enhances complex\nreasoning capabilities. In local evaluations, we have significantly improved\naccuracy and reduced error rates compared to the baseline model, achieving a\nnotable increase in scores. In the meanwhile, we have attained outstanding\nresults in online assessments, demonstrating the performance and generalization\ncapabilities of the proposed system. The source code for our system is released\nin \\url{https://gitlab.aicrowd.com/shizueyy/crag-new}.\n","authors":["Ye Yuan","Chengwu Liu","Jingyang Yuan","Gongbo Sun","Siqi Li","Ming Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.05141v2.pdf","comment":"Technical report for 3rd prize in Task 1 of Meta CRAG KDD Cup 2024"},{"id":"http://arxiv.org/abs/2408.14118v1","updated":"2024-08-26T09:06:35Z","published":"2024-08-26T09:06:35Z","title":"Towards Lifelong Learning Embeddings: An Algorithmic Approach to\n Dynamically Extend Embeddings","summary":" The rapid evolution of technology has transformed business operations and\ncustomer interactions worldwide, with personalization emerging as a key\nopportunity for e-commerce companies to engage customers more effectively. The\napplication of machine learning, particularly that of deep learning models, has\ngained significant traction due to its ability to rapidly recognize patterns in\nlarge datasets, thereby offering numerous possibilities for personalization.\nThese models use embeddings to map discrete information, such as product IDs,\ninto a latent vector space, a method increasingly popular in recent years.\nHowever, e-commerce's dynamic nature, characterized by frequent new product\nintroductions, poses challenges for these embeddings, which typically require\nfixed dimensions and inputs, leading to the need for periodic retraining from\nscratch. This paper introduces a modular algorithm that extends embedding input\nsize while preserving learned knowledge, addressing the challenges posed by\ne-commerce's dynamism. The proposed algorithm also incorporates strategies to\nmitigate the cold start problem associated with new products. The results of\ninitial experiments suggest that this method outperforms traditional\nembeddings.\n","authors":["Miguel Alves Gomes","Philipp Meisen","Tobias Meisen"],"pdf_url":"https://arxiv.org/pdf/2408.14118v1.pdf","comment":"Accepted Extended Abstract for 3rd Workshop on End-End Customer\n Journey Optimization at KDD2024, Barcelona, Spain"},{"id":"http://arxiv.org/abs/2408.08713v2","updated":"2024-08-26T03:03:47Z","published":"2024-08-16T12:51:52Z","title":"Beyond KAN: Introducing KarSein for Adaptive High-Order Feature\n Interaction Modeling in CTR Prediction","summary":" Modeling feature interactions is crucial for click-through rate (CTR)\nprediction, particularly when it comes to high-order explicit interactions.\nTraditional methods struggle with this task because they often predefine a\nmaximum interaction order, which relies heavily on prior knowledge and can\nlimit the model's effectiveness. Additionally, modeling high-order interactions\ntypically leads to increased computational costs. Therefore, the challenge lies\nin adaptively modeling high-order feature interactions while maintaining\nefficiency. To address this issue, we introduce Kolmogorov-Arnold Represented\nSparse Efficient Interaction Network (KarSein), designed to optimize both\npredictive accuracy and computational efficiency. We firstly identify\nlimitations of directly applying Kolmogorov-Arnold Networks (KAN) to CTR and\nthen introduce KarSein to overcome these issues. It features a novel\narchitecture that reduces the computational costs of KAN and supports embedding\nvectors as feature inputs. Additionally, KarSein employs guided symbolic\nregression to address the challenge of KAN in spontaneously learning\nmultiplicative relationships. Extensive experiments demonstrate KarSein's\nsuperior performance, achieving significant predictive accuracy with minimal\ncomputational overhead. Furthermore, KarSein maintains strong global\nexplainability while enabling the removal of redundant features, resulting in a\nsparse network structure. These advantages also position KarSein as a promising\nmethod for efficient inference.\n","authors":["Yunxiao Shi","Wujiang Xu","Mingyu Jin","Haimin Zhang","Qiang Wu","Yongfeng Zhang","Min Xu"],"pdf_url":"https://arxiv.org/pdf/2408.08713v2.pdf","comment":"KarSein for CTR"},{"id":"http://arxiv.org/abs/2408.13986v1","updated":"2024-08-26T02:36:55Z","published":"2024-08-26T02:36:55Z","title":"AgentMove: Predicting Human Mobility Anywhere Using Large Language Model\n based Agentic Framework","summary":" Human mobility prediction plays a crucial role in various real-world\napplications. Although deep learning based models have shown promising results\nover the past decade, their reliance on extensive private mobility data for\ntraining and their inability to perform zero-shot predictions, have hindered\nfurther advancements. Recently, attempts have been made to apply large language\nmodels (LLMs) to mobility prediction task. However, their performance has been\nconstrained by the absence of a systematic design of workflow. They directly\ngenerate the final output using LLMs, which limits the potential of LLMs to\nuncover complex mobility patterns and underestimates their extensive reserve of\nglobal geospatial knowledge. In this paper, we introduce AgentMove, a\nsystematic agentic prediction framework to achieve generalized mobility\nprediction for any cities worldwide. In AgentMove, we first decompose the\nmobility prediction task into three sub-tasks and then design corresponding\nmodules to complete these subtasks, including spatial-temporal memory for\nindividual mobility pattern mining, world knowledge generator for modeling the\neffects of urban structure and collective knowledge extractor for capturing the\nshared patterns among population. Finally, we combine the results of three\nmodules and conduct a reasoning step to generate the final predictions.\nExtensive experiments on mobility data from two sources in 12 cities\ndemonstrate that AgentMove outperforms the best baseline more than 8% in\nvarious metrics and it shows robust predictions with various LLMs as base and\nalso less geographical bias across cities. Codes and data can be found in\nhttps://github.com/tsinghua-fib-lab/AgentMove.\n","authors":["Jie Feng","Yuwei Du","Jie Zhao","Yong Li"],"pdf_url":"https://arxiv.org/pdf/2408.13986v1.pdf","comment":"13 pages"},{"id":"http://arxiv.org/abs/2406.18747v2","updated":"2024-08-26T01:07:11Z","published":"2024-06-26T20:25:53Z","title":"A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond\n Four Stems","summary":" Despite significant recent progress across multiple subtasks of audio source\nseparation, few music source separation systems support separation beyond the\nfour-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current\nsystems that support source separation beyond this setup, most continue to rely\non an inflexible decoder setup that can only support a fixed pre-defined set of\nstems. Increasing stem support in these inflexible systems correspondingly\nrequires increasing computational complexity, rendering extensions of these\nsystems computationally infeasible for long-tail instruments. In this work, we\npropose Banquet, a system that allows source separation of multiple stems using\njust one decoder. A bandsplit source separation model is extended to work in a\nquery-based setup in tandem with a music instrument recognition PaSST model. On\nthe MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached\nthe performance level of the significantly more complex 6-stem Hybrid\nTransformer Demucs on VDBO stems and outperformed it on guitar and piano. The\nquery-based setup allows for the separation of narrow instrument classes such\nas clean acoustic guitars, and can be successfully applied to the extraction of\nless common stems such as reeds and organs. Implementation is available at\nhttps://github.com/kwatcharasupat/query-bandit.\n","authors":["Karn N. Watcharasupat","Alexander Lerch"],"pdf_url":"https://arxiv.org/pdf/2406.18747v2.pdf","comment":"Accepted to the 25th International Society for Music Information\n Retrieval Conference (ISMIR 2024). Camera-ready version"},{"id":"http://arxiv.org/abs/2408.14698v1","updated":"2024-08-26T23:52:27Z","published":"2024-08-26T23:52:27Z","title":"Smart Multi-Modal Search: Contextual Sparse and Dense Embedding\n Integration in Adobe Express","summary":" As user content and queries become increasingly multi-modal, the need for\neffective multi-modal search systems has grown. Traditional search systems\noften rely on textual and metadata annotations for indexed images, while\nmulti-modal embeddings like CLIP enable direct search using text and image\nembeddings. However, embedding-based approaches face challenges in integrating\ncontextual features such as user locale and recency. Building a scalable\nmulti-modal search system requires fine-tuning several components. This paper\npresents a multi-modal search architecture and a series of AB tests that\noptimize embeddings and multi-modal technologies in Adobe Express template\nsearch. We address considerations such as embedding model selection, the roles\nof embeddings in matching and ranking, and the balance between dense and sparse\nembeddings. Our iterative approach demonstrates how utilizing sparse, dense,\nand contextual features enhances short and long query search, significantly\nreduces null rates (over 70\\%), and increases click-through rates (CTR). Our\nfindings provide insights into developing robust multi-modal search systems,\nthereby enhancing relevance for complex queries.\n","authors":["Cherag Aroraa","Tracy Holloway King","Jayant Kumar","Yi Lu","Sanat Sharma","Arvind Srikantan","David Uvalle","Josep Valls-Vargas","Harsha Vardhan"],"pdf_url":"https://arxiv.org/pdf/2408.14698v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14689v1","updated":"2024-08-26T23:29:03Z","published":"2024-08-26T23:29:03Z","title":"Federated User Preference Modeling for Privacy-Preserving Cross-Domain\n Recommendation","summary":" Cross-domain recommendation (CDR) aims to address the data-sparsity problem\nby transferring knowledge across domains. Existing CDR methods generally assume\nthat the user-item interaction data is shareable between domains, which leads\nto privacy leakage. Recently, some privacy-preserving CDR (PPCDR) models have\nbeen proposed to solve this problem. However, they primarily transfer simple\nrepresentations learned only from user-item interaction histories, overlooking\nother useful side information, leading to inaccurate user preferences.\nAdditionally, they transfer differentially private user-item interaction\nmatrices or embeddings across domains to protect privacy. However, these\nmethods offer limited privacy protection, as attackers may exploit external\ninformation to infer the original data. To address these challenges, we propose\na novel Federated User Preference Modeling (FUPM) framework. In FUPM, first, a\nnovel comprehensive preference exploration module is proposed to learn users'\ncomprehensive preferences from both interaction data and additional data\nincluding review texts and potentially positive items. Next, a private\npreference transfer module is designed to first learn differentially private\nlocal and global prototypes, and then privately transfer the global prototypes\nusing a federated learning strategy. These prototypes are generalized\nrepresentations of user groups, making it difficult for attackers to infer\nindividual information. Extensive experiments on four CDR tasks conducted on\nthe Amazon and Douban datasets validate the superiority of FUPM over SOTA\nbaselines. Code is available at https://github.com/Lili1013/FUPM.\n","authors":["Li Wang","Shoujin Wang","Quangui Zhang","Qiang Wu","Min Xu"],"pdf_url":"https://arxiv.org/pdf/2408.14689v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.12276v2","updated":"2024-08-26T23:05:42Z","published":"2024-02-19T16:40:38Z","title":"Explain then Rank: Scale Calibration of Neural Rankers Using Natural\n Language Explanations from LLMs","summary":" In search settings, calibrating the scores during the ranking process to\nquantities such as click-through rates or relevance levels enhances a system's\nusefulness and trustworthiness for downstream users. While previous research\nhas improved this notion of calibration for low complexity learning-to-rank\nmodels, the larger data demands and parameter count specific to modern neural\ntext rankers produce unique obstacles that hamper the efficacy of methods\nintended for the learning-to-rank setting.\n This paper proposes exploiting large language models (LLMs) to provide\nrelevance and uncertainty signals for these neural text rankers to produce\nscale-calibrated scores through Monte Carlo sampling of natural language\nexplanations (NLEs). Our approach transforms the neural ranking task from\nranking textual query-document pairs to ranking corresponding synthesized NLEs.\nComprehensive experiments on two popular document ranking datasets show that\nthe NLE-based calibration approach consistently outperforms past calibration\nmethods and LLM-based methods for ranking, calibration, and query performance\nprediction tasks.\n","authors":["Puxuan Yu","Daniel Cohen","Hemank Lamba","Joel Tetreault","Alex Jaimes"],"pdf_url":"https://arxiv.org/pdf/2402.12276v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14678v1","updated":"2024-08-26T23:01:48Z","published":"2024-08-26T23:01:48Z","title":"Bridging the Gap: Unpacking the Hidden Challenges in Knowledge\n Distillation for Online Ranking Systems","summary":" Knowledge Distillation (KD) is a powerful approach for compressing a large\nmodel into a smaller, more efficient model, particularly beneficial for\nlatency-sensitive applications like recommender systems. However, current KD\nresearch predominantly focuses on Computer Vision (CV) and NLP tasks,\noverlooking unique data characteristics and challenges inherent to recommender\nsystems. This paper addresses these overlooked challenges, specifically: (1)\nmitigating data distribution shifts between teacher and student models, (2)\nefficiently identifying optimal teacher configurations within time and\nbudgetary constraints, and (3) enabling computationally efficient and rapid\nsharing of teacher labels to support multiple students. We present a robust KD\nsystem developed and rigorously evaluated on multiple large-scale personalized\nvideo recommendation systems within Google. Our live experiment results\ndemonstrate significant improvements in student model performance while\nensuring consistent and reliable generation of high quality teacher labels from\na continuous data stream of data.\n","authors":["Nikhil Khani","Shuo Yang","Aniruddh Nath","Yang Liu","Pendo Abbo","Li Wei","Shawn Andrews","Maciej Kula","Jarrod Kahn","Zhe Zhao","Lichan Hong","Ed Chi"],"pdf_url":"https://arxiv.org/pdf/2408.14678v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14658v1","updated":"2024-08-26T21:47:49Z","published":"2024-08-26T21:47:49Z","title":"KGPrune: a Web Application to Extract Subgraphs of Interest from\n Wikidata with Analogical Pruning","summary":" Knowledge graphs (KGs) have become ubiquitous publicly available knowledge\nsources, and are nowadays covering an ever increasing array of domains.\nHowever, not all knowledge represented is useful or pertaining when considering\na new application or specific task. Also, due to their increasing size,\nhandling large KGs in their entirety entails scalability issues. These two\naspects asks for efficient methods to extract subgraphs of interest from\nexisting KGs. To this aim, we introduce KGPrune, a Web Application that, given\nseed entities of interest and properties to traverse, extracts their\nneighboring subgraphs from Wikidata. To avoid topical drift, KGPrune relies on\na frugal pruning algorithm based on analogical reasoning to only keep relevant\nneighbors while pruning irrelevant ones. The interest of KGPrune is illustrated\nby two concrete applications, namely, bootstrapping an enterprise KG and\nextracting knowledge related to looted artworks.\n","authors":["Pierre Monnin","Cherif-Hassan Nousradine","Lucas Jarnac","Laurel Zuckerman","Miguel Couceiro"],"pdf_url":"https://arxiv.org/pdf/2408.14658v1.pdf","comment":"Accepted as a demo paper at ECAI 2024"},{"id":"http://arxiv.org/abs/2408.14636v1","updated":"2024-08-26T21:00:25Z","published":"2024-08-26T21:00:25Z","title":"Relationships are Complicated! An Analysis of Relationships Between\n Datasets on the Web","summary":" The Web today has millions of datasets, and the number of datasets continues\nto grow at a rapid pace. These datasets are not standalone entities; rather,\nthey are intricately connected through complex relationships. Semantic\nrelationships between datasets provide critical insights for research and\ndecision-making processes. In this paper, we study dataset relationships from\nthe perspective of users who discover, use, and share datasets on the Web: what\nrelationships are important for different tasks? What contextual information\nmight users want to know? We first present a comprehensive taxonomy of\nrelationships between datasets on the Web and map these relationships to user\ntasks performed during dataset discovery. We develop a series of methods to\nidentify these relationships and compare their performance on a large corpus of\ndatasets generated from Web pages with schema.org markup. We demonstrate that\nmachine-learning based methods that use dataset metadata achieve multi-class\nclassification accuracy of 90%. Finally, we highlight gaps in available\nsemantic markup for datasets and discuss how incorporating comprehensive\nsemantics can facilitate the identification of dataset relationships. By\nproviding a comprehensive overview of dataset relationships at scale, this\npaper sets a benchmark for future research.\n","authors":["Kate Lin","Tarfah Alrashed","Natasha Noy"],"pdf_url":"https://arxiv.org/pdf/2408.14636v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14623v1","updated":"2024-08-26T20:36:52Z","published":"2024-08-26T20:36:52Z","title":"MODOC: A Modular Interface for Flexible Interlinking of Text Retrieval\n and Text Generation Functions","summary":" Large Language Models (LLMs) produce eloquent texts but often the content\nthey generate needs to be verified. Traditional information retrieval systems\ncan assist with this task, but most systems have not been designed with\nLLM-generated queries in mind. As such, there is a compelling need for\nintegrated systems that provide both retrieval and generation functionality\nwithin a single user interface.\n We present MODOC, a modular user interface that leverages the capabilities of\nLLMs and provides assistance with detecting their confabulations, promoting\nintegrity in scientific writing. MODOC represents a significant step forward in\nscientific writing assistance. Its modular architecture supports flexible\nfunctions for retrieving information and for writing and generating text in a\nsingle, user-friendly interface.\n","authors":["Yingqiang Gao","Jhony Prada","Nianlong Gu","Jessica Lam","Richard H. R. Hahnloser"],"pdf_url":"https://arxiv.org/pdf/2408.14623v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2408.14471v1","updated":"2024-08-26T17:59:01Z","published":"2024-08-26T17:59:01Z","title":"A Practitioner's Guide to Continual Multimodal Pretraining","summary":" Multimodal foundation models serve numerous applications at the intersection\nof vision and language. Still, despite being pretrained on extensive data, they\nbecome outdated over time. To keep models updated, research into continual\npretraining mainly explores scenarios with either (1) infrequent,\nindiscriminate updates on large-scale new data, or (2) frequent, sample-level\nupdates. However, practical model deployment often operates in the gap between\nthese two limit cases, as real-world applications often demand adaptation to\nspecific subdomains, tasks or concepts -- spread over the entire, varying life\ncycle of a model. In this work, we complement current perspectives on continual\npretraining through a research test bed as well as provide comprehensive\nguidance for effective continual model updates in such scenarios. We first\nintroduce FoMo-in-Flux, a continual multimodal pretraining benchmark with\nrealistic compute constraints and practical deployment requirements,\nconstructed over 63 datasets with diverse visual and semantic coverage. Using\nFoMo-in-Flux, we explore the complex landscape of practical continual\npretraining through multiple perspectives: (1) A data-centric investigation of\ndata mixtures and stream orderings that emulate real-world deployment\nsituations, (2) a method-centric investigation ranging from simple fine-tuning\nand traditional continual learning strategies to parameter-efficient updates\nand model merging, (3) meta learning rate schedules and mechanistic design\nchoices, and (4) the influence of model and compute scaling. Together, our\ninsights provide a practitioner's guide to continual multimodal pretraining for\nreal-world deployment. Our benchmark and code is here:\nhttps://github.com/ExplainableML/fomo_in_flux.\n","authors":["Karsten Roth","Vishaal Udandarao","Sebastian Dziadzio","Ameya Prabhu","Mehdi Cherti","Oriol Vinyals","Olivier Hénaff","Samuel Albanie","Matthias Bethge","Zeynep Akata"],"pdf_url":"https://arxiv.org/pdf/2408.14471v1.pdf","comment":"Technical Report. 52 pages"},{"id":"http://arxiv.org/abs/2408.14461v1","updated":"2024-08-26T17:50:47Z","published":"2024-08-26T17:50:47Z","title":"A domain decomposition-based autoregressive deep learning model for\n unsteady and nonlinear partial differential equations","summary":" In this paper, we propose a domain-decomposition-based deep learning (DL)\nframework, named transient-CoMLSim, for accurately modeling unsteady and\nnonlinear partial differential equations (PDEs). The framework consists of two\nkey components: (a) a convolutional neural network (CNN)-based autoencoder\narchitecture and (b) an autoregressive model composed of fully connected\nlayers. Unlike existing state-of-the-art methods that operate on the entire\ncomputational domain, our CNN-based autoencoder computes a lower-dimensional\nbasis for solution and condition fields represented on subdomains. Timestepping\nis performed entirely in the latent space, generating embeddings of the\nsolution variables from the time history of embeddings of solution and\ncondition variables. This approach not only reduces computational complexity\nbut also enhances scalability, making it well-suited for large-scale\nsimulations. Furthermore, to improve the stability of our rollouts, we employ a\ncurriculum learning (CL) approach during the training of the autoregressive\nmodel. The domain-decomposition strategy enables scaling to out-of-distribution\ndomain sizes while maintaining the accuracy of predictions -- a feature not\neasily integrated into popular DL-based approaches for physics simulations. We\nbenchmark our model against two widely-used DL architectures, Fourier Neural\nOperator (FNO) and U-Net, and demonstrate that our framework outperforms them\nin terms of accuracy, extrapolation to unseen timesteps, and stability for a\nwide range of use cases.\n","authors":["Sheel Nidhan","Haoliang Jiang","Lalit Ghule","Clancy Umphrey","Rishikesh Ranade","Jay Pathak"],"pdf_url":"https://arxiv.org/pdf/2408.14461v1.pdf","comment":"26 pages"},{"id":"http://arxiv.org/abs/2408.11796v2","updated":"2024-08-26T17:50:46Z","published":"2024-08-21T17:38:48Z","title":"LLM Pruning and Distillation in Practice: The Minitron Approach","summary":" We present a comprehensive report on compressing the Llama 3.1 8B and Mistral\nNeMo 12B models to 4B and 8B parameters, respectively, using pruning and\ndistillation. We explore two distinct pruning strategies: (1) depth pruning and\n(2) joint hidden/attention/MLP (width) pruning, and evaluate the results on\ncommon benchmarks from the LM Evaluation Harness. The models are then aligned\nwith NeMo Aligner and tested in instruct-tuned versions. This approach produces\na compelling 4B model from Llama 3.1 8B and a state-of-the-art\nMistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo\n12B. We found that with no access to the original data, it is beneficial to\nslightly fine-tune teacher models on the distillation dataset. We open-source\nour base model weights on Hugging Face with a permissive license.\n","authors":["Sharath Turuvekere Sreenivas","Saurav Muralidharan","Raviraj Joshi","Marcin Chochowski","Mostofa Patwary","Mohammad Shoeybi","Bryan Catanzaro","Jan Kautz","Pavlo Molchanov"],"pdf_url":"https://arxiv.org/pdf/2408.11796v2.pdf","comment":"v2: Added missing references. Cleaned up runtime performance section"},{"id":"http://arxiv.org/abs/2408.14453v1","updated":"2024-08-26T17:48:42Z","published":"2024-08-26T17:48:42Z","title":"Reconstructing physiological signals from fMRI across the adult lifespan","summary":" Interactions between the brain and body are of fundamental importance for\nhuman behavior and health. Functional magnetic resonance imaging (fMRI)\ncaptures whole-brain activity noninvasively, and modeling how fMRI signals\ninteract with physiological dynamics of the body can provide new insight into\nbrain function and offer potential biomarkers of disease. However,\nphysiological recordings are not always possible to acquire since they require\nextra equipment and setup, and even when they are, the recorded physiological\nsignals may contain substantial artifacts. To overcome this limitation, machine\nlearning models have been proposed to directly extract features of respiratory\nand cardiac activity from resting-state fMRI signals. To date, such work has\nbeen carried out only in healthy young adults and in a pediatric population,\nleaving open questions about the efficacy of these approaches on older adults.\nHere, we propose a novel framework that leverages Transformer-based\narchitectures for reconstructing two key physiological signals - low-frequency\nrespiratory volume (RV) and heart rate (HR) fluctuations - from fMRI data, and\ntest these models on a dataset of individuals aged 36-89 years old. Our\nframework outperforms previously proposed approaches (attaining median\ncorrelations between predicted and measured signals of r ~ .698 for RV and r ~\n.618 for HR), indicating the potential of leveraging attention mechanisms to\nmodel fMRI-physiological signal relationships. We also evaluate several model\ntraining and fine-tuning strategies, and find that incorporating young-adult\ndata during training improves the performance when predicting physiological\nsignals in the aging cohort. Overall, our approach successfully infers key\nphysiological variables directly from fMRI data from individuals across a wide\nrange of the adult lifespan.\n","authors":["Shiyu Wang","Ziyuan Xu","Yamin Li","Mara Mather","Roza G. Bayrak","Catie Chang"],"pdf_url":"https://arxiv.org/pdf/2408.14453v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14445v1","updated":"2024-08-26T17:36:51Z","published":"2024-08-26T17:36:51Z","title":"Symmetry & Critical Points","summary":" Critical points of an invariant function may or may not be symmetric. We\nprove, however, that if a symmetric critical point exists, those adjacent to it\nare generically symmetry breaking. This mathematical mechanism is shown to\ncarry important implications for our ability to efficiently minimize invariant\nnonconvex functions, in particular those associated with neural networks.\n","authors":["Yossi Arjevani"],"pdf_url":"https://arxiv.org/pdf/2408.14445v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14442v1","updated":"2024-08-26T17:35:01Z","published":"2024-08-26T17:35:01Z","title":"Model Parallel Training and Transfer Learning for Convolutional Neural\n Networks by Domain Decomposition","summary":" Deep convolutional neural networks (CNNs) have been shown to be very\nsuccessful in a wide range of image processing applications. However, due to\ntheir increasing number of model parameters and an increasing availability of\nlarge amounts of training data, parallelization strategies to efficiently train\ncomplex CNNs are necessary. In previous work by the authors, a novel model\nparallel CNN architecture was proposed which is loosely inspired by domain\ndecomposition. In particular, the novel network architecture is based on a\ndecomposition of the input data into smaller subimages. For each of these\nsubimages, local CNNs with a proportionally smaller number of parameters are\ntrained in parallel and the resulting local classifications are then aggregated\nin a second step by a dense feedforward neural network (DNN). In the present\nwork, we compare the resulting CNN-DNN architecture to less costly alternatives\nto combine the local classifications into a final, global decision.\nAdditionally, we investigate the performance of the CNN-DNN trained as one\ncoherent model as well as using a transfer learning strategy, where the\nparameters of the pre-trained local CNNs are used as initial values for a\nsubsequently trained global coherent CNN-DNN model.\n","authors":["Axel Klawonn","Martin Lanser","Janine Weber"],"pdf_url":"https://arxiv.org/pdf/2408.14442v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13840v3","updated":"2024-08-26T17:34:44Z","published":"2023-06-24T02:25:56Z","title":"Beyond Scale: The Diversity Coefficient as a Data Quality Metric for\n Variability in Natural Language Data","summary":" Current trends in pre-training Large Language Models (LLMs) primarily focus\non the scaling of model and dataset size. While the quality of pre-training\ndata is considered an important factor for training powerful LLMs, it remains a\nnebulous concept that has not been rigorously characterized. To this end, we\npropose a formalization of one key aspect of data quality -- measuring the\nvariability of natural language data -- specifically via a measure we call the\ndiversity coefficient. Our empirical analysis shows that the proposed diversity\ncoefficient aligns with the intuitive properties of diversity and variability,\ne.g., it increases as the number of latent concepts increases. Then, we measure\nthe diversity coefficient of publicly available pre-training datasets and\ndemonstrate that their formal diversity is high compared to theoretical lower\nand upper bounds. Finally, we conduct a comprehensive set of controlled\ninterventional experiments with GPT-2 and LLaMAv2 that demonstrate the\ndiversity coefficient of pre-training data characterizes useful aspects of\ndownstream model evaluation performance -- totaling 44 models of various sizes\n(51M to 7B parameters). We conclude that our formal notion of diversity is an\nimportant aspect of data quality that captures variability and causally leads\nto improved evaluation performance.\n","authors":["Brando Miranda","Alycia Lee","Sudharsan Sundar","Allison Casasola","Sanmi Koyejo"],"pdf_url":"https://arxiv.org/pdf/2306.13840v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.10844v2","updated":"2024-08-26T17:31:16Z","published":"2024-07-15T15:59:39Z","title":"Improved Uncertainty Estimation of Graph Neural Network Potentials Using\n Engineered Latent Space Distances","summary":" Graph neural networks (GNNs) have been shown to be astonishingly capable\nmodels for molecular property prediction, particularly as surrogates for\nexpensive density functional theory calculations of relaxed energy for novel\nmaterial discovery. However, one limitation of GNNs in this context is the lack\nof useful uncertainty prediction methods, as this is critical to the material\ndiscovery pipeline. In this work, we show that uncertainty quantification for\nrelaxed energy calculations is more complex than uncertainty quantification for\nother kinds of molecular property prediction, due to the effect that structure\noptimizations have on the error distribution. We propose that distribution-free\ntechniques are more useful tools for assessing calibration, recalibrating, and\ndeveloping uncertainty prediction methods for GNNs performing relaxed energy\ncalculations. We also develop a relaxed energy task for evaluating uncertainty\nmethods for equivariant GNNs, based on distribution-free recalibration and\nusing the Open Catalyst Project dataset. We benchmark a set of popular\nuncertainty prediction methods on this task, and show that latent distance\nmethods, with our novel improvements, are the most well-calibrated and\neconomical approach for relaxed energy calculations. Finally, we demonstrate\nthat our latent space distance method produces results which align with our\nexpectations on a clustering example, and on specific equation of state and\nadsorbate coverage examples from outside the training dataset.\n","authors":["Joseph Musielewicz","Janice Lan","Matt Uyttendaele","John R. Kitchin"],"pdf_url":"https://arxiv.org/pdf/2407.10844v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.10468v3","updated":"2024-08-26T17:28:23Z","published":"2024-08-20T00:40:49Z","title":"Tracing Privacy Leakage of Language Models to Training Data via Adjusted\n Influence Functions","summary":" The responses generated by Large Language Models (LLMs) can include sensitive\ninformation from individuals and organizations, leading to potential privacy\nleakage. This work implements Influence Functions (IFs) to trace privacy\nleakage back to the training data, thereby mitigating privacy concerns of\nLanguage Models (LMs). However, we notice that current IFs struggle to\naccurately estimate the influence of tokens with large gradient norms,\npotentially overestimating their influence. When tracing the most influential\nsamples, this leads to frequently tracing back to samples with large gradient\nnorm tokens, overshadowing the actual most influential samples even if their\ninfluences are well estimated. To address this issue, we propose Heuristically\nAdjusted IF (HAIF), which reduces the weight of tokens with large gradient\nnorms, thereby significantly improving the accuracy of tracing the most\ninfluential samples. To establish easily obtained groundtruth for tracing\nprivacy leakage, we construct two datasets, PII-E and PII-CR, representing two\ndistinct scenarios: one with identical text in the model outputs and\npre-training data, and the other where models leverage their reasoning\nabilities to generate text divergent from pre-training data. HAIF significantly\nimproves tracing accuracy, enhancing it by 20.96% to 73.71% on the PII-E\ndataset and 3.21% to 45.93% on the PII-CR dataset, compared to the best SOTA\nIFs against various GPT-2 and QWen-1.5 models. HAIF also outperforms SOTA IFs\non real-world pretraining data CLUECorpus2020, demonstrating strong robustness\nregardless prompt and response lengths.\n","authors":["Jinxin Liu","Zao Yang"],"pdf_url":"https://arxiv.org/pdf/2408.10468v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14435v1","updated":"2024-08-26T17:21:54Z","published":"2024-08-26T17:21:54Z","title":"Social perception of faces in a vision-language model","summary":" We explore social perception of human faces in CLIP, a widely used\nopen-source vision-language model. To this end, we compare the similarity in\nCLIP embeddings between different textual prompts and a set of face images. Our\ntextual prompts are constructed from well-validated social psychology terms\ndenoting social perception. The face images are synthetic and are\nsystematically and independently varied along six dimensions: the legally\nprotected attributes of age, gender, and race, as well as facial expression,\nlighting, and pose. Independently and systematically manipulating face\nattributes allows us to study the effect of each on social perception and\navoids confounds that can occur in wild-collected data due to uncontrolled\nsystematic correlations between attributes. Thus, our findings are experimental\nrather than observational. Our main findings are three. First, while CLIP is\ntrained on the widest variety of images and texts, it is able to make\nfine-grained human-like social judgments on face images. Second, age, gender,\nand race do systematically impact CLIP's social perception of faces, suggesting\nan undesirable bias in CLIP vis-a-vis legally protected attributes. Most\nstrikingly, we find a strong pattern of bias concerning the faces of Black\nwomen, where CLIP produces extreme values of social perception across different\nages and facial expressions. Third, facial expression impacts social perception\nmore than age and lighting as much as age. The last finding predicts that\nstudies that do not control for unprotected visual attributes may reach the\nwrong conclusions on bias. Our novel method of investigation, which is founded\non the social psychology literature and on the experiments involving the\nmanipulation of individual attributes, yields sharper and more reliable\nobservations than previous observational methods and may be applied to study\nbiases in any vision-language model.\n","authors":["Carina I. Hausladen","Manuel Knott","Colin F. Camerer","Pietro Perona"],"pdf_url":"https://arxiv.org/pdf/2408.14435v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14434v1","updated":"2024-08-26T17:21:19Z","published":"2024-08-26T17:21:19Z","title":"Employing Artificial Intelligence to Steer Exascale Workflows with\n Colmena","summary":" Computational workflows are a common class of application on supercomputers,\nyet the loosely coupled and heterogeneous nature of workflows often fails to\ntake full advantage of their capabilities. We created Colmena to leverage the\nmassive parallelism of a supercomputer by using Artificial Intelligence (AI) to\nlearn from and adapt a workflow as it executes. Colmena allows scientists to\ndefine how their application should respond to events (e.g., task completion)\nas a series of cooperative agents. In this paper, we describe the design of\nColmena, the challenges we overcame while deploying applications on exascale\nsystems, and the science workflows we have enhanced through interweaving AI.\nThe scaling challenges we discuss include developing steering strategies that\nmaximize node utilization, introducing data fabrics that reduce communication\noverhead of data-intensive tasks, and implementing workflow tasks that cache\ncostly operations between invocations. These innovations coupled with a variety\nof application patterns accessible through our agent-based steering model have\nenabled science advances in chemistry, biophysics, and materials science using\ndifferent types of AI. Our vision is that Colmena will spur creative solutions\nthat harness AI across many domains of scientific computing.\n","authors":["Logan Ward","J. Gregory Pauloski","Valerie Hayot-Sasson","Yadu Babuji","Alexander Brace","Ryan Chard","Kyle Chard","Rajeev Thakur","Ian Foster"],"pdf_url":"https://arxiv.org/pdf/2408.14434v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14432v1","updated":"2024-08-26T17:20:34Z","published":"2024-08-26T17:20:34Z","title":"Contextual Bandit with Herding Effects: Algorithms and Recommendation\n Applications","summary":" Contextual bandits serve as a fundamental algorithmic framework for\noptimizing recommendation decisions online. Though extensive attention has been\npaid to tailoring contextual bandits for recommendation applications, the\n\"herding effects\" in user feedback have been ignored. These herding effects\nbias user feedback toward historical ratings, breaking down the assumption of\nunbiased feedback inherent in contextual bandits. This paper develops a novel\nvariant of the contextual bandit that is tailored to address the feedback bias\ncaused by the herding effects. A user feedback model is formulated to capture\nthis feedback bias. We design the TS-Conf (Thompson Sampling under Conformity)\nalgorithm, which employs posterior sampling to balance the exploration and\nexploitation tradeoff. We prove an upper bound for the regret of the algorithm,\nrevealing the impact of herding effects on learning speed. Extensive\nexperiments on datasets demonstrate that TS-Conf outperforms four benchmark\nalgorithms. Analysis reveals that TS-Conf effectively mitigates the negative\nimpact of herding effects, resulting in faster learning and improved\nrecommendation accuracy.\n","authors":["Luyue Xu","Liming Wang","Hong Xie","Mingqiang Zhou"],"pdf_url":"https://arxiv.org/pdf/2408.14432v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.03341v4","updated":"2024-08-26T17:12:07Z","published":"2024-06-05T14:58:32Z","title":"Tackling GenAI Copyright Issues: Originality Estimation and\n Genericization","summary":" The rapid progress of generative AI technology has sparked significant\ncopyright concerns, leading to numerous lawsuits filed against AI developers.\nWhile various techniques for mitigating copyright issues have been studied,\nsignificant risks remain. Here, we propose a genericization method that\nmodifies the outputs of a generative model to make them more generic and less\nlikely to infringe copyright. To achieve this, we introduce a metric for\nquantifying the level of originality of data in a manner that is consistent\nwith the legal framework. This metric can be practically estimated by drawing\nsamples from a generative model, which is then used for the genericization\nprocess. As a practical implementation, we introduce PREGen, which combines our\ngenericization method with an existing mitigation technique. Experiments\ndemonstrate that our genericization method successfully modifies the output of\na text-to-image generative model so that it produces more generic,\ncopyright-compliant images. Compared to the existing method, PREGen reduces the\nlikelihood of generating copyrighted characters by more than half when the\nnames of copyrighted characters are used as the prompt, dramatically improving\nthe performance. Additionally, while generative models can produce copyrighted\ncharacters even when their names are not directly mentioned in the prompt,\nPREGen almost entirely prevents the generation of such characters in these\ncases.\n","authors":["Hiroaki Chiba-Okabe","Weijie J. Su"],"pdf_url":"https://arxiv.org/pdf/2406.03341v4.pdf","comment":"19 pages, 10 figures"},{"id":"http://arxiv.org/abs/2405.12295v3","updated":"2024-08-26T17:10:41Z","published":"2024-05-20T18:01:15Z","title":"Efficient Model-Stealing Attacks Against Inductive Graph Neural Networks","summary":" Graph Neural Networks (GNNs) are recognized as potent tools for processing\nreal-world data organized in graph structures. Especially inductive GNNs, which\nallow for the processing of graph-structured data without relying on predefined\ngraph structures, are becoming increasingly important in a wide range of\napplications. As such these networks become attractive targets for\nmodel-stealing attacks where an adversary seeks to replicate the functionality\nof the targeted network. Significant efforts have been devoted to developing\nmodel-stealing attacks that extract models trained on images and texts.\nHowever, little attention has been given to stealing GNNs trained on graph\ndata. This paper identifies a new method of performing unsupervised\nmodel-stealing attacks against inductive GNNs, utilizing graph contrastive\nlearning and spectral graph augmentations to efficiently extract information\nfrom the targeted model. The new type of attack is thoroughly evaluated on six\ndatasets and the results show that our approach outperforms the current\nstate-of-the-art by Shen et al. (2021). In particular, our attack surpasses the\nbaseline across all benchmarks, attaining superior fidelity and downstream\naccuracy of the stolen model while necessitating fewer queries directed toward\nthe target model.\n","authors":["Marcin Podhajski","Jan Dubiński","Franziska Boenisch","Adam Dziedzic","Agnieszka Pregowska And Tomasz Michalak"],"pdf_url":"https://arxiv.org/pdf/2405.12295v3.pdf","comment":"Accepted at ECAI - 27TH EUROPEAN CONFERENCE ON ARTIFICIAL\n INTELLIGENCE"},{"id":"http://arxiv.org/abs/2408.14421v1","updated":"2024-08-26T17:04:52Z","published":"2024-08-26T17:04:52Z","title":"Evaluating saliency scores in point clouds of natural environments by\n learning surface anomalies","summary":" In recent years, three-dimensional point clouds are used increasingly to\ndocument natural environments. Each dataset contains a diverse set of objects,\nat varying shapes and sizes, distributed throughout the data and intricately\nintertwined with the topography. Therefore, regions of interest are difficult\nto find and consequent analyses become a challenge. Inspired from visual\nperception principles, we propose to differentiate objects of interest from the\ncluttered environment by evaluating how much they stand out from their\nsurroundings, i.e., their geometric salience. Previous saliency detection\napproaches suggested mostly handcrafted attributes for the task. However, such\nmethods fail when the data are too noisy or have high levels of texture. Here\nwe propose a learning-based mechanism that accommodates noise and textured\nsurfaces. We assume that within the natural environment any change from the\nprevalent surface would suggest a salient object. Thus, we first learn the\nunderlying surface and then search for anomalies within it. Initially, a deep\nneural network is trained to reconstruct the surface. Regions where the\nreconstructed part deviates significantly from the original point cloud yield a\nsubstantial reconstruction error, signifying an anomaly, i.e., saliency. We\ndemonstrate the effectiveness of the proposed approach by searching for salient\nfeatures in various natural scenarios, which were acquired by different\nacquisition platforms. We show the strong correlation between the\nreconstruction error and salient objects.\n","authors":["Reuma Arav","Dennis Wittich","Franz Rottensteiner"],"pdf_url":"https://arxiv.org/pdf/2408.14421v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14416v1","updated":"2024-08-26T17:03:14Z","published":"2024-08-26T17:03:14Z","title":"Hyperdimensional Computing Empowered Federated Foundation Model over\n Wireless Networks for Metaverse","summary":" The Metaverse, a burgeoning collective virtual space merging augmented\nreality and persistent virtual worlds, necessitates advanced artificial\nintelligence (AI) and communication technologies to support immersive and\ninteractive experiences. Federated learning (FL) has emerged as a promising\ntechnique for collaboratively training AI models while preserving data privacy.\nHowever, FL faces challenges such as high communication overhead and\nsubstantial computational demands, particularly for neural network (NN) models.\nTo address these issues, we propose an integrated federated split learning and\nhyperdimensional computing (FSL-HDC) framework for emerging foundation models.\nThis novel approach reduces communication costs, computation load, and privacy\nrisks, making it particularly suitable for resource-constrained edge devices in\nthe Metaverse, ensuring real-time responsive interactions. Additionally, we\nintroduce an optimization algorithm that concurrently optimizes transmission\npower and bandwidth to minimize the maximum transmission time among all users\nto the server. The simulation results based on the MNIST dataset indicate that\nFSL-HDC achieves an accuracy rate of approximately 87.5%, which is slightly\nlower than that of FL-HDC. However, FSL-HDC exhibits a significantly faster\nconvergence speed, approximately 3.733x that of FSL-NN, and demonstrates\nrobustness to non-IID data distributions. Moreover, our proposed optimization\nalgorithm can reduce the maximum transmission time by up to 64% compared with\nthe baseline.\n","authors":["Yahao Ding","Wen Shang","Minrui Xu","Zhaohui Yang","Ye Hu","Dusit Niyato","Mohammad Shikh-Bahaei"],"pdf_url":"https://arxiv.org/pdf/2408.14416v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14415v1","updated":"2024-08-26T17:02:25Z","published":"2024-08-26T17:02:25Z","title":"LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation","summary":" Mamba, a State Space Model (SSM), has recently shown competitive performance\nto Convolutional Neural Networks (CNNs) and Transformers in Natural Language\nProcessing and general sequence modeling. Various attempts have been made to\nadapt Mamba to Computer Vision tasks, including medical image segmentation\n(MIS). Vision Mamba (VM)-based networks are particularly attractive due to\ntheir ability to achieve global receptive fields, similar to Vision\nTransformers, while also maintaining linear complexity in the number of tokens.\nHowever, the existing VM models still struggle to maintain both spatially local\nand global dependencies of tokens in high dimensional arrays due to their\nsequential nature. Employing multiple and/or complicated scanning strategies is\ncomputationally costly, which hinders applications of SSMs to high-dimensional\n2D and 3D images that are common in MIS problems. In this work, we propose\nLocal-Global Vision Mamba, LoG-VMamba, that explicitly enforces spatially\nadjacent tokens to remain nearby on the channel axis, and retains the global\ncontext in a compressed form. Our method allows the SSMs to access the local\nand global contexts even before reaching the last token while requiring only a\nsimple scanning strategy. Our segmentation models are computationally efficient\nand substantially outperform both CNN and Transformers-based baselines on a\ndiverse set of 2D and 3D MIS tasks. The implementation of LoG-VMamba is\navailable at \\url{https://github.com/Oulu-IMEDS/LoG-VMamba}.\n","authors":["Trung Dinh Quoc Dang","Huy Hoang Nguyen","Aleksei Tiulpin"],"pdf_url":"https://arxiv.org/pdf/2408.14415v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2407.03194v5","updated":"2024-08-26T16:57:34Z","published":"2024-07-03T15:26:02Z","title":"Prediction Instability in Machine Learning Ensembles","summary":" In machine learning ensembles predictions from multiple models are\naggregated. Despite widespread use and strong performance of ensembles in\napplied problems little is known about the mathematical properties of\naggregating models and associated consequences for safe, explainable use of\nsuch models. In this paper we prove a theorem that shows that any ensemble will\nexhibit at least one of the following forms of prediction instability. It will\neither ignore agreement among all underlying models, change its mind when none\nof the underlying models have done so, or be manipulable through inclusion or\nexclusion of options it would never actually predict. As a consequence,\nensemble aggregation procedures will always need to balance the benefits of\ninformation use against the risk of these prediction instabilities. This\nanalysis also sheds light on what specific forms of prediction instability to\nexpect from particular ensemble algorithms; for example popular tree ensembles\nlike random forest, or xgboost will violate basic, intuitive fairness\nproperties. Finally, we show that this can be ameliorated by using consistent\nmodels in asymptotic conditions.\n","authors":["Jeremy Kedziora"],"pdf_url":"https://arxiv.org/pdf/2407.03194v5.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2408.14407v1","updated":"2024-08-26T16:49:42Z","published":"2024-08-26T16:49:42Z","title":"Spectrally Informed Learning of Fluid Flows","summary":" Accurate and efficient fluid flow models are essential for applications\nrelating to many physical phenomena including geophysical, aerodynamic, and\nbiological systems. While these flows may exhibit rich and multiscale dynamics,\nin many cases underlying low-rank structures exist which describe the bulk of\nthe motion. These structures tend to be spatially large and temporally slow,\nand may contain most of the energy in a given flow. The extraction and\nparsimonious representation of these low-rank dynamics from high-dimensional\ndata is a key challenge. Inspired by the success of physics-informed machine\nlearning methods, we propose a spectrally-informed approach to extract low-rank\nmodels of fluid flows by leveraging known spectral properties in the learning\nprocess. We incorporate this knowledge by imposing regularizations on the\nlearned dynamics, which bias the training process towards learning\nlow-frequency structures with corresponding higher power. We demonstrate the\neffectiveness of this method to improve prediction and produce learned models\nwhich better match the underlying spectral properties of prototypical fluid\nflows.\n","authors":["Benjamin D. Shaffer","Jeremy R. Vorenberg","M. Ani Hsieh"],"pdf_url":"https://arxiv.org/pdf/2408.14407v1.pdf","comment":"13 pages, 10 figures"},{"id":"http://arxiv.org/abs/2403.05720v2","updated":"2024-08-26T16:48:08Z","published":"2024-03-08T23:17:55Z","title":"A Dataset and Benchmark for Hospital Course Summarization with Adapted\n Large Language Models","summary":" Brief hospital course (BHC) summaries are clinical documents that summarize a\npatient's hospital stay. While large language models (LLMs) depict remarkable\ncapabilities in automating real-world tasks, their capabilities for healthcare\napplications such as synthesizing BHCs from clinical notes have not been shown.\nWe introduce a novel pre-processed dataset, the MIMIC-IV-BHC, encapsulating\nclinical note and brief hospital course (BHC) pairs to adapt LLMs for BHC\nsynthesis. Furthermore, we introduce a benchmark of the summarization\nperformance of two general-purpose LLMs and three healthcare-adapted LLMs.\n Using clinical notes as input, we apply prompting-based (using in-context\nlearning) and fine-tuning-based adaptation strategies to three open-source LLMs\n(Clinical-T5-Large, Llama2-13B, FLAN-UL2) and two proprietary LLMs (GPT-3.5,\nGPT-4). We evaluate these LLMs across multiple context-length inputs using\nnatural language similarity metrics. We further conduct a clinical study with\nfive clinicians, comparing clinician-written and LLM-generated BHCs across 30\nsamples, focusing on their potential to enhance clinical decision-making\nthrough improved summary quality. We observe that the Llama2-13B fine-tuned LLM\noutperforms other domain-adapted models given quantitative evaluation metrics\nof BLEU and BERT-Score. GPT-4 with in-context learning shows more robustness to\nincreasing context lengths of clinical note inputs than fine-tuned Llama2-13B.\nDespite comparable quantitative metrics, the reader study depicts a significant\npreference for summaries generated by GPT-4 with in-context learning compared\nto both Llama2-13B fine-tuned summaries and the original summaries,\nhighlighting the need for qualitative clinical evaluation.\n","authors":["Asad Aali","Dave Van Veen","Yamin Ishraq Arefeen","Jason Hom","Christian Bluethgen","Eduardo Pontes Reis","Sergios Gatidis","Namuun Clifford","Joseph Daws","Arash S. Tehrani","Jangwon Kim","Akshay S. Chaudhari"],"pdf_url":"https://arxiv.org/pdf/2403.05720v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14404v1","updated":"2024-08-26T16:47:20Z","published":"2024-08-26T16:47:20Z","title":"Application of Neural Ordinary Differential Equations for ITER Burning\n Plasma Dynamics","summary":" The dynamics of burning plasmas in tokamaks are crucial for advancing\ncontrolled thermonuclear fusion. This study introduces the NeuralPlasmaODE, a\nmulti-region multi-timescale transport model to simulate the complex energy\ntransfer processes in ITER deuterium-tritium (D-T) plasmas. Our model captures\nthe interactions between energetic alpha particles, electrons, and ions, which\nare vital for understanding phenomena such as thermal runaway instability. We\nemploy neural ordinary differential equations (Neural ODEs) for the numerical\nderivation of diffusivity parameters, enabling precise modeling of energy\ninteractions between different plasma regions. By leveraging transfer learning,\nwe utilize model parameters derived from DIII-D experimental data, enhancing\nthe efficiency and accuracy of our simulations without training from scratch.\nApplying this model to ITER's inductive and non-inductive operational\nscenarios, our results demonstrate that radiation and transport processes\neffectively remove excess heat from the core plasma, preventing thermal runaway\ninstability. This study underscores the potential of machine learning in\nadvancing our understanding and control of burning plasma dynamics in fusion\nreactors.\n","authors":["Zefang Liu","Weston M. Stacey"],"pdf_url":"https://arxiv.org/pdf/2408.14404v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14398v1","updated":"2024-08-26T16:29:13Z","published":"2024-08-26T16:29:13Z","title":"Language-specific Calibration for Pruning Multilingual Language Models","summary":" Recent advances in large language model (LLM) pruning have shown\nstate-of-the-art compression results in post-training and retraining-free\nsettings while maintaining high predictive performance. However, such research\nmainly considers calibrating pruning using English text, despite the\nmultilingual nature of modern LLMs and their frequent uses in non-English\nlanguages. In this paper, we set out to explore effective strategies for\ncalibrating the pruning of multilingual language models. We present the first\ncomprehensive empirical study, comparing different calibration languages for\npruning multilingual models across diverse tasks, models, and state-of-the-art\npruning techniques. Our results present practical suggestions, for example,\ncalibrating in the target language can efficiently yield lower perplexity, but\ndoes not necessarily benefit downstream tasks. Our further analysis experiments\nunveil that calibration in the target language mainly contributes to preserving\nlanguage-specific features related to fluency and coherence, but might not\ncontribute to capturing language-agnostic features such as language\nunderstanding and reasoning. Last, we provide practical recommendations for\nfuture practitioners.\n","authors":["Simon Kurz","Zhixue Zhao","Jian-Jia Chen","Lucie Flek"],"pdf_url":"https://arxiv.org/pdf/2408.14398v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14393v1","updated":"2024-08-26T16:21:50Z","published":"2024-08-26T16:21:50Z","title":"CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper\n Influence","summary":" With increasing privacy concerns in artificial intelligence, regulations have\nmandated the right to be forgotten, granting individuals the right to withdraw\ntheir data from models. Machine unlearning has emerged as a potential solution\nto enable selective forgetting in models, particularly in recommender systems\nwhere historical data contains sensitive user information. Despite recent\nadvances in recommendation unlearning, evaluating unlearning methods\ncomprehensively remains challenging due to the absence of a unified evaluation\nframework and overlooked aspects of deeper influence, e.g., fairness. To\naddress these gaps, we propose CURE4Rec, the first comprehensive benchmark for\nrecommendation unlearning evaluation. CURE4Rec covers four aspects, i.e.,\nunlearning Completeness, recommendation Utility, unleaRning efficiency, and\nrecommendation fairnEss, under three data selection strategies, i.e., core\ndata, edge data, and random data. Specifically, we consider the deeper\ninfluence of unlearning on recommendation fairness and robustness towards data\nwith varying impact levels. We construct multiple datasets with CURE4Rec\nevaluation and conduct extensive experiments on existing recommendation\nunlearning methods. Our code is released at\nhttps://github.com/xiye7lai/CURE4Rec.\n","authors":["Chaochao Chen","Jiaming Zhang","Yizhao Zhang","Li Zhang","Lingjuan Lyu","Yuyuan Li","Biao Gong","Chenggang Yan"],"pdf_url":"https://arxiv.org/pdf/2408.14393v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14387v1","updated":"2024-08-26T16:11:53Z","published":"2024-08-26T16:11:53Z","title":"Reprogramming Foundational Large Language Models(LLMs) for Enterprise\n Adoption for Spatio-Temporal Forecasting Applications: Unveiling a New Era in\n Copilot-Guided Cross-Modal Time Series Representation Learning","summary":" Spatio-temporal forecasting plays a crucial role in various sectors such as\ntransportation systems, logistics, and supply chain management. However,\nexisting methods are limited by their ability to handle large, complex\ndatasets. To overcome this limitation, we introduce a hybrid approach that\ncombines the strengths of open-source large and small-scale language models\n(LLMs and LMs) with traditional forecasting methods. We augment traditional\nmethods with dynamic prompting and a grouped-query, multi-head attention\nmechanism to more effectively capture both intra-series and inter-series\ndependencies in evolving nonlinear time series data. In addition, we facilitate\non-premises customization by fine-tuning smaller open-source LMs for time\nseries trend analysis utilizing descriptions generated by open-source large LMs\non consumer-grade hardware using Low-Rank Adaptation with Activation Memory\nReduction (LoRA-AMR) technique to reduce computational overhead and activation\nstorage memory demands while preserving inference latency. We combine language\nmodel processing for time series trend analysis with traditional time series\nrepresentation learning method for cross-modal integration, achieving robust\nand accurate forecasts. The framework effectiveness is demonstrated through\nextensive experiments on various real-world datasets, outperforming existing\nmethods by significant margins in terms of forecast accuracy.\n","authors":["Sakhinana Sagar Srinivas","Chidaksh Ravuru","Geethan Sannidhi","Venkataramana Runkana"],"pdf_url":"https://arxiv.org/pdf/2408.14387v1.pdf","comment":"Paper published at the Deployable AI (DAI) workshop at AAAI-2024"},{"id":"http://arxiv.org/abs/2408.14381v1","updated":"2024-08-26T16:04:13Z","published":"2024-08-26T16:04:13Z","title":"Learning Tree-Structured Composition of Data Augmentation","summary":" Data augmentation is widely used for training a neural network given little\nlabeled data. A common practice of augmentation training is applying a\ncomposition of multiple transformations sequentially to the data. Existing\naugmentation methods such as RandAugment randomly sample from a list of\npre-selected transformations, while methods such as AutoAugment apply advanced\nsearch to optimize over an augmentation set of size $k^d$, which is the number\nof transformation sequences of length $d$, given a list of $k$ transformations.\n In this paper, we design efficient algorithms whose running time complexity\nis much faster than the worst-case complexity of $O(k^d)$, provably. We propose\na new algorithm to search for a binary tree-structured composition of $k$\ntransformations, where each tree node corresponds to one transformation. The\nbinary tree generalizes sequential augmentations, such as the SimCLR\naugmentation scheme for contrastive learning. Using a top-down, recursive\nsearch procedure, our algorithm achieves a runtime complexity of $O(2^d k)$,\nwhich is much faster than $O(k^d)$ as $k$ increases above $2$. We apply our\nalgorithm to tackle data distributions with heterogeneous subpopulations by\nsearching for one tree in each subpopulation and then learning a weighted\ncombination, resulting in a forest of trees.\n We validate our proposed algorithms on numerous graph and image datasets,\nincluding a multi-label graph classification dataset we collected. The dataset\nexhibits significant variations in the sizes of graphs and their average\ndegrees, making it ideal for studying data augmentation. We show that our\napproach can reduce the computation cost by 43% over existing search methods\nwhile improving performance by 4.3%. The tree structures can be used to\ninterpret the relative importance of each transformation, such as identifying\nthe important transformations on small vs. large graphs.\n","authors":["Dongyue Li","Kailai Chen","Predrag Radivojac","Hongyang R. Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.14381v1.pdf","comment":"25 pages"},{"id":"http://arxiv.org/abs/2408.14371v1","updated":"2024-08-26T15:53:50Z","published":"2024-08-26T15:53:50Z","title":"SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery","summary":" In this paper, we address Generalized Category Discovery, aiming to\nsimultaneously uncover novel categories and accurately classify known ones.\nTraditional methods, which lean heavily on self-supervision and contrastive\nlearning, often fall short when distinguishing between fine-grained categories.\nTo address this, we introduce a novel concept called `self-expertise', which\nenhances the model's ability to recognize subtle differences and uncover\nunknown categories. Our approach combines unsupervised and supervised\nself-expertise strategies to refine the model's discernment and generalization.\nInitially, hierarchical pseudo-labeling is used to provide `soft supervision',\nimproving the effectiveness of self-expertise. Our supervised technique differs\nfrom traditional methods by utilizing more abstract positive and negative\nsamples, aiding in the formation of clusters that can generalize to novel\ncategories. Meanwhile, our unsupervised strategy encourages the model to\nsharpen its category distinctions by considering within-category examples as\n`hard' negatives. Supported by theoretical insights, our empirical results\nshowcase that our method outperforms existing state-of-the-art techniques in\nGeneralized Category Discovery across several fine-grained datasets. Our code\nis available at: https://github.com/SarahRastegar/SelEx.\n","authors":["Sarah Rastegar","Mohammadreza Salehi","Yuki M. Asano","Hazel Doughty","Cees G. M. Snoek"],"pdf_url":"https://arxiv.org/pdf/2408.14371v1.pdf","comment":"Accepted by ECCV 2024"},{"id":"http://arxiv.org/abs/2310.07979v2","updated":"2024-08-26T15:51:38Z","published":"2023-10-12T01:57:27Z","title":"Graph-SCP: Accelerating Set Cover Problems with Graph Neural Networks","summary":" Machine learning (ML) approaches are increasingly being used to accelerate\ncombinatorial optimization (CO) problems. We investigate the Set Cover Problem\n(SCP) and propose Graph-SCP, a graph neural network method that augments\nexisting optimization solvers by learning to identify a much smaller\nsub-problem that contains the solution space. Graph-SCP uses both supervised\nlearning from prior solved instances and unsupervised learning aimed at\nminimizing the SCP objective. We evaluate the performance of Graph-SCP on\nsynthetically weighted and unweighted SCP instances with diverse problem\ncharacteristics and complexities, and on instances from the OR Library, a\ncanonical benchmark for SCP. We show that Graph-SCP reduces the problem size by\n60-80% and achieves runtime speedups of up to 10x on average when compared to\nGurobi (a state-of-the-art commercial solver), while maintaining solution\nquality. This is in contrast to fast greedy solutions that significantly\ncompromise solution quality to achieve guaranteed polynomial runtime. We\nshowcase Graph-SCP's ability to generalize to larger problem sizes, training on\nSCP instances with up to 3,000 subsets and testing on SCP instances with up to\n10,000 subsets.\n","authors":["Zohair Shafi","Benjamin A. Miller","Tina Eliassi-Rad","Rajmonda S. Caceres"],"pdf_url":"https://arxiv.org/pdf/2310.07979v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14369v1","updated":"2024-08-26T15:49:31Z","published":"2024-08-26T15:49:31Z","title":"Exploiting Conjugate Label Information for Multi-Instance Partial-Label\n Learning","summary":" Multi-instance partial-label learning (MIPL) addresses scenarios where each\ntraining sample is represented as a multi-instance bag associated with a\ncandidate label set containing one true label and several false positives.\nExisting MIPL algorithms have primarily focused on mapping multi-instance bags\nto candidate label sets for disambiguation, disregarding the intrinsic\nproperties of the label space and the supervised information provided by\nnon-candidate label sets. In this paper, we propose an algorithm named ELIMIPL,\ni.e., Exploiting conjugate Label Information for Multi-Instance Partial-Label\nlearning, which exploits the conjugate label information to improve the\ndisambiguation performance. To achieve this, we extract the label information\nembedded in both candidate and non-candidate label sets, incorporating the\nintrinsic properties of the label space. Experimental results obtained from\nbenchmark and real-world datasets demonstrate the superiority of the proposed\nELIMIPL over existing MIPL algorithms and other well-established partial-label\nlearning algorithms.\n","authors":["Wei Tang","Weijia Zhang","Min-Ling Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.14369v1.pdf","comment":"Accepted at IJCAI 2024. The code can be found at\n https://github.com/tangw-seu/ELIMIPL"},{"id":"http://arxiv.org/abs/2408.14358v1","updated":"2024-08-26T15:32:31Z","published":"2024-08-26T15:32:31Z","title":"An Embedding is Worth a Thousand Noisy Labels","summary":" The performance of deep neural networks scales with dataset size and label\nquality, rendering the efficient mitigation of low-quality data annotations\ncrucial for building robust and cost-effective systems. Existing strategies to\naddress label noise exhibit severe limitations due to computational complexity\nand application dependency. In this work, we propose WANN, a Weighted Adaptive\nNearest Neighbor approach that builds on self-supervised feature\nrepresentations obtained from foundation models. To guide the weighted voting\nscheme, we introduce a reliability score, which measures the likelihood of a\ndata label being correct. WANN outperforms reference methods, including a\nlinear layer trained with robust loss functions, on diverse datasets of varying\nsize and under various noise types and severities. WANN also exhibits superior\ngeneralization on imbalanced data compared to both Adaptive-NNs (ANN) and fixed\nk-NNs. Furthermore, the proposed weighting scheme enhances supervised\ndimensionality reduction under noisy labels. This yields a significant boost in\nclassification performance with 10x and 100x smaller image embeddings,\nminimizing latency and storage requirements. Our approach, emphasizing\nefficiency and explainability, emerges as a simple, robust solution to overcome\nthe inherent limitations of deep neural network training. The code is available\nat https://github.com/francescodisalvo05/wann-noisy-labels .\n","authors":["Francesco Di Salvo","Sebastian Doerrich","Ines Rieger","Christian Ledig"],"pdf_url":"https://arxiv.org/pdf/2408.14358v1.pdf","comment":"Preprint submitted to the International Journal of Computer Vision\n (IJCV)"},{"id":"http://arxiv.org/abs/2408.14352v1","updated":"2024-08-26T15:29:34Z","published":"2024-08-26T15:29:34Z","title":"Assessing Contamination in Large Language Models: Introducing the\n LogProber method","summary":" In machine learning, contamination refers to situations where testing data\nleak into the training set. The issue is particularly relevant for the\nevaluation of the performance of Large Language Models (LLMs), which are\ngenerally trained on gargantuan, and generally opaque, corpora of text scraped\nfrom the world wide web. Developing tools to detect contamination is therefore\ncrucial to be able to fairly and properly track the evolution of the\nperformance of LLMs. Most recent works in the field are not tailored to\nquantify contamination on short sequences of text like we find in psychology\nquestionnaires. In the present paper we introduce LogProber, a novel,\nefficient, algorithm that we show able to detect contamination using token\nprobability in given sentences. In the second part we investigate the\nlimitations of the method and discuss how different training methods can\ncontaminate models without leaving traces in the token probabilities.\n","authors":["Nicolas Yax","Pierre-Yves Oudeyer","Stefano Palminteri"],"pdf_url":"https://arxiv.org/pdf/2408.14352v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.11126v2","updated":"2024-08-26T15:19:12Z","published":"2024-08-20T18:26:09Z","title":"Binocular Model: A deep learning solution for online melt pool\n temperature analysis using dual-wavelength Imaging Pyrometry","summary":" In metal Additive Manufacturing (AM), monitoring the temperature of the Melt\nPool (MP) is crucial for ensuring part quality, process stability, defect\nprevention, and overall process optimization. Traditional methods, are slow to\nconverge and require extensive manual effort to translate data into actionable\ninsights, rendering them impractical for real-time monitoring and control. To\naddress this challenge, we propose an Artificial Intelligence (AI)-based\nsolution aimed at reducing manual data processing reliance and improving the\nefficiency of transitioning from data to insight. In our study, we utilize a\ndataset comprising dual-wavelength real-time process monitoring data and\ncorresponding temperature maps. We introduce a deep learning model called the\n\"Binocular model,\" which exploits dual input observations to perform a precise\nanalysis of MP temperature in Laser Powder Bed Fusion (L-PBF). Through advanced\ndeep learning techniques, we seamlessly convert raw data into temperature maps,\nsignificantly streamlining the process and enabling batch processing at a rate\nof up to 750 frames per second, approximately 1000 times faster than\nconventional methods. Our Binocular model achieves high accuracy in temperature\nestimation, evidenced by a 0.95 R-squared score, while simultaneously enhancing\nprocessing efficiency by a factor of $\\sim1000x$ times. This model directly\naddresses the challenge of real-time MP temperature monitoring and offers\ninsights into the encountered constraints and the benefits of our Deep\nLearning-based approach. By combining efficiency and precision, our work\ncontributes to the advancement of temperature monitoring in L-PBF, thus driving\nprogress in the field of metal AM.\n","authors":["Javid Akhavan","Chaitanya Krishna Vallabh","Xiayun Zhao","Souran Manoochehri"],"pdf_url":"https://arxiv.org/pdf/2408.11126v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.04522v3","updated":"2024-08-26T15:13:22Z","published":"2024-07-05T14:07:15Z","title":"Graph Reinforcement Learning for Power Grids: A Comprehensive Survey","summary":" The rise of renewable energy and distributed generation requires new\napproaches to overcome the limitations of traditional methods. In this context,\nGraph Neural Networks are promising due to their ability to learn from\ngraph-structured data. Combined with Reinforcement Learning, they can serve as\ncontrol approaches to determine remedial network actions. This review analyses\nhow Graph Reinforcement Learning (GRL) can improve representation learning and\ndecision making in power grid use cases. Although GRL has demonstrated\nadaptability to unpredictable events and noisy data, it is primarily at a\nproof-of-concept stage. We highlight open challenges and limitations with\nrespect to real-world applications.\n","authors":["Mohamed Hassouna","Clara Holzhüter","Pawel Lytaev","Josephine Thomas","Bernhard Sick","Christoph Scholz"],"pdf_url":"https://arxiv.org/pdf/2407.04522v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14340v1","updated":"2024-08-26T15:13:14Z","published":"2024-08-26T15:13:14Z","title":"Foundation Models for Music: A Survey","summary":" In recent years, foundation models (FMs) such as large language models (LLMs)\nand latent diffusion models (LDMs) have profoundly impacted diverse sectors,\nincluding music. This comprehensive review examines state-of-the-art (SOTA)\npre-trained models and foundation models in music, spanning from representation\nlearning, generative learning and multimodal learning. We first contextualise\nthe significance of music in various industries and trace the evolution of AI\nin music. By delineating the modalities targeted by foundation models, we\ndiscover many of the music representations are underexplored in FM development.\nThen, emphasis is placed on the lack of versatility of previous methods on\ndiverse music applications, along with the potential of FMs in music\nunderstanding, generation and medical application. By comprehensively exploring\nthe details of the model pre-training paradigm, architectural choices,\ntokenisation, finetuning methodologies and controllability, we emphasise the\nimportant topics that should have been well explored, like instruction tuning\nand in-context learning, scaling law and emergent ability, as well as\nlong-sequence modelling etc. A dedicated section presents insights into music\nagents, accompanied by a thorough analysis of datasets and evaluations\nessential for pre-training and downstream tasks. Finally, by underscoring the\nvital importance of ethical considerations, we advocate that following research\non FM for music should focus more on such issues as interpretability,\ntransparency, human responsibility, and copyright issues. The paper offers\ninsights into future challenges and trends on FMs for music, aiming to shape\nthe trajectory of human-AI collaboration in the music realm.\n","authors":["Yinghao Ma","Anders Øland","Anton Ragni","Bleiz MacSen Del Sette","Charalampos Saitis","Chris Donahue","Chenghua Lin","Christos Plachouras","Emmanouil Benetos","Elio Quinton","Elona Shatri","Fabio Morreale","Ge Zhang","György Fazekas","Gus Xia","Huan Zhang","Ilaria Manco","Jiawen Huang","Julien Guinot","Liwei Lin","Luca Marinelli","Max W. Y. Lam","Megha Sharma","Qiuqiang Kong","Roger B. Dannenberg","Ruibin Yuan","Shangda Wu","Shih-Lun Wu","Shuqi Dai","Shun Lei","Shiyin Kang","Simon Dixon","Wenhu Chen","Wehhao Huang","Xingjian Du","Xingwei Qu","Xu Tan","Yizhi Li","Zeyue Tian","Zhiyong Wu","Zhizheng Wu","Ziyang Ma","Ziyu Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14340v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14338v1","updated":"2024-08-26T15:07:35Z","published":"2024-08-26T15:07:35Z","title":"Machine Learning for Quantifier Selection in cvc5","summary":" In this work we considerably improve the state-of-the-art SMT solving on\nfirst-order quantified problems by efficient machine learning guidance of\nquantifier selection. Quantifiers represent a significant challenge for SMT and\nare technically a source of undecidability. In our approach, we train an\nefficient machine learning model that informs the solver which quantifiers\nshould be instantiated and which not. Each quantifier may be instantiated\nmultiple times and the set of the active quantifiers changes as the solving\nprogresses. Therefore, we invoke the ML predictor many times, during the whole\nrun of the solver. To make this efficient, we use fast ML models based on\ngradient boosting decision trees. We integrate our approach into the\nstate-of-the-art cvc5 SMT solver and show a considerable increase of the\nsystem's holdout-set performance after training it on a large set of\nfirst-order problems collected from the Mizar Mathematical Library.\n","authors":["Jan Jakubův","Mikoláš Janota","Jelle Piepenbrock","Josef Urban"],"pdf_url":"https://arxiv.org/pdf/2408.14338v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14332v1","updated":"2024-08-26T15:01:04Z","published":"2024-08-26T15:01:04Z","title":"One-layer transformers fail to solve the induction heads task","summary":" A simple communication complexity argument proves that no one-layer\ntransformer can solve the induction heads task unless its size is exponentially\nlarger than the size sufficient for a two-layer transformer.\n","authors":["Clayton Sanford","Daniel Hsu","Matus Telgarsky"],"pdf_url":"https://arxiv.org/pdf/2408.14332v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16528v2","updated":"2024-08-26T14:59:53Z","published":"2024-05-26T11:29:57Z","title":"LoQT: Low Rank Adapters for Quantized Training","summary":" Training of large neural networks requires significant computational\nresources. Despite advances using low-rank adapters and quantization,\npretraining of models such as LLMs on consumer hardware has not been possible\nwithout model sharding, offloading during training, or per-layer gradient\nupdates. To address these limitations, we propose LoQT, a method for\nefficiently training quantized models. LoQT uses gradient-based tensor\nfactorization to initialize low-rank trainable weight matrices that are\nperiodically merged into quantized full-rank weight matrices. Our approach is\nsuitable for both pretraining and fine-tuning of models, which we demonstrate\nexperimentally for language modeling and downstream task adaptation. We find\nthat LoQT enables efficient training of models up to 7B parameters on a\nconsumer-grade 24GB GPU. We also demonstrate the feasibility of training a 13B\nparameter model using per-layer gradient updates on the same hardware.\n","authors":["Sebastian Loeschcke","Mads Toftrup","Michael J. Kastoryano","Serge Belongie","Vésteinn Snæbjarnarson"],"pdf_url":"https://arxiv.org/pdf/2405.16528v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14331v1","updated":"2024-08-26T14:55:40Z","published":"2024-08-26T14:55:40Z","title":"Automated Machine Learning in Insurance","summary":" Machine Learning (ML) has gained popularity in actuarial research and\ninsurance industrial applications. However, the performance of most ML tasks\nheavily depends on data preprocessing, model selection, and hyperparameter\noptimization, which are considered to be intensive in terms of domain\nknowledge, experience, and manual labor. Automated Machine Learning (AutoML)\naims to automatically complete the full life-cycle of ML tasks and provides\nstate-of-the-art ML models without human intervention or supervision. This\npaper introduces an AutoML workflow that allows users without domain knowledge\nor prior experience to achieve robust and effortless ML deployment by writing\nonly a few lines of code. This proposed AutoML is specifically tailored for the\ninsurance application, with features like the balancing step in data\npreprocessing, ensemble pipelines, and customized loss functions. These\nfeatures are designed to address the unique challenges of the insurance domain,\nincluding the imbalanced nature of common insurance datasets. The full code and\ndocumentation are available on the GitHub repository.\n(https://github.com/PanyiDong/InsurAutoML)\n","authors":["Panyi Dong","Zhiyu Quan"],"pdf_url":"https://arxiv.org/pdf/2408.14331v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14326v1","updated":"2024-08-26T14:54:14Z","published":"2024-08-26T14:54:14Z","title":"Streamline tractography of the fetal brain in utero with machine\n learning","summary":" Diffusion-weighted magnetic resonance imaging (dMRI) is the only non-invasive\ntool for studying white matter tracts and structural connectivity of the brain.\nThese assessments rely heavily on tractography techniques, which reconstruct\nvirtual streamlines representing white matter fibers. Much effort has been\ndevoted to improving tractography methodology for adult brains, while\ntractography of the fetal brain has been largely neglected. Fetal tractography\nfaces unique difficulties due to low dMRI signal quality, immature and rapidly\ndeveloping brain structures, and paucity of reference data. This work presents\nthe first machine learning model for fetal tractography. The model input\nconsists of five sources of information: (1) Fiber orientation, inferred from a\ndiffusion tensor fit to the dMRI signal; (2) Directions of recent propagation\nsteps; (3) Global spatial information, encoded as distances to keypoints in the\nbrain cortex; (4) Tissue segmentation information; and (5) Prior information\nabout the expected local fiber orientations supplied with an atlas. In order to\nmitigate the local tensor estimation error, a large spatial context around the\ncurrent point in the diffusion tensor image is encoded using convolutional and\nattention neural network modules. Moreover, the diffusion tensor information at\na hypothetical next point is included in the model input. Filtering rules based\non anatomically constrained tractography are applied to prune implausible\nstreamlines. We trained the model on manually-refined whole-brain fetal\ntractograms and validated the trained model on an independent set of 11 test\nscans with gestational ages between 23 and 36 weeks. Results show that our\nproposed method achieves superior performance across all evaluated tracts. The\nnew method can significantly advance the capabilities of dMRI for studying\nnormal and abnormal brain development in utero.\n","authors":["Weide Liu","Camilo Calixto","Simon K. Warfield","Davood Karimi"],"pdf_url":"https://arxiv.org/pdf/2408.14326v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14325v1","updated":"2024-08-26T14:54:13Z","published":"2024-08-26T14:54:13Z","title":"Function-Space MCMC for Bayesian Wide Neural Networks","summary":" Bayesian Neural Networks represent a fascinating confluence of deep learning\nand probabilistic reasoning, offering a compelling framework for understanding\nuncertainty in complex predictive models. In this paper, we investigate the use\nof the preconditioned Crank-Nicolson algorithm and its Langevin version to\nsample from the reparametrised posterior distribution of the weights as the\nwidths of Bayesian Neural Networks grow larger. In addition to being robust in\nthe infinite-dimensional setting, we prove that the acceptance probabilities of\nthe proposed methods approach 1 as the width of the network increases,\nindependently of any stepsize tuning. Moreover, we examine and compare how the\nmixing speeds of the underdamped Langevin Monte Carlo, the preconditioned\nCrank-Nicolson and the preconditioned Crank-Nicolson Langevin samplers are\ninfluenced by changes in the network width in some real-world cases. Our\nfindings suggest that, in wide Bayesian Neural Networks configurations, the\npreconditioned Crank-Nicolson method allows for more efficient sampling of the\nreparametrised posterior distribution, as evidenced by a higher effective\nsample size and improved diagnostic results compared with the other analysed\nalgorithms.\n","authors":["Lucia Pezzetti","Stefano Favaro","Stefano Pelucchetti"],"pdf_url":"https://arxiv.org/pdf/2408.14325v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14319v1","updated":"2024-08-26T14:51:26Z","published":"2024-08-26T14:51:26Z","title":"Rethinking Knowledge Transfer in Learning Using Privileged Information","summary":" In supervised machine learning, privileged information (PI) is information\nthat is unavailable at inference, but is accessible during training time.\nResearch on learning using privileged information (LUPI) aims to transfer the\nknowledge captured in PI onto a model that can perform inference without PI. It\nseems that this extra bit of information ought to make the resulting model\nbetter. However, finding conclusive theoretical or empirical evidence that\nsupports the ability to transfer knowledge using PI has been challenging. In\nthis paper, we critically examine the assumptions underlying existing\ntheoretical analyses and argue that there is little theoretical justification\nfor when LUPI should work. We analyze LUPI methods and reveal that apparent\nimprovements in empirical risk of existing research may not directly result\nfrom PI. Instead, these improvements often stem from dataset anomalies or\nmodifications in model design misguidedly attributed to PI. Our experiments for\na wide variety of application domains further demonstrate that state-of-the-art\nLUPI approaches fail to effectively transfer knowledge from PI. Thus, we\nadvocate for practitioners to exercise caution when working with PI to avoid\nunintended inductive biases.\n","authors":["Danil Provodin","Bram van den Akker","Christina Katsimerou","Maurits Kaptein","Mykola Pechenizkiy"],"pdf_url":"https://arxiv.org/pdf/2408.14319v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.07437v3","updated":"2024-08-26T14:46:08Z","published":"2023-02-15T02:58:09Z","title":"Bridging the Usability Gap: Theoretical and Methodological Advances for\n Spectral Learning of Hidden Markov Models","summary":" The Baum-Welch (B-W) algorithm is the most widely accepted method for\ninferring hidden Markov models (HMM). However, it is prone to getting stuck in\nlocal optima, and can be too slow for many real-time applications. Spectral\nlearning of HMMs (SHMM), based on the method of moments (MOM) has been proposed\nin the literature to overcome these obstacles. Despite its promises, asymptotic\ntheory for SHMM has been elusive, and the long-run performance of SHMM can\ndegrade due to unchecked propagation of error. In this paper, we (1) provide an\nasymptotic distribution for the approximate error of the likelihood estimated\nby SHMM, (2) propose a novel algorithm called projected SHMM (PSHMM) that\nmitigates the problem of error propagation, and (3) develop online learning\nvariants of both SHMM and PSHMM that accommodate potential nonstationarity. We\ncompare the performance of SHMM with PSHMM and estimation through the B-W\nalgorithm on both simulated data and data from real world applications, and\nfind that PSHMM not only retains the computational advantages of SHMM, but also\nprovides more robust estimation and forecasting.\n","authors":["Xiaoyuan Ma","Jordan Rodu"],"pdf_url":"https://arxiv.org/pdf/2302.07437v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14307v1","updated":"2024-08-26T14:38:19Z","published":"2024-08-26T14:38:19Z","title":"LLM-3D Print: Large Language Models To Monitor and Control 3D Printing","summary":" Industry 4.0 has revolutionized manufacturing by driving digitalization and\nshifting the paradigm toward additive manufacturing (AM). Fused Deposition\nModeling (FDM), a key AM technology, enables the creation of highly customized,\ncost-effective products with minimal material waste through layer-by-layer\nextrusion, posing a significant challenge to traditional subtractive methods.\nHowever, the susceptibility of material extrusion techniques to errors often\nrequires expert intervention to detect and mitigate defects that can severely\ncompromise product quality. While automated error detection and machine\nlearning models exist, their generalizability across diverse 3D printer setups,\nfirmware, and sensors is limited, and deep learning methods require extensive\nlabeled datasets, hindering scalability and adaptability. To address these\nchallenges, we present a process monitoring and control framework that\nleverages pre-trained Large Language Models (LLMs) alongside 3D printers to\ndetect and address printing defects. The LLM evaluates print quality by\nanalyzing images captured after each layer or print segment, identifying\nfailure modes and querying the printer for relevant parameters. It then\ngenerates and executes a corrective action plan. We validated the effectiveness\nof the proposed framework in identifying defects by comparing it against a\ncontrol group of engineers with diverse AM expertise. Our evaluation\ndemonstrated that LLM-based agents not only accurately identify common 3D\nprinting errors, such as inconsistent extrusion, stringing, warping, and layer\nadhesion, but also effectively determine the parameters causing these failures\nand autonomously correct them without any need for human intervention.\n","authors":["Yayati Jadhav","Peter Pak","Amir Barati Farimani"],"pdf_url":"https://arxiv.org/pdf/2408.14307v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14284v1","updated":"2024-08-26T14:09:40Z","published":"2024-08-26T14:09:40Z","title":"May the Forgetting Be with You: Alternate Replay for Learning with Noisy\n Labels","summary":" Forgetting presents a significant challenge during incremental training,\nmaking it particularly demanding for contemporary AI systems to assimilate new\nknowledge in streaming data environments. To address this issue, most\napproaches in Continual Learning (CL) rely on the replay of a restricted buffer\nof past data. However, the presence of noise in real-world scenarios, where\nhuman annotation is constrained by time limitations or where data is\nautomatically gathered from the web, frequently renders these strategies\nvulnerable. In this study, we address the problem of CL under Noisy Labels\n(CLN) by introducing Alternate Experience Replay (AER), which takes advantage\nof forgetting to maintain a clear distinction between clean, complex, and noisy\nsamples in the memory buffer. The idea is that complex or mislabeled examples,\nwhich hardly fit the previously learned data distribution, are most likely to\nbe forgotten. To grasp the benefits of such a separation, we equip AER with\nAsymmetric Balanced Sampling (ABS): a new sample selection strategy that\nprioritizes purity on the current task while retaining relevant samples from\nthe past. Through extensive computational comparisons, we demonstrate the\neffectiveness of our approach in terms of both accuracy and purity of the\nobtained buffer, resulting in a remarkable average gain of 4.71% points in\naccuracy with respect to existing loss-based purification strategies. Code is\navailable at https://github.com/aimagelab/mammoth.\n","authors":["Monica Millunzi","Lorenzo Bonicelli","Angelo Porrello","Jacopo Credi","Petter N. Kolm","Simone Calderara"],"pdf_url":"https://arxiv.org/pdf/2408.14284v1.pdf","comment":"25 pages, 5 figures. Accepted at the The 35th British Machine Vision\n Conference 2024 (BMVC 2024), Glasgow, UK"},{"id":"http://arxiv.org/abs/2305.07715v2","updated":"2024-08-26T14:09:37Z","published":"2023-05-12T18:14:21Z","title":"Field theory for optimal signal propagation in ResNets","summary":" Residual networks have significantly better trainability and thus performance\nthan feed-forward networks at large depth. Introducing skip connections\nfacilitates signal propagation to deeper layers. In addition, previous works\nfound that adding a scaling parameter for the residual branch further improves\ngeneralization performance. While they empirically identified a particularly\nbeneficial range of values for this scaling parameter, the associated\nperformance improvement and its universality across network hyperparameters yet\nneed to be understood. For feed-forward networks, finite-size theories have led\nto important insights with regard to signal propagation and hyperparameter\ntuning. We here derive a systematic finite-size field theory for residual\nnetworks to study signal propagation and its dependence on the scaling for the\nresidual branch. We derive analytical expressions for the response function, a\nmeasure for the network's sensitivity to inputs, and show that for deep\nnetworks the empirically found values for the scaling parameter lie within the\nrange of maximal sensitivity. Furthermore, we obtain an analytical expression\nfor the optimal scaling parameter that depends only weakly on other network\nhyperparameters, such as the weight variance, thereby explaining its\nuniversality across hyperparameters. Overall, this work provides a theoretical\nframework to study ResNets at finite size.\n","authors":["Kirsten Fischer","David Dahmen","Moritz Helias"],"pdf_url":"https://arxiv.org/pdf/2305.07715v2.pdf","comment":"21 pages, 8 figures, under review"},{"id":"http://arxiv.org/abs/2408.12615v2","updated":"2024-08-26T14:06:59Z","published":"2024-08-08T14:11:06Z","title":"Pediatric TSC-Related Epilepsy Classification from Clinical MR Images\n Using Quantum Neural Network","summary":" Tuberous sclerosis complex (TSC) manifests as a multisystem disorder with\nsignificant neurological implications. This study addresses the critical need\nfor robust classification models tailored to TSC in pediatric patients,\nintroducing QResNet,a novel deep learning model seamlessly integrating\nconventional convolutional neural networks with quantum neural networks. The\nmodel incorporates a two-layer quantum layer (QL), comprising ZZFeatureMap and\nAnsatz layers, strategically designed for processing classical data within a\nquantum framework. A comprehensive evaluation, demonstrates the superior\nperformance of QResNet in TSC MRI image classification compared to conventional\n3D-ResNet models. These compelling findings underscore the potential of quantum\ncomputing to revolutionize medical imaging and diagnostics.Remarkably, this\nmethod surpasses conventional CNNs in accuracy and Area Under the Curve (AUC)\nmetrics with the current dataset. Future research endeavors may focus on\nexploring the scalability and practical implementation of quantum algorithms in\nreal-world medical imaging scenarios.\n","authors":["Ling Lin","Yihang Zhou","Zhanqi Hu","Dian Jiang","Congcong Liu","Shuo Zhou","Yanjie Zhu","Jianxiang Liao","Dong Liang","Hairong Zheng","Haifeng Wang"],"pdf_url":"https://arxiv.org/pdf/2408.12615v2.pdf","comment":"5 pages,4 figures,2 tables,presented at ISBI 2024"},{"id":"http://arxiv.org/abs/2408.14281v1","updated":"2024-08-26T14:02:30Z","published":"2024-08-26T14:02:30Z","title":"Uncertainties of Latent Representations in Computer Vision","summary":" Uncertainty quantification is a key pillar of trustworthy machine learning.\nIt enables safe reactions under unsafe inputs, like predicting only when the\nmachine learning model detects sufficient evidence, discarding anomalous data,\nor emitting warnings when an error is likely to be inbound. This is\nparticularly crucial in safety-critical areas like medical image classification\nor self-driving cars. Despite the plethora of proposed uncertainty\nquantification methods achieving increasingly higher scores on performance\nbenchmarks, uncertainty estimates are often shied away from in practice. Many\nmachine learning projects start from pretrained latent representations that\ncome without uncertainty estimates. Uncertainties would need to be trained by\npractitioners on their own, which is notoriously difficult and\nresource-intense.\n This thesis makes uncertainty estimates easily accessible by adding them to\nthe latent representation vectors of pretrained computer vision models. Besides\nproposing approaches rooted in probability and decision theory, such as\nMonte-Carlo InfoNCE (MCInfoNCE) and loss prediction, we delve into both\ntheoretical and empirical questions. We show that these unobservable\nuncertainties about unobservable latent representations are indeed provably\ncorrect. We also provide an uncertainty-aware representation learning (URL)\nbenchmark to compare these unobservables against observable ground-truths.\nFinally, we compile our findings to pretrain lightweight representation\nuncertainties on large-scale computer vision models that transfer to unseen\ndatasets in a zero-shot manner.\n Our findings do not only advance the current theoretical understanding of\nuncertainties over latent variables, but also facilitate the access to\nuncertainty quantification for future researchers inside and outside the field,\nenabling straightforward but trustworthy machine learning.\n","authors":["Michael Kirchhof"],"pdf_url":"https://arxiv.org/pdf/2408.14281v1.pdf","comment":"Doctoral thesis"},{"id":"http://arxiv.org/abs/2312.01210v4","updated":"2024-08-26T13:57:31Z","published":"2023-12-02T19:39:50Z","title":"When accurate prediction models yield harmful self-fulfilling prophecies","summary":" Prediction models are popular in medical research and practice. By predicting\nan outcome of interest for specific patients, these models may help inform\ndifficult treatment decisions, and are often hailed as the poster children for\npersonalized, data-driven healthcare. We show however, that using prediction\nmodels for decision making can lead to harmful decisions, even when the\npredictions exhibit good discrimination after deployment. These models are\nharmful self-fulfilling prophecies: their deployment harms a group of patients\nbut the worse outcome of these patients does not invalidate the predictive\npower of the model. Our main result is a formal characterization of a set of\nsuch prediction models. Next we show that models that are well calibrated\nbefore and after deployment are useless for decision making as they made no\nchange in the data distribution. These results point to the need to revise\nstandard practices for validation, deployment and evaluation of prediction\nmodels that are used in medical decisions.\n","authors":["Wouter A. C. van Amsterdam","Nan van Geloven","Jesse H. Krijthe","Rajesh Ranganath","Giovanni Ciná"],"pdf_url":"https://arxiv.org/pdf/2312.01210v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.09131v2","updated":"2024-08-26T13:48:32Z","published":"2024-06-13T14:02:18Z","title":"OLGA: One-cLass Graph Autoencoder","summary":" One-class learning (OCL) comprises a set of techniques applied when\nreal-world problems have a single class of interest. The usual procedure for\nOCL is learning a hypersphere that comprises instances of this class and,\nideally, repels unseen instances from any other classes. Besides, several OCL\nalgorithms for graphs have been proposed since graph representation learning\nhas succeeded in various fields. These methods may use a two-step strategy,\ninitially representing the graph and, in a second step, classifying its nodes.\nOn the other hand, end-to-end methods learn the node representations while\nclassifying the nodes in one learning process. We highlight three main gaps in\nthe literature on OCL for graphs: (i) non-customized representations for OCL;\n(ii) the lack of constraints on hypersphere parameters learning; and (iii) the\nmethods' lack of interpretability and visualization. We propose One-cLass Graph\nAutoencoder (OLGA). OLGA is end-to-end and learns the representations for the\ngraph nodes while encapsulating the interest instances by combining two loss\nfunctions. We propose a new hypersphere loss function to encapsulate the\ninterest instances. OLGA combines this new hypersphere loss with the graph\nautoencoder reconstruction loss to improve model learning. OLGA achieved\nstate-of-the-art results and outperformed six other methods with a\nstatistically significant difference from five methods. Moreover, OLGA learns\nlow-dimensional representations maintaining the classification performance with\nan interpretable model representation learning and results.\n","authors":["M. P. S. Gôlo","J. G. B. M. Junior","D. F. Silva","R. M. Marcacini"],"pdf_url":"https://arxiv.org/pdf/2406.09131v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.07182v7","updated":"2024-08-26T13:43:46Z","published":"2022-10-13T17:03:36Z","title":"PDEBENCH: An Extensive Benchmark for Scientific Machine Learning","summary":" Machine learning-based modeling of physical systems has experienced increased\ninterest in recent years. Despite some impressive progress, there is still a\nlack of benchmarks for Scientific ML that are easy to use but still challenging\nand representative of a wide range of problems. We introduce PDEBench, a\nbenchmark suite of time-dependent simulation tasks based on Partial\nDifferential Equations (PDEs). PDEBench comprises both code and data to\nbenchmark the performance of novel machine learning models against both\nclassical numerical simulations and machine learning baselines. Our proposed\nset of benchmark problems contribute the following unique features: (1) A much\nwider range of PDEs compared to existing benchmarks, ranging from relatively\ncommon examples to more realistic and difficult problems; (2) much larger\nready-to-use datasets compared to prior work, comprising multiple simulation\nruns across a larger number of initial and boundary conditions and PDE\nparameters; (3) more extensible source codes with user-friendly APIs for data\ngeneration and baseline results with popular machine learning models (FNO,\nU-Net, PINN, Gradient-Based Inverse Method). PDEBench allows researchers to\nextend the benchmark freely for their own purposes using a standardized API and\nto compare the performance of new models to existing baseline methods. We also\npropose new evaluation metrics with the aim to provide a more holistic\nunderstanding of learning methods in the context of Scientific ML. With those\nmetrics we identify tasks which are challenging for recent ML methods and\npropose these tasks as future challenges for the community. The code is\navailable at https://github.com/pdebench/PDEBench.\n","authors":["Makoto Takamoto","Timothy Praditia","Raphael Leiteritz","Dan MacKinlay","Francesco Alesiani","Dirk Pflüger","Mathias Niepert"],"pdf_url":"https://arxiv.org/pdf/2210.07182v7.pdf","comment":"16 pages (main body) + 34 pages (supplemental material), accepted for\n publication in NeurIPS 2022 Track Datasets and Benchmarks"},{"id":"http://arxiv.org/abs/2408.14267v1","updated":"2024-08-26T13:42:43Z","published":"2024-08-26T13:42:43Z","title":"1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit","summary":" Fully quantized training (FQT) accelerates the training of deep neural\nnetworks by quantizing the activations, weights, and gradients into lower\nprecision. To explore the ultimate limit of FQT (the lowest achievable\nprecision), we make a first attempt to 1-bit FQT. We provide a theoretical\nanalysis of FQT based on Adam and SGD, revealing that the gradient variance\ninfluences the convergence of FQT. Building on these theoretical results, we\nintroduce an Activation Gradient Pruning (AGP) strategy. The strategy leverages\nthe heterogeneity of gradients by pruning less informative gradients and\nenhancing the numerical precision of remaining gradients to mitigate gradient\nvariance. Additionally, we propose Sample Channel joint Quantization (SCQ),\nwhich utilizes different quantization strategies in the computation of weight\ngradients and activation gradients to ensure that the method is friendly to\nlow-bitwidth hardware. Finally, we present a framework to deploy our algorithm.\nFor fine-tuning VGGNet-16 and ResNet-18 on multiple datasets, our algorithm\nachieves an average accuracy improvement of approximately 6%, compared to\nper-sample quantization. Moreover, our training speedup can reach a maximum of\n5.13x compared to full precision training.\n","authors":["Chang Gao","Jianfei Chen","Kang Zhao","Jiaqi Wang","Liping Jing"],"pdf_url":"https://arxiv.org/pdf/2408.14267v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14266v1","updated":"2024-08-26T13:40:33Z","published":"2024-08-26T13:40:33Z","title":"HyperSBINN: A Hypernetwork-Enhanced Systems Biology-Informed Neural\n Network for Efficient Drug Cardiosafety Assessment","summary":" Mathematical modeling in systems toxicology enables a comprehensive\nunderstanding of the effects of pharmaceutical substances on cardiac health.\nHowever, the complexity of these models limits their widespread application in\nearly drug discovery. In this paper, we introduce a novel approach to solving\nparameterized models of cardiac action potentials by combining meta-learning\ntechniques with Systems Biology-Informed Neural Networks (SBINNs). The proposed\nmethod, HyperSBINN, effectively addresses the challenge of predicting the\neffects of various compounds at different concentrations on cardiac action\npotentials, outperforming traditional differential equation solvers in speed.\nOur model efficiently handles scenarios with limited data and complex\nparameterized differential equations. The HyperSBINN model demonstrates robust\nperformance in predicting APD90 values, indicating its potential as a reliable\ntool for modeling cardiac electrophysiology and aiding in preclinical drug\ndevelopment. This framework represents an advancement in computational\nmodeling, offering a scalable and efficient solution for simulating and\nunderstanding complex biological systems.\n","authors":["Inass Soukarieh","Gerhard Hessler","Hervé Minoux","Marcel Mohr","Friedemann Schmidt","Jan Wenzel","Pierre Barbillon","Hugo Gangloff","Pierre Gloaguen"],"pdf_url":"https://arxiv.org/pdf/2408.14266v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14254v1","updated":"2024-08-26T13:16:42Z","published":"2024-08-26T13:16:42Z","title":"Integrated Brain Connectivity Analysis with fMRI, DTI, and sMRI Powered\n by Interpretable Graph Neural Networks","summary":" Multimodal neuroimaging modeling has becomes a widely used approach but\nconfronts considerable challenges due to heterogeneity, which encompasses\nvariability in data types, scales, and formats across modalities. This\nvariability necessitates the deployment of advanced computational methods to\nintegrate and interpret these diverse datasets within a cohesive analytical\nframework. In our research, we amalgamate functional magnetic resonance\nimaging, diffusion tensor imaging, and structural MRI into a cohesive\nframework. This integration capitalizes on the unique strengths of each\nmodality and their inherent interconnections, aiming for a comprehensive\nunderstanding of the brain's connectivity and anatomical characteristics.\nUtilizing the Glasser atlas for parcellation, we integrate imaging derived\nfeatures from various modalities: functional connectivity from fMRI, structural\nconnectivity from DTI, and anatomical features from sMRI within consistent\nregions. Our approach incorporates a masking strategy to differentially weight\nneural connections, thereby facilitating a holistic amalgamation of multimodal\nimaging data. This technique enhances interpretability at connectivity level,\ntranscending traditional analyses centered on singular regional attributes. The\nmodel is applied to the Human Connectome Project's Development study to\nelucidate the associations between multimodal imaging and cognitive functions\nthroughout youth. The analysis demonstrates improved predictive accuracy and\nuncovers crucial anatomical features and essential neural connections,\ndeepening our understanding of brain structure and function.\n","authors":["Gang Qu","Ziyu Zhou","Vince D. Calhoun","Aiying Zhang","Yu-Ping Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14254v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14252v1","updated":"2024-08-26T13:14:26Z","published":"2024-08-26T13:14:26Z","title":"An Evaluation of Explanation Methods for Black-Box Detectors of\n Machine-Generated Text","summary":" The increasing difficulty to distinguish language-model-generated from\nhuman-written text has led to the development of detectors of machine-generated\ntext (MGT). However, in many contexts, a black-box prediction is not\nsufficient, it is equally important to know on what grounds a detector made\nthat prediction. Explanation methods that estimate feature importance promise\nto provide indications of which parts of an input are used by classifiers for\nprediction. However, the quality of different explanation methods has not\npreviously been assessed for detectors of MGT. This study conducts the first\nsystematic evaluation of explanation quality for this task. The dimensions of\nfaithfulness and stability are assessed with five automated experiments, and\nusefulness is evaluated in a user study. We use a dataset of ChatGPT-generated\nand human-written documents, and pair predictions of three existing\nlanguage-model-based detectors with the corresponding SHAP, LIME, and Anchor\nexplanations. We find that SHAP performs best in terms of faithfulness,\nstability, and in helping users to predict the detector's behavior. In\ncontrast, LIME, perceived as most useful by users, scores the worst in terms of\nuser performance at predicting the detectors' behavior.\n","authors":["Loris Schoenegger","Yuxi Xia","Benjamin Roth"],"pdf_url":"https://arxiv.org/pdf/2408.14252v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.03816v2","updated":"2024-08-26T13:12:45Z","published":"2024-08-07T14:52:06Z","title":"Early Prediction of Causes (not Effects) in Healthcare by Long-Term\n Clinical Time Series Forecasting","summary":" Machine learning for early syndrome diagnosis aims to solve the intricate\ntask of predicting a ground truth label that most often is the outcome (effect)\nof a medical consensus definition applied to observed clinical measurements\n(causes), given clinical measurements observed several hours before. Instead of\nfocusing on the prediction of the future effect, we propose to directly predict\nthe causes via time series forecasting (TSF) of clinical variables and\ndetermine the effect by applying the gold standard consensus definition to the\nforecasted values. This method has the invaluable advantage of being\nstraightforwardly interpretable to clinical practitioners, and because model\ntraining does not rely on a particular label anymore, the forecasted data can\nbe used to predict any consensus-based label. We exemplify our method by means\nof long-term TSF with Transformer models, with a focus on accurate prediction\nof sparse clinical variables involved in the SOFA-based Sepsis-3 definition and\nthe new Simplified Acute Physiology Score (SAPS-II) definition. Our experiments\nare conducted on two datasets and show that contrary to recent proposals which\nadvocate set function encoders for time series and direct multi-step decoders,\nbest results are achieved by a combination of standard dense encoders with\niterative multi-step decoders. The key for success of iterative multi-step\ndecoding can be attributed to its ability to capture cross-variate dependencies\nand to a student forcing training strategy that teaches the model to rely on\nits own previous time step predictions for the next time step prediction.\n","authors":["Michael Staniek","Marius Fracarolli","Michael Hagmann","Stefan Riezler"],"pdf_url":"https://arxiv.org/pdf/2408.03816v2.pdf","comment":"Published at Machine Learning for Healthcare (MLHC), Toronto, 2024"},{"id":"http://arxiv.org/abs/2408.12658v2","updated":"2024-08-26T13:02:46Z","published":"2024-08-22T18:04:29Z","title":"Hierarchical Generative Modeling of Melodic Vocal Contours in Hindustani\n Classical Music","summary":" Hindustani music is a performance-driven oral tradition that exhibits the\nrendition of rich melodic patterns. In this paper, we focus on generative\nmodeling of singers' vocal melodies extracted from audio recordings, as the\nvoice is musically prominent within the tradition. Prior generative work in\nHindustani music models melodies as coarse discrete symbols which fails to\ncapture the rich expressive melodic intricacies of singing. Thus, we propose to\nuse a finely quantized pitch contour, as an intermediate representation for\nhierarchical audio modeling. We propose GaMaDHaNi, a modular two-level\nhierarchy, consisting of a generative model on pitch contours, and a pitch\ncontour to audio synthesis model. We compare our approach to non-hierarchical\naudio models and hierarchical models that use a self-supervised intermediate\nrepresentation, through a listening test and qualitative analysis. We also\nevaluate audio model's ability to faithfully represent the pitch contour input\nusing Pearson correlation coefficient. By using pitch contours as an\nintermediate representation, we show that our model may be better equipped to\nlisten and respond to musicians in a human-AI collaborative setting by\nhighlighting two potential interaction use cases (1) primed generation, and (2)\ncoarse pitch conditioning.\n","authors":["Nithya Shikarpur","Krishna Maneesha Dendukuri","Yusong Wu","Antoine Caillon","Cheng-Zhi Anna Huang"],"pdf_url":"https://arxiv.org/pdf/2408.12658v2.pdf","comment":"Accepted at International Society for Music Information Retrieval\n (ISMIR) 2024"},{"id":"http://arxiv.org/abs/2408.14236v1","updated":"2024-08-26T12:50:27Z","published":"2024-08-26T12:50:27Z","title":"DSTI at LLMs4OL 2024 Task A: Intrinsic versus extrinsic knowledge for\n type classification","summary":" We introduce semantic towers, an extrinsic knowledge representation method,\nand compare it to intrinsic knowledge in large language models for ontology\nlearning. Our experiments show a trade-off between performance and semantic\ngrounding for extrinsic knowledge compared to a fine-tuned model intrinsic\nknowledge. We report our findings on the Large Language Models for Ontology\nLearning (LLMs4OL) 2024 challenge.\n","authors":["Hanna Abi Akl"],"pdf_url":"https://arxiv.org/pdf/2408.14236v1.pdf","comment":"8 pages, 4 figures, accepted for the LLMs4OL challenge at the\n International Semantic Web Conference (ISWC) 2024"},{"id":"http://arxiv.org/abs/2408.14234v1","updated":"2024-08-26T12:49:41Z","published":"2024-08-26T12:49:41Z","title":"FSDEM: Feature Selection Dynamic Evaluation Metric","summary":" Expressive evaluation metrics are indispensable for informative experiments\nin all areas, and while several metrics are established in some areas, in\nothers, such as feature selection, only indirect or otherwise limited\nevaluation metrics are found. In this paper, we propose a novel evaluation\nmetric to address several problems of its predecessors and allow for flexible\nand reliable evaluation of feature selection algorithms. The proposed metric is\na dynamic metric with two properties that can be used to evaluate both the\nperformance and the stability of a feature selection algorithm. We conduct\nseveral empirical experiments to illustrate the use of the proposed metric in\nthe successful evaluation of feature selection algorithms. We also provide a\ncomparison and analysis to show the different aspects involved in the\nevaluation of the feature selection algorithms. The results indicate that the\nproposed metric is successful in carrying out the evaluation task for feature\nselection algorithms.\n This paper is an extended version of a paper accepted at SISAP 2024.\n","authors":["Muhammad Rajabinasab","Anton D. Lautrup","Tobias Hyrup","Arthur Zimek"],"pdf_url":"https://arxiv.org/pdf/2408.14234v1.pdf","comment":"Short version of this paper is accepted at 17th International\n Conference on Similarity Search and Applications, SISAP 2024"},{"id":"http://arxiv.org/abs/2408.14229v1","updated":"2024-08-26T12:44:17Z","published":"2024-08-26T12:44:17Z","title":"Gallery-Aware Uncertainty Estimation For Open-Set Face Recognition","summary":" Accurately estimating image quality and model robustness improvement are\ncritical challenges in unconstrained face recognition, which can be addressed\nthrough uncertainty estimation via probabilistic face embeddings. Previous\nresearch mainly focused on uncertainty estimation in face verification, leaving\nthe open-set face recognition task underexplored. In open-set face recognition,\none seeks to classify an image, which could also be unknown. Here, the low\nvariance of probabilistic embedding does not imply a low error probability: an\nimage embedding could be close to several classes in a gallery, thus yielding\nhigh uncertainty. We propose a method aware of two sources of ambiguity in the\nopen-set recognition system: (1) the gallery uncertainty caused by overlapping\nclasses and (2) the uncertainty of the face embeddings. To detect both types,\nwe use a Bayesian probabilistic model of embedding distribution, which provides\na principled uncertainty estimate. Challenging open-set face recognition\ndatasets, such as IJB-C, serve as a testbed for our method. We also propose a\nnew open-set recognition protocol for whale and dolphin identification. The\nproposed approach better identifies recognition errors than uncertainty\nestimation methods based solely on image quality.\n","authors":["Leonid Erlygin","Alexey Zaytsev"],"pdf_url":"https://arxiv.org/pdf/2408.14229v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14225v1","updated":"2024-08-26T12:41:41Z","published":"2024-08-26T12:41:41Z","title":"Provable Imbalanced Point Clustering","summary":" We suggest efficient and provable methods to compute an approximation for\nimbalanced point clustering, that is, fitting $k$-centers to a set of points in\n$\\mathbb{R}^d$, for any $d,k\\geq 1$. To this end, we utilize \\emph{coresets},\nwhich, in the context of the paper, are essentially weighted sets of points in\n$\\mathbb{R}^d$ that approximate the fitting loss for every model in a given\nset, up to a multiplicative factor of $1\\pm\\varepsilon$. We provide [Section 3\nand Section E in the appendix] experiments that show the empirical contribution\nof our suggested methods for real images (novel and reference), synthetic data,\nand real-world data. We also propose choice clustering, which by combining\nclustering algorithms yields better performance than each one separately.\n","authors":["David Denisov","Dan Feldman","Shlomi Dolev","Michael Segal"],"pdf_url":"https://arxiv.org/pdf/2408.14225v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.02378v2","updated":"2024-08-26T12:36:51Z","published":"2023-07-05T15:45:53Z","title":"Continuum Limits of Ollivier's Ricci Curvature on data clouds: pointwise\n consistency and global lower bounds","summary":" Let $M$ denote a low-dimensional manifold embedded in Euclidean space and let\n${X}= \\{ x_1, \\dots, x_n \\}$ be a collection of points uniformly sampled from\nit. We study the relationship between the curvature of a random geometric graph\nbuilt from ${X}$ and the curvature of the manifold $M$ via continuum limits of\nOllivier's discrete Ricci curvature. We prove pointwise, non-asymptotic\nconsistency results and also show that if $M$ has Ricci curvature bounded from\nbelow by a positive constant, then the random geometric graph will inherit this\nglobal structural property with high probability. We discuss applications of\nthe global discrete curvature bounds to contraction properties of heat kernels\non graphs, as well as implications for manifold learning from data clouds. In\nparticular, we show that our consistency results allow for estimating the\nintrinsic curvature of a manifold by first estimating concrete extrinsic\nquantities.\n","authors":["Nicolas Garcia Trillos","Melanie Weber"],"pdf_url":"https://arxiv.org/pdf/2307.02378v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.11341v2","updated":"2024-08-26T12:14:31Z","published":"2024-04-17T13:00:52Z","title":"The Causal Chambers: Real Physical Systems as a Testbed for AI\n Methodology","summary":" In some fields of AI, machine learning and statistics, the validation of new\nmethods and algorithms is often hindered by the scarcity of suitable real-world\ndatasets. Researchers must often turn to simulated data, which yields limited\ninformation about the applicability of the proposed methods to real problems.\nAs a step forward, we have constructed two devices that allow us to quickly and\ninexpensively produce large datasets from non-trivial but well-understood\nphysical systems. The devices, which we call causal chambers, are\ncomputer-controlled laboratories that allow us to manipulate and measure an\narray of variables from these physical systems, providing a rich testbed for\nalgorithms from a variety of fields. We illustrate potential applications\nthrough a series of case studies in fields such as causal discovery,\nout-of-distribution generalization, change point detection, independent\ncomponent analysis, and symbolic regression. For applications to causal\ninference, the chambers allow us to carefully perform interventions. We also\nprovide and empirically validate a causal model of each chamber, which can be\nused as ground truth for different tasks. All hardware and software is made\nopen source, and the datasets are publicly available at causalchamber.org or\nthrough the Python package causalchamber.\n","authors":["Juan L. Gamella","Jonas Peters","Peter Bühlmann"],"pdf_url":"https://arxiv.org/pdf/2404.11341v2.pdf","comment":"40 pages, 20 figures"},{"id":"http://arxiv.org/abs/2408.14206v1","updated":"2024-08-26T12:09:38Z","published":"2024-08-26T12:09:38Z","title":"Lemon and Orange Disease Classification using CNN-Extracted Features and\n Machine Learning Classifier","summary":" Lemons and oranges, both are the most economically significant citrus fruits\nglobally. The production of lemons and oranges is severely affected due to\ndiseases in its growth stages. Fruit quality has degraded due to the presence\nof flaws. Thus, it is necessary to diagnose the disease accurately so that we\ncan avoid major loss of lemons and oranges. To improve citrus farming, we\nproposed a disease classification approach for lemons and oranges. This\napproach would enable early disease detection and intervention, reduce yield\nlosses, and optimize resource allocation. For the initial modeling of disease\nclassification, the research uses innovative deep learning architectures such\nas VGG16, VGG19 and ResNet50. In addition, for achieving better accuracy, the\nbasic machine learning algorithms used for classification problems include\nRandom Forest, Naive Bayes, K-Nearest Neighbors (KNN) and Logistic Regression.\nThe lemon and orange fruits diseases are classified more accurately (95.0% for\nlemon and 99.69% for orange) by the model. The model's base features were\nextracted from the ResNet50 pre-trained model and the diseases are classified\nby the Logistic Regression which beats the performance given by VGG16 and VGG19\nfor other classifiers. Experimental outcomes show that the proposed model also\noutperforms existing models in which most of them classified the diseases using\nthe Softmax classifier without using any individual classifiers.\n","authors":["Khandoker Nosiba Arifin","Sayma Akter Rupa","Md Musfique Anwar","Israt Jahan"],"pdf_url":"https://arxiv.org/pdf/2408.14206v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.03846v2","updated":"2024-08-26T11:58:22Z","published":"2024-02-06T09:48:33Z","title":"Efficient Generation of Hidden Outliers for Improved Outlier Detection","summary":" Outlier generation is a popular technique used for solving important outlier\ndetection tasks. Generating outliers with realistic behavior is challenging.\nPopular existing methods tend to disregard the 'multiple views' property of\noutliers in high-dimensional spaces. The only existing method accounting for\nthis property falls short in efficiency and effectiveness. We propose BISECT, a\nnew outlier generation method that creates realistic outliers mimicking said\nproperty. To do so, BISECT employs a novel proposition introduced in this\narticle stating how to efficiently generate said realistic outliers. Our method\nhas better guarantees and complexity than the current methodology for\nrecreating 'multiple views'. We use the synthetic outliers generated by BISECT\nto effectively enhance outlier detection in diverse datasets, for multiple use\ncases. For instance, oversampling with BISECT reduced the error by up to 3\ntimes when compared with the baselines.\n","authors":["Jose Cribeiro-Ramallo","Vadim Arzamasov","Klemens Böhm"],"pdf_url":"https://arxiv.org/pdf/2402.03846v2.pdf","comment":"Preprint. Full paper is scheduled to appear in TKDD; Updated results\n in table 4"},{"id":"http://arxiv.org/abs/2408.14195v1","updated":"2024-08-26T11:47:52Z","published":"2024-08-26T11:47:52Z","title":"Representative Arm Identification: A fixed confidence approach to\n identify cluster representatives","summary":" We study the representative arm identification (RAI) problem in the\nmulti-armed bandits (MAB) framework, wherein we have a collection of arms, each\nassociated with an unknown reward distribution. An underlying instance is\ndefined by a partitioning of the arms into clusters of predefined sizes, such\nthat for any $j > i$, all arms in cluster $i$ have a larger mean reward than\nthose in cluster $j$. The goal in RAI is to reliably identify a certain\nprespecified number of arms from each cluster, while using as few arm pulls as\npossible. The RAI problem covers as special cases several well-studied MAB\nproblems such as identifying the best arm or any $M$ out of the top $K$, as\nwell as both full and coarse ranking. We start by providing an\ninstance-dependent lower bound on the sample complexity of any feasible\nalgorithm for this setting. We then propose two algorithms, based on the idea\nof confidence intervals, and provide high probability upper bounds on their\nsample complexity, which orderwise match the lower bound. Finally, we do an\nempirical comparison of both algorithms along with an LUCB-type alternative on\nboth synthetic and real-world datasets, and demonstrate the superior\nperformance of our proposed schemes in most cases.\n","authors":["Sarvesh Gharat","Aniket Yadav","Nikhil Karamchandani","Jayakrishnan Nair"],"pdf_url":"https://arxiv.org/pdf/2408.14195v1.pdf","comment":"We analyse a clustered multi-armed bandit formulation, where the\n learning objective is to identify representative arms from each cluster, in a\n fixed confidence setting"},{"id":"http://arxiv.org/abs/2408.05920v3","updated":"2024-08-26T11:41:28Z","published":"2024-08-12T05:00:23Z","title":"Urban Region Pre-training and Prompting: A Graph-based Approach","summary":" Urban region representation is crucial for various urban downstream tasks.\nHowever, despite the proliferation of methods and their success, acquiring\ngeneral urban region knowledge and adapting to different tasks remains\nchallenging. Previous work often neglects the spatial structures and functional\nlayouts between entities, limiting their ability to capture transferable\nknowledge across regions. Further, these methods struggle to adapt effectively\nto specific downstream tasks, as they do not adequately address the unique\nfeatures and relationships required for different downstream tasks. In this\npaper, we propose a $\\textbf{G}$raph-based $\\textbf{U}$rban $\\textbf{R}$egion\n$\\textbf{P}$re-training and $\\textbf{P}$rompting framework ($\\textbf{GURPP}$)\nfor region representation learning. Specifically, we first construct an urban\nregion graph that integrates detailed spatial entity data for more effective\nurban region representation. Then, we develop a subgraph-centric urban region\npre-training model to capture the heterogeneous and transferable patterns of\ninteractions among entities. To further enhance the adaptability of these\nembeddings to different tasks, we design two graph-based prompting methods to\nincorporate explicit/hidden task knowledge. Extensive experiments on various\nurban region prediction tasks and different cities demonstrate the superior\nperformance of our GURPP framework.\n","authors":["Jiahui Jin","Yifan Song","Dong Kan","Haojia Zhu","Xiangguo Sun","Zhicheng Li","Xigang Sun","Jinghui Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.05920v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.02595v3","updated":"2024-08-26T11:35:52Z","published":"2023-04-02T02:19:15Z","title":"Bayesian neural networks via MCMC: a Python-based tutorial","summary":" Bayesian inference provides a methodology for parameter estimation and\nuncertainty quantification in machine learning and deep learning methods.\nVariational inference and Markov Chain Monte-Carlo (MCMC) sampling methods are\nused to implement Bayesian inference. In the past three decades, MCMC sampling\nmethods have faced some challenges in being adapted to larger models (such as\nin deep learning) and big data problems. Advanced proposal distributions that\nincorporate gradients, such as a Langevin proposal distribution, provide a\nmeans to address some of the limitations of MCMC sampling for Bayesian neural\nnetworks. Furthermore, MCMC methods have typically been constrained to\nstatisticians and currently not well-known among deep learning researchers. We\npresent a tutorial for MCMC methods that covers simple Bayesian linear and\nlogistic models, and Bayesian neural networks. The aim of this tutorial is to\nbridge the gap between theory and implementation via coding, given a general\nsparsity of libraries and tutorials to this end. This tutorial provides code in\nPython with data and instructions that enable their use and extension. We\nprovide results for some benchmark problems showing the strengths and\nweaknesses of implementing the respective Bayesian models via MCMC. We\nhighlight the challenges in sampling multi-modal posterior distributions for\nthe case of Bayesian neural networks and the need for further improvement of\nconvergence diagnosis methods.\n","authors":["Rohitash Chandra","Joshua Simmons"],"pdf_url":"https://arxiv.org/pdf/2304.02595v3.pdf","comment":"IEEE Access (2024)"},{"id":"http://arxiv.org/abs/2406.12284v2","updated":"2024-08-26T11:33:13Z","published":"2024-06-18T05:23:29Z","title":"Demystifying the Recency Heuristic in Temporal-Difference Learning","summary":" The recency heuristic in reinforcement learning is the assumption that\nstimuli that occurred closer in time to an acquired reward should be more\nheavily reinforced. The recency heuristic is one of the key assumptions made by\nTD($\\lambda$), which reinforces recent experiences according to an\nexponentially decaying weighting. In fact, all other widely used return\nestimators for TD learning, such as $n$-step returns, satisfy a weaker (i.e.,\nnon-monotonic) recency heuristic. Why is the recency heuristic effective for\ntemporal credit assignment? What happens when credit is assigned in a way that\nviolates this heuristic? In this paper, we analyze the specific mathematical\nimplications of adopting the recency heuristic in TD learning. We prove that\nany return estimator satisfying this heuristic: 1) is guaranteed to converge to\nthe correct value function, 2) has a relatively fast contraction rate, and 3)\nhas a long window of effective credit assignment, yet bounded worst-case\nvariance. We also give a counterexample where on-policy, tabular TD methods\nviolating the recency heuristic diverge. Our results offer some of the first\ntheoretical evidence that credit assignment based on the recency heuristic\nfacilitates learning.\n","authors":["Brett Daley","Marlos C. Machado","Martha White"],"pdf_url":"https://arxiv.org/pdf/2406.12284v2.pdf","comment":"RLC 2024. 18 pages, 8 figures, 1 table"},{"id":"http://arxiv.org/abs/2408.14183v1","updated":"2024-08-26T11:16:03Z","published":"2024-08-26T11:16:03Z","title":"Robot Navigation with Entity-Based Collision Avoidance using Deep\n Reinforcement Learning","summary":" Efficient navigation in dynamic environments is crucial for autonomous robots\ninteracting with various environmental entities, including both moving agents\nand static obstacles. In this study, we present a novel methodology that\nenhances the robot's interaction with different types of agents and obstacles\nbased on specific safety requirements. This approach uses information about the\nentity types, improving collision avoidance and ensuring safer navigation. We\nintroduce a new reward function that penalizes the robot for collisions with\ndifferent entities such as adults, bicyclists, children, and static obstacles,\nand additionally encourages the robot's proximity to the goal. It also\npenalizes the robot for being close to entities, and the safe distance also\ndepends on the entity type. Additionally, we propose an optimized algorithm for\ntraining and testing, which significantly accelerates train, validation, and\ntest steps and enables training in complex environments. Comprehensive\nexperiments conducted using simulation demonstrate that our approach\nconsistently outperforms conventional navigation and collision avoidance\nmethods, including state-of-the-art techniques. To sum up, this work\ncontributes to enhancing the safety and efficiency of navigation systems for\nautonomous robots in dynamic, crowded environments.\n","authors":["Yury Kolomeytsev","Dmitry Golembiovsky"],"pdf_url":"https://arxiv.org/pdf/2408.14183v1.pdf","comment":"14 pages, 5 figures"},{"id":"http://arxiv.org/abs/2402.11237v2","updated":"2024-08-26T10:39:22Z","published":"2024-02-17T10:02:22Z","title":"Be Persistent: Towards a Unified Solution for Mitigating Shortcuts in\n Deep Learning","summary":" Deep neural networks (DNNs) are vulnerable to shortcut learning: rather than\nlearning the intended task, they tend to draw inconclusive relationships\nbetween their inputs and outputs. Shortcut learning is ubiquitous among many\nfailure cases of neural networks, and traces of this phenomenon can be seen in\ntheir generalizability issues, domain shift, adversarial vulnerability, and\neven bias towards majority groups. In this paper, we argue that this\ncommonality in the cause of various DNN issues creates a significant\nopportunity that should be leveraged to find a unified solution for shortcut\nlearning. To this end, we outline the recent advances in topological data\nanalysis (TDA), and persistent homology (PH) in particular, to sketch a unified\nroadmap for detecting shortcuts in deep learning. We demonstrate our arguments\nby investigating the topological features of computational graphs in DNNs using\ntwo cases of unlearnable examples and bias in decision-making as our test\nstudies. Our analysis of these two failure cases of DNNs reveals that finding a\nunified solution for shortcut learning in DNNs is not out of reach, and TDA can\nplay a significant role in forming such a framework.\n","authors":["Hadi M. Dolatabadi","Sarah M. Erfani","Christopher Leckie"],"pdf_url":"https://arxiv.org/pdf/2402.11237v2.pdf","comment":"Accepted to the 2024 European Conference on Artificial Intelligence\n (ECAI)"},{"id":"http://arxiv.org/abs/2306.08270v2","updated":"2024-08-26T10:26:28Z","published":"2023-06-14T06:13:50Z","title":"Solar Active Regions Detection Via 2D Circular Kernel Time Series\n Transformation, Entropy and Machine Learning Approach","summary":" This study proposes an enhancement to the existing method for detecting Solar\nActive Regions (ARs). Our technique tracks ARs using images from the\nAtmospheric Imaging Assembly (AIA) of NASA's Solar Dynamics Observatory (SDO).\nIt involves a 2D circular kernel time series transformation, combined with\nStatistical and Entropy measures, and a Machine Learning (ML) approach. The\ntechnique transforms the circular area around pixels in the SDO AIA images into\none-dimensional time series (1-DTS). Statistical measures (Median Value, Xmed;\n95th Percentile, X95) and Entropy measures (Distribution Entropy, DisEn; Fuzzy\nEntropy, FuzzyEn) are used as feature selection methods (FSM 1), alongside a\nmethod applying 1-DTS elements directly as features (FSM 2). The ML algorithm\nclassifies these series into three categories: no Active Region (nARs type 1,\nclass 1), non-flaring Regions outside active regions with brightness (nARs type\n2, class 2), and flaring Active Regions (ARs, class 3). The ML model achieves a\nclassification accuracy of 0.900 and 0.914 for Entropy and Statistical\nmeasures, respectively. Notably, Fuzzy Entropy shows the highest classification\naccuracy (AKF=0.895), surpassing DisEn (AKF=0.738), X95 (AKF=0.873), and Xmed\n(AKF=0.840). This indicates the high effectiveness of Entropy and Statistical\nmeasures for AR detection in SDO AIA images. FSM 2 captures a similar\ndistribution of flaring AR activities as FSM 1. Additionally, we introduce a\ngeneralizing characteristic of AR activities (GSA), finding a direct agreement\nbetween increased AR activities and higher GSA values. The Python code\nimplementation of the proposed method is available in supplementary material.\n","authors":["Irewola Aaron Oludehinwa","Andrei Velichko","Maksim Belyaev","Olasunkanmi I. Olusola"],"pdf_url":"https://arxiv.org/pdf/2306.08270v2.pdf","comment":"30 pages, 10 figures, 4 tables"},{"id":"http://arxiv.org/abs/2404.03309v2","updated":"2024-08-26T10:21:00Z","published":"2024-04-04T09:08:04Z","title":"Optimistic Online Non-stochastic Control via FTRL","summary":" This paper brings the concept of ``optimism\" to the new and promising\nframework of online Non-stochastic Control (NSC). Namely, we study how NSC can\nbenefit from a prediction oracle of unknown quality responsible for forecasting\nfuture costs. The posed problem is first reduced to an optimistic learning with\ndelayed feedback problem, which is handled through the Optimistic Follow the\nRegularized Leader (OFTRL) algorithmic family. This reduction enables the\ndesign of \\texttt{OptFTRL-C}, the first Disturbance Action Controller (DAC)\nwith optimistic policy regret bounds. These new bounds are commensurate with\nthe oracle's accuracy, ranging from $\\mathcal{O}(1)$ for perfect predictions to\nthe order-optimal $\\mathcal{O}(\\sqrt{T})$ even when all predictions fail. By\naddressing the challenge of incorporating untrusted predictions into online\ncontrol, this work contributes to the advancement of the NSC framework and\npaves the way toward effective and robust learning-based controllers.\n","authors":["Naram Mhaisen","George Iosifidis"],"pdf_url":"https://arxiv.org/pdf/2404.03309v2.pdf","comment":"to appear in the proceedings of IEEE CDC 2024"},{"id":"http://arxiv.org/abs/2312.01878v8","updated":"2024-08-26T10:13:43Z","published":"2023-12-04T13:20:15Z","title":"HGPROMPT: Bridging Homogeneous and Heterogeneous Graphs for Few-shot\n Prompt Learning","summary":" Graph neural networks (GNNs) and heterogeneous graph neural networks (HGNNs)\nare prominent techniques for homogeneous and heterogeneous graph representation\nlearning, yet their performance in an end-to-end supervised framework greatly\ndepends on the availability of task-specific supervision. To reduce the\nlabeling cost, pre-training on self-supervised pretext tasks has become a\npopular paradigm,but there is often a gap between the pre-trained model and\ndownstream tasks, stemming from the divergence in their objectives. To bridge\nthe gap, prompt learning has risen as a promising direction especially in\nfew-shot settings, without the need to fully fine-tune the pre-trained model.\nWhile there has been some early exploration of prompt-based learning on graphs,\nthey primarily deal with homogeneous graphs, ignoring the heterogeneous graphs\nthat are prevalent in downstream applications. In this paper, we propose\nHGPROMPT, a novel pre-training and prompting framework to unify not only\npre-training and downstream tasks but also homogeneous and heterogeneous graphs\nvia a dual-template design. Moreover, we propose dual-prompt in HGPROMPT to\nassist a downstream task in locating the most relevant prior to bridge the gaps\ncaused by not only feature variations but also heterogeneity differences across\ntasks. Finally, we thoroughly evaluate and analyze HGPROMPT through extensive\nexperiments on three public datasets.\n","authors":["Xingtong Yu","Yuan Fang","Zemin Liu","Xinming Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.01878v8.pdf","comment":"AAAI2024 main track"},{"id":"http://arxiv.org/abs/2311.15317v5","updated":"2024-08-26T10:12:45Z","published":"2023-11-26T14:35:28Z","title":"Generalized Graph Prompt: Toward a Unification of Pre-Training and\n Downstream Tasks on Graphs","summary":" Graph neural networks have emerged as a powerful tool for graph\nrepresentation learning, but their performance heavily relies on abundant\ntask-specific supervision. To reduce labeling requirement, the \"pre-train,\nprompt\" paradigms have become increasingly common. However, existing study of\nprompting on graphs is limited, lacking a universal treatment to appeal to\ndifferent downstream tasks. In this paper, we propose GraphPrompt, a novel\npre-training and prompting framework on graphs. GraphPrompt not only unifies\npre-training and downstream tasks into a common task template but also employs\na learnable prompt to assist a downstream task in locating the most relevant\nknowledge from the pre-trained model in a task-specific manner. To further\nenhance GraphPrompt in these two stages, we extend it into GraphPrompt+ with\ntwo major enhancements. First, we generalize several popular graph pre-training\ntasks beyond simple link prediction to broaden the compatibility with our task\ntemplate. Second, we propose a more generalized prompt design that incorporates\na series of prompt vectors within every layer of the pre-trained graph encoder,\nin order to capitalize on the hierarchical information across different layers\nbeyond just the readout layer. Finally, we conduct extensive experiments on\nfive public datasets to evaluate and analyze GraphPrompt and GraphPrompt+.\n","authors":["Xingtong Yu","Zhenghao Liu","Yuan Fang","Zemin Liu","Sihong Chen","Xinming Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.15317v5.pdf","comment":"Accepted by IEEE TKDE. Extension of \"GraphPrompt: Unifying\n Pre-Training and Downstream Tasks for Graph Neural Networks\". arXiv admin\n note: substantial text overlap with arXiv:2302.08043"},{"id":"http://arxiv.org/abs/2312.03731v7","updated":"2024-08-26T10:11:45Z","published":"2023-11-28T02:36:53Z","title":"MultiGPrompt for Multi-Task Pre-Training and Prompting on Graphs","summary":" Graphs can inherently model interconnected objects on the Web, thereby\nfacilitating a series of Web applications, such as web analyzing and content\nrecommendation. Recently, Graph Neural Networks (GNNs) have emerged as a\nmainstream technique for graph representation learning. However, their efficacy\nwithin an end-to-end supervised framework is significantly tied to the\navailabilityof task-specific labels. To mitigate labeling costs and enhance\nrobustness in few-shot settings, pre-training on self-supervised tasks has\nemerged as a promising method, while prompting has been proposed to further\nnarrow the objective gap between pretext and downstream tasks. Although there\nhas been some initial exploration of prompt-based learning on graphs, they\nprimarily leverage a single pretext task, resulting in a limited subset of\ngeneral knowledge that could be learned from the pre-training data. Hence, in\nthis paper, we propose MultiGPrompt, a novel multi-task pre-training and\nprompting framework to exploit multiple pretext tasks for more comprehensive\npre-trained knowledge. First, in pre-training, we design a set of pretext\ntokens to synergize multiple pretext tasks. Second, we propose a dual-prompt\nmechanism consisting of composed and open prompts to leverage task-specific and\nglobal pre-training knowledge, to guide downstream tasks in few-shot settings.\nFinally, we conduct extensive experiments on six public datasets to evaluate\nand analyze MultiGPrompt.\n","authors":["Xingtong Yu","Chang Zhou","Yuan Fang","Xinming Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.03731v7.pdf","comment":"WWW2024 research track"},{"id":"http://arxiv.org/abs/2402.03903v3","updated":"2024-08-26T09:59:24Z","published":"2024-02-06T11:13:57Z","title":"Averaging $n$-step Returns Reduces Variance in Reinforcement Learning","summary":" Multistep returns, such as $n$-step returns and $\\lambda$-returns, are\ncommonly used to improve the sample efficiency of reinforcement learning (RL)\nmethods. The variance of the multistep returns becomes the limiting factor in\ntheir length; looking too far into the future increases variance and reverses\nthe benefits of multistep learning. In our work, we demonstrate the ability of\ncompound returns -- weighted averages of $n$-step returns -- to reduce\nvariance. We prove for the first time that any compound return with the same\ncontraction modulus as a given $n$-step return has strictly lower variance. We\nadditionally prove that this variance-reduction property improves the\nfinite-sample complexity of temporal-difference learning under linear function\napproximation. Because general compound returns can be expensive to implement,\nwe introduce two-bootstrap returns which reduce variance while remaining\nefficient, even when using minibatched experience replay. We conduct\nexperiments showing that compound returns often increase the sample efficiency\nof $n$-step deep RL agents like DQN and PPO.\n","authors":["Brett Daley","Martha White","Marlos C. Machado"],"pdf_url":"https://arxiv.org/pdf/2402.03903v3.pdf","comment":"ICML 2024. 27 pages, 7 figures, 3 tables"},{"id":"http://arxiv.org/abs/2407.13431v2","updated":"2024-08-26T09:58:04Z","published":"2024-07-18T12:00:32Z","title":"Improving Out-of-Distribution Generalization of Trajectory Prediction\n for Autonomous Driving via Polynomial Representations","summary":" Robustness against Out-of-Distribution (OoD) samples is a key performance\nindicator of a trajectory prediction model. However, the development and\nranking of state-of-the-art (SotA) models are driven by their In-Distribution\n(ID) performance on individual competition datasets. We present an OoD testing\nprotocol that homogenizes datasets and prediction tasks across two large-scale\nmotion datasets. We introduce a novel prediction algorithm based on polynomial\nrepresentations for agent trajectory and road geometry on both the input and\noutput sides of the model. With a much smaller model size, training effort, and\ninference time, we reach near SotA performance for ID testing and significantly\nimprove robustness in OoD testing. Within our OoD testing protocol, we further\nstudy two augmentation strategies of SotA models and their effects on model\ngeneralization. Highlighting the contrast between ID and OoD performance, we\nsuggest adding OoD testing to the evaluation criteria of trajectory prediction\nmodels.\n","authors":["Yue Yao","Shengchao Yan","Daniel Goehring","Wolfram Burgard","Joerg Reichardt"],"pdf_url":"https://arxiv.org/pdf/2407.13431v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14152v1","updated":"2024-08-26T09:55:32Z","published":"2024-08-26T09:55:32Z","title":"Application of Disentanglement to Map Registration Problem","summary":" Geospatial data come from various sources, such as satellites, aircraft, and\nLiDAR. The variability of the source is not limited to the types of data\nacquisition techniques, as we have maps from different time periods. To\nincorporate these data for a coherent analysis, it is essential to first align\ndifferent \"styles\" of geospatial data to its matching images that point to the\nsame location on the surface of the Earth. In this paper, we approach the image\nregistration as a two-step process of (1) extracting geospatial contents\ninvariant to visual (and any other non-content-related) information, and (2)\nmatching the data based on such (purely) geospatial contents. We hypothesize\nthat a combination of $\\beta$-VAE-like architecture [2] and adversarial\ntraining will achieve both the disentanglement of the geographic information\nand artistic styles and generation of new map tiles by composing the encoded\ngeographic information with any artistic style.\n","authors":["Hae Jin Song","Patrycja Krawczuk","Po-Hsuan Huang"],"pdf_url":"https://arxiv.org/pdf/2408.14152v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14146v1","updated":"2024-08-26T09:44:21Z","published":"2024-08-26T09:44:21Z","title":"TSAK: Two-Stage Semantic-Aware Knowledge Distillation for Efficient\n Wearable Modality and Model Optimization in Manufacturing Lines","summary":" Smaller machine learning models, with less complex architectures and sensor\ninputs, can benefit wearable sensor-based human activity recognition (HAR)\nsystems in many ways, from complexity and cost to battery life. In the specific\ncase of smart factories, optimizing human-robot collaboration hinges on the\nimplementation of cutting-edge, human-centric AI systems. To this end, workers'\nactivity recognition enables accurate quantification of performance metrics,\nimproving efficiency holistically. We present a two-stage semantic-aware\nknowledge distillation (KD) approach, TSAK, for efficient, privacy-aware, and\nwearable HAR in manufacturing lines, which reduces the input sensor modalities\nas well as the machine learning model size, while reaching similar recognition\nperformance as a larger multi-modal and multi-positional teacher model. The\nfirst stage incorporates a teacher classifier model encoding attention, causal,\nand combined representations. The second stage encompasses a semantic\nclassifier merging the three representations from the first stage. To evaluate\nTSAK, we recorded a multi-modal dataset at a smart factory testbed with\nwearable and privacy-aware sensors (IMU and capacitive) located on both\nworkers' hands. In addition, we evaluated our approach on OpenPack, the only\navailable open dataset mimicking the wearable sensor placements on both hands\nin the manufacturing HAR scenario. We compared several KD strategies with\ndifferent representations to regulate the training process of a smaller student\nmodel. Compared to the larger teacher model, the student model takes fewer\nsensor channels from a single hand, has 79% fewer parameters, runs 8.88 times\nfaster, and requires 96.6% less computing power (FLOPS).\n","authors":["Hymalai Bello","Daniel Geißler","Sungho Suh","Bo Zhou","Paul Lukowicz"],"pdf_url":"https://arxiv.org/pdf/2408.14146v1.pdf","comment":"Accepted in 27th International Conference on Pattern Recognition\n (ICPR)"},{"id":"http://arxiv.org/abs/2407.02112v2","updated":"2024-08-26T09:43:12Z","published":"2024-07-02T09:54:39Z","title":"A Data-Centric Perspective on Evaluating Machine Learning Models for\n Tabular Data","summary":" Tabular data is prevalent in real-world machine learning applications, and\nnew models for supervised learning of tabular data are frequently proposed.\nComparative studies assessing the performance of models typically consist of\nmodel-centric evaluation setups with overly standardized data preprocessing.\nThis paper demonstrates that such model-centric evaluations are biased, as\nreal-world modeling pipelines often require dataset-specific preprocessing and\nfeature engineering. Therefore, we propose a data-centric evaluation framework.\nWe select 10 relevant datasets from Kaggle competitions and implement\nexpert-level preprocessing pipelines for each dataset. We conduct experiments\nwith different preprocessing pipelines and hyperparameter optimization (HPO)\nregimes to quantify the impact of model selection, HPO, feature engineering,\nand test-time adaptation. Our main findings are: 1. After dataset-specific\nfeature engineering, model rankings change considerably, performance\ndifferences decrease, and the importance of model selection reduces. 2. Recent\nmodels, despite their measurable progress, still significantly benefit from\nmanual feature engineering. This holds true for both tree-based models and\nneural networks. 3. While tabular data is typically considered static, samples\nare often collected over time, and adapting to distribution shifts can be\nimportant even in supposedly static data. These insights suggest that research\nefforts should be directed toward a data-centric perspective, acknowledging\nthat tabular data requires feature engineering and often exhibits temporal\ncharacteristics. Our framework is available under:\nhttps://github.com/atschalz/dc_tabeval.\n","authors":["Andrej Tschalzev","Sascha Marton","Stefan Lüdtke","Christian Bartelt","Heiner Stuckenschmidt"],"pdf_url":"https://arxiv.org/pdf/2407.02112v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14144v1","updated":"2024-08-26T09:42:18Z","published":"2024-08-26T09:42:18Z","title":"Neighborhood and Global Perturbations Supported SAM in Federated\n Learning: From Local Tweaks To Global Awareness","summary":" Federated Learning (FL) can be coordinated under the orchestration of a\ncentral server to collaboratively build a privacy-preserving model without the\nneed for data exchange. However, participant data heterogeneity leads to local\noptima divergence, subsequently affecting convergence outcomes. Recent research\nhas focused on global sharpness-aware minimization (SAM) and dynamic\nregularization techniques to enhance consistency between global and local\ngeneralization and optimization objectives. Nonetheless, the estimation of\nglobal SAM introduces additional computational and memory overhead, while\ndynamic regularization suffers from bias in the local and global dual variables\ndue to training isolation. In this paper, we propose a novel FL algorithm,\nFedTOGA, designed to consider optimization and generalization objectives while\nmaintaining minimal uplink communication overhead. By linking local\nperturbations to global updates, global generalization consistency is improved.\nAdditionally, global updates are used to correct local dynamic regularizers,\nreducing dual variables bias and enhancing optimization consistency. Global\nupdates are passively received by clients, reducing overhead. We also propose\nneighborhood perturbation to approximate local perturbation, analyzing its\nstrengths and limitations. Theoretical analysis shows FedTOGA achieves faster\nconvergence $O(1/T)$ under non-convex functions. Empirical studies demonstrate\nthat FedTOGA outperforms state-of-the-art algorithms, with a 1\\% accuracy\nincrease and 30\\% faster convergence, achieving state-of-the-art.\n","authors":["Boyuan Li","Zihao Peng","Yafei Li","Mingliang Xu","Shengbo Chen","Baofeng Ji","Cong Shen"],"pdf_url":"https://arxiv.org/pdf/2408.14144v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14143v1","updated":"2024-08-26T09:41:40Z","published":"2024-08-26T09:41:40Z","title":"2D-Malafide: Adversarial Attacks Against Face Deepfake Detection Systems","summary":" We introduce 2D-Malafide, a novel and lightweight adversarial attack designed\nto deceive face deepfake detection systems. Building upon the concept of 1D\nconvolutional perturbations explored in the speech domain, our method leverages\n2D convolutional filters to craft perturbations which significantly degrade the\nperformance of state-of-the-art face deepfake detectors. Unlike traditional\nadditive noise approaches, 2D-Malafide optimises a small number of filter\ncoefficients to generate robust adversarial perturbations which are\ntransferable across different face images. Experiments, conducted using the\nFaceForensics++ dataset, demonstrate that 2D-Malafide substantially degrades\ndetection performance in both white-box and black-box settings, with larger\nfilter sizes having the greatest impact. Additionally, we report an\nexplainability analysis using GradCAM which illustrates how 2D-Malafide\nmisleads detection systems by altering the image areas used most for\nclassification. Our findings highlight the vulnerability of current deepfake\ndetection systems to convolutional adversarial attacks as well as the need for\nfuture work to enhance detection robustness through improved image fidelity\nconstraints.\n","authors":["Chiara Galdi","Michele Panariello","Massimiliano Todisco","Nicholas Evans"],"pdf_url":"https://arxiv.org/pdf/2408.14143v1.pdf","comment":"Accepted at BIOSIG 2024"},{"id":"http://arxiv.org/abs/2405.18194v3","updated":"2024-08-26T09:35:54Z","published":"2024-05-28T14:04:09Z","title":"Delving into Differentially Private Transformer","summary":" Deep learning with differential privacy (DP) has garnered significant\nattention over the past years, leading to the development of numerous methods\naimed at enhancing model accuracy and training efficiency. This paper delves\ninto the problem of training Transformer models with differential privacy. Our\ntreatment is modular: the logic is to `reduce' the problem of training DP\nTransformer to the more basic problem of training DP vanilla neural nets. The\nlatter is better understood and amenable to many model-agnostic methods. Such\n`reduction' is done by first identifying the hardness unique to DP Transformer\ntraining: the attention distraction phenomenon and a lack of compatibility with\nexisting techniques for efficient gradient clipping. To deal with these two\nissues, we propose the Re-Attention Mechanism and Phantom Clipping,\nrespectively. We believe that our work not only casts new light on training DP\nTransformers but also promotes a modular treatment to advance research in the\nfield of differentially private deep learning.\n","authors":["Youlong Ding","Xueyang Wu","Yining Meng","Yonggang Luo","Hao Wang","Weike Pan"],"pdf_url":"https://arxiv.org/pdf/2405.18194v3.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2408.14134v1","updated":"2024-08-26T09:29:56Z","published":"2024-08-26T09:29:56Z","title":"Exploring the Potential of Large Language Models for Heterophilic Graphs","summary":" Graph Neural Networks (GNNs) are essential for various graph-based learning\ntasks. Notably, classical GNN architectures operate under the assumption of\nhomophily, which posits that connected nodes are likely to share similar\nfeatures. However, this assumption limits the effectiveness of GNNs in handling\nheterophilic graphs where connected nodes often exhibit dissimilar\ncharacteristics. Existing approaches for homophily graphs such as non-local\nneighbor extension and architectural refinement overlook the rich textual data\nassociated with nodes, which could unlock deeper insights into these\nheterophilic contexts. With advancements in Large Language Models (LLMs), there\nis significant promise to enhance GNNs by leveraging the extensive open-world\nknowledge within LLMs to more effectively interpret and utilize textual data\nfor characterizing heterophilic graphs. In this work, we explore the potential\nof LLMs for modeling heterophilic graphs and propose a novel two-stage\nframework: LLM-enhanced edge discriminator and LLM-guided edge reweighting.\nSpecifically, in the first stage, we fine-tune the LLM to better identify\nhomophilic and heterophilic edges based on the textual information of their\nnodes. In the second stage, we adaptively manage message propagation in GNNs\nfor different edge types based on node features, structures, and heterophilic\nor homophilic characteristics. To cope with the computational demands when\ndeploying LLMs in practical scenarios, we further explore model distillation\ntechniques to fine-tune smaller, more efficient models that maintain\ncompetitive performance. Extensive experiments validate the effectiveness of\nour framework, demonstrating the feasibility of using LLMs to enhance GNNs for\nnode classification on heterophilic graphs.\n","authors":["Yuxia Wu","Shujie Li","Yuan Fang","Chuan Shi"],"pdf_url":"https://arxiv.org/pdf/2408.14134v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2408.14130v1","updated":"2024-08-26T09:24:36Z","published":"2024-08-26T09:24:36Z","title":"Theoretical Proportion Label Perturbation for Learning from Label\n Proportions in Large Bags","summary":" Learning from label proportions (LLP) is a kind of weakly supervised learning\nthat trains an instance-level classifier from label proportions of bags, which\nconsist of sets of instances without using instance labels. A challenge in LLP\narises when the number of instances in a bag (bag size) is numerous, making the\ntraditional LLP methods difficult due to GPU memory limitations. This study\naims to develop an LLP method capable of learning from bags with large sizes.\nIn our method, smaller bags (mini-bags) are generated by sampling instances\nfrom large-sized bags (original bags), and these mini-bags are used in place of\nthe original bags. However, the proportion of a mini-bag is unknown and differs\nfrom that of the original bag, leading to overfitting. To address this issue,\nwe propose a perturbation method for the proportion labels of sampled mini-bags\nto mitigate overfitting to noisy label proportions. This perturbation is added\nbased on the multivariate hypergeometric distribution, which is statistically\nmodeled. Additionally, loss weighting is implemented to reduce the negative\nimpact of proportions sampled from the tail of the distribution. Experimental\nresults demonstrate that the proportion label perturbation and loss weighting\nachieve classification accuracy comparable to that obtained without sampling.\nOur codes are available at https://github.com/stainlessnight/LLP-LargeBags.\n","authors":["Shunsuke Kubo","Shinnosuke Matsuo","Daiki Suehiro","Kazuhiro Terada","Hiroaki Ito","Akihiko Yoshizawa","Ryoma Bise"],"pdf_url":"https://arxiv.org/pdf/2408.14130v1.pdf","comment":"Accepted at ECAI2024"},{"id":"http://arxiv.org/abs/2408.14126v1","updated":"2024-08-26T09:19:58Z","published":"2024-08-26T09:19:58Z","title":"Enhancing Fairness through Reweighting: A Path to Attain the Sufficiency\n Rule","summary":" We introduce an innovative approach to enhancing the empirical risk\nminimization (ERM) process in model training through a refined reweighting\nscheme of the training data to enhance fairness. This scheme aims to uphold the\nsufficiency rule in fairness by ensuring that optimal predictors maintain\nconsistency across diverse sub-groups. We employ a bilevel formulation to\naddress this challenge, wherein we explore sample reweighting strategies.\nUnlike conventional methods that hinge on model size, our formulation bases\ngeneralization complexity on the space of sample weights. We discretize the\nweights to improve training speed. Empirical validation of our method showcases\nits effectiveness and robustness, revealing a consistent improvement in the\nbalance between prediction performance and fairness metrics across various\nexperiments.\n","authors":["Xuan Zhao","Klaus Broelemann","Salvatore Ruggieri","Gjergji Kasneci"],"pdf_url":"https://arxiv.org/pdf/2408.14126v1.pdf","comment":"accepted at ECAI 2024"},{"id":"http://arxiv.org/abs/2407.05206v4","updated":"2024-08-26T09:15:11Z","published":"2024-07-06T23:16:41Z","title":"Helios: An extremely low power event-based gesture recognition for\n always-on smart eyewear","summary":" This paper introduces Helios, the first extremely low-power, real-time,\nevent-based hand gesture recognition system designed for all-day on smart\neyewear. As augmented reality (AR) evolves, current smart glasses like the Meta\nRay-Bans prioritize visual and wearable comfort at the expense of\nfunctionality. Existing human-machine interfaces (HMIs) in these devices, such\nas capacitive touch and voice controls, present limitations in ergonomics,\nprivacy and power consumption. Helios addresses these challenges by leveraging\nnatural hand interactions for a more intuitive and comfortable user experience.\nOur system utilizes a extremely low-power and compact 3mmx4mm/20mW event camera\nto perform natural hand-based gesture recognition for always-on smart eyewear.\nThe camera's output is processed by a convolutional neural network (CNN)\nrunning on a NXP Nano UltraLite compute platform, consuming less than 350mW.\nHelios can recognize seven classes of gestures, including subtle microgestures\nlike swipes and pinches, with 91% accuracy. We also demonstrate real-time\nperformance across 20 users at a remarkably low latency of 60ms. Our user\ntesting results align with the positive feedback we received during our recent\nsuccessful demo at AWE-USA-2024.\n","authors":["Prarthana Bhattacharyya","Joshua Mitton","Ryan Page","Owen Morgan","Ben Menzies","Gabriel Homewood","Kemi Jacobs","Paolo Baesso","David Trickett","Chris Mair","Taru Muhonen","Rory Clark","Louis Berridge","Richard Vigars","Iain Wallace"],"pdf_url":"https://arxiv.org/pdf/2407.05206v4.pdf","comment":"Accepted at ECCV-Integrating Computer Vision in Smart Eyewear, 2024.\n 18 pages, 10 figures. First three authors contributed equally to this paper"},{"id":"http://arxiv.org/abs/2408.14118v1","updated":"2024-08-26T09:06:35Z","published":"2024-08-26T09:06:35Z","title":"Towards Lifelong Learning Embeddings: An Algorithmic Approach to\n Dynamically Extend Embeddings","summary":" The rapid evolution of technology has transformed business operations and\ncustomer interactions worldwide, with personalization emerging as a key\nopportunity for e-commerce companies to engage customers more effectively. The\napplication of machine learning, particularly that of deep learning models, has\ngained significant traction due to its ability to rapidly recognize patterns in\nlarge datasets, thereby offering numerous possibilities for personalization.\nThese models use embeddings to map discrete information, such as product IDs,\ninto a latent vector space, a method increasingly popular in recent years.\nHowever, e-commerce's dynamic nature, characterized by frequent new product\nintroductions, poses challenges for these embeddings, which typically require\nfixed dimensions and inputs, leading to the need for periodic retraining from\nscratch. This paper introduces a modular algorithm that extends embedding input\nsize while preserving learned knowledge, addressing the challenges posed by\ne-commerce's dynamism. The proposed algorithm also incorporates strategies to\nmitigate the cold start problem associated with new products. The results of\ninitial experiments suggest that this method outperforms traditional\nembeddings.\n","authors":["Miguel Alves Gomes","Philipp Meisen","Tobias Meisen"],"pdf_url":"https://arxiv.org/pdf/2408.14118v1.pdf","comment":"Accepted Extended Abstract for 3rd Workshop on End-End Customer\n Journey Optimization at KDD2024, Barcelona, Spain"},{"id":"http://arxiv.org/abs/2408.14116v1","updated":"2024-08-26T09:05:43Z","published":"2024-08-26T09:05:43Z","title":"Hierarchical Learning and Computing over Space-Ground Integrated\n Networks","summary":" Space-ground integrated networks hold great promise for providing global\nconnectivity, particularly in remote areas where large amounts of valuable data\nare generated by Internet of Things (IoT) devices, but lacking terrestrial\ncommunication infrastructure. The massive data is conventionally transferred to\nthe cloud server for centralized artificial intelligence (AI) models training,\nraising huge communication overhead and privacy concerns. To address this, we\npropose a hierarchical learning and computing framework, which leverages the\nlowlatency characteristic of low-earth-orbit (LEO) satellites and the global\ncoverage of geostationary-earth-orbit (GEO) satellites, to provide global\naggregation services for locally trained models on ground IoT devices. Due to\nthe time-varying nature of satellite network topology and the energy\nconstraints of LEO satellites, efficiently aggregating the received local\nmodels from ground devices on LEO satellites is highly challenging. By\nleveraging the predictability of inter-satellite connectivity, modeling the\nspace network as a directed graph, we formulate a network energy minimization\nproblem for model aggregation, which turns out to be a Directed Steiner Tree\n(DST) problem. We propose a topologyaware energy-efficient routing (TAEER)\nalgorithm to solve the DST problem by finding a minimum spanning arborescence\non a substitute directed graph. Extensive simulations under realworld\nspace-ground integrated network settings demonstrate that the proposed TAEER\nalgorithm significantly reduces energy consumption and outperforms benchmarks.\n","authors":["Jingyang Zhu","Yuanming Shi","Yong Zhou","Chunxiao Jiang","Linling Kuang"],"pdf_url":"https://arxiv.org/pdf/2408.14116v1.pdf","comment":"14 pages, 10 figures"},{"id":"http://arxiv.org/abs/2403.08525v2","updated":"2024-08-26T08:49:48Z","published":"2024-03-13T13:33:35Z","title":"From Weak to Strong Sound Event Labels using Adaptive Change-Point\n Detection and Active Learning","summary":" We propose an adaptive change point detection method (A-CPD) for machine\nguided weak label annotation of audio recording segments. The goal is to\nmaximize the amount of information gained about the temporal activations of the\ntarget sounds. For each unlabeled audio recording, we use a prediction model to\nderive a probability curve used to guide annotation. The prediction model is\ninitially pre-trained on available annotated sound event data with classes that\nare disjoint from the classes in the unlabeled dataset. The prediction model\nthen gradually adapts to the annotations provided by the annotator in an active\nlearning loop. We derive query segments to guide the weak label annotator\ntowards strong labels, using change point detection on these probabilities. We\nshow that it is possible to derive strong labels of high quality with a limited\nannotation budget, and show favorable results for A-CPD when compared to two\nbaseline query segment strategies.\n","authors":["John Martinsson","Olof Mogren","Maria Sandsten","Tuomas Virtanen"],"pdf_url":"https://arxiv.org/pdf/2403.08525v2.pdf","comment":"Accepted at EUSIPCO 2024 (nominated best student paper)"},{"id":"http://arxiv.org/abs/2405.08334v2","updated":"2024-08-26T08:24:14Z","published":"2024-05-14T06:09:08Z","title":"Could Chemical LLMs benefit from Message Passing","summary":" Pretrained language models (LMs) showcase significant capabilities in\nprocessing molecular text, while concurrently, message passing neural networks\n(MPNNs) demonstrate resilience and versatility in the domain of molecular\nscience. Despite these advancements, we find there are limited studies\ninvestigating the bidirectional interactions between molecular structures and\ntheir corresponding textual representations. Therefore, in this paper, we\npropose two strategies to evaluate whether an information integration can\nenhance the performance: contrast learning, which involves utilizing an MPNN to\nsupervise the training of the LM, and fusion, which exploits information from\nboth models. Our empirical analysis reveals that the integration approaches\nexhibit superior performance compared to baselines when applied to smaller\nmolecular graphs, while these integration approaches do not yield performance\nenhancements on large scale graphs.\n","authors":["Jiaqing Xie","Ziheng Chi"],"pdf_url":"https://arxiv.org/pdf/2405.08334v2.pdf","comment":"Accepted at ACL @ Languages and Molecules 2024. In Proceedings of ACL\n 2024"},{"id":"http://arxiv.org/abs/2408.14086v1","updated":"2024-08-26T08:12:26Z","published":"2024-08-26T08:12:26Z","title":"ReLExS: Reinforcement Learning Explanations for Stackelberg No-Regret\n Learners","summary":" With the constraint of a no regret follower, will the players in a two-player\nStackelberg game still reach Stackelberg equilibrium? We first show when the\nfollower strategy is either reward-average or transform-reward-average, the two\nplayers can always get the Stackelberg Equilibrium. Then, we extend that the\nplayers can achieve the Stackelberg equilibrium in the two-player game under\nthe no regret constraint. Also, we show a strict upper bound of the follower's\nutility difference between with and without no regret constraint. Moreover, in\nconstant-sum two-player Stackelberg games with non-regret action sequences, we\nensure the total optimal utility of the game remains also bounded.\n","authors":["Xiangge Huang","Jingyuan Li","Jiaqing Xie"],"pdf_url":"https://arxiv.org/pdf/2408.14086v1.pdf","comment":"10 pages, 3 figures. Technical Report"},{"id":"http://arxiv.org/abs/2407.20003v2","updated":"2024-08-26T08:10:56Z","published":"2024-07-29T13:34:34Z","title":"On the Effects of Irrelevant Variables in Treatment Effect Estimation\n with Deep Disentanglement","summary":" Estimating treatment effects from observational data is paramount in\nhealthcare, education, and economics, but current deep disentanglement-based\nmethods to address selection bias are insufficiently handling irrelevant\nvariables. We demonstrate in experiments that this leads to prediction errors.\nWe disentangle pre-treatment variables with a deep embedding method and\nexplicitly identify and represent irrelevant variables, additionally to\ninstrumental, confounding and adjustment latent factors. To this end, we\nintroduce a reconstruction objective and create an embedding space for\nirrelevant variables using an attached autoencoder. Instead of relying on\nserendipitous suppression of irrelevant variables as in previous deep\ndisentanglement approaches, we explicitly force irrelevant variables into this\nembedding space and employ orthogonalization to prevent irrelevant information\nfrom leaking into the latent space representations of the other factors. Our\nexperiments with synthetic and real-world benchmark datasets show that we can\nbetter identify irrelevant variables and more precisely predict treatment\neffects than previous methods, while prediction quality degrades less when\nadditional irrelevant variables are introduced.\n","authors":["Ahmad Saeed Khan","Erik Schaffernicht","Johannes Andreas Stork"],"pdf_url":"https://arxiv.org/pdf/2407.20003v2.pdf","comment":"Paper is accepted at ECAI-2024"},{"id":"http://arxiv.org/abs/2408.14080v1","updated":"2024-08-26T08:02:57Z","published":"2024-08-26T08:02:57Z","title":"SONICS: Synthetic Or Not -- Identifying Counterfeit Songs","summary":" The recent surge in AI-generated songs presents exciting possibilities and\nchallenges. While these tools democratize music creation, they also necessitate\nthe ability to distinguish between human-composed and AI-generated songs for\nsafeguarding artistic integrity and content curation. Existing research and\ndatasets in fake song detection only focus on singing voice deepfake detection\n(SVDD), where the vocals are AI-generated but the instrumental music is sourced\nfrom real songs. However, this approach is inadequate for contemporary\nend-to-end AI-generated songs where all components (vocals, lyrics, music, and\nstyle) could be AI-generated. Additionally, existing datasets lack lyrics-music\ndiversity, long-duration songs, and open fake songs. To address these gaps, we\nintroduce SONICS, a novel dataset for end-to-end Synthetic Song Detection\n(SSD), comprising over 97k songs with over 49k synthetic songs from popular\nplatforms like Suno and Udio. Furthermore, we highlight the importance of\nmodeling long-range temporal dependencies in songs for effective authenticity\ndetection, an aspect overlooked in existing methods. To capture these patterns,\nwe propose a novel model, SpecTTTra, that is up to 3 times faster and 6 times\nmore memory efficient compared to popular CNN and Transformer-based models\nwhile maintaining competitive performance. Finally, we offer both AI-based and\nHuman evaluation benchmarks, addressing another deficiency in current research.\n","authors":["Md Awsafur Rahman","Zaber Ibn Abdul Hakim","Najibul Haque Sarker","Bishmoy Paul","Shaikh Anowarul Fattah"],"pdf_url":"https://arxiv.org/pdf/2408.14080v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14073v1","updated":"2024-08-26T07:56:17Z","published":"2024-08-26T07:56:17Z","title":"Score-based change point detection via tracking the best of infinitely\n many experts","summary":" We suggest a novel algorithm for online change point detection based on\nsequential score function estimation and tracking the best expert approach. The\ncore of the procedure is a version of the fixed share forecaster for the case\nof infinite number of experts and quadratic loss functions. The algorithm shows\na promising performance in numerical experiments on artificial and real-world\ndata sets. We also derive new upper bounds on the dynamic regret of the fixed\nshare forecaster with varying parameter, which are of independent interest.\n","authors":["Anna Markovich","Nikita Puchkin"],"pdf_url":"https://arxiv.org/pdf/2408.14073v1.pdf","comment":"43 pages, 4 figures"},{"id":"http://arxiv.org/abs/2311.02971v3","updated":"2024-08-26T07:46:53Z","published":"2023-11-06T09:17:18Z","title":"TabRepo: A Large Scale Repository of Tabular Model Evaluations and its\n AutoML Applications","summary":" We introduce TabRepo, a new dataset of tabular model evaluations and\npredictions. TabRepo contains the predictions and metrics of 1310 models\nevaluated on 200 classification and regression datasets. We illustrate the\nbenefit of our dataset in multiple ways. First, we show that it allows to\nperform analysis such as comparing Hyperparameter Optimization against current\nAutoML systems while also considering ensembling at marginal cost by using\nprecomputed model predictions. Second, we show that our dataset can be readily\nleveraged to perform transfer-learning. In particular, we show that applying\nstandard transfer-learning techniques allows to outperform current\nstate-of-the-art tabular systems in accuracy, runtime and latency.\n","authors":["David Salinas","Nick Erickson"],"pdf_url":"https://arxiv.org/pdf/2311.02971v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14063v1","updated":"2024-08-26T07:44:53Z","published":"2024-08-26T07:44:53Z","title":"Bridging the gap between Learning-to-plan, Motion Primitives and Safe\n Reinforcement Learning","summary":" Trajectory planning under kinodynamic constraints is fundamental for advanced\nrobotics applications that require dexterous, reactive, and rapid skills in\ncomplex environments. These constraints, which may represent task, safety, or\nactuator limitations, are essential for ensuring the proper functioning of\nrobotic platforms and preventing unexpected behaviors. Recent advances in\nkinodynamic planning demonstrate that learning-to-plan techniques can generate\ncomplex and reactive motions under intricate constraints. However, these\ntechniques necessitate the analytical modeling of both the robot and the entire\ntask, a limiting assumption when systems are extremely complex or when\nconstructing accurate task models is prohibitive. This paper addresses this\nlimitation by combining learning-to-plan methods with reinforcement learning,\nresulting in a novel integration of black-box learning of motion primitives and\noptimization. We evaluate our approach against state-of-the-art safe\nreinforcement learning methods, showing that our technique, particularly when\nexploiting task structure, outperforms baseline methods in challenging\nscenarios such as planning to hit in robot air hockey. This work demonstrates\nthe potential of our integrated approach to enhance the performance and safety\nof robots operating under complex kinodynamic constraints.\n","authors":["Piotr Kicki","Davide Tateo","Puze Liu","Jonas Guenster","Jan Peters","Krzysztof Walas"],"pdf_url":"https://arxiv.org/pdf/2408.14063v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12961v2","updated":"2024-08-26T07:43:03Z","published":"2024-08-23T10:12:08Z","title":"Symplectic Bregman divergences","summary":" We present a generalization of Bregman divergences in symplectic vector\nspaces that we term symplectic Bregman divergences. Symplectic Bregman\ndivergences are derived from a symplectic generalization of the Fenchel-Young\ninequality which relies on the notion of symplectic subdifferentials. The\nsymplectic Fenchel-Young inequality is obtained using the symplectic Fenchel\ntransform which is defined with respect to a linear symplectic form. When the\nsymplectic form is built from an inner product, we show that the corresponding\nsymplectic Bregman divergences amount to ordinary Bregman divergences with\nrespect to composite inner products. Some potential applications of symplectic\ndivergences in geometric mechanics, information geometry, and learning dynamics\nin machine learning are touched upon.\n","authors":["Frank Nielsen"],"pdf_url":"https://arxiv.org/pdf/2408.12961v2.pdf","comment":"12 pages, 2 figures"},{"id":"http://arxiv.org/abs/2404.10635v4","updated":"2024-08-26T07:40:52Z","published":"2024-03-26T15:36:47Z","title":"Compressed Federated Reinforcement Learning with a Generative Model","summary":" Reinforcement learning has recently gained unprecedented popularity, yet it\nstill grapples with sample inefficiency. Addressing this challenge, federated\nreinforcement learning (FedRL) has emerged, wherein agents collaboratively\nlearn a single policy by aggregating local estimations. However, this\naggregation step incurs significant communication costs. In this paper, we\npropose CompFedRL, a communication-efficient FedRL approach incorporating both\n\\textit{periodic aggregation} and (direct/error-feedback) compression\nmechanisms. Specifically, we consider compressed federated $Q$-learning with a\ngenerative model setup, where a central server learns an optimal $Q$-function\nby periodically aggregating compressed $Q$-estimates from local agents. For the\nfirst time, we characterize the impact of these two mechanisms (which have\nremained elusive) by providing a finite-time analysis of our algorithm,\ndemonstrating strong convergence behaviors when utilizing either direct or\nerror-feedback compression. Our bounds indicate improved solution accuracy\nconcerning the number of agents and other federated hyperparameters while\nsimultaneously reducing communication costs. To corroborate our theory, we also\nconduct in-depth numerical experiments to verify our findings, considering\nTop-$K$ and Sparsified-$K$ sparsification operators.\n","authors":["Ali Beikmohammadi","Sarit Khirirat","Sindri Magnússon"],"pdf_url":"https://arxiv.org/pdf/2404.10635v4.pdf","comment":"European Conference on Machine Learning and Principles and Practice\n of Knowledge Discovery in Databases (ECML-PKDD 2024)"},{"id":"http://arxiv.org/abs/2408.10174v2","updated":"2024-08-26T07:34:46Z","published":"2024-08-19T17:32:15Z","title":"SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From\n Pre-Trained Foundation Models","summary":" Deep model training on extensive datasets is increasingly becoming\ncost-prohibitive, prompting the widespread adoption of deep model fusion\ntechniques to leverage knowledge from pre-existing models. From simple weight\naveraging to more sophisticated methods like AdaMerging, model fusion\neffectively improves model performance and accelerates the development of new\nmodels. However, potential interference between parameters of individual models\nand the lack of interpretability in the fusion progress remain significant\nchallenges. Existing methods often try to resolve the parameter interference\nissue by evaluating attributes of parameters, such as their magnitude or sign,\nor by parameter pruning. In this study, we begin by examining the fine-tuning\nof linear layers through the lens of subspace analysis and explicitly define\nparameter interference as an optimization problem to shed light on this\nsubject. Subsequently, we introduce an innovative approach to model fusion\ncalled zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction, which\nallows for the upscaling of source models into an MoE model without extra data\nor further training. Our approach relies on the observation that fine-tuning\nmostly keeps the important parts from the pre-training, but it uses less\nsignificant or unused areas to adapt to new tasks. Also, the issue of parameter\ninterference, which is intrinsically intractable in the original parameter\nspace, can be managed by expanding the dimensions. We conduct extensive\nexperiments across diverse scenarios, such as image classification and text\ngeneration tasks, using full fine-tuning and LoRA fine-tuning, and we apply our\nmethod to large language models (CLIP models, Flan-T5 models, and Mistral-7B\nmodels), highlighting the adaptability and scalability of SMILE. Code is\navailable at https://github.com/tanganke/fusion_bench\n","authors":["Anke Tang","Li Shen","Yong Luo","Shuai Xie","Han Hu","Lefei Zhang","Bo Du","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2408.10174v2.pdf","comment":"Code is available at https://github.com/tanganke/fusion_bench"},{"id":"http://arxiv.org/abs/2301.12778v3","updated":"2024-08-26T07:19:33Z","published":"2023-01-30T10:48:10Z","title":"Investigating Feature and Model Importance in Android Malware Detection:\n An Implemented Survey and Experimental Comparison of ML-Based Methods","summary":" The popularity of Android means it is a common target for malware. Over the\nyears, various studies have found that machine learning models can effectively\ndiscriminate malware from benign applications. However, as the operating system\nevolves, so does malware, bringing into question the findings of these previous\nstudies, many of which report very high accuracies using small, outdated, and\noften imbalanced datasets. In this paper, we reimplement 18 representative past\nworks and reevaluate them using a balanced, relevant, and up-to-date dataset\ncomprising 124,000 applications. We also carry out new experiments designed to\nfill holes in existing knowledge, and use our findings to identify the most\neffective features and models to use for Android malware detection within a\ncontemporary environment. We show that high detection accuracies (up to 96.8%)\ncan be achieved using features extracted through static analysis alone,\nyielding a modest benefit (1%) from using far more expensive dynamic analysis.\nAPI calls and opcodes are the most productive static and TCP network traffic\nprovide the most predictive dynamic features. Random forests are generally the\nmost effective model, outperforming more complex deep learning approaches.\nWhilst directly combining static and dynamic features is generally ineffective,\nensembling models separately leads to performances comparable to the best\nmodels but using less brittle features.\n","authors":["Ali Muzaffar","Hani Ragab Hassen","Hind Zantout","Michael A Lones"],"pdf_url":"https://arxiv.org/pdf/2301.12778v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13089v2","updated":"2024-08-26T06:40:07Z","published":"2024-08-23T14:16:10Z","title":"On the good reliability of an interval-based metric to validate\n prediction uncertainty for machine learning regression tasks","summary":" This short study presents an opportunistic approach to a (more) reliable\nvalidation method for prediction uncertainty average calibration. Considering\nthat variance-based calibration metrics (ZMS, NLL, RCE...) are quite sensitive\nto the presence of heavy tails in the uncertainty and error distributions, a\nshift is proposed to an interval-based metric, the Prediction Interval Coverage\nProbability (PICP). It is shown on a large ensemble of molecular properties\ndatasets that (1) sets of z-scores are well represented by Student's-$t(\\nu)$\ndistributions, $\\nu$ being the number of degrees of freedom; (2) accurate\nestimation of 95 $\\%$ prediction intervals can be obtained by the simple\n$2\\sigma$ rule for $\\nu>3$; and (3) the resulting PICPs are more quickly and\nreliably tested than variance-based calibration metrics. Overall, this method\nenables to test 20 $\\%$ more datasets than ZMS testing. Conditional calibration\nis also assessed using the PICP approach.\n","authors":["Pascal Pernot"],"pdf_url":"https://arxiv.org/pdf/2408.13089v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14042v1","updated":"2024-08-26T06:39:49Z","published":"2024-08-26T06:39:49Z","title":"PAGE: Parametric Generative Explainer for Graph Neural Network","summary":" This article introduces PAGE, a parameterized generative interpretive\nframework. PAGE is capable of providing faithful explanations for any graph\nneural network without necessitating prior knowledge or internal details.\nSpecifically, we train the auto-encoder to generate explanatory substructures\nby designing appropriate training strategy. Due to the dimensionality reduction\nof features in the latent space of the auto-encoder, it becomes easier to\nextract causal features leading to the model's output, which can be easily\nemployed to generate explanations. To accomplish this, we introduce an\nadditional discriminator to capture the causality between latent causal\nfeatures and the model's output. By designing appropriate optimization\nobjectives, the well-trained discriminator can be employed to constrain the\nencoder in generating enhanced causal features. Finally, these features are\nmapped to substructures of the input graph through the decoder to serve as\nexplanations. Compared to existing methods, PAGE operates at the sample scale\nrather than nodes or edges, eliminating the need for perturbation or encoding\nprocesses as seen in previous methods. Experimental results on both\nartificially synthesized and real-world datasets demonstrate that our approach\nnot only exhibits the highest faithfulness and accuracy but also significantly\noutperforms baseline models in terms of efficiency.\n","authors":["Yang Qiu","Wei Liu","Jun Wang","Ruixuan Li"],"pdf_url":"https://arxiv.org/pdf/2408.14042v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.10737v2","updated":"2024-08-26T06:37:24Z","published":"2024-06-15T20:47:38Z","title":"Dynamic Domains, Dynamic Solutions: DPCore for Continual Test-Time\n Adaptation","summary":" Continual Test-Time Adaptation (CTTA) seeks to adapt a source pre-trained\nmodel to continually changing, unlabeled target domains. Existing TTA methods\nare typically designed for environments where domain changes occur sequentially\nand can struggle in more dynamic scenarios, as illustrated in Figure\n\\ref{fig:settings}. Inspired by the principles of online K-Means, we introduce\na novel approach to CTTA through visual prompting. We propose a \\emph{Dynamic\nPrompt Coreset} that not only preserves knowledge from previously visited\ndomains but also accommodates learning from new potential domains. This is\ncomplemented by a distance-based \\emph{Weight Updating Mechanism} that ensures\nthe coreset remains current and relevant. Our approach employs a fixed model\narchitecture alongside the coreset and an innovative updating system to\neffectively mitigate challenges such as catastrophic forgetting and error\naccumulation. Extensive testing on four widely-used benchmarks demonstrates\nthat our method consistently outperforms state-of-the-art alternatives in both\nclassification and segmentation CTTA tasks across the structured and dynamic\nCTTA settings, with $99\\%$ fewer trainable parameters.\n","authors":["Yunbei Zhang","Akshay Mehra","Jihun Hamm"],"pdf_url":"https://arxiv.org/pdf/2406.10737v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14037v1","updated":"2024-08-26T06:14:25Z","published":"2024-08-26T06:14:25Z","title":"Re-Mix: Optimizing Data Mixtures for Large Scale Imitation Learning","summary":" Increasingly large imitation learning datasets are being collected with the\ngoal of training foundation models for robotics. However, despite the fact that\ndata selection has been of utmost importance in vision and natural language\nprocessing, little work in robotics has questioned what data such models should\nactually be trained on. In this work we investigate how to weigh different\nsubsets or ``domains'' of robotics datasets for robot foundation model\npre-training. Concrete, we use distributionally robust optimization (DRO) to\nmaximize worst-case performance across all possible downstream domains. Our\nmethod, Re-Mix, addresses the wide range of challenges that arise when applying\nDRO to robotics datasets including variability in action spaces and dynamics\nacross different datasets. Re-Mix employs early stopping, action normalization,\nand discretization to counteract these issues. Through extensive\nexperimentation on the largest open-source robot manipulation dataset, the Open\nX-Embodiment dataset, we demonstrate that data curation can have an outsized\nimpact on downstream performance. Specifically, domain weights learned by\nRe-Mix outperform uniform weights by 38\\% on average and outperform\nhuman-selected weights by 32\\% on datasets used to train existing generalist\nrobot policies, specifically the RT-X models.\n","authors":["Joey Hejna","Chethan Bhateja","Yichen Jian","Karl Pertsch","Dorsa Sadigh"],"pdf_url":"https://arxiv.org/pdf/2408.14037v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19757v3","updated":"2024-08-26T05:54:22Z","published":"2024-05-30T07:06:02Z","title":"Improving SMOTE via Fusing Conditional VAE for Data-adaptive Noise\n Filtering","summary":" Recent advances in a generative neural network model extend the development\nof data augmentation methods. However, the augmentation methods based on the\nmodern generative models fail to achieve notable performance for class\nimbalance data compared to the conventional model, Synthetic Minority\nOversampling Technique (SMOTE). We investigate the problem of the generative\nmodel for imbalanced classification and introduce a framework to enhance the\nSMOTE algorithm using Variational Autoencoders (VAE). Our approach\nsystematically quantifies the density of data points in a low-dimensional\nlatent space using the VAE, simultaneously incorporating information on class\nlabels and classification difficulty. Then, the data points potentially\ndegrading the augmentation are systematically excluded, and the neighboring\nobservations are directly augmented on the data space. Empirical studies on\nseveral imbalanced datasets represent that this simple process innovatively\nimproves the conventional SMOTE algorithm over the deep learning models.\nConsequently, we conclude that the selection of minority data and the\ninterpolation in the data space are beneficial for imbalanced classification\nproblems with a relatively small number of data points.\n","authors":["Sungchul Hong","Seunghwan An","Jong-June Jeon"],"pdf_url":"https://arxiv.org/pdf/2405.19757v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.18878v2","updated":"2024-08-26T05:54:21Z","published":"2024-03-27T10:46:24Z","title":"Teaching AI the Anatomy Behind the Scan: Addressing Anatomical Flaws in\n Medical Image Segmentation with Learnable Prior","summary":" Imposing key anatomical features, such as the number of organs, their shapes\nand relative positions, is crucial for building a robust multi-organ\nsegmentation model. Current attempts to incorporate anatomical features include\nbroadening the effective receptive field (ERF) size with data-intensive\nmodules, or introducing anatomical constraints that scales poorly to\nmulti-organ segmentation. We introduce a novel architecture called the\nAnatomy-Informed Cascaded Segmentation Network (AIC-Net). AIC-Net incorporates\na learnable input termed \"Anatomical Prior\", which can be adapted to\npatient-specific anatomy using a differentiable spatial deformation. The\ndeformed prior later guides decoder layers towards more anatomy-informed\npredictions. We repeat this process at a local patch level to enhance the\nrepresentation of intricate objects, resulting in a cascaded network structure.\nAIC-Net is a general method that enhances any existing segmentation models to\nbe more anatomy-aware. We have validated the performance of AIC-Net, with\nvarious backbones, on two multi-organ segmentation tasks: abdominal organs and\nvertebrae. For each respective task, our benchmarks demonstrate improved dice\nscore and Hausdorff distance.\n","authors":["Young Seok Jeon","Hongfei Yang","Huazhu Fu","Mengling Feng"],"pdf_url":"https://arxiv.org/pdf/2403.18878v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14028v1","updated":"2024-08-26T05:38:27Z","published":"2024-08-26T05:38:27Z","title":"SurGen: Text-Guided Diffusion Model for Surgical Video Generation","summary":" Diffusion-based video generation models have made significant strides,\nproducing outputs with improved visual fidelity, temporal coherence, and user\ncontrol. These advancements hold great promise for improving surgical education\nby enabling more realistic, diverse, and interactive simulation environments.\nIn this study, we introduce SurGen, a text-guided diffusion model tailored for\nsurgical video synthesis, producing the highest resolution and longest duration\nvideos among existing surgical video generation models. We validate the visual\nand temporal quality of the outputs using standard image and video generation\nmetrics. Additionally, we assess their alignment to the corresponding text\nprompts through a deep learning classifier trained on surgical data. Our\nresults demonstrate the potential of diffusion models to serve as valuable\neducational tools for surgical trainees.\n","authors":["Joseph Cho","Samuel Schmidgall","Cyril Zakka","Mrudang Mathur","Rohan Shad","William Hiesinger"],"pdf_url":"https://arxiv.org/pdf/2408.14028v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14025v1","updated":"2024-08-26T05:31:46Z","published":"2024-08-26T05:31:46Z","title":"An Item Response Theory-based R Module for Algorithm Portfolio Analysis","summary":" Experimental evaluation is crucial in AI research, especially for assessing\nalgorithms across diverse tasks. Many studies often evaluate a limited set of\nalgorithms, failing to fully understand their strengths and weaknesses within a\ncomprehensive portfolio. This paper introduces an Item Response Theory (IRT)\nbased analysis tool for algorithm portfolio evaluation called AIRT-Module.\nTraditionally used in educational psychometrics, IRT models test question\ndifficulty and student ability using responses to test questions. Adapting IRT\nto algorithm evaluation, the AIRT-Module contains a Shiny web application and\nthe R package airt. AIRT-Module uses algorithm performance measures to compute\nanomalousness, consistency, and difficulty limits for an algorithm and the\ndifficulty of test instances. The strengths and weaknesses of algorithms are\nvisualised using the difficulty spectrum of the test instances. AIRT-Module\noffers a detailed understanding of algorithm capabilities across varied test\ninstances, thus enhancing comprehensive AI method assessment. It is available\nat https://sevvandi.shinyapps.io/AIRT/ .\n","authors":["Brodie Oldfield","Sevvandi Kandanaarachchi","Ziqi Xu","Mario Andrés Muñoz"],"pdf_url":"https://arxiv.org/pdf/2408.14025v1.pdf","comment":"10 Pages, 6 Figures. Submitted to SoftwareX"},{"id":"http://arxiv.org/abs/2408.10566v2","updated":"2024-08-26T05:08:29Z","published":"2024-08-20T06:05:52Z","title":"SparseGrow: Addressing Growth-Induced Forgetting in Task-Agnostic\n Continual Learning","summary":" In continual learning (CL), model growth enhances adaptability over new data,\nimproving knowledge retention for more tasks. However, improper model growth\ncan lead to severe degradation of previously learned knowledge, an issue we\nname as growth-induced forgetting (GIFt), especially in task-agnostic CL using\nentire grown model for inference. Existing works, despite adopting model growth\nand random initialization for better adaptability, often fail to recognize the\npresence of GIFt caused by improper model growth. This oversight limits\ncomprehensive control of forgetting and hinders full utilization of model\ngrowth. We are the first in CL to identify this issue and conduct an in-depth\nstudy on root cause of GIFt, where layer expansion stands out among model\ngrowth strategies, widening layers without affecting model functionality. Yet,\ndirect adoption of layer expansion presents challenges. It lacks data-driven\ncontrol and initialization of expanded parameters to balance adaptability and\nknowledge retention. This paper presents a novel SparseGrow approach to\novercome the issue of GIFt while enhancing adaptability over new data.\nSparseGrow employs data-driven sparse layer expansion to control efficient\nparameter usage during growth, reducing GIFt from excessive growth and\nfunctionality changes. It also combines sparse growth with on-data\ninitialization at training late-stage to create partially 0-valued expansions\nthat fit learned distribution, enhancing retention and adaptability. To further\nminimize forgetting, freezing is applied by calculating the sparse mask,\nallowing data-driven preservation of important parameters. Through experiments\nacross datasets with various settings, cases and task numbers, we demonstrate\nthe necessity of layer expansion and showcase the effectiveness of SparseGrow\nin overcoming GIFt, highlighting its adaptability and knowledge retention for\nincremental tasks.\n","authors":["Yuqing Zhao","Divya Saxena","Jiannong Cao","Xiaoyun Liu","Changlin Song"],"pdf_url":"https://arxiv.org/pdf/2408.10566v2.pdf","comment":"This paper has been submitted to the AAAI conference. If accepted,\n the final version will be updated to reflect the conference proceedings"},{"id":"http://arxiv.org/abs/2407.10784v3","updated":"2024-08-26T04:58:15Z","published":"2024-07-15T15:02:53Z","title":"AdapTable: Test-Time Adaptation for Tabular Data via Shift-Aware\n Uncertainty Calibrator and Label Distribution Handler","summary":" In real-world scenarios, tabular data often suffer from distribution shifts\nthat threaten the performance of machine learning models. Despite its\nprevalence and importance, handling distribution shifts in the tabular domain\nremains underexplored due to the inherent challenges within the tabular data\nitself. In this sense, test-time adaptation (TTA) offers a promising solution\nby adapting models to target data without accessing source data, crucial for\nprivacy-sensitive tabular domains. However, existing TTA methods either 1)\noverlook the nature of tabular distribution shifts, often involving label\ndistribution shifts, or 2) impose architectural constraints on the model,\nleading to a lack of applicability. To this end, we propose AdapTable, a novel\nTTA framework for tabular data. AdapTable operates in two stages: 1)\ncalibrating model predictions using a shift-aware uncertainty calibrator, and\n2) adjusting these predictions to match the target label distribution with a\nlabel distribution handler. We validate the effectiveness of AdapTable through\ntheoretical analysis and extensive experiments on various distribution shift\nscenarios. Our results demonstrate AdapTable's ability to handle various\nreal-world distribution shifts, achieving up to a 16% improvement on the HELOC\ndataset.\n","authors":["Changhun Kim","Taewon Kim","Seungyeon Woo","June Yong Yang","Eunho Yang"],"pdf_url":"https://arxiv.org/pdf/2407.10784v3.pdf","comment":"Under Review at AAAI 2025"},{"id":"http://arxiv.org/abs/2408.14014v1","updated":"2024-08-26T04:39:33Z","published":"2024-08-26T04:39:33Z","title":"Category-Theoretical and Topos-Theoretical Frameworks in Machine\n Learning: A Survey","summary":" In this survey, we provide an overview of category theory-derived machine\nlearning from four mainstream perspectives: gradient-based learning,\nprobability-based learning, invariance and equivalence-based learning, and\ntopos-based learning. For the first three topics, we primarily review research\nin the past five years, updating and expanding on the previous survey by\nShiebler et al.. The fourth topic, which delves into higher category theory,\nparticularly topos theory, is surveyed for the first time in this paper. In\ncertain machine learning methods, the compositionality of functors plays a\nvital role, prompting the development of specific categorical frameworks.\nHowever, when considering how the global properties of a network reflect in\nlocal structures and how geometric properties are expressed with logic, the\ntopos structure becomes particularly significant and profound.\n","authors":["Yiyang Jia","Guohong Peng","Zheng Yang","Tianhao Chen"],"pdf_url":"https://arxiv.org/pdf/2408.14014v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14010v1","updated":"2024-08-26T04:31:55Z","published":"2024-08-26T04:31:55Z","title":"Improving Water Quality Time-Series Prediction in Hong Kong using\n Sentinel-2 MSI Data and Google Earth Engine Cloud Computing","summary":" Effective water quality monitoring in coastal regions is crucial due to the\nprogressive deterioration caused by pollution and human activities. To address\nthis, this study develops time-series models to predict chlorophyll-a (Chl-a),\nsuspended solids (SS), and turbidity using Sentinel-2 satellite data and Google\nEarth Engine (GEE) in the coastal regions of Hong Kong. Leveraging Long\nShort-Term Memory (LSTM) Recurrent Neural Networks, the study incorporates\nextensive temporal datasets to enhance prediction accuracy. The models utilize\nspectral data from Sentinel-2, focusing on optically active components, and\ndemonstrate that selected variables closely align with the spectral\ncharacteristics of Chl-a and SS. The results indicate improved predictive\nperformance over previous methods, highlighting the potential for remote\nsensing technology in continuous and comprehensive water quality assessment.\n","authors":["Rohin Sood","Kevin Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.14010v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14001v1","updated":"2024-08-26T03:58:20Z","published":"2024-08-26T03:58:20Z","title":"Decentralized Federated Learning with Model Caching on Mobile Agents","summary":" Federated Learning (FL) aims to train a shared model using data and\ncomputation power on distributed agents coordinated by a central server.\nDecentralized FL (DFL) utilizes local model exchange and aggregation between\nagents to reduce the communication and computation overheads on the central\nserver. However, when agents are mobile, the communication opportunity between\nagents can be sporadic, largely hindering the convergence and accuracy of DFL.\nIn this paper, we study delay-tolerant model spreading and aggregation enabled\nby model caching on mobile agents. Each agent stores not only its own model,\nbut also models of agents encountered in the recent past. When two agents meet,\nthey exchange their own models as well as the cached models. Local model\naggregation works on all models in the cache. We theoretically analyze the\nconvergence of DFL with cached models, explicitly taking into account the model\nstaleness introduced by caching. We design and compare different model caching\nalgorithms for different DFL and mobility scenarios. We conduct detailed case\nstudies in a vehicular network to systematically investigate the interplay\nbetween agent mobility, cache staleness, and model convergence. In our\nexperiments, cached DFL converges quickly, and significantly outperforms DFL\nwithout caching.\n","authors":["Xiaoyu Wang","Guojun Xiong","Houwei Cao","Jian Li","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2408.14001v1.pdf","comment":"27 pages"},{"id":"http://arxiv.org/abs/2407.06886v7","updated":"2024-08-26T03:25:12Z","published":"2024-07-09T14:14:47Z","title":"Aligning Cyber Space with Physical World: A Comprehensive Survey on\n Embodied AI","summary":" Embodied Artificial Intelligence (Embodied AI) is crucial for achieving\nArtificial General Intelligence (AGI) and serves as a foundation for various\napplications that bridge cyberspace and the physical world. Recently, the\nemergence of Multi-modal Large Models (MLMs) and World Models (WMs) have\nattracted significant attention due to their remarkable perception,\ninteraction, and reasoning capabilities, making them a promising architecture\nfor the brain of embodied agents. However, there is no comprehensive survey for\nEmbodied AI in the era of MLMs. In this survey, we give a comprehensive\nexploration of the latest advancements in Embodied AI. Our analysis firstly\nnavigates through the forefront of representative works of embodied robots and\nsimulators, to fully understand the research focuses and their limitations.\nThen, we analyze four main research targets: 1) embodied perception, 2)\nembodied interaction, 3) embodied agent, and 4) sim-to-real adaptation,\ncovering the state-of-the-art methods, essential paradigms, and comprehensive\ndatasets. Additionally, we explore the complexities of MLMs in virtual and real\nembodied agents, highlighting their significance in facilitating interactions\nin dynamic digital and physical environments. Finally, we summarize the\nchallenges and limitations of embodied AI and discuss their potential future\ndirections. We hope this survey will serve as a foundational reference for the\nresearch community and inspire continued innovation. The associated project can\nbe found at https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List.\n","authors":["Yang Liu","Weixing Chen","Yongjie Bai","Xiaodan Liang","Guanbin Li","Wen Gao","Liang Lin"],"pdf_url":"https://arxiv.org/pdf/2407.06886v7.pdf","comment":"The first comprehensive review of Embodied AI in the era of MLMs, 39\n pages. We also provide the paper list for Embodied AI:\n https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List"},{"id":"http://arxiv.org/abs/2408.13991v1","updated":"2024-08-26T03:19:52Z","published":"2024-08-26T03:19:52Z","title":"Dual-CBA: Improving Online Continual Learning via Dual Continual Bias\n Adaptors from a Bi-level Optimization Perspective","summary":" In online continual learning (CL), models trained on changing distributions\neasily forget previously learned knowledge and bias toward newly received\ntasks. To address this issue, we present Continual Bias Adaptor (CBA), a\nbi-level framework that augments the classification network to adapt to\ncatastrophic distribution shifts during training, enabling the network to\nachieve a stable consolidation of all seen tasks. However, the CBA module\nadjusts distribution shifts in a class-specific manner, exacerbating the\nstability gap issue and, to some extent, fails to meet the need for continual\ntesting in online CL. To mitigate this challenge, we further propose a novel\nclass-agnostic CBA module that separately aggregates the posterior\nprobabilities of classes from new and old tasks, and applies a stable\nadjustment to the resulting posterior probabilities. We combine the two kinds\nof CBA modules into a unified Dual-CBA module, which thus is capable of\nadapting to catastrophic distribution shifts and simultaneously meets the\nreal-time testing requirements of online CL. Besides, we propose Incremental\nBatch Normalization (IBN), a tailored BN module to re-estimate its population\nstatistics for alleviating the feature bias arising from the inner loop\noptimization problem of our bi-level framework. To validate the effectiveness\nof the proposed method, we theoretically provide some insights into how it\nmitigates catastrophic distribution shifts, and empirically demonstrate its\nsuperiority through extensive experiments based on four rehearsal-based\nbaselines and three public continual learning benchmarks.\n","authors":["Quanziang Wang","Renzhen Wang","Yichen Wu","Xixi Jia","Minghao Zhou","Deyu Meng"],"pdf_url":"https://arxiv.org/pdf/2408.13991v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.15951v4","updated":"2024-08-26T03:09:46Z","published":"2023-06-28T06:21:22Z","title":"Reduce Computational Complexity for Convolutional Layers by Skipping\n Zeros","summary":" Convolutional neural networks necessitate good algorithms to reduce\ncomplexity, and sufficient utilization of parallel processors for acceleration.\nWithin convolutional layers, there are three types of operators: convolution\nused in forward propagation, deconvolution and dilated-convolution utilized in\nbackward propagation. During the execution of these operators, zeros are\ntypically added to tensors, leading to redundant calculations and unnecessary\nstrain on hardware. To circumvent these inefficiencies, we propose the C-K-S\nalgorithm, accompanied by efficient GPU implementations. C-K-S trims filters to\nexclude zero-padding. For deconvolution and dilated-convolution, C-K-S\ntransforms sparse tensors into dense tensors, and standardizes the local\ncomputational rules to simplify the hardware control. The experimental results\ndemonstrate that C-K-S offers good performance in terms of speed and\nconvergence, surpassing the capabilities of PyTorch and cuDNN in certain\nscenarios.\n","authors":["Zhiyi Zhang","Pengfei Zhang","Zhuopin Xu","Qi Wang"],"pdf_url":"https://arxiv.org/pdf/2306.15951v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08713v2","updated":"2024-08-26T03:03:47Z","published":"2024-08-16T12:51:52Z","title":"Beyond KAN: Introducing KarSein for Adaptive High-Order Feature\n Interaction Modeling in CTR Prediction","summary":" Modeling feature interactions is crucial for click-through rate (CTR)\nprediction, particularly when it comes to high-order explicit interactions.\nTraditional methods struggle with this task because they often predefine a\nmaximum interaction order, which relies heavily on prior knowledge and can\nlimit the model's effectiveness. Additionally, modeling high-order interactions\ntypically leads to increased computational costs. Therefore, the challenge lies\nin adaptively modeling high-order feature interactions while maintaining\nefficiency. To address this issue, we introduce Kolmogorov-Arnold Represented\nSparse Efficient Interaction Network (KarSein), designed to optimize both\npredictive accuracy and computational efficiency. We firstly identify\nlimitations of directly applying Kolmogorov-Arnold Networks (KAN) to CTR and\nthen introduce KarSein to overcome these issues. It features a novel\narchitecture that reduces the computational costs of KAN and supports embedding\nvectors as feature inputs. Additionally, KarSein employs guided symbolic\nregression to address the challenge of KAN in spontaneously learning\nmultiplicative relationships. Extensive experiments demonstrate KarSein's\nsuperior performance, achieving significant predictive accuracy with minimal\ncomputational overhead. Furthermore, KarSein maintains strong global\nexplainability while enabling the removal of redundant features, resulting in a\nsparse network structure. These advantages also position KarSein as a promising\nmethod for efficient inference.\n","authors":["Yunxiao Shi","Wujiang Xu","Mingyu Jin","Haimin Zhang","Qiang Wu","Yongfeng Zhang","Min Xu"],"pdf_url":"https://arxiv.org/pdf/2408.08713v2.pdf","comment":"KarSein for CTR"},{"id":"http://arxiv.org/abs/2304.06879v2","updated":"2024-08-26T02:59:10Z","published":"2023-04-14T01:12:48Z","title":"Performative Prediction with Neural Networks","summary":" Performative prediction is a framework for learning models that influence the\ndata they intend to predict. We focus on finding classifiers that are\nperformatively stable, i.e. optimal for the data distribution they induce.\nStandard convergence results for finding a performatively stable classifier\nwith the method of repeated risk minimization assume that the data distribution\nis Lipschitz continuous to the model's parameters. Under this assumption, the\nloss must be strongly convex and smooth in these parameters; otherwise, the\nmethod will diverge for some problems. In this work, we instead assume that the\ndata distribution is Lipschitz continuous with respect to the model's\npredictions, a more natural assumption for performative systems. As a result,\nwe are able to significantly relax the assumptions on the loss function. In\nparticular, we do not need to assume convexity with respect to the model's\nparameters. As an illustration, we introduce a resampling procedure that models\nrealistic distribution shifts and show that it satisfies our assumptions. We\nsupport our theory by showing that one can learn performatively stable\nclassifiers with neural networks making predictions about real data that shift\naccording to our proposed procedure.\n","authors":["Mehrnaz Mofakhami","Ioannis Mitliagkas","Gauthier Gidel"],"pdf_url":"https://arxiv.org/pdf/2304.06879v2.pdf","comment":"Published at AISTATS 2023; Theoretical results extended"},{"id":"http://arxiv.org/abs/2408.13282v1","updated":"2024-08-26T02:53:55Z","published":"2024-08-26T02:53:55Z","title":"Question answering system of bridge design specification based on large\n language model","summary":" This paper constructs question answering system for bridge design\nspecification based on large language model. Three implementation schemes are\ntried: full fine-tuning of the Bert pretrained model, parameter-efficient\nfine-tuning of the Bert pretrained model, and self-built language model from\nscratch. Through the self-built question and answer task dataset, based on the\ntensorflow and keras deep learning platform framework, the model is constructed\nand trained to predict the start position and end position of the answer in the\nbridge design specification given by the user. The experimental results show\nthat full fine-tuning of the Bert pretrained model achieves 100% accuracy in\nthe training-dataset, validation-dataset and test-dataset, and the system can\nextract the answers from the bridge design specification given by the user to\nanswer various questions of the user; While parameter-efficient fine-tuning of\nthe Bert pretrained model and self-built language model from scratch perform\nwell in the training-dataset, their generalization ability in the test-dataset\nneeds to be improved. The research of this paper provides a useful reference\nfor the development of question answering system in professional field.\n","authors":["Leye Zhang","Xiangxiang Tian","Hongjun Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.13282v1.pdf","comment":"10 pages, 7 figures"},{"id":"http://arxiv.org/abs/2408.13986v1","updated":"2024-08-26T02:36:55Z","published":"2024-08-26T02:36:55Z","title":"AgentMove: Predicting Human Mobility Anywhere Using Large Language Model\n based Agentic Framework","summary":" Human mobility prediction plays a crucial role in various real-world\napplications. Although deep learning based models have shown promising results\nover the past decade, their reliance on extensive private mobility data for\ntraining and their inability to perform zero-shot predictions, have hindered\nfurther advancements. Recently, attempts have been made to apply large language\nmodels (LLMs) to mobility prediction task. However, their performance has been\nconstrained by the absence of a systematic design of workflow. They directly\ngenerate the final output using LLMs, which limits the potential of LLMs to\nuncover complex mobility patterns and underestimates their extensive reserve of\nglobal geospatial knowledge. In this paper, we introduce AgentMove, a\nsystematic agentic prediction framework to achieve generalized mobility\nprediction for any cities worldwide. In AgentMove, we first decompose the\nmobility prediction task into three sub-tasks and then design corresponding\nmodules to complete these subtasks, including spatial-temporal memory for\nindividual mobility pattern mining, world knowledge generator for modeling the\neffects of urban structure and collective knowledge extractor for capturing the\nshared patterns among population. Finally, we combine the results of three\nmodules and conduct a reasoning step to generate the final predictions.\nExtensive experiments on mobility data from two sources in 12 cities\ndemonstrate that AgentMove outperforms the best baseline more than 8% in\nvarious metrics and it shows robust predictions with various LLMs as base and\nalso less geographical bias across cities. Codes and data can be found in\nhttps://github.com/tsinghua-fib-lab/AgentMove.\n","authors":["Jie Feng","Yuwei Du","Jie Zhao","Yong Li"],"pdf_url":"https://arxiv.org/pdf/2408.13986v1.pdf","comment":"13 pages"},{"id":"http://arxiv.org/abs/2209.11691v4","updated":"2024-08-26T02:33:01Z","published":"2022-09-23T16:11:09Z","title":"Linear multidimensional regression with interactive fixed-effects","summary":" This paper studies a linear and additively separable model for\nmultidimensional panel data of three or more dimensions with unobserved\ninteractive fixed effects. Two approaches are considered to account for these\nunobserved interactive fixed-effects when estimating coefficients on the\nobserved covariates. First, the model is embedded within the standard two\ndimensional panel framework and restrictions are formed under which the factor\nstructure methods in Bai (2009) lead to consistent estimation of model\nparameters, but at slow rates of convergence. The second approach develops a\nkernel weighted fixed-effects method that is more robust to the\nmultidimensional nature of the problem and can achieve the parametric rate of\nconsistency under certain conditions. Theoretical results and simulations show\nsome benefits to standard two-dimensional panel methods when the structure of\nthe interactive fixed-effect term is known, but also highlight how the kernel\nweighted method performs well without knowledge of this structure. The methods\nare implemented to estimate the demand elasticity for beer.\n","authors":["Hugo Freeman"],"pdf_url":"https://arxiv.org/pdf/2209.11691v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12095v2","updated":"2024-08-26T02:26:31Z","published":"2024-08-22T03:08:49Z","title":"uMedSum: A Unified Framework for Advancing Medical Abstractive\n Summarization","summary":" Medical abstractive summarization faces the challenge of balancing\nfaithfulness and informativeness. Current methods often sacrifice key\ninformation for faithfulness or introduce confabulations when prioritizing\ninformativeness. While recent advancements in techniques like in-context\nlearning (ICL) and fine-tuning have improved medical summarization, they often\noverlook crucial aspects such as faithfulness and informativeness without\nconsidering advanced methods like model reasoning and self-improvement.\nMoreover, the field lacks a unified benchmark, hindering systematic evaluation\ndue to varied metrics and datasets. This paper addresses these gaps by\npresenting a comprehensive benchmark of six advanced abstractive summarization\nmethods across three diverse datasets using five standardized metrics. Building\non these findings, we propose uMedSum, a modular hybrid summarization framework\nthat introduces novel approaches for sequential confabulation removal followed\nby key missing information addition, ensuring both faithfulness and\ninformativeness. Our work improves upon previous GPT-4-based state-of-the-art\n(SOTA) medical summarization methods, significantly outperforming them in both\nquantitative metrics and qualitative domain expert evaluations. Notably, we\nachieve an average relative performance improvement of 11.8% in reference-free\nmetrics over the previous SOTA. Doctors prefer uMedSum's summaries 6 times more\nthan previous SOTA in difficult cases where there are chances of confabulations\nor missing information. These results highlight uMedSum's effectiveness and\ngeneralizability across various datasets and metrics, marking a significant\nadvancement in medical summarization.\n","authors":["Aishik Nagar","Yutong Liu","Andy T. Liu","Viktor Schlegel","Vijay Prakash Dwivedi","Arun-Kumar Kaliya-Perumal","Guna Pratheep Kalanchiam","Yili Tang","Robby T. Tan"],"pdf_url":"https://arxiv.org/pdf/2408.12095v2.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2403.10650v2","updated":"2024-08-26T02:19:11Z","published":"2024-03-15T19:35:10Z","title":"PALM: Pushing Adaptive Learning Rate Mechanisms for Continual Test-Time\n Adaptation","summary":" Real-world vision models in dynamic environments face rapid shifts in domain\ndistributions, leading to decreased recognition performance. Using unlabeled\ntest data, continual test-time adaptation (CTTA) directly adjusts a pre-trained\nsource discriminative model to these changing domains. A highly effective CTTA\nmethod involves applying layer-wise adaptive learning rates for selectively\nadapting pre-trained layers. However, it suffers from the poor estimation of\ndomain shift and the inaccuracies arising from the pseudo-labels. This work\naims to overcome these limitations by identifying layers for adaptation via\nquantifying model prediction uncertainty without relying on pseudo-labels. We\nutilize the magnitude of gradients as a metric, calculated by backpropagating\nthe KL divergence between the softmax output and a uniform distribution, to\nselect layers for further adaptation. Subsequently, for the parameters\nexclusively belonging to these selected layers, with the remaining ones frozen,\nwe evaluate their sensitivity to approximate the domain shift and adjust their\nlearning rates accordingly. We conduct extensive image classification\nexperiments on CIFAR-10C, CIFAR-100C, and ImageNet-C, demonstrating the\nsuperior efficacy of our method compared to prior approaches.\n","authors":["Sarthak Kumar Maharana","Baoming Zhang","Yunhui Guo"],"pdf_url":"https://arxiv.org/pdf/2403.10650v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13979v1","updated":"2024-08-26T02:09:05Z","published":"2024-08-26T02:09:05Z","title":"Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models","summary":" With the prevalence of large-scale pretrained vision-language models (VLMs),\nsuch as CLIP, soft-prompt tuning has become a popular method for adapting these\nmodels to various downstream tasks. However, few works delve into the inherent\nproperties of learnable soft-prompt vectors, specifically the impact of their\nnorms to the performance of VLMs. This motivates us to pose an unexplored\nresearch question: ``Do we need to normalize the soft prompts in VLMs?'' To\nfill this research gap, we first uncover a phenomenon, called the\n\\textbf{Low-Norm Effect} by performing extensive corruption experiments,\nsuggesting that reducing the norms of certain learned prompts occasionally\nenhances the performance of VLMs, while increasing them often degrades it. To\nharness this effect, we propose a novel method named \\textbf{N}ormalizing\nth\\textbf{e} soft-pro\\textbf{m}pt v\\textbf{e}ctors of vi\\textbf{si}on-language\nmodel\\textbf{s} (\\textbf{Nemesis}) to normalize soft-prompt vectors in VLMs. To\nthe best of our knowledge, our work is the first to systematically investigate\nthe role of norms of soft-prompt vector in VLMs, offering valuable insights for\nfuture research in soft-prompt tuning. The code is available at\n\\texttt{\\href{https://github.com/ShyFoo/Nemesis}{https://github.com/ShyFoo/Nemesis}}.\n","authors":["Shuai Fu","Xiequn Wang","Qiushi Huang","Yu Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.13979v1.pdf","comment":"Accepted at ICLR 2024 (Spotlight)"},{"id":"http://arxiv.org/abs/2408.02679v2","updated":"2024-08-26T01:17:43Z","published":"2024-07-31T08:44:34Z","title":"Visual Analysis of Multi-outcome Causal Graphs","summary":" We introduce a visual analysis method for multiple causal graphs with\ndifferent outcome variables, namely, multi-outcome causal graphs. Multi-outcome\ncausal graphs are important in healthcare for understanding multimorbidity and\ncomorbidity. To support the visual analysis, we collaborated with medical\nexperts to devise two comparative visualization techniques at different stages\nof the analysis process. First, a progressive visualization method is proposed\nfor comparing multiple state-of-the-art causal discovery algorithms. The method\ncan handle mixed-type datasets comprising both continuous and categorical\nvariables and assist in the creation of a fine-tuned causal graph of a single\noutcome. Second, a comparative graph layout technique and specialized visual\nencodings are devised for the quick comparison of multiple causal graphs. In\nour visual analysis approach, analysts start by building individual causal\ngraphs for each outcome variable, and then, multi-outcome causal graphs are\ngenerated and visualized with our comparative technique for analyzing\ndifferences and commonalities of these causal graphs. Evaluation includes\nquantitative measurements on benchmark datasets, a case study with a medical\nexpert, and expert user studies with real-world health research data.\n","authors":["Mengjie Fan","Jinlu Yu","Daniel Weiskopf","Nan Cao","Huai-Yu Wang","Liang Zhou"],"pdf_url":"https://arxiv.org/pdf/2408.02679v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.10148v2","updated":"2024-08-26T01:08:49Z","published":"2024-06-14T15:59:36Z","title":"A Primal-Dual-Assisted Penalty Approach to Bilevel Optimization with\n Coupled Constraints","summary":" Interest in bilevel optimization has grown in recent years, partially due to\nits applications to tackle challenging machine-learning problems. Several\nexciting recent works have been centered around developing efficient\ngradient-based algorithms that can solve bilevel optimization problems with\nprovable guarantees. However, the existing literature mainly focuses on bilevel\nproblems either without constraints, or featuring only simple constraints that\ndo not couple variables across the upper and lower levels, excluding a range of\ncomplex applications. Our paper studies this challenging but less explored\nscenario and develops a (fully) first-order algorithm, which we term BLOCC, to\ntackle BiLevel Optimization problems with Coupled Constraints. We establish\nrigorous convergence theory for the proposed algorithm and demonstrate its\neffectiveness on two well-known real-world applications - hyperparameter\nselection in support vector machine (SVM) and infrastructure planning in\ntransportation networks using the real data from the city of Seville.\n","authors":["Liuyuan Jiang","Quan Xiao","Victor M. Tenorio","Fernando Real-Rojas","Antonio G. Marques","Tianyi Chen"],"pdf_url":"https://arxiv.org/pdf/2406.10148v2.pdf","comment":"In this version, we have made the following updates: (1) Added a\n sensitivity analysis of the algorithm's hyperparameters (stepsize and penalty\n constant) in Appendix G. (2) Included a computational complexity analysis and\n comparison in Appendix H. (3) Explicitly stated the inner-loop stepsizes in\n Remarks 2 and 3"},{"id":"http://arxiv.org/abs/2406.18747v2","updated":"2024-08-26T01:07:11Z","published":"2024-06-26T20:25:53Z","title":"A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond\n Four Stems","summary":" Despite significant recent progress across multiple subtasks of audio source\nseparation, few music source separation systems support separation beyond the\nfour-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current\nsystems that support source separation beyond this setup, most continue to rely\non an inflexible decoder setup that can only support a fixed pre-defined set of\nstems. Increasing stem support in these inflexible systems correspondingly\nrequires increasing computational complexity, rendering extensions of these\nsystems computationally infeasible for long-tail instruments. In this work, we\npropose Banquet, a system that allows source separation of multiple stems using\njust one decoder. A bandsplit source separation model is extended to work in a\nquery-based setup in tandem with a music instrument recognition PaSST model. On\nthe MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached\nthe performance level of the significantly more complex 6-stem Hybrid\nTransformer Demucs on VDBO stems and outperformed it on guitar and piano. The\nquery-based setup allows for the separation of narrow instrument classes such\nas clean acoustic guitars, and can be successfully applied to the extraction of\nless common stems such as reeds and organs. Implementation is available at\nhttps://github.com/kwatcharasupat/query-bandit.\n","authors":["Karn N. Watcharasupat","Alexander Lerch"],"pdf_url":"https://arxiv.org/pdf/2406.18747v2.pdf","comment":"Accepted to the 25th International Society for Music Information\n Retrieval Conference (ISMIR 2024). Camera-ready version"},{"id":"http://arxiv.org/abs/2407.07275v2","updated":"2024-08-26T00:55:01Z","published":"2024-07-09T23:39:37Z","title":"Remastering Divide and Remaster: A Cinematic Audio Source Separation\n Dataset with Multilingual Support","summary":" Cinematic audio source separation (CASS), as a problem of extracting the\ndialogue, music, and effects stems from their mixture, is a relatively new\nsubtask of audio source separation. To date, only one publicly available\ndataset exists for CASS, that is, the Divide and Remaster (DnR) dataset, which\nis currently at version 2. While DnR v2 has been an incredibly useful resource\nfor CASS, several areas of improvement have been identified, particularly\nthrough its use in the 2023 Sound Demixing Challenge. In this work, we develop\nversion 3 of the DnR dataset, addressing issues relating to vocal content in\nnon-dialogue stems, loudness distributions, mastering process, and linguistic\ndiversity. In particular, the dialogue stem of DnR v3 includes speech content\nfrom more than 30 languages from multiple families including but not limited to\nthe Germanic, Romance, Indo-Aryan, Dravidian, Malayo-Polynesian, and Bantu\nfamilies. Benchmark results using the Bandit model indicated that training on\nmultilingual data yields significant generalizability to the model even in\nlanguages with low data availability. Even in languages with high data\navailability, the multilingual model often performs on par or better than\ndedicated models trained on monolingual CASS datasets. Dataset and model\nimplementation will be made available at\nhttps://github.com/kwatcharasupat/source-separation-landing.\n","authors":["Karn N. Watcharasupat","Chih-Wei Wu","Iroro Orife"],"pdf_url":"https://arxiv.org/pdf/2407.07275v2.pdf","comment":"Accepted to the 5th IEEE International Symposium on the Internet of\n Sounds. Camera-ready version"},{"id":"http://arxiv.org/abs/2405.00697v2","updated":"2024-08-26T00:53:17Z","published":"2024-04-10T11:20:52Z","title":"Unveiling Nonlinear Dynamics in Catastrophe Bond Pricing: A Machine\n Learning Perspective","summary":" This paper explores the implications of using machine learning models in the\npricing of catastrophe (CAT) bonds. By integrating advanced machine learning\ntechniques, our approach uncovers nonlinear relationships and complex\ninteractions between key risk factors and CAT bond spreads -- dynamics that are\noften overlooked by traditional linear regression models. Using primary market\nCAT bond transaction records between January 1999 and March 2021, our findings\ndemonstrate that machine learning models not only enhance the accuracy of CAT\nbond pricing but also provide a deeper understanding of how various risk\nfactors interact and influence bond prices in a nonlinear way. These findings\nsuggest that investors and issuers can benefit from incorporating machine\nlearning to better capture the intricate interplay between risk factors when\npricing CAT bonds. The results also highlight the potential for machine\nlearning models to refine our understanding of asset pricing in markets\ncharacterized by complex risk structures.\n","authors":["Xiaowei Chen","Hong Li","Yufan Lu","Rui Zhou"],"pdf_url":"https://arxiv.org/pdf/2405.00697v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.03588v2","updated":"2024-08-26T00:52:40Z","published":"2024-08-07T07:04:29Z","title":"Facing the Music: Tackling Singing Voice Separation in Cinematic Audio\n Source Separation","summary":" Cinematic audio source separation (CASS), as a standalone problem of\nextracting individual stems from their mixture, is a fairly new subtask of\naudio source separation. A typical setup of CASS is a three-stem problem, with\nthe aim of separating the mixture into the dialogue (DX), music (MX), and\neffects (FX) stems. Given the creative nature of cinematic sound production,\nhowever, several edge cases exist; some sound sources do not fit neatly in any\nof these three stems, necessitating the use of additional auxiliary stems in\nproduction. One very common edge case is the singing voice in film audio, which\nmay belong in either the DX or MX or neither, depending heavily on the\ncinematic context. In this work, we demonstrate a very straightforward\nextension of the dedicated-decoder Bandit and query-based single-decoder\nBanquet models to a four-stem problem, treating non-musical dialogue,\ninstrumental music, singing voice, and effects as separate stems.\nInterestingly, the query-based Banquet model outperformed the dedicated-decoder\nBandit model. We hypothesized that this is due to a better feature alignment at\nthe bottleneck as enforced by the band-agnostic FiLM layer. Dataset and model\nimplementation will be made available at\nhttps://github.com/kwatcharasupat/source-separation-landing.\n","authors":["Karn N. Watcharasupat","Chih-Wei Wu","Iroro Orife"],"pdf_url":"https://arxiv.org/pdf/2408.03588v2.pdf","comment":"Submitted to the Late-Breaking Demo Session of the 25th International\n Society for Music Information Retrieval (ISMIR) Conference, 2024"},{"id":"http://arxiv.org/abs/2402.17131v2","updated":"2024-08-26T23:59:43Z","published":"2024-02-27T01:53:02Z","title":"Predicting O-GlcNAcylation Sites in Mammalian Proteins with Transformers\n and RNNs Trained with a New Loss Function","summary":" Glycosylation, a protein modification, has multiple essential functional and\nstructural roles. O-GlcNAcylation, a subtype of glycosylation, has the\npotential to be an important target for therapeutics, but methods to reliably\npredict O-GlcNAcylation sites had not been available until 2023; a 2021 review\ncorrectly noted that published models were insufficient and failed to\ngeneralize. Moreover, many are no longer usable. In 2023, a considerably better\nRNN model with an F$_1$ score of 36.17% and an MCC of 34.57% on a large dataset\nwas published. This article first sought to improve these metrics using\ntransformer encoders. While transformers displayed high performance on this\ndataset, their performance was inferior to that of the previously published\nRNN. We then created a new loss function, which we call the weighted focal\ndifferentiable MCC, to improve the performance of classification models. RNN\nmodels trained with this new function display superior performance to models\ntrained using the weighted cross-entropy loss; this new function can also be\nused to fine-tune trained models. A two-cell RNN trained with this loss\nachieves state-of-the-art performance in O-GlcNAcylation site prediction with\nan F$_1$ score of 38.88% and an MCC of 38.20% on that large dataset.\n","authors":["Pedro Seber"],"pdf_url":"https://arxiv.org/pdf/2402.17131v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2202.08658v2","updated":"2024-08-26T23:24:52Z","published":"2022-02-17T13:43:06Z","title":"The merged-staircase property: a necessary and nearly sufficient\n condition for SGD learning of sparse functions on two-layer neural networks","summary":" It is currently known how to characterize functions that neural networks can\nlearn with SGD for two extremal parameterizations: neural networks in the\nlinear regime, and neural networks with no structural constraints. However, for\nthe main parametrization of interest (non-linear but regular networks) no tight\ncharacterization has yet been achieved, despite significant developments.\n We take a step in this direction by considering depth-2 neural networks\ntrained by SGD in the mean-field regime. We consider functions on binary inputs\nthat depend on a latent low-dimensional subspace (i.e., small number of\ncoordinates). This regime is of interest since it is poorly understood how\nneural networks routinely tackle high-dimensional datasets and adapt to latent\nlow-dimensional structure without suffering from the curse of dimensionality.\nAccordingly, we study SGD-learnability with $O(d)$ sample complexity in a large\nambient dimension $d$.\n Our main results characterize a hierarchical property, the \"merged-staircase\nproperty\", that is both necessary and nearly sufficient for learning in this\nsetting.\n We further show that non-linear training is necessary: for this class of\nfunctions, linear methods on any feature map (e.g., the NTK) are not capable of\nlearning efficiently. The key tools are a new \"dimension-free\" dynamics\napproximation result that applies to functions defined on a latent space of\nlow-dimension, a proof of global convergence based on polynomial identity\ntesting, and an improvement of lower bounds against linear methods for\nnon-almost orthogonal functions.\n","authors":["Emmanuel Abbe","Enric Boix-Adsera","Theodor Misiakiewicz"],"pdf_url":"https://arxiv.org/pdf/2202.08658v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14687v1","updated":"2024-08-26T23:24:31Z","published":"2024-08-26T23:24:31Z","title":"A Synthetic Benchmark to Explore Limitations of Localized Drift\n Detections","summary":" Concept drift is a common phenomenon in data streams where the statistical\nproperties of the target variable change over time. Traditionally, drift is\nassumed to occur globally, affecting the entire dataset uniformly. However,\nthis assumption does not always hold true in real-world scenarios where only\nspecific subpopulations within the data may experience drift. This paper\nexplores the concept of localized drift and evaluates the performance of\nseveral drift detection techniques in identifying such localized changes. We\nintroduce a synthetic dataset based on the Agrawal generator, where drift is\ninduced in a randomly chosen subgroup. Our experiments demonstrate that\ncommonly adopted drift detection methods may fail to detect drift when it is\nconfined to a small subpopulation. We propose and test various drift detection\napproaches to quantify their effectiveness in this localized drift scenario. We\nmake the source code for the generation of the synthetic benchmark available at\nhttps://github.com/fgiobergia/subgroup-agrawal-drift.\n","authors":["Flavio Giobergia","Eliana Pastor","Luca de Alfaro","Elena Baralis"],"pdf_url":"https://arxiv.org/pdf/2408.14687v1.pdf","comment":"Paper accepted at DELTA Workshop @ KDD 2024"},{"id":"http://arxiv.org/abs/2408.14685v1","updated":"2024-08-26T23:21:44Z","published":"2024-08-26T23:21:44Z","title":"Model-Based Reinforcement Learning for Control of Strongly-Disturbed\n Unsteady Aerodynamic Flows","summary":" The intrinsic high dimension of fluid dynamics is an inherent challenge to\ncontrol of aerodynamic flows, and this is further complicated by a flow's\nnonlinear response to strong disturbances. Deep reinforcement learning, which\ntakes advantage of the exploratory aspects of reinforcement learning (RL) and\nthe rich nonlinearity of a deep neural network, provides a promising approach\nto discover feasible control strategies. However, the typical model-free\napproach to reinforcement learning requires a significant amount of interaction\nbetween the flow environment and the RL agent during training, and this high\ntraining cost impedes its development and application. In this work, we propose\na model-based reinforcement learning (MBRL) approach by incorporating a novel\nreduced-order model as a surrogate for the full environment. The model consists\nof a physics-augmented autoencoder, which compresses high-dimensional CFD flow\nfield snaphsots into a three-dimensional latent space, and a latent dynamics\nmodel that is trained to accurately predict the long-time dynamics of\ntrajectories in the latent space in response to action sequences. The\nrobustness and generalizability of the model is demonstrated in two distinct\nflow environments, a pitching airfoil in a highly disturbed environment and a\nvertical-axis wind turbine in a disturbance-free environment. Based on the\ntrained model in the first problem, we realize an MBRL strategy to mitigate\nlift variation during gust-airfoil encounters. We demonstrate that the policy\nlearned in the reduced-order environment translates to an effective control\nstrategy in the full CFD environment.\n","authors":["Zhecheng Liu","Diederik Beckers","Jeff D. Eldredge"],"pdf_url":"https://arxiv.org/pdf/2408.14685v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14682v1","updated":"2024-08-26T23:13:38Z","published":"2024-08-26T23:13:38Z","title":"Detecting Interpretable Subgroup Drifts","summary":" The ability to detect and adapt to changes in data distributions is crucial\nto maintain the accuracy and reliability of machine learning models. Detection\nis generally approached by observing the drift of model performance from a\nglobal point of view. However, drifts occurring in (fine-grained) data\nsubgroups may go unnoticed when monitoring global drift. We take a different\nperspective, and introduce methods for observing drift at the finer granularity\nof subgroups. Relevant data subgroups are identified during training and\nmonitored efficiently throughout the model's life. Performance drifts in any\nsubgroup are detected, quantified and characterized so as to provide an\ninterpretable summary of the model behavior over time. Experimental results\nconfirm that our subgroup-level drift analysis identifies drifts that do not\nshow at the (coarser) global dataset level. The proposed approach provides a\nvaluable tool for monitoring model performance in dynamic real-world\napplications, offering insights into the evolving nature of data and ultimately\ncontributing to more robust and adaptive models.\n","authors":["Flavio Giobergia","Eliana Pastor","Luca de Alfaro","Elena Baralis"],"pdf_url":"https://arxiv.org/pdf/2408.14682v1.pdf","comment":"Currently under submission"},{"id":"http://arxiv.org/abs/2401.10393v3","updated":"2024-08-26T23:10:59Z","published":"2024-01-18T22:06:38Z","title":"Natural Mitigation of Catastrophic Interference: Continual Learning in\n Power-Law Learning Environments","summary":" Neural networks often suffer from catastrophic interference (CI): performance\non previously learned tasks drops off significantly when learning a new task.\nThis contrasts strongly with humans, who can continually learn new tasks\nwithout appreciably forgetting previous tasks. Prior work has explored various\ntechniques for mitigating CI and promoting continual learning such as\nregularization, rehearsal, generative replay, and context-specific components.\nThis paper takes a different approach, one guided by cognitive science research\nshowing that in naturalistic environments, the probability of encountering a\ntask decreases as a power-law of the time since it was last performed. We argue\nthat techniques for mitigating CI should be compared against the intrinsic\nmitigation in simulated naturalistic learning environments. Thus, we evaluate\nthe extent of the natural mitigation of CI when training models in power-law\nenvironments, similar to those humans face. Our results show that natural\nrehearsal environments are better at mitigating CI than existing methods,\ncalling for the need for better evaluation processes. The benefits of this\nenvironment include simplicity, rehearsal that is agnostic to both tasks and\nmodels, and the lack of a need for extra neural circuitry. In addition, we\nexplore popular mitigation techniques in power-law environments to create new\nbaselines for continual learning research.\n","authors":["Atith Gandhi","Raj Sanjay Shah","Vijay Marupudi","Sashank Varma"],"pdf_url":"https://arxiv.org/pdf/2401.10393v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14681v1","updated":"2024-08-26T23:10:42Z","published":"2024-08-26T23:10:42Z","title":"Enhancing Neural Network Interpretability Through Conductance-Based\n Information Plane Analysis","summary":" The Information Plane is a conceptual framework used to analyze the flow of\ninformation in neural networks, but traditional methods based on activations\nmay not fully capture the dynamics of information processing. This paper\nintroduces a new approach that uses layer conductance, a measure of sensitivity\nto input features, to enhance the Information Plane analysis. By incorporating\ngradient-based contributions, we provide a more precise characterization of\ninformation dynamics within the network. The proposed conductance-based\nInformation Plane and a new Information Transformation Efficiency (ITE) metric\nare evaluated on pretrained ResNet50 and VGG16 models using the ImageNet\ndataset. Our results demonstrate the ability to identify critical hidden layers\nthat contribute significantly to model performance and interpretability, giving\ninsights into information compression, preservation, and utilization across\nlayers. The conductance-based approach offers a granular perspective on feature\nattribution, enhancing our understanding of the decision-making processes\nwithin neural networks. Furthermore, our empirical findings challenge certain\ntheoretical predictions of the Information Bottleneck theory, highlighting the\ncomplexities of information dynamics in real-world data scenarios. The proposed\nmethod not only advances our understanding of information dynamics in neural\nnetworks but also has the potential to significantly impact the broader field\nof Artificial Intelligence by enabling the development of more interpretable,\nefficient, and robust models.\n","authors":["Jaouad Dabounou","Amine Baazzouz"],"pdf_url":"https://arxiv.org/pdf/2408.14681v1.pdf","comment":"16 pages, 10 figures"},{"id":"http://arxiv.org/abs/2408.14680v1","updated":"2024-08-26T23:10:01Z","published":"2024-08-26T23:10:01Z","title":"On-Chip Learning with Memristor-Based Neural Networks: Assessing\n Accuracy and Efficiency Under Device Variations, Conductance Errors, and\n Input Noise","summary":" This paper presents a memristor-based compute-in-memory hardware accelerator\nfor on-chip training and inference, focusing on its accuracy and efficiency\nagainst device variations, conductance errors, and input noise. Utilizing\nrealistic SPICE models of commercially available silver-based metal\nself-directed channel (M-SDC) memristors, the study incorporates inherent\ndevice non-idealities into the circuit simulations. The hardware, consisting of\n30 memristors and 4 neurons, utilizes three different M-SDC structures with\ntungsten, chromium, and carbon media to perform binary image classification\ntasks. An on-chip training algorithm precisely tunes memristor conductance to\nachieve target weights. Results show that incorporating moderate noise (<15%)\nduring training enhances robustness to device variations and noisy input data,\nachieving up to 97% accuracy despite conductance variations and input noises.\nThe network tolerates a 10% conductance error without significant accuracy\nloss. Notably, omitting the initial memristor reset pulse during training\nconsiderably reduces training time and energy consumption. The hardware\ndesigned with chromium-based memristors exhibits superior performance,\nachieving a training time of 2.4 seconds and an energy consumption of 18.9 mJ.\nThis research provides insights for developing robust and energy-efficient\nmemristor-based neural networks for on-chip learning in edge applications.\n","authors":["M. Reza Eslami","Dhiman Biswas","Soheib Takhtardeshir","Sarah S. Sharif","Yaser M. Banad"],"pdf_url":"https://arxiv.org/pdf/2408.14680v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12320v2","updated":"2024-08-26T23:05:51Z","published":"2024-08-22T11:57:07Z","title":"PolyRouter: A Multi-LLM Querying System","summary":" With the rapid growth of Large Language Models (LLMs) across various domains,\nnumerous new LLMs have emerged, each possessing domain-specific expertise. This\nproliferation has highlighted the need for quick, high-quality, and\ncost-effective LLM query response methods. Yet, no single LLM exists to\nefficiently balance this trilemma. Some models are powerful but extremely\ncostly, while others are fast and inexpensive but qualitatively inferior. To\naddress this challenge, we present PolyRouter, a non-monolithic LLM querying\nsystem that seamlessly integrates various LLM experts into a single query\ninterface and dynamically routes incoming queries to the most high-performant\nexpert based on query's requirements. Through extensive experiments, we\ndemonstrate that when compared to standalone expert models, PolyRouter improves\nquery efficiency by up to 40%, and leads to significant cost reductions of up\nto 30%, while maintaining or enhancing model performance by up to 10%.\n","authors":["Dimitris Stripelis","Zijian Hu","Jipeng Zhang","Zhaozhuo Xu","Alay Dilipbhai Shah","Han Jin","Yuhang Yao","Salman Avestimehr","Chaoyang He"],"pdf_url":"https://arxiv.org/pdf/2408.12320v2.pdf","comment":"14 pages, 7 figures, 2 tables"},{"id":"http://arxiv.org/abs/2408.14678v1","updated":"2024-08-26T23:01:48Z","published":"2024-08-26T23:01:48Z","title":"Bridging the Gap: Unpacking the Hidden Challenges in Knowledge\n Distillation for Online Ranking Systems","summary":" Knowledge Distillation (KD) is a powerful approach for compressing a large\nmodel into a smaller, more efficient model, particularly beneficial for\nlatency-sensitive applications like recommender systems. However, current KD\nresearch predominantly focuses on Computer Vision (CV) and NLP tasks,\noverlooking unique data characteristics and challenges inherent to recommender\nsystems. This paper addresses these overlooked challenges, specifically: (1)\nmitigating data distribution shifts between teacher and student models, (2)\nefficiently identifying optimal teacher configurations within time and\nbudgetary constraints, and (3) enabling computationally efficient and rapid\nsharing of teacher labels to support multiple students. We present a robust KD\nsystem developed and rigorously evaluated on multiple large-scale personalized\nvideo recommendation systems within Google. Our live experiment results\ndemonstrate significant improvements in student model performance while\nensuring consistent and reliable generation of high quality teacher labels from\na continuous data stream of data.\n","authors":["Nikhil Khani","Shuo Yang","Aniruddh Nath","Yang Liu","Pendo Abbo","Li Wei","Shawn Andrews","Maciej Kula","Jarrod Kahn","Zhe Zhao","Lichan Hong","Ed Chi"],"pdf_url":"https://arxiv.org/pdf/2408.14678v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14677v1","updated":"2024-08-26T22:57:01Z","published":"2024-08-26T22:57:01Z","title":"Can Optimization Trajectories Explain Multi-Task Transfer?","summary":" Despite the widespread adoption of multi-task training in deep learning,\nlittle is understood about how multi-task learning (MTL) affects\ngeneralization. Prior work has conjectured that the negative effects of MTL are\ndue to optimization challenges that arise during training, and many\noptimization methods have been proposed to improve multi-task performance.\nHowever, recent work has shown that these methods fail to consistently improve\nmulti-task generalization. In this work, we seek to improve our understanding\nof these failures by empirically studying how MTL impacts the optimization of\ntasks, and whether this impact can explain the effects of MTL on\ngeneralization. We show that MTL results in a generalization gap-a gap in\ngeneralization at comparable training loss-between single-task and multi-task\ntrajectories early into training. However, we find that factors of the\noptimization trajectory previously proposed to explain generalization gaps in\nsingle-task settings cannot explain the generalization gaps between single-task\nand multi-task models. Moreover, we show that the amount of gradient conflict\nbetween tasks is correlated with negative effects to task optimization, but is\nnot predictive of generalization. Our work sheds light on the underlying causes\nfor failures in MTL and, importantly, raises questions about the role of\ngeneral purpose multi-task optimization algorithms.\n","authors":["David Mueller","Mark Dredze","Nicholas Andrews"],"pdf_url":"https://arxiv.org/pdf/2408.14677v1.pdf","comment":"Pre-print"}],"Multimedia":[{"id":"http://arxiv.org/abs/2408.14155v1","updated":"2024-08-26T09:59:45Z","published":"2024-08-26T09:59:45Z","title":"Digital Fingerprinting on Multimedia: A Survey","summary":" The explosive growth of multimedia content in the digital economy era has\nbrought challenges in content recognition, copyright protection, and data\nmanagement. As an emerging content management technology, perceptual hash-based\ndigital fingerprints, serving as compact summaries of multimedia content, have\nbeen widely adopted for efficient multimedia content identification and\nretrieval across different modalities (e.g., text, image, video, audio),\nattracting significant attention from both academia and industry. Despite the\nincreasing applications of digital fingerprints, there is a lack of systematic\nand comprehensive literature review on multimedia digital fingerprints. This\nsurvey aims to fill this gap and provide an important resource for researchers\nstudying the details and related advancements of multimedia digital\nfingerprints. The survey first introduces the definition, characteristics, and\nrelated concepts (including hash functions, granularity, similarity measures,\netc.) of digital fingerprints. It then focuses on analyzing and summarizing the\nalgorithms for extracting unimodal fingerprints of different types of digital\ncontent, including text fingerprints, image fingerprints, video fingerprints,\nand audio fingerprints. Particularly, it provides an in-depth review and\nsummary of deep learning-based fingerprints. Additionally, the survey\nelaborates on the various practical applications of digital fingerprints and\noutlines the challenges and potential future research directions. The goal is\nto promote the continued development of multimedia digital fingerprint\nresearch.\n","authors":["Wendi Chen","Wensheng Gan","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2408.14155v1.pdf","comment":"Preprint. 5 figures, 7 tables"},{"id":"http://arxiv.org/abs/2407.04284v2","updated":"2024-08-26T08:19:03Z","published":"2024-07-05T06:32:52Z","title":"TSC-PCAC: Voxel Transformer and Sparse Convolution Based Point Cloud\n Attribute Compression for 3D Broadcasting","summary":" Point cloud has been the mainstream representation for advanced 3D\napplications, such as virtual reality and augmented reality. However, the\nmassive data amounts of point clouds is one of the most challenging issues for\ntransmission and storage. In this paper, we propose an end-to-end voxel\nTransformer and Sparse Convolution based Point Cloud Attribute Compression\n(TSC-PCAC) for 3D broadcasting. Firstly, we present a framework of the\nTSC-PCAC, which include Transformer and Sparse Convolutional Module (TSCM)\nbased variational autoencoder and channel context module. Secondly, we propose\na two-stage TSCM, where the first stage focuses on modeling local dependencies\nand feature representations of the point clouds, and the second stage captures\nglobal features through spatial and channel pooling encompassing larger\nreceptive fields. This module effectively extracts global and local interpoint\nrelevance to reduce informational redundancy. Thirdly, we design a TSCM based\nchannel context module to exploit interchannel correlations, which improves the\npredicted probability distribution of quantized latent representations and thus\nreduces the bitrate. Experimental results indicate that the proposed TSC-PCAC\nmethod achieves an average of 38.53%, 21.30%, and 11.19% Bjontegaard Delta\nbitrate reductions compared to the Sparse-PCAC, NF-PCAC, and G-PCC v23 methods,\nrespectively. The encoding/decoding time costs are reduced up to 97.68%/98.78%\non average compared to the Sparse-PCAC. The source code and the trained models\nof the TSC-PCAC are available at https://github.com/igizuxo/TSC-PCAC.\n","authors":["Zixi Guo","Yun Zhang","Linwei Zhu","Hanli Wang","Gangyi Jiang"],"pdf_url":"https://arxiv.org/pdf/2407.04284v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14084v1","updated":"2024-08-26T08:11:35Z","published":"2024-08-26T08:11:35Z","title":"HABD: a houma alliance book ancient handwritten character recognition\n database","summary":" The Houma Alliance Book, one of history's earliest calligraphic examples, was\nunearthed in the 1970s. These artifacts were meticulously organized,\nreproduced, and copied by the Shanxi Provincial Institute of Cultural Relics.\nHowever, because of their ancient origins and severe ink erosion, identifying\ncharacters in the Houma Alliance Book is challenging, necessitating the use of\ndigital technology. In this paper, we propose a new ancient handwritten\ncharacter recognition database for the Houma alliance book, along with a novel\nbenchmark based on deep learning architectures. More specifically, a collection\nof 26,732 characters samples from the Houma Alliance Book were gathered,\nencompassing 327 different types of ancient characters through iterative\nannotation. Furthermore, benchmark algorithms were proposed by combining four\ndeep neural network classifiers with two data augmentation methods. This\nresearch provides valuable resources and technical support for further studies\non the Houma Alliance Book and other ancient characters. This contributes to\nour understanding of ancient culture and history, as well as the preservation\nand inheritance of humanity's cultural heritage.\n","authors":["Xiaoyu Yuan","Xiaohua Huang","Zibo Zhang","Yabo Sun"],"pdf_url":"https://arxiv.org/pdf/2408.14084v1.pdf","comment":"8 pages, 5 figures"},{"id":"http://arxiv.org/abs/2408.12321v2","updated":"2024-08-26T04:27:54Z","published":"2024-08-22T11:57:16Z","title":"MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework\n for Multimodal Large Language Model","summary":" This paper presents MaVEn, an innovative Multi-granularity Visual Encoding\nframework designed to enhance the capabilities of Multimodal Large Language\nModels (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on\nsingle-image visual understanding, limiting their ability to interpret and\nintegrate information across multiple images. MaVEn addresses this limitation\nby combining discrete visual symbol sequences, which abstract coarse-grained\nsemantic concepts, with traditional continuous representation sequences that\nmodel fine-grained features. This dual approach bridges the semantic gap\nbetween visual and textual data, thereby improving the model's ability to\nprocess and interpret information from multiple images effectively.\nAdditionally, we design a dynamic reduction mechanism by for long-sequence\ncontinuous features to enhance multi-image processing efficiency. Experimental\nresults demonstrate that MaVEn significantly enhances MLLMs' understanding in\ncomplex multi-image scenarios, while also improving performance in single-image\ncontexts.\n","authors":["Chaoya Jiang","Jia Hongrui","Haiyang Xu","Wei Ye","Mengfan Dong","Ming Yan","Ji Zhang","Fei Huang","Shikun Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.12321v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14547v1","updated":"2024-08-26T18:00:33Z","published":"2024-08-26T18:00:33Z","title":"Revisiting Image Captioning Training Paradigm via Direct CLIP-based\n Optimization","summary":" The conventional training approach for image captioning involves pre-training\na network using teacher forcing and subsequent fine-tuning with Self-Critical\nSequence Training to maximize hand-crafted captioning metrics. However, when\nattempting to optimize modern and higher-quality metrics like CLIP-Score and\nPAC-Score, this training method often encounters instability and fails to\nacquire the genuine descriptive capabilities needed to produce fluent and\ninformative captions. In this paper, we propose a new training paradigm termed\nDirect CLIP-Based Optimization (DiCO). Our approach jointly learns and\noptimizes a reward model that is distilled from a learnable captioning\nevaluator with high human correlation. This is done by solving a weighted\nclassification problem directly inside the captioner. At the same time, DiCO\nprevents divergence from the original model, ensuring that fluency is\nmaintained. DiCO not only exhibits improved stability and enhanced quality in\nthe generated captions but also aligns more closely with human preferences\ncompared to existing methods, especially in modern metrics. Additionally, it\nmaintains competitive performance in traditional metrics. Our source code and\ntrained models are publicly available at https://github.com/aimagelab/DiCO.\n","authors":["Nicholas Moratelli","Davide Caffagni","Marcella Cornia","Lorenzo Baraldi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2408.14547v1.pdf","comment":"BMVC 2024"}]},"2024-08-25T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2408.12574v2","updated":"2024-08-25T23:58:25Z","published":"2024-08-22T17:41:45Z","title":"MuMA-ToM: Multi-modal Multi-Agent Theory of Mind","summary":" Understanding people's social interactions in complex real-world scenarios\noften relies on intricate mental reasoning. To truly understand how and why\npeople interact with one another, we must infer the underlying mental states\nthat give rise to the social interactions, i.e., Theory of Mind reasoning in\nmulti-agent interactions. Additionally, social interactions are often\nmulti-modal -- we can watch people's actions, hear their conversations, and/or\nread about their past behaviors. For AI systems to successfully and safely\ninteract with people in real-world environments, they also need to understand\npeople's mental states as well as their inferences about each other's mental\nstates based on multi-modal information about their interactions. For this, we\nintroduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark.\nMuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates\nmental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide\nvideo and text descriptions of people's multi-modal behavior in realistic\nhousehold environments. Based on the context, we then ask questions about\npeople's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM\nin a human experiment and provided a human baseline. We also proposed a novel\nmulti-modal, multi-agent ToM model, LIMP (Language model-based Inverse\nMulti-agent Planning). Our experimental results show that LIMP significantly\noutperforms state-of-the-art methods, including large multi-modal models (e.g.,\nGPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.\n","authors":["Haojun Shi","Suyu Ye","Xinyu Fang","Chuanyang Jin","Leyla Isik","Yen-Ling Kuo","Tianmin Shu"],"pdf_url":"https://arxiv.org/pdf/2408.12574v2.pdf","comment":"Project website: https://scai.cs.jhu.edu/projects/MuMA-ToM/ Code:\n https://github.com/SCAI-JHU/MuMA-ToM"},{"id":"http://arxiv.org/abs/2408.13959v1","updated":"2024-08-25T23:46:35Z","published":"2024-08-25T23:46:35Z","title":"Bidirectional Awareness Induction in Autoregressive Seq2Seq Models","summary":" Autoregressive Sequence-To-Sequence models are the foundation of many Deep\nLearning achievements in major research fields such as Vision and Natural\nLanguage Processing. Despite that, they still present significant limitations.\nFor instance, when errors occur in the early steps of the prediction, the whole\noutput is severely affected. Such reliance on previously predicted tokens and\nthe inherent computational unfriendliness of sequential algorithms, motivated\nresearchers to explore different architectures and methods in the search for\nbidirectional approaches. In this work, we introduce the Bidirectional\nAwareness Induction (BAI), a training method that leverages a subset of\nelements in the network, the Pivots, to perform bidirectional learning without\nbreaking the autoregressive constraints. To showcase its flexibility, we apply\nthe method to three architectures, the Transformer, ExpansionNet v2 and GPT,\nthen perform experiments over three tasks. Experimental results showcase BAI's\neffectiveness on all selected tasks and architectures. In particular, we\nobserved an increase of up to 2.4 CIDEr in Image-Captioning, 4.96 BLEU in\nNeural Machine Translation, and 1.16 ROUGE in Text Summarization compared to\nthe respective baselines. Notably, BAI not only has a positive impact on models\ntrained from scratch but on pre-trained models as well. Such an aspect,\ncombined with the absence of architectural requirements synergizes well with\nthe current trend of LLMs.\n","authors":["Jia Cheng Hu","Roberto Cavicchioli","Alessandro Capotondi"],"pdf_url":"https://arxiv.org/pdf/2408.13959v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13958v1","updated":"2024-08-25T23:41:39Z","published":"2024-08-25T23:41:39Z","title":"Prediction of COPD Using Machine Learning, Clinical Summary Notes, and\n Vital Signs","summary":" Chronic obstructive pulmonary disease (COPD) is a chronic inflammatory lung\ndisease that causes obstructed airflow from the lungs. In the United States,\nmore than 15.7 million Americans have been diagnosed with COPD, with 96% of\nindividuals living with at least one other chronic health condition. It is the\n4th leading cause of death in the country. Over 2.2 million patients are\nadmitted to hospitals annually due to COPD exacerbations. Monitoring and\npredicting patient exacerbations on-time could save their life. This paper\npresents two different predictive models to predict COPD exacerbation using AI\nand natural language processing (NLP) approaches. These models use respiration\nsummary notes, symptoms, and vital signs. To train and test these models, data\nrecords containing physiologic signals and vital signs time series were used.\nThese records were captured from patient monitors and comprehensive clinical\ndata obtained from hospital medical information systems for tens of thousands\nof Intensive Care Unit (ICU) patients. We achieved an area under the Receiver\noperating characteristic (ROC) curve of 0.82 in detection and prediction of\nCOPD exacerbation.\n","authors":["Negar Orangi-Fard"],"pdf_url":"https://arxiv.org/pdf/2408.13958v1.pdf","comment":"11 pages, 5 figures"},{"id":"http://arxiv.org/abs/2406.06878v2","updated":"2024-08-25T23:25:08Z","published":"2024-06-11T01:43:23Z","title":"Modeling language contact with the Iterated Learning Model","summary":" Contact between languages has the potential to transmit vocabulary and other\nlanguage features; however, this does not always happen. Here, an iterated\nlearning model is used to examine, in a simple way, the resistance of languages\nto change during language contact. Iterated learning models are agent-based\nmodels of language change, they demonstrate that languages that are expressive\nand compositional arise spontaneously as a consequence of a language\ntransmission bottleneck. A recently introduced type of iterated learning model,\nthe Semi-Supervised ILM is used to simulate language contact. These simulations\ndo not include many of the complex factors involved in language contact and do\nnot model a population of speakers; nonetheless the model demonstrates that the\ndynamics which lead languages in the model to spontaneously become expressive\nand compositional, also cause a language to maintain its core traits even after\nmixing with another language.\n","authors":["Seth Bullock","Conor Houghton"],"pdf_url":"https://arxiv.org/pdf/2406.06878v2.pdf","comment":"to appear ALIFE24"},{"id":"http://arxiv.org/abs/2408.13940v1","updated":"2024-08-25T21:20:17Z","published":"2024-08-25T21:20:17Z","title":"CoT Rerailer: Enhancing the Reliability of Large Language Models in\n Complex Reasoning Tasks through Error Detection and Correction","summary":" Chain-of-Thought (CoT) prompting enhances Large Language Models (LLMs)\ncomplex reasoning abilities by generating intermediate steps. However, these\nsteps can introduce hallucinations and accumulate errors. We propose the CoT\nRerailer to address these challenges, employing self-consistency and\nmulti-agent debate systems to identify and rectify errors in the reasoning\nprocess. The CoT Rerailer first selects the most logically correct Reasoning\nPath (RP) using consistency checks and critical evaluation by automated agents.\nIt then engages a multi-agent debate system to propose and validate corrections\nto ensure the generation of an error-free intermediate logical path. The\ncorrected steps are then used to generate a revised reasoning chain to further\nreduce hallucinations and enhance answer quality. We demonstrate the\neffectiveness of our approach across diverse question-answering datasets in\nvarious knowledge domains. The CoT Rerailer enhances the reliability of\nLLM-generated reasoning, contributing to more trustworthy AI driven\ndecision-making processes.\n","authors":["Guangya Wan","Yuqi Wu","Jie Chen","Sheng Li"],"pdf_url":"https://arxiv.org/pdf/2408.13940v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13933v1","updated":"2024-08-25T20:41:22Z","published":"2024-08-25T20:41:22Z","title":"MobileQuant: Mobile-friendly Quantization for On-device Language Models","summary":" Large language models (LLMs) have revolutionized language processing,\ndelivering outstanding results across multiple applications. However, deploying\nLLMs on edge devices poses several challenges with respect to memory, energy,\nand compute costs, limiting their widespread use in devices such as mobile\nphones. A promising solution is to reduce the number of bits used to represent\nweights and activations. While existing works have found partial success at\nquantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations\nbeyond 16 bits often leads to large computational overheads due to poor\non-device quantization support, or a considerable accuracy drop. Yet, 8-bit\nactivations are very attractive for on-device deployment as they would enable\nLLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units\n(NPUs). In this work, we make a first attempt to facilitate the on-device\ndeployment of LLMs using integer-only quantization. We first investigate the\nlimitations of existing quantization methods for on-device deployment, with a\nspecial focus on activation quantization. We then address these limitations by\nintroducing a simple post-training quantization method, named MobileQuant, that\nextends previous weight equivalent transformation works by jointly optimizing\nthe weight transformation and activation range parameters in an end-to-end\nmanner. MobileQuant demonstrates superior capabilities over existing methods by\n1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2)\nreducing latency and energy consumption by 20\\%-50\\% compared to current\non-device quantization strategies, 3) requiring limited compute budget, 4)\nbeing compatible with mobile-friendly compute units, e.g. NPU.\n","authors":["Fuwen Tan","Royson Lee","Łukasz Dudziak","Shell Xu Hu","Sourav Bhattacharya","Timothy Hospedales","Georgios Tzimiropoulos","Brais Martinez"],"pdf_url":"https://arxiv.org/pdf/2408.13933v1.pdf","comment":"Code and models available: https://github.com/saic-fi/MobileQuant"},{"id":"http://arxiv.org/abs/2408.13915v1","updated":"2024-08-25T18:47:55Z","published":"2024-08-25T18:47:55Z","title":"LLMs are Superior Feedback Providers: Bootstrapping Reasoning for Lie\n Detection with Self-Generated Feedback","summary":" Large Language Models (LLMs) excel at generating human-like dialogues and\ncomprehending text. However, understanding the subtleties of complex exchanges\nin language remains a challenge. We propose a bootstrapping framework that\nleverages self-generated feedback to enhance LLM reasoning capabilities for lie\ndetection. The framework consists of three stages: suggestion, feedback\ncollection, and modification. In the suggestion stage, a cost-effective\nlanguage model generates initial predictions based on game state and dialogue.\nThe feedback-collection stage involves a language model providing feedback on\nthese predictions. In the modification stage, a more advanced language model\nrefines the initial predictions using the auto-generated feedback. We\ninvestigate the application of the proposed framework for detecting betrayal\nand deception in Diplomacy games, and compare it with feedback from\nprofessional human players. The LLM-generated feedback exhibits superior\nquality and significantly enhances the performance of the model. Our approach\nachieves a 39% improvement over the zero-shot baseline in lying-F1 without the\nneed for any training data, rivaling state-of-the-art supervised learning\nresults.\n","authors":["Tanushree Banerjee","Richard Zhu","Runzhe Yang","Karthik Narasimhan"],"pdf_url":"https://arxiv.org/pdf/2408.13915v1.pdf","comment":"19 pages, 18 figures"},{"id":"http://arxiv.org/abs/2408.13909v1","updated":"2024-08-25T18:10:16Z","published":"2024-08-25T18:10:16Z","title":"LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages\n in Multimodal Image Retrieval Task","summary":" This research explores the development of multimodal vision-language models\nfor image retrieval in low-resource languages, specifically Azerbaijani.\nExisting vision-language models primarily support high-resource languages, and\nfine-tuning them remains computationally demanding. To address challenges in\nvision-language retrieval for low-resource languages, we integrated the CLIP\nmodel architecture and employed several techniques to balance computational\nefficiency with performance. These techniques include synthetic data generation\nthrough machine translation, image augmentation, and further training the\nattention mechanisms of transformer-based models with domain-specific data. We\nintegrated Multilingual BERT as a text encoder with image encoders like\nResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer.\nOur study found that models like EfficientNet0 and Tiny Swin Transformer\nperform best on the datasets they were trained on, such as COCO, Flickr30k, and\nFlickr8k. Augmentation techniques boosted EfficientNet0 MAP on Flickr30k from\n0.84 to 0.87 and ResNet50 MAP on MSCOCO from 0.70 to 0.80, contributing to a\nnew state of the art in vision-language retrieval. We share our configurations\nand results to support further research. Code and pre-trained models are\navailable at https://github.com/aliasgerovs/azclip.\n","authors":["Ali Asgarov","Samir Rustamov"],"pdf_url":"https://arxiv.org/pdf/2408.13909v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.05109v2","updated":"2024-08-25T17:22:29Z","published":"2024-05-08T15:05:55Z","title":"QFMTS: Generating Query-Focused Summaries over Multi-Table Inputs","summary":" Table summarization is a crucial task aimed at condensing information from\ntabular data into concise and comprehensible textual summaries. However,\nexisting approaches often fall short of adequately meeting users' information\nand quality requirements and tend to overlook the complexities of real-world\nqueries. In this paper, we propose a novel method to address these limitations\nby introducing query-focused multi-table summarization. Our approach, which\ncomprises a table serialization module, a summarization controller, and a large\nlanguage model (LLM), utilizes textual queries and multiple tables to generate\nquery-dependent table summaries tailored to users' information needs. To\nfacilitate research in this area, we present a comprehensive dataset\nspecifically tailored for this task, consisting of 4909 query-summary pairs,\neach associated with multiple tables. Through extensive experiments using our\ncurated dataset, we demonstrate the effectiveness of our proposed method\ncompared to baseline approaches. Our findings offer insights into the\nchallenges of complex table reasoning for precise summarization, contributing\nto the advancement of research in query-focused multi-table summarization.\n","authors":["Weijia Zhang","Vaishali Pal","Jia-Hong Huang","Evangelos Kanoulas","Maarten de Rijke"],"pdf_url":"https://arxiv.org/pdf/2405.05109v2.pdf","comment":"Accepted by the 27th European Conference on Artificial Intelligence\n (ECAI-2024)"},{"id":"http://arxiv.org/abs/2408.13891v1","updated":"2024-08-25T17:05:26Z","published":"2024-08-25T17:05:26Z","title":"SpeechCaps: Advancing Instruction-Based Universal Speech Models with\n Multi-Talker Speaking Style Captioning","summary":" Instruction-based speech processing is becoming popular. Studies show that\ntraining with multiple tasks boosts performance, but collecting diverse,\nlarge-scale tasks and datasets is expensive. Thus, it is highly desirable to\ndesign a fundamental task that benefits other downstream tasks. This paper\nintroduces a multi-talker speaking style captioning task to enhance the\nunderstanding of speaker and prosodic information. We used large language\nmodels to generate descriptions for multi-talker speech. Then, we trained our\nmodel with pre-training on this captioning task followed by instruction tuning.\nEvaluation on Dynamic-SUPERB shows our model outperforming the baseline\npre-trained only on single-talker tasks, particularly in speaker and emotion\nrecognition. Additionally, tests on a multi-talker QA task reveal that current\nmodels struggle with attributes such as gender, pitch, and speaking rate. The\ncode and dataset are available at https://github.com/cyhuang-tw/speechcaps.\n","authors":["Chien-yu Huang","Min-Han Shih","Ke-Han Lu","Chi-Yuan Hsiao","Hung-yi Lee"],"pdf_url":"https://arxiv.org/pdf/2408.13891v1.pdf","comment":"SynData4GenAI 2024"},{"id":"http://arxiv.org/abs/2408.13889v1","updated":"2024-08-25T16:43:19Z","published":"2024-08-25T16:43:19Z","title":"LLM with Relation Classifier for Document-Level Relation Extraction","summary":" Large language models (LLMs) create a new paradigm for natural language\nprocessing. Despite their advancement, LLM-based methods still lag behind\ntraditional approaches in document-level relation extraction (DocRE), a\ncritical task for understanding complex entity relations. This paper\ninvestigates the causes of this performance gap, identifying the dispersion of\nattention by LLMs due to entity pairs without relations as a primary factor. We\nthen introduce a novel classifier-LLM approach to DocRE. The proposed approach\nbegins with a classifier specifically designed to select entity pair candidates\nexhibiting potential relations and thereby feeds them to LLM for the final\nrelation extraction. This method ensures that during inference, the LLM's focus\nis directed primarily at entity pairs with relations. Experiments on DocRE\nbenchmarks reveal that our method significantly outperforms recent LLM-based\nDocRE models and achieves competitive performance with several leading\ntraditional DocRE models.\n","authors":["Xingzuo Li","Kehai Chen","Yunfei Long","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.13889v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13863v1","updated":"2024-08-25T15:27:21Z","published":"2024-08-25T15:27:21Z","title":"CodeGraph: Enhancing Graph Reasoning of LLMs with Code","summary":" With the increasing popularity of large language models (LLMs), reasoning on\nbasic graph algorithm problems is an essential intermediate step in assessing\ntheir abilities to process and infer complex graph reasoning tasks. Existing\nmethods usually convert graph-structured data to textual descriptions and then\nuse LLMs for reasoning and computation. However, LLMs often produce computation\nerrors on arithmetic parts in basic graph algorithm problems, such as counting\nnumber of edges. In addition, they struggle to control or understand the output\nof the reasoning process, raising concerns about whether LLMs are simply\nguessing. In this paper, we introduce CodeGraph, a method that encodes graph\nproblem solutions as code. The methods solve new graph problems by learning\nfrom exemplars, generating programs, and executing them via a program\ninterpreter. Using the few-shot setting, we evaluate CodeGraph with the base\nLLM being GPT-3.5 Turbo, Llama3-70B Instruct, Mixtral-8x22B Instruct, and\nMixtral-8x7B Instruct. Experimental results on six tasks with six graph\nencoding methods in the GraphQA dataset demonstrate that CodeGraph can boost\nperformance on graph reasoning tasks inside LLMs by 1.3% to 58.6%, depending on\nthe task. Compared to the existing methods, CodeGraph demonstrates strong\nperformance on arithmetic problems in graph tasks and offers a more\ncontrollable and interpretable approach to the reasoning process.\n","authors":["Qiaolong Cai","Zhaowei Wang","Shizhe Diao","James Kwok","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2408.13863v1.pdf","comment":"In Progress"},{"id":"http://arxiv.org/abs/2408.13860v1","updated":"2024-08-25T15:17:43Z","published":"2024-08-25T15:17:43Z","title":"Knowledge-Aware Reasoning over Multimodal Semi-structured Tables","summary":" Existing datasets for tabular question answering typically focus exclusively\non text within cells. However, real-world data is inherently multimodal, often\nblending images such as symbols, faces, icons, patterns, and charts with\ntextual content in tables. With the evolution of AI models capable of\nmultimodal reasoning, it is pertinent to assess their efficacy in handling such\nstructured data. This study investigates whether current AI models can perform\nknowledge-aware reasoning on multimodal structured data. We explore their\nability to reason on tables that integrate both images and text, introducing\nMMTabQA, a new dataset designed for this purpose. Our experiments highlight\nsubstantial challenges for current AI models in effectively integrating and\ninterpreting multiple text and image inputs, understanding visual context, and\ncomparing visual content across images. These findings establish our dataset as\na robust benchmark for advancing AI's comprehension and capabilities in\nanalyzing multimodal structured data.\n","authors":["Suyash Vardhan Mathur","Jainit Sushil Bafna","Kunal Kartik","Harshita Khandelwal","Manish Shrivastava","Vivek Gupta","Mohit Bansal","Dan Roth"],"pdf_url":"https://arxiv.org/pdf/2408.13860v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.02267v2","updated":"2024-08-25T14:41:32Z","published":"2024-05-03T17:34:57Z","title":"Structural Pruning of Pre-trained Language Models via Neural\n Architecture Search","summary":" Pre-trained language models (PLM), for example BERT or RoBERTa, mark the\nstate-of-the-art for natural language understanding task when fine-tuned on\nlabeled data. However, their large size poses challenges in deploying them for\ninference in real-world applications, due to significant GPU memory\nrequirements and high inference latency. This paper explores neural\narchitecture search (NAS) for structural pruning to find sub-parts of the\nfine-tuned network that optimally trade-off efficiency, for example in terms of\nmodel size or latency, and generalization performance. We also show how we can\nutilize more recently developed two-stage weight-sharing NAS approaches in this\nsetting to accelerate the search process. Unlike traditional pruning methods\nwith fixed thresholds, we propose to adopt a multi-objective approach that\nidentifies the Pareto optimal set of sub-networks, allowing for a more flexible\nand automated compression process.\n","authors":["Aaron Klein","Jacek Golebiowski","Xingchen Ma","Valerio Perrone","Cedric Archambeau"],"pdf_url":"https://arxiv.org/pdf/2405.02267v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02481v3","updated":"2024-08-25T14:21:29Z","published":"2024-06-04T16:49:06Z","title":"Large Language Models as Carriers of Hidden Messages","summary":" With the help of simple fine-tuning, one can artificially embed hidden text\ninto large language models (LLMs). This text is revealed only when triggered by\na specific query to the LLM. Two primary applications are LLM fingerprinting\nand steganography. In the context of LLM fingerprinting, a unique text\nidentifier (fingerprint) is embedded within the model to verify licensing\ncompliance. In the context of steganography, the LLM serves as a carrier for\nhidden messages that can be disclosed through a chosen trigger question.\n Our work demonstrates that embedding hidden text in the LLM via fine-tuning,\nthough seemingly secure due to the vast number of potential triggers (any\nsequence of characters or tokens could serve as a trigger), is susceptible to\nextraction through analysis of the LLM's output decoding process. We propose an\nextraction attack called Unconditional Token Forcing (UTF). It is premised on\nthe hypothesis that iteratively feeding each token from the LLM's vocabulary\ninto the model should reveal output sequences with abnormally high token\nprobabilities, indicating potential hidden text candidates. We also present a\ndefense method to hide text in such a way that it is resistant to both UTF and\nattacks based on sampling decoding methods, which we named Unconditional Token\nForcing Confusion (UTFC). To the best of our knowledge, there is no attack\nmethod that can extract text hidden with UTFC. UTFC has both benign\napplications (improving LLM fingerprinting) and malign applications (using LLMs\nto create covert communication channels).\n","authors":["Jakub Hoscilowicz","Pawel Popiolek","Jan Rudkowski","Jedrzej Bieniasz","Artur Janicki"],"pdf_url":"https://arxiv.org/pdf/2406.02481v3.pdf","comment":"Work in progress. Code is available at\n https://github.com/j-hoscilowic/zurek-stegano"},{"id":"http://arxiv.org/abs/2408.13833v1","updated":"2024-08-25T13:36:22Z","published":"2024-08-25T13:36:22Z","title":"Biomedical Large Languages Models Seem not to be Superior to Generalist\n Models on Unseen Medical Data","summary":" Large language models (LLMs) have shown potential in biomedical applications,\nleading to efforts to fine-tune them on domain-specific data. However, the\neffectiveness of this approach remains unclear. This study evaluates the\nperformance of biomedically fine-tuned LLMs against their general-purpose\ncounterparts on a variety of clinical tasks. We evaluated their performance on\nclinical case challenges from the New England Journal of Medicine (NEJM) and\nthe Journal of the American Medical Association (JAMA) and on several clinical\ntasks (e.g., information extraction, document summarization, and clinical\ncoding). Using benchmarks specifically chosen to be likely outside the\nfine-tuning datasets of biomedical models, we found that biomedical LLMs mostly\nperform inferior to their general-purpose counterparts, especially on tasks not\nfocused on medical knowledge. While larger models showed similar performance on\ncase tasks (e.g., OpenBioLLM-70B: 66.4% vs. Llama-3-70B-Instruct: 65% on JAMA\ncases), smaller biomedical models showed more pronounced underperformance\n(e.g., OpenBioLLM-8B: 30% vs. Llama-3-8B-Instruct: 64.3% on NEJM cases).\nSimilar trends were observed across the CLUE (Clinical Language Understanding\nEvaluation) benchmark tasks, with general-purpose models often performing\nbetter on text generation, question answering, and coding tasks. Our results\nsuggest that fine-tuning LLMs to biomedical data may not provide the expected\nbenefits and may potentially lead to reduced performance, challenging\nprevailing assumptions about domain-specific adaptation of LLMs and\nhighlighting the need for more rigorous evaluation frameworks in healthcare AI.\nAlternative approaches, such as retrieval-augmented generation, may be more\neffective in enhancing the biomedical capabilities of LLMs without compromising\ntheir general knowledge.\n","authors":["Felix J. Dorfner","Amin Dada","Felix Busch","Marcus R. Makowski","Tianyu Han","Daniel Truhn","Jens Kleesiek","Madhumita Sushil","Jacqueline Lammert","Lisa C. Adams","Keno K. Bressem"],"pdf_url":"https://arxiv.org/pdf/2408.13833v1.pdf","comment":"10 pages, 3 tables, 1 figure"},{"id":"http://arxiv.org/abs/2408.13831v1","updated":"2024-08-25T13:29:34Z","published":"2024-08-25T13:29:34Z","title":"Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics\n Fall In!","summary":" Annually, at the Conference of Machine Translation (WMT), the Metrics Shared\nTask organizers conduct the meta-evaluation of Machine Translation (MT)\nmetrics, ranking them according to their correlation with human judgments.\nTheir results guide researchers toward enhancing the next generation of metrics\nand MT systems. With the recent introduction of neural metrics, the field has\nwitnessed notable advancements. Nevertheless, the inherent opacity of these\nmetrics has posed substantial challenges to the meta-evaluation process. This\nwork highlights two issues with the meta-evaluation framework currently\nemployed in WMT, and assesses their impact on the metrics rankings. To do this,\nwe introduce the concept of sentinel metrics, which are designed explicitly to\nscrutinize the meta-evaluation process's accuracy, robustness, and fairness. By\nemploying sentinel metrics, we aim to validate our findings, and shed light on\nand monitor the potential biases or inconsistencies in the rankings. We\ndiscover that the present meta-evaluation framework favors two categories of\nmetrics: i) those explicitly trained to mimic human quality assessments, and\nii) continuous metrics. Finally, we raise concerns regarding the evaluation\ncapabilities of state-of-the-art metrics, emphasizing that they might be basing\ntheir assessments on spurious correlations found in their training data.\n","authors":["Stefano Perrella","Lorenzo Proietti","Alessandro Scirè","Edoardo Barba","Roberto Navigli"],"pdf_url":"https://arxiv.org/pdf/2408.13831v1.pdf","comment":"Presented at ACL 2024 Main Conference. 29 pages"},{"id":"http://arxiv.org/abs/2402.13546v2","updated":"2024-08-25T11:23:50Z","published":"2024-02-21T05:56:52Z","title":"LLMs Meet Long Video: Advancing Long Video Question Answering with An\n Interactive Visual Adapter in LLMs","summary":" Long video understanding is a significant and ongoing challenge in the\nintersection of multimedia and artificial intelligence. Employing large\nlanguage models (LLMs) for comprehending video becomes an emerging and\npromising method. However, this approach incurs high computational costs due to\nthe extensive array of video tokens, experiences reduced visual clarity as a\nconsequence of token aggregation, and confronts challenges arising from\nirrelevant visual tokens while answering video-related questions. To alleviate\nthese issues, we present an Interactive Visual Adapter (IVA) within LLMs,\ndesigned to enhance interaction with fine-grained visual elements.\nSpecifically, we first transform long videos into temporal video tokens via\nleveraging a visual encoder alongside a pretrained causal transformer, then\nfeed them into LLMs with the video instructions. Subsequently, we integrated\nIVA, which contains a lightweight temporal frame selector and a spatial feature\ninteractor, within the internal blocks of LLMs to capture instruction-aware and\nfine-grained visual signals. Consequently, the proposed video-LLM facilitates a\ncomprehensive understanding of long video content through appropriate long\nvideo modeling and precise visual interactions. We conducted extensive\nexperiments on nine video understanding benchmarks and experimental results\nshow that our interactive visual adapter significantly improves the performance\nof video LLMs on long video QA tasks. Ablation studies further verify the\neffectiveness of IVA in understanding long and short video.\n","authors":["Yunxin Li","Xinyu Chen","Baotain Hu","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2402.13546v2.pdf","comment":"12 pages; working in progress"},{"id":"http://arxiv.org/abs/2310.05746v4","updated":"2024-08-25T11:19:33Z","published":"2023-10-09T14:22:09Z","title":"Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and\n Execution of LLM Agents in an Auction Arena","summary":" Recent advancements in Large Language Models (LLMs) showcase advanced\nreasoning, yet NLP evaluations often depend on static benchmarks. Evaluating\nthis necessitates environments that test strategic reasoning in dynamic,\ncompetitive scenarios requiring long-term planning. We introduce AucArena, a\nnovel evaluation suite that simulates auctions, a setting chosen for being\nhighly unpredictable and involving many skills related to resource and risk\nmanagement, while also being easy to evaluate. We conduct controlled\nexperiments using state-of-the-art LLMs to power bidding agents to benchmark\ntheir planning and execution skills. Our research demonstrates that LLMs, such\nas GPT-4, possess key skills for auction participation, such as budget\nmanagement and goal adherence, which improve with adaptive strategies. This\nhighlights LLMs' potential in modeling complex social interactions in\ncompetitive contexts. However, variability in LLM performance and occasional\noutperformance by simpler methods indicate opportunities for further\nadvancements in LLM design and the value of our simulation environment for\nongoing testing and refinement.\n","authors":["Jiangjie Chen","Siyu Yuan","Rong Ye","Bodhisattwa Prasad Majumder","Kyle Richardson"],"pdf_url":"https://arxiv.org/pdf/2310.05746v4.pdf","comment":"Project page: https://auction-arena.github.io"},{"id":"http://arxiv.org/abs/2408.13810v1","updated":"2024-08-25T11:13:29Z","published":"2024-08-25T11:13:29Z","title":"Revisiting the Exit from Nuclear Energy in Germany with NLP","summary":" Annotation of political discourse is resource-intensive, but recent\ndevelopments in NLP promise to automate complex annotation tasks. Fine-tuned\ntransformer-based models outperform human annotators in some annotation tasks,\nbut they require large manually annotated training datasets. In our\ncontribution, we explore to which degree a manually annotated dataset can be\nautomatically replicated with today's NLP methods, using unsupervised machine\nlearning and zero- and few-shot learning.\n","authors":["Sebastian Haunss","André Blessing"],"pdf_url":"https://arxiv.org/pdf/2408.13810v1.pdf","comment":"23 pages, 8 figures, Accepted for publication in Zeitschrift f\\\"ur\n Diskursforschung/Journal for Discourse Studies, ISSN: 2195-867X"},{"id":"http://arxiv.org/abs/2408.13808v1","updated":"2024-08-25T11:09:15Z","published":"2024-08-25T11:09:15Z","title":"Towards Reliable Medical Question Answering: Techniques and Challenges\n in Mitigating Hallucinations in Language Models","summary":" The rapid advancement of large language models (LLMs) has significantly\nimpacted various domains, including healthcare and biomedicine. However, the\nphenomenon of hallucination, where LLMs generate outputs that deviate from\nfactual accuracy or context, poses a critical challenge, especially in\nhigh-stakes domains. This paper conducts a scoping study of existing techniques\nfor mitigating hallucinations in knowledge-based task in general and especially\nfor medical domains. Key methods covered in the paper include\nRetrieval-Augmented Generation (RAG)-based techniques, iterative feedback\nloops, supervised fine-tuning, and prompt engineering. These techniques, while\npromising in general contexts, require further adaptation and optimization for\nthe medical domain due to its unique demands for up-to-date, specialized\nknowledge and strict adherence to medical guidelines. Addressing these\nchallenges is crucial for developing trustworthy AI systems that enhance\nclinical decision-making and patient safety as well as accuracy of biomedical\nscientific research.\n","authors":["Duy Khoa Pham","Bao Quoc Vo"],"pdf_url":"https://arxiv.org/pdf/2408.13808v1.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2311.18743v4","updated":"2024-08-25T09:58:57Z","published":"2023-11-30T17:41:30Z","title":"AlignBench: Benchmarking Chinese Alignment of Large Language Models","summary":" Alignment has become a critical step for instruction-tuned Large Language\nModels (LLMs) to become helpful assistants. However, the effective evaluation\nof alignment for emerging Chinese LLMs is still largely unexplored. To fill in\nthis gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark\nfor evaluating LLMs' alignment in Chinese. We design a human-in-the-loop data\ncuration pipeline, containing eight main categories, 683 real-scenario rooted\nqueries and corresponding human verified references. To ensure the correctness\nof references, each knowledge-intensive query is accompanied with evidences\ncollected from reliable web sources (including URLs and quotations) by our\nannotators. For automatic evaluation, our benchmark employs a rule-calibrated\nmulti-dimensional LLM-as-Judge~\\cite{zheng2023judging} approach with\nChain-of-Thought to generate explanations and final ratings, ensuring high\nreliability and interpretability. All evaluation code, data, and LLM\ngenerations are available at \\url{https://github.com/THUDM/AlignBench}. Since\nits release, AlignBench has been adopted by top (Chinese) LLMs for evaluating\ntheir alignment capabilities in Chinese, including ChatGLM, Qwen, DeepSeek, Yi,\nBaichuan, and Abab.\n","authors":["Xiao Liu","Xuanyu Lei","Shengyuan Wang","Yue Huang","Zhuoer Feng","Bosi Wen","Jiale Cheng","Pei Ke","Yifan Xu","Weng Lam Tam","Xiaohan Zhang","Lichao Sun","Xiaotao Gu","Hongning Wang","Jing Zhang","Minlie Huang","Yuxiao Dong","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2311.18743v4.pdf","comment":"Accepted to ACL 2024"},{"id":"http://arxiv.org/abs/2407.01411v3","updated":"2024-08-25T09:39:23Z","published":"2024-07-01T16:00:53Z","title":"HyperLoader: Integrating Hypernetwork-Based LoRA and Adapter Layers into\n Multi-Task Transformers for Sequence Labelling","summary":" We present HyperLoader, a simple approach that combines different\nparameter-efficient fine-tuning methods in a multi-task setting. To achieve\nthis goal, our model uses a hypernetwork to generate the weights of these\nmodules based on the task, the transformer layer, and its position within this\nlayer. Our method combines the benefits of multi-task learning by capturing the\nstructure of all tasks while reducing the task interference problem by\nencapsulating the task-specific knowledge in the generated weights and the\nbenefits of combining different parameter-efficient methods to outperform\nfull-fine tuning. We provide empirical evidence that HyperLoader outperforms\nprevious approaches in most datasets and obtains the best average performance\nacross tasks in high-resource and low-resource scenarios.\n","authors":["Jesus-German Ortiz-Barajas","Helena Gomez-Adorno","Thamar Solorio"],"pdf_url":"https://arxiv.org/pdf/2407.01411v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13745v1","updated":"2024-08-25T07:10:36Z","published":"2024-08-25T07:10:36Z","title":"DOCE: Finding the Sweet Spot for Execution-Based Code Generation","summary":" Recently, a diverse set of decoding and reranking procedures have been shown\neffective for LLM-based code generation. However, a comprehensive framework\nthat links and experimentally compares these methods is missing. We address\nthis by proposing Decoding Objectives for Code Execution, a comprehensive\nframework that includes candidate generation, $n$-best reranking, minimum Bayes\nrisk (MBR) decoding, and self-debugging as the core components. We then study\nthe contributions of these components through execution-based evaluation\nmetrics. Our findings highlight the importance of execution-based methods and\nthe difference gap between execution-based and execution-free methods.\nFurthermore, we assess the impact of filtering based on trial unit tests, a\nsimple and effective strategy that has been often overlooked in prior works. We\nalso propose self-debugging on multiple candidates, obtaining state-of-the-art\nperformance on reranking for code generation. We expect our framework to\nprovide a solid guideline for future research on code generation.\n","authors":["Haau-Sing Li","Patrick Fernandes","Iryna Gurevych","André F. T. Martins"],"pdf_url":"https://arxiv.org/pdf/2408.13745v1.pdf","comment":"10 pages (32 including appendix), 5 figures, 25 tables. arXiv admin\n note: text overlap with arXiv:2304.05128 by other authors"},{"id":"http://arxiv.org/abs/2408.13739v1","updated":"2024-08-25T06:52:48Z","published":"2024-08-25T06:52:48Z","title":"Literary and Colloquial Tamil Dialect Identification","summary":" Culture and language evolve together. The old literary form of Tamil is used\ncommonly for writing and the contemporary colloquial Tamil is used for\nspeaking. Human-computer interaction applications require Colloquial Tamil (CT)\nto make it more accessible and easy for the everyday user and, it requires\nLiterary Tamil (LT) when information is needed in a formal written format.\nContinuing the use of LT alongside CT in computer aided language learning\napplications will both preserve LT, and provide ease of use via CT, at the same\ntime. Hence there is a need for the conversion between LT and CT dialects,\nwhich demands as a first step, dialect identification. Dialect Identification\n(DID) of LT and CT is an unexplored area of research. In the current work,\nkeeping the nuances of both these dialects in mind, five methods are explored\nwhich include two implicit methods - Gaussian Mixture Model (GMM) and\nConvolutional Neural Network (CNN); two explicit methods - Parallel Phone\nRecognition (PPR) and Parallel Large Vocabulary Continuous Speech Recognition\n(P-LVCSR); two versions of the proposed explicit Unified Phone Recognition\nmethod (UPR-1 and UPR-2). These methods vary based on: the need for annotated\ndata, the size of the unit, the way in which modelling is carried out, and the\nway in which the final decision is made. Even though the average duration of\nthe test utterances is less - 4.9s for LT and 2.5s for CT - the systems\nperformed well, offering the following identification accuracies: 87.72% (GMM),\n93.97% (CNN), 89.24% (PPR), 94.21% (P-LVCSR), 88.57% (UPR-1), 93.53% (UPR-1\nwith P-LVCSR), 94.55% (UPR-2), and 95.61% (UPR-2 with P-LVCSR).\n","authors":["M. Nanmalar","P. Vijayalakshmi","T. Nagarajan"],"pdf_url":"https://arxiv.org/pdf/2408.13739v1.pdf","comment":"18 pages, 6 figures, submitted to \"Circuits, Systems, and Signal\n Processing\""},{"id":"http://arxiv.org/abs/2408.13738v1","updated":"2024-08-25T06:49:03Z","published":"2024-08-25T06:49:03Z","title":"Poor-Supervised Evaluation for SuperLLM via Mutual Consistency","summary":" The guidance from capability evaluations has greatly propelled the progress\nof both human society and Artificial Intelligence. However, as LLMs evolve, it\nbecomes challenging to construct evaluation benchmarks for them with accurate\nlabels on hard tasks that approach the boundaries of human capabilities. To\ncredibly conduct evaluation without accurate labels (denoted as poor-supervised\nevaluation), we propose the PoEM framework. We first prove that the capability\nof a model can be equivalently assessed by the consistency between it and\ncertain reference model, when their prediction distributions are independent\nand the sample size is infinite. To alleviate the insufficiencies of the\nconditions in reality, we further introduce an algorithm that treats humans\n(when available) and the models under evaluation as reference models,\nalternately conducting model weights calibration and filtering during E-step\nand M-step. Comprehensive experiments across 3 types of tasks with 16\nmainstream LLMs have shown that PoEM under poor supervision can achieve an\naverage of 0.98 Pearson correlation coefficient with supervised evaluation\nresults, demonstrating good effectiveness, efficiency and generalizability.\nMore generally, PoEM has advanced the evaluation paradigm evolution from\nhuman-centric to human&model-centric by treating both of them as reference\nmodels, mitigating the limitations of human evaluation in the era of LLMs.\n","authors":["Peiwen Yuan","Shaoxiong Feng","Yiwei Li","Xinglin Wang","Boyuan Pan","Heda Wang","Yao Hu","Kan Li"],"pdf_url":"https://arxiv.org/pdf/2408.13738v1.pdf","comment":"ACL findings"},{"id":"http://arxiv.org/abs/2305.04928v5","updated":"2024-08-25T06:22:00Z","published":"2023-05-05T12:14:22Z","title":"From Zero to Hero: Harnessing Transformers for Biomedical Named Entity\n Recognition in Zero- and Few-shot Contexts","summary":" Supervised named entity recognition (NER) in the biomedical domain depends on\nlarge sets of annotated texts with the given named entities. The creation of\nsuch datasets can be time-consuming and expensive, while extraction of new\nentities requires additional annotation tasks and retraining the model. To\naddress these challenges, this paper proposes a method for zero- and few-shot\nNER in the biomedical domain. The method is based on transforming the task of\nmulti-class token classification into binary token classification and\npre-training on a large amount of datasets and biomedical entities, which allow\nthe model to learn semantic relations between the given and potentially novel\nnamed entity labels. We have achieved average F1 scores of 35.44% for zero-shot\nNER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot\nNER on 9 diverse evaluated biomedical entities with fine-tuned PubMedBERT-based\nmodel. The results demonstrate the effectiveness of the proposed method for\nrecognizing new biomedical entities with no or limited number of examples,\noutperforming previous transformer-based methods, and being comparable to\nGPT3-based models using models with over 1000 times fewer parameters. We make\nmodels and developed code publicly available.\n","authors":["Miloš Košprdić","Nikola Prodanović","Adela Ljajić","Bojana Bašaragin","Nikola Milošević"],"pdf_url":"https://arxiv.org/pdf/2305.04928v5.pdf","comment":"Collaboration between Bayer Pharma R&D and Serbian Institute for\n Artificial Intelligence Research and Development. Artificial Intelligence in\n Medicine (2024)"},{"id":"http://arxiv.org/abs/2408.13704v1","updated":"2024-08-25T02:01:38Z","published":"2024-08-25T02:01:38Z","title":"DHP Benchmark: Are LLMs Good NLG Evaluators?","summary":" Large Language Models (LLMs) are increasingly serving as evaluators in\nNatural Language Generation (NLG) tasks. However, the capabilities of LLMs in\nscoring NLG quality remain inadequately explored. Current studies depend on\nhuman assessments and simple metrics that fail to capture the discernment of\nLLMs across diverse NLG tasks. To address this gap, we propose the Discernment\nof Hierarchical Perturbation (DHP) benchmarking framework, which provides\nquantitative discernment scores for LLMs utilizing hierarchically perturbed\ntext data and statistical tests to measure the NLG evaluation capabilities of\nLLMs systematically. We have re-established six evaluation datasets for this\nbenchmark, covering four NLG tasks: Summarization, Story Completion, Question\nAnswering, and Translation. Our comprehensive benchmarking of five major LLM\nseries provides critical insight into their strengths and limitations as NLG\nevaluators.\n","authors":["Yicheng Wang","Jiayi Yuan","Yu-Neng Chuang","Zhuoer Wang","Yingchi Liu","Mark Cusick","Param Kulkarni","Zhengping Ji","Yasser Ibrahim","Xia Hu"],"pdf_url":"https://arxiv.org/pdf/2408.13704v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.03827v2","updated":"2024-08-25T01:38:45Z","published":"2024-06-06T08:03:05Z","title":"Chaos with Keywords: Exposing Large Language Models Sycophantic\n Hallucination to Misleading Keywords and Evaluating Defense Strategies","summary":" This study explores the sycophantic tendencies of Large Language Models\n(LLMs), where these models tend to provide answers that match what users want\nto hear, even if they are not entirely correct. The motivation behind this\nexploration stems from the common behavior observed in individuals searching\nthe internet for facts with partial or misleading knowledge. Similar to using\nweb search engines, users may recall fragments of misleading keywords and\nsubmit them to an LLM, hoping for a comprehensive response. Our empirical\nanalysis of several LLMs shows the potential danger of these models amplifying\nmisinformation when presented with misleading keywords. Additionally, we\nthoroughly assess four existing hallucination mitigation strategies to reduce\nLLMs sycophantic behavior. Our experiments demonstrate the effectiveness of\nthese strategies for generating factually correct statements. Furthermore, our\nanalyses delve into knowledge-probing experiments on factual keywords and\ndifferent categories of sycophancy mitigation.\n","authors":["Aswin RRV","Nemika Tyagi","Md Nayem Uddin","Neeraj Varshney","Chitta Baral"],"pdf_url":"https://arxiv.org/pdf/2406.03827v2.pdf","comment":"Findings of ACL 2024"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2408.12574v2","updated":"2024-08-25T23:58:25Z","published":"2024-08-22T17:41:45Z","title":"MuMA-ToM: Multi-modal Multi-Agent Theory of Mind","summary":" Understanding people's social interactions in complex real-world scenarios\noften relies on intricate mental reasoning. To truly understand how and why\npeople interact with one another, we must infer the underlying mental states\nthat give rise to the social interactions, i.e., Theory of Mind reasoning in\nmulti-agent interactions. Additionally, social interactions are often\nmulti-modal -- we can watch people's actions, hear their conversations, and/or\nread about their past behaviors. For AI systems to successfully and safely\ninteract with people in real-world environments, they also need to understand\npeople's mental states as well as their inferences about each other's mental\nstates based on multi-modal information about their interactions. For this, we\nintroduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark.\nMuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates\nmental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide\nvideo and text descriptions of people's multi-modal behavior in realistic\nhousehold environments. Based on the context, we then ask questions about\npeople's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM\nin a human experiment and provided a human baseline. We also proposed a novel\nmulti-modal, multi-agent ToM model, LIMP (Language model-based Inverse\nMulti-agent Planning). Our experimental results show that LIMP significantly\noutperforms state-of-the-art methods, including large multi-modal models (e.g.,\nGPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.\n","authors":["Haojun Shi","Suyu Ye","Xinyu Fang","Chuanyang Jin","Leyla Isik","Yen-Ling Kuo","Tianmin Shu"],"pdf_url":"https://arxiv.org/pdf/2408.12574v2.pdf","comment":"Project website: https://scai.cs.jhu.edu/projects/MuMA-ToM/ Code:\n https://github.com/SCAI-JHU/MuMA-ToM"},{"id":"http://arxiv.org/abs/2408.13963v1","updated":"2024-08-25T23:57:07Z","published":"2024-08-25T23:57:07Z","title":"Shifted Window Fourier Transform And Retention For Image Captioning","summary":" Image Captioning is an important Language and Vision task that finds\napplication in a variety of contexts, ranging from healthcare to autonomous\nvehicles. As many real-world applications rely on devices with limited\nresources, much effort in the field was put into the development of lighter and\nfaster models. However, much of the current optimizations focus on the\nTransformer architecture in contrast to the existence of more efficient\nmethods. In this work, we introduce SwiFTeR, an architecture almost entirely\nbased on Fourier Transform and Retention, to tackle the main efficiency\nbottlenecks of current light image captioning models, being the visual\nbackbone's onerosity, and the decoder's quadratic cost. SwiFTeR is made of only\n20M parameters, and requires 3.1 GFLOPs for a single forward pass.\nAdditionally, it showcases superior scalability to the caption length and its\nsmall memory requirements enable more images to be processed in parallel,\ncompared to the traditional transformer-based architectures. For instance, it\ncan generate 400 captions in one second. Although, for the time being, the\ncaption quality is lower (110.2 CIDEr-D), most of the decrease is not\nattributed to the architecture but rather an incomplete training practice which\ncurrently leaves much room for improvements. Overall, SwiFTeR points toward a\npromising direction to new efficient architectural design. The implementation\ncode will be released in the future.\n","authors":["Jia Cheng Hu","Roberto Cavicchioli","Alessandro Capotondi"],"pdf_url":"https://arxiv.org/pdf/2408.13963v1.pdf","comment":"Pre-print version of paper accepted for ICONIP 2024"},{"id":"http://arxiv.org/abs/2408.13953v1","updated":"2024-08-25T22:26:46Z","published":"2024-08-25T22:26:46Z","title":"InterTrack: Tracking Human Object Interaction without Object Templates","summary":" Tracking human object interaction from videos is important to understand\nhuman behavior from the rapidly growing stream of video data. Previous\nvideo-based methods require predefined object templates while\nsingle-image-based methods are template-free but lack temporal consistency. In\nthis paper, we present a method to track human object interaction without any\nobject shape templates. We decompose the 4D tracking problem into per-frame\npose tracking and canonical shape optimization. We first apply a single-view\nreconstruction method to obtain temporally-inconsistent per-frame interaction\nreconstructions. Then, for the human, we propose an efficient autoencoder to\npredict SMPL vertices directly from the per-frame reconstructions, introducing\ntemporally consistent correspondence. For the object, we introduce a pose\nestimator that leverages temporal information to predict smooth object\nrotations under occlusions. To train our model, we propose a method to generate\nsynthetic interaction videos and synthesize in total 10 hour videos of 8.5k\nsequences with full 3D ground truth. Experiments on BEHAVE and InterCap show\nthat our method significantly outperforms previous template-based video\ntracking and single-frame reconstruction methods. Our proposed synthetic video\ndataset also allows training video-based methods that generalize to real-world\nvideos. Our code and dataset will be publicly released.\n","authors":["Xianghui Xie","Jan Eric Lenssen","Gerard Pons-Moll"],"pdf_url":"https://arxiv.org/pdf/2408.13953v1.pdf","comment":"17 pages, 13 figures and 6 tables. Project page:\n https://virtualhumans.mpi-inf.mpg.de/InterTrack/"},{"id":"http://arxiv.org/abs/2408.13945v1","updated":"2024-08-25T21:49:10Z","published":"2024-08-25T21:49:10Z","title":"Personalized Topology-Informed 12-Lead ECG Electrode Localization from\n Incomplete Cardiac MRIs for Efficient Cardiac Digital Twins","summary":" Cardiac digital twins (CDTs) offer personalized \\textit{in-silico} cardiac\nrepresentations for the inference of multi-scale properties tied to cardiac\nmechanisms. The creation of CDTs requires precise information about the\nelectrode position on the torso, especially for the personalized\nelectrocardiogram (ECG) calibration. However, current studies commonly rely on\nadditional acquisition of torso imaging and manual/semi-automatic methods for\nECG electrode localization. In this study, we propose a novel and efficient\ntopology-informed model to fully automatically extract personalized ECG\nelectrode locations from 2D clinically standard cardiac MRIs. Specifically, we\nobtain the sparse torso contours from the cardiac MRIs and then localize the\nelectrodes from the contours. Cardiac MRIs aim at imaging of the heart instead\nof the torso, leading to incomplete torso geometry within the imaging. To\ntackle the missing topology, we incorporate the electrodes as a subset of the\nkeypoints, which can be explicitly aligned with the 3D torso topology. The\nexperimental results demonstrate that the proposed model outperforms the\ntime-consuming conventional method in terms of accuracy (Euclidean distance:\n$1.24 \\pm 0.293$ cm vs. $1.48 \\pm 0.362$ cm) and efficiency ($2$~s vs.\n$30$-$35$~min). We further demonstrate the effectiveness of using the detected\nelectrodes for \\textit{in-silico} ECG simulation, highlighting their potential\nfor creating accurate and efficient CDT models. The code will be released\npublicly after the manuscript is accepted for publication.\n","authors":["Lei Li","Hannah Smith","Yilin Lyu","Julia Camps","Blanca Rodriguez","Abhirup Banerjee","Vicente Grau"],"pdf_url":"https://arxiv.org/pdf/2408.13945v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2301.03796v2","updated":"2024-08-25T21:45:24Z","published":"2023-01-10T05:40:28Z","title":"Enhancing Evaluation Methods for Infrared Small-Target Detection in\n Real-world Scenarios","summary":" Infrared small target detection (IRSTD) poses a significant challenge in the\nfield of computer vision. While substantial efforts have been made over the\npast two decades to improve the detection capabilities of IRSTD algorithms,\nthere has been a lack of extensive investigation into the evaluation metrics\nused for assessing their performance. In this paper, we employ a systematic\napproach to address this issue by first evaluating the effectiveness of\nexisting metrics and then proposing new metrics to overcome the limitations of\nconventional ones. To achieve this, we carefully analyze the necessary\nconditions for successful detection and identify the shortcomings of current\nevaluation metrics, including both pre-thresholding and post-thresholding\nmetrics. We then introduce new metrics that are designed to align with the\nrequirements of real-world systems. Furthermore, we utilize these newly\nproposed metrics to compare and evaluate the performance of five widely\nrecognized small infrared target detection algorithms. The results demonstrate\nthat the new metrics provide consistent and meaningful quantitative\nassessments, aligning with qualitative observations.\n","authors":["Saed Moradi","Alireza Memarmoghadam","Payman Moallem","Mohamad Farzan Sabahi"],"pdf_url":"https://arxiv.org/pdf/2301.03796v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13936v1","updated":"2024-08-25T20:53:53Z","published":"2024-08-25T20:53:53Z","title":"OpenNav: Efficient Open Vocabulary 3D Object Detection for Smart\n Wheelchair Navigation","summary":" Open vocabulary 3D object detection (OV3D) allows precise and extensible\nobject recognition crucial for adapting to diverse environments encountered in\nassistive robotics. This paper presents OpenNav, a zero-shot 3D object\ndetection pipeline based on RGB-D images for smart wheelchairs. Our pipeline\nintegrates an open-vocabulary 2D object detector with a mask generator for\nsemantic segmentation, followed by depth isolation and point cloud construction\nto create 3D bounding boxes. The smart wheelchair exploits these 3D bounding\nboxes to identify potential targets and navigate safely. We demonstrate\nOpenNav's performance through experiments on the Replica dataset and we report\npreliminary results with a real wheelchair. OpenNav improves state-of-the-art\nsignificantly on the Replica dataset at mAP25 (+9pts) and mAP50 (+5pts) with\nmarginal improvement at mAP. The code is publicly available at this link:\nhttps://github.com/EasyWalk-PRIN/OpenNav.\n","authors":["Muhammad Rameez ur Rahman","Piero Simonetto","Anna Polato","Francesco Pasti","Luca Tonin","Sebastiano Vascon"],"pdf_url":"https://arxiv.org/pdf/2408.13936v1.pdf","comment":"ECCVW"},{"id":"http://arxiv.org/abs/2408.13928v1","updated":"2024-08-25T20:09:46Z","published":"2024-08-25T20:09:46Z","title":"GeoPlant: Spatial Plant Species Prediction Dataset","summary":" The difficulty of monitoring biodiversity at fine scales and over large areas\nlimits ecological knowledge and conservation efforts. To fill this gap, Species\nDistribution Models (SDMs) predict species across space from spatially explicit\nfeatures. Yet, they face the challenge of integrating the rich but\nheterogeneous data made available over the past decade, notably millions of\nopportunistic species observations and standardized surveys, as well as\nmulti-modal remote sensing data. In light of that, we have designed and\ndeveloped a new European-scale dataset for SDMs at high spatial resolution\n(10-50 m), including more than 10k species (i.e., most of the European flora).\nThe dataset comprises 5M heterogeneous Presence-Only records and 90k exhaustive\nPresence-Absence survey records, all accompanied by diverse environmental\nrasters (e.g., elevation, human footprint, and soil) that are traditionally\nused in SDMs. In addition, it provides Sentinel-2 RGB and NIR satellite images\nwith 10 m resolution, a 20-year time-series of climatic variables, and\nsatellite time-series from the Landsat program. In addition to the data, we\nprovide an openly accessible SDM benchmark (hosted on Kaggle), which has\nalready attracted an active community and a set of strong baselines for single\npredictor/modality and multimodal approaches. All resources, e.g., the dataset,\npre-trained models, and baseline methods (in the form of notebooks), are\navailable on Kaggle, allowing one to start with our dataset literally with two\nmouse clicks.\n","authors":["Lukas Picek","Christophe Botella","Maximilien Servajean","César Leblanc","Rémi Palard","Théo Larcher","Benjamin Deneu","Diego Marcos","Pierre Bonnet","Alexis Joly"],"pdf_url":"https://arxiv.org/pdf/2408.13928v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13925v1","updated":"2024-08-25T19:47:40Z","published":"2024-08-25T19:47:40Z","title":"Infrared Domain Adaptation with Zero-Shot Quantization","summary":" Quantization is one of the most popular techniques for reducing computation\ntime and shrinking model size. However, ensuring the accuracy of quantized\nmodels typically involves calibration using training data, which may be\ninaccessible due to privacy concerns. In such cases, zero-shot quantization, a\ntechnique that relies on pretrained models and statistical information without\nthe need for specific training data, becomes valuable. Exploring zero-shot\nquantization in the infrared domain is important due to the prevalence of\ninfrared imaging in sensitive fields like medical and security applications. In\nthis work, we demonstrate how to apply zero-shot quantization to an object\ndetection model retrained with thermal imagery. We use batch normalization\nstatistics of the model to distill data for calibration. RGB image-trained\nmodels and thermal image-trained models are compared in the context of\nzero-shot quantization. Our investigation focuses on the contributions of mean\nand standard deviation statistics to zero-shot quantization performance.\nAdditionally, we compare zero-shot quantization with post-training quantization\non a thermal dataset. We demonstrated that zero-shot quantization successfully\ngenerates data that represents the training dataset for the quantization of\nobject detection models. Our results indicate that our zero-shot quantization\nframework is effective in the absence of training data and is well-suited for\nthe infrared domain.\n","authors":["Burak Sevsay","Erdem Akagündüz"],"pdf_url":"https://arxiv.org/pdf/2408.13925v1.pdf","comment":"ICMV 2024"},{"id":"http://arxiv.org/abs/2408.13922v1","updated":"2024-08-25T19:18:18Z","published":"2024-08-25T19:18:18Z","title":"COMPOSE: Comprehensive Portrait Shadow Editing","summary":" Existing portrait relighting methods struggle with precise control over\nfacial shadows, particularly when faced with challenges such as handling hard\nshadows from directional light sources or adjusting shadows while remaining in\nharmony with existing lighting conditions. In many situations, completely\naltering input lighting is undesirable for portrait retouching applications:\none may want to preserve some authenticity in the captured environment.\nExisting shadow editing methods typically restrict their application to just\nthe facial region and often offer limited lighting control options, such as\nshadow softening or rotation. In this paper, we introduce COMPOSE: a novel\nshadow editing pipeline for human portraits, offering precise control over\nshadow attributes such as shape, intensity, and position, all while preserving\nthe original environmental illumination of the portrait. This level of\ndisentanglement and controllability is obtained thanks to a novel decomposition\nof the environment map representation into ambient light and an editable\ngaussian dominant light source. COMPOSE is a four-stage pipeline that consists\nof light estimation and editing, light diffusion, shadow synthesis, and finally\nshadow editing. We define facial shadows as the result of a dominant light\nsource, encoded using our novel gaussian environment map representation.\nUtilizing an OLAT dataset, we have trained models to: (1) predict this light\nsource representation from images, and (2) generate realistic shadows using\nthis representation. We also demonstrate comprehensive and intuitive shadow\nediting with our pipeline. Through extensive quantitative and qualitative\nevaluations, we have demonstrated the robust capability of our system in shadow\nediting.\n","authors":["Andrew Hou","Zhixin Shu","Xuaner Zhang","He Zhang","Yannick Hold-Geoffroy","Jae Shin Yoon","Xiaoming Liu"],"pdf_url":"https://arxiv.org/pdf/2408.13922v1.pdf","comment":"Accepted at ECCV 2024"},{"id":"http://arxiv.org/abs/2408.13912v1","updated":"2024-08-25T18:27:20Z","published":"2024-08-25T18:27:20Z","title":"Splatt3R: Zero-shot Gaussian Splatting from Uncalibarated Image Pairs","summary":" In this paper, we introduce Splatt3R, a pose-free, feed-forward method for\nin-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given\nuncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without\nrequiring any camera parameters or depth information. For generalizability, we\nstart from a 'foundation' 3D geometry reconstruction method, MASt3R, and extend\nit to be a full 3D structure and appearance reconstructor. Specifically, unlike\nthe original MASt3R which reconstructs only 3D point clouds, we predict the\nadditional Gaussian attributes required to construct a Gaussian primitive for\neach point. Hence, unlike other novel view synthesis methods, Splatt3R is first\ntrained by optimizing the 3D point cloud's geometry loss, and then a novel view\nsynthesis objective. By doing this, we avoid the local minima present in\ntraining 3D Gaussian Splats from stereo views. We also propose a novel loss\nmasking strategy that we empirically find is critical for strong performance on\nextrapolated viewpoints. We train Splatt3R on the ScanNet++ dataset and\ndemonstrate excellent generalisation to uncalibrated, in-the-wild images.\nSplatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and the\nresultant splats can be rendered in real-time.\n","authors":["Brandon Smart","Chuanxia Zheng","Iro Laina","Victor Adrian Prisacariu"],"pdf_url":"https://arxiv.org/pdf/2408.13912v1.pdf","comment":"Our project page can be found at: https://splatt3r.active.vision/"},{"id":"http://arxiv.org/abs/2408.13909v1","updated":"2024-08-25T18:10:16Z","published":"2024-08-25T18:10:16Z","title":"LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages\n in Multimodal Image Retrieval Task","summary":" This research explores the development of multimodal vision-language models\nfor image retrieval in low-resource languages, specifically Azerbaijani.\nExisting vision-language models primarily support high-resource languages, and\nfine-tuning them remains computationally demanding. To address challenges in\nvision-language retrieval for low-resource languages, we integrated the CLIP\nmodel architecture and employed several techniques to balance computational\nefficiency with performance. These techniques include synthetic data generation\nthrough machine translation, image augmentation, and further training the\nattention mechanisms of transformer-based models with domain-specific data. We\nintegrated Multilingual BERT as a text encoder with image encoders like\nResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer.\nOur study found that models like EfficientNet0 and Tiny Swin Transformer\nperform best on the datasets they were trained on, such as COCO, Flickr30k, and\nFlickr8k. Augmentation techniques boosted EfficientNet0 MAP on Flickr30k from\n0.84 to 0.87 and ResNet50 MAP on MSCOCO from 0.70 to 0.80, contributing to a\nnew state of the art in vision-language retrieval. We share our configurations\nand results to support further research. Code and pre-trained models are\navailable at https://github.com/aliasgerovs/azclip.\n","authors":["Ali Asgarov","Samir Rustamov"],"pdf_url":"https://arxiv.org/pdf/2408.13909v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13906v1","updated":"2024-08-25T18:02:36Z","published":"2024-08-25T18:02:36Z","title":"ConVis: Contrastive Decoding with Hallucination Visualization for\n Mitigating Hallucinations in Multimodal Large Language Models","summary":" Hallucinations in Multimodal Large Language Models (MLLMs) where generated\nresponses fail to accurately reflect the given image pose a significant\nchallenge to their reliability. To address this, we introduce ConVis, a novel\ntraining-free contrastive decoding method. ConVis leverages a text-to-image\n(T2I) generation model to semantically reconstruct the given image from\nhallucinated captions. By comparing the contrasting probability distributions\nproduced by the original and reconstructed images, ConVis enables MLLMs to\ncapture visual contrastive signals that penalize hallucination generation.\nNotably, this method operates purely within the decoding process, eliminating\nthe need for additional data or model updates. Our extensive experiments on\nfive popular benchmarks demonstrate that ConVis effectively reduces\nhallucinations across various MLLMs, highlighting its potential to enhance\nmodel reliability.\n","authors":["Yeji Park","Deokyeong Lee","Junsuk Choe","Buru Chang"],"pdf_url":"https://arxiv.org/pdf/2408.13906v1.pdf","comment":"First two authors contributed equally. Source code is available at\n https://github.com/yejipark-m/ConVis"},{"id":"http://arxiv.org/abs/2407.10159v2","updated":"2024-08-25T17:59:22Z","published":"2024-07-14T10:59:34Z","title":"RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D\n LiDAR Segmentation","summary":" 3D point clouds play a pivotal role in outdoor scene perception, especially\nin the context of autonomous driving. Recent advancements in 3D LiDAR\nsegmentation often focus intensely on the spatial positioning and distribution\nof points for accurate segmentation. However, these methods, while robust in\nvariable conditions, encounter challenges due to sole reliance on coordinates\nand point intensity, leading to poor isometric invariance and suboptimal\nsegmentation. To tackle this challenge, our work introduces Range-Aware\nPointwise Distance Distribution (RAPiD) features and the associated RAPiD-Seg\narchitecture. Our RAPiD features exhibit rigid transformation invariance and\neffectively adapt to variations in point density, with a design focus on\ncapturing the localized geometry of neighboring structures. They utilize\ninherent LiDAR isotropic radiation and semantic categorization for enhanced\nlocal representation and computational efficiency, while incorporating a 4D\ndistance metric that integrates geometric and surface material reflectivity for\nimproved semantic segmentation. To effectively embed high-dimensional RAPiD\nfeatures, we propose a double-nested autoencoder structure with a novel\nclass-aware embedding objective to encode high-dimensional features into\nmanageable voxel-wise embeddings. Additionally, we propose RAPiD-Seg which\nincorporates a channel-wise attention fusion and two effective RAPiD-Seg\nvariants, further optimizing the embedding for enhanced performance and\ngeneralization. Our method outperforms contemporary LiDAR segmentation work in\nterms of mIoU on SemanticKITTI (76.1) and nuScenes (83.6) datasets.\n","authors":["Li Li","Hubert P. H. Shum","Toby P. Breckon"],"pdf_url":"https://arxiv.org/pdf/2407.10159v2.pdf","comment":"ECCV 2024 (Oral); 18 pages, 6 figures, 7 tables; Code at\n https://github.com/l1997i/rapid_seg"},{"id":"http://arxiv.org/abs/2408.13902v1","updated":"2024-08-25T17:59:17Z","published":"2024-08-25T17:59:17Z","title":"TraIL-Det: Transformation-Invariant Local Feature Networks for 3D LiDAR\n Object Detection with Unsupervised Pre-Training","summary":" 3D point clouds are essential for perceiving outdoor scenes, especially\nwithin the realm of autonomous driving. Recent advances in 3D LiDAR Object\nDetection focus primarily on the spatial positioning and distribution of points\nto ensure accurate detection. However, despite their robust performance in\nvariable conditions, these methods are hindered by their sole reliance on\ncoordinates and point intensity, resulting in inadequate isometric invariance\nand suboptimal detection outcomes. To tackle this challenge, our work\nintroduces Transformation-Invariant Local (TraIL) features and the associated\nTraIL-Det architecture. Our TraIL features exhibit rigid transformation\ninvariance and effectively adapt to variations in point density, with a design\nfocus on capturing the localized geometry of neighboring structures. They\nutilize the inherent isotropic radiation of LiDAR to enhance local\nrepresentation, improve computational efficiency, and boost detection\nperformance. To effectively process the geometric relations among points within\neach proposal, we propose a Multi-head self-Attention Encoder (MAE) with\nasymmetric geometric features to encode high-dimensional TraIL features into\nmanageable representations. Our method outperforms contemporary self-supervised\n3D object detection approaches in terms of mAP on KITTI (67.8, 20% label,\nmoderate) and Waymo (68.9, 20% label, moderate) datasets under various label\nratios (20%, 50%, and 100%).\n","authors":["Li Li","Tanqiu Qiao","Hubert P. H. Shum","Toby P. Breckon"],"pdf_url":"https://arxiv.org/pdf/2408.13902v1.pdf","comment":"BMVC 2024; 15 pages, 3 figures, 3 tables; Code at\n https://github.com/l1997i/rapid_seg"},{"id":"http://arxiv.org/abs/2408.13898v1","updated":"2024-08-25T17:42:05Z","published":"2024-08-25T17:42:05Z","title":"Evaluating Attribute Comprehension in Large Vision-Language Models","summary":" Currently, large vision-language models have gained promising progress on\nmany downstream tasks. However, they still suffer many challenges in\nfine-grained visual understanding tasks, such as object attribute\ncomprehension. Besides, there have been growing efforts on the evaluations of\nlarge vision-language models, but lack of in-depth study of attribute\ncomprehension and the visual language fine-tuning process. In this paper, we\npropose to evaluate the attribute comprehension ability of large\nvision-language models from two perspectives: attribute recognition and\nattribute hierarchy understanding. We evaluate three vision-language\ninteractions, including visual question answering, image-text matching, and\nimage-text cosine similarity. Furthermore, we explore the factors affecting\nattribute comprehension during fine-tuning. Through a series of quantitative\nand qualitative experiments, we introduce three main findings: (1) Large\nvision-language models possess good attribute recognition ability, but their\nhierarchical understanding ability is relatively limited. (2) Compared to ITC,\nITM exhibits superior capability in capturing finer details, making it more\nsuitable for attribute understanding tasks. (3) The attribute information in\nthe captions used for fine-tuning plays a crucial role in attribute\nunderstanding. We hope this work can help guide future progress in fine-grained\nvisual understanding of large vision-language models.\n","authors":["Haiwen Zhang","Zixi Yang","Yuanzhi Liu","Xinran Wang","Zheqi He","Kongming Liang","Zhanyu Ma"],"pdf_url":"https://arxiv.org/pdf/2408.13898v1.pdf","comment":"15 pages, 4 figures"},{"id":"http://arxiv.org/abs/2408.13896v1","updated":"2024-08-25T17:33:40Z","published":"2024-08-25T17:33:40Z","title":"RT-Attack: Jailbreaking Text-to-Image Models via Random Token","summary":" Recently, Text-to-Image(T2I) models have achieved remarkable success in image\ngeneration and editing, yet these models still have many potential issues,\nparticularly in generating inappropriate or Not-Safe-For-Work(NSFW) content.\nStrengthening attacks and uncovering such vulnerabilities can advance the\ndevelopment of reliable and practical T2I models. Most of the previous works\ntreat T2I models as white-box systems, using gradient optimization to generate\nadversarial prompts. However, accessing the model's gradient is often\nimpossible in real-world scenarios. Moreover, existing defense methods, those\nusing gradient masking, are designed to prevent attackers from obtaining\naccurate gradient information. While some black-box jailbreak attacks have been\nexplored, these typically rely on simply replacing sensitive words, leading to\nsuboptimal attack performance. To address this issue, we introduce a two-stage\nquery-based black-box attack method utilizing random search. In the first\nstage, we establish a preliminary prompt by maximizing the semantic similarity\nbetween the adversarial and target harmful prompts. In the second stage, we use\nthis initial prompt to refine our approach, creating a detailed adversarial\nprompt aimed at jailbreaking and maximizing the similarity in image features\nbetween the images generated from this prompt and those produced by the target\nharmful prompt. Extensive experiments validate the effectiveness of our method\nin attacking the latest prompt checkers, post-hoc image checkers, securely\ntrained T2I models, and online commercial models.\n","authors":["Sensen Gao","Xiaojun Jia","Yihao Huang","Ranjie Duan","Jindong Gu","Yang Liu","Qing Guo"],"pdf_url":"https://arxiv.org/pdf/2408.13896v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.20891v2","updated":"2024-08-25T17:07:49Z","published":"2024-07-30T15:07:13Z","title":"Bayesian Low-Rank LeArning (Bella): A Practical Approach to Bayesian\n Neural Networks","summary":" Computational complexity of Bayesian learning is impeding its adoption in\npractical, large-scale tasks. Despite demonstrations of significant merits such\nas improved robustness and resilience to unseen or out-of-distribution inputs\nover their non- Bayesian counterparts, their practical use has faded to near\ninsignificance. In this study, we introduce an innovative framework to mitigate\nthe computational burden of Bayesian neural networks (BNNs). Our approach\nfollows the principle of Bayesian techniques based on deep ensembles, but\nsignificantly reduces their cost via multiple low-rank perturbations of\nparameters arising from a pre-trained neural network. Both vanilla version of\nensembles as well as more sophisticated schemes such as Bayesian learning with\nStein Variational Gradient Descent (SVGD), previously deemed impractical for\nlarge models, can be seamlessly implemented within the proposed framework,\ncalled Bayesian Low-Rank LeArning (Bella). In a nutshell, i) Bella achieves a\ndramatic reduction in the number of trainable parameters required to\napproximate a Bayesian posterior; and ii) it not only maintains, but in some\ninstances, surpasses the performance of conventional Bayesian learning methods\nand non-Bayesian baselines. Our results with large-scale tasks such as\nImageNet, CAMELYON17, DomainNet, VQA with CLIP, LLaVA demonstrate the\neffectiveness and versatility of Bella in building highly scalable and\npractical Bayesian deep models for real-world applications.\n","authors":["Bao Gia Doan","Afshar Shamsi","Xiao-Yu Guo","Arash Mohammadi","Hamid Alinejad-Rokny","Dino Sejdinovic","Damith C. Ranasinghe","Ehsan Abbasnejad"],"pdf_url":"https://arxiv.org/pdf/2407.20891v2.pdf","comment":"17 pages, 14 figures, 11 tables"},{"id":"http://arxiv.org/abs/2408.13890v1","updated":"2024-08-25T16:43:47Z","published":"2024-08-25T16:43:47Z","title":"Making Large Language Models Better Planners with Reasoning-Decision\n Alignment","summary":" Data-driven approaches for autonomous driving (AD) have been widely adopted\nin the past decade but are confronted with dataset bias and uninterpretability.\nInspired by the knowledge-driven nature of human driving, recent approaches\nexplore the potential of large language models (LLMs) to improve understanding\nand decision-making in traffic scenarios. They find that the pretrain-finetune\nparadigm of LLMs on downstream data with the Chain-of-Thought (CoT) reasoning\nprocess can enhance explainability and scene understanding. However, such a\npopular strategy proves to suffer from the notorious problems of misalignment\nbetween the crafted CoTs against the consequent decision-making, which remains\nuntouched by previous LLM-based AD methods. To address this problem, we\nmotivate an end-to-end decision-making model based on multimodality-augmented\nLLM, which simultaneously executes CoT reasoning and carries out planning\nresults. Furthermore, we propose a reasoning-decision alignment constraint\nbetween the paired CoTs and planning results, imposing the correspondence\nbetween reasoning and decision-making. Moreover, we redesign the CoTs to enable\nthe model to comprehend complex scenarios and enhance decision-making\nperformance. We dub our proposed large language planners with\nreasoning-decision alignment as RDA-Driver. Experimental evaluations on the\nnuScenes and DriveLM-nuScenes benchmarks demonstrate the effectiveness of our\nRDA-Driver in enhancing the performance of end-to-end AD systems. Specifically,\nour RDA-Driver achieves state-of-the-art planning performance on the nuScenes\ndataset with 0.80 L2 error and 0.32 collision rate, and also achieves leading\nresults on challenging DriveLM-nuScenes benchmarks with 0.82 L2 error and 0.38\ncollision rate.\n","authors":["Zhijian Huang","Tao Tang","Shaoxiang Chen","Sihao Lin","Zequn Jie","Lin Ma","Guangrun Wang","Xiaodan Liang"],"pdf_url":"https://arxiv.org/pdf/2408.13890v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14773v2","updated":"2024-08-25T16:11:03Z","published":"2023-12-22T15:39:37Z","title":"Cross-Age and Cross-Site Domain Shift Impacts on Deep Learning-Based\n White Matter Fiber Estimation in Newborn and Baby Brains","summary":" Deep learning models have shown great promise in estimating tissue\nmicrostructure from limited diffusion magnetic resonance imaging data. However,\nthese models face domain shift challenges when test and train data are from\ndifferent scanners and protocols, or when the models are applied to data with\ninherent variations such as the developing brains of infants and children\nscanned at various ages. Several techniques have been proposed to address some\nof these challenges, such as data harmonization or domain adaptation in the\nadult brain. However, those techniques remain unexplored for the estimation of\nfiber orientation distribution functions in the rapidly developing brains of\ninfants. In this work, we extensively investigate the age effect and domain\nshift within and across two different cohorts of 201 newborns and 165 babies\nusing the Method of Moments and fine-tuning strategies. Our results show that\nreduced variations in the microstructural development of babies in comparison\nto newborns directly impact the deep learning models' cross-age performance. We\nalso demonstrate that a small number of target domain samples can significantly\nmitigate domain shift problems.\n","authors":["Rizhong Lin","Ali Gholipour","Jean-Philippe Thiran","Davood Karimi","Hamza Kebiri","Meritxell Bach Cuadra"],"pdf_url":"https://arxiv.org/pdf/2312.14773v2.pdf","comment":"5 pages, 5 figures; accepted as an Oral Presentation at the 2024 IEEE\n International Symposium on Biomedical Imaging (ISBI) in Athens, Greece"},{"id":"http://arxiv.org/abs/2408.13877v1","updated":"2024-08-25T15:56:33Z","published":"2024-08-25T15:56:33Z","title":"Camouflaged_Object_Tracking__A_Benchmark","summary":" Visual tracking has seen remarkable advancements, largely driven by the\navailability of large-scale training datasets that have enabled the development\nof highly accurate and robust algorithms. While significant progress has been\nmade in tracking general objects, research on more challenging scenarios, such\nas tracking camouflaged objects, remains limited. Camouflaged objects, which\nblend seamlessly with their surroundings or other objects, present unique\nchallenges for detection and tracking in complex environments. This challenge\nis particularly critical in applications such as military, security,\nagriculture, and marine monitoring, where precise tracking of camouflaged\nobjects is essential. To address this gap, we introduce the Camouflaged Object\nTracking Dataset (COTD), a specialized benchmark designed specifically for\nevaluating camouflaged object tracking methods. The COTD dataset comprises 200\nsequences and approximately 80,000 frames, each annotated with detailed\nbounding boxes. Our evaluation of 20 existing tracking algorithms reveals\nsignificant deficiencies in their performance with camouflaged objects. To\naddress these issues, we propose a novel tracking framework, HiPTrack-MLS,\nwhich demonstrates promising results in improving tracking performance for\ncamouflaged objects. COTD and code are avialable at\nhttps://github.com/openat25/HIPTrack-MLS.\n","authors":["Xiaoyu Guo","Pengzhi Zhong","Hao Zhang","Ling Huang","Defeng Huang","Shuiwang Li"],"pdf_url":"https://arxiv.org/pdf/2408.13877v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12528v2","updated":"2024-08-25T15:46:51Z","published":"2024-08-22T16:32:32Z","title":"Show-o: One Single Transformer to Unify Multimodal Understanding and\n Generation","summary":" We present a unified transformer, i.e., Show-o, that unifies multimodal\nunderstanding and generation. Unlike fully autoregressive models, Show-o\nunifies autoregressive and (discrete) diffusion modeling to adaptively handle\ninputs and outputs of various and mixed modalities. The unified model flexibly\nsupports a wide range of vision-language tasks including visual\nquestion-answering, text-to-image generation, text-guided\ninpainting/extrapolation, and mixed-modality generation. Across various\nbenchmarks, it demonstrates comparable or superior performance to existing\nindividual models with an equivalent or larger number of parameters tailored\nfor understanding or generation. This significantly highlights its potential as\na next-generation foundation model. Code and models are released at\nhttps://github.com/showlab/Show-o.\n","authors":["Jinheng Xie","Weijia Mao","Zechen Bai","David Junhao Zhang","Weihao Wang","Kevin Qinghong Lin","Yuchao Gu","Zhijie Chen","Zhenheng Yang","Mike Zheng Shou"],"pdf_url":"https://arxiv.org/pdf/2408.12528v2.pdf","comment":"Technical Report"},{"id":"http://arxiv.org/abs/2408.13868v1","updated":"2024-08-25T15:36:28Z","published":"2024-08-25T15:36:28Z","title":"Particle-Filtering-based Latent Diffusion for Inverse Problems","summary":" Current strategies for solving image-based inverse problems apply latent\ndiffusion models to perform posterior sampling.However, almost all approaches\nmake no explicit attempt to explore the solution space, instead drawing only a\nsingle sample from a Gaussian distribution from which to generate their\nsolution. In this paper, we introduce a particle-filtering-based framework for\na nonlinear exploration of the solution space in the initial stages of reverse\nSDE methods. Our proposed particle-filtering-based latent diffusion (PFLD)\nmethod and proposed problem formulation and framework can be applied to any\ndiffusion-based solution for linear or nonlinear inverse problems. Our\nexperimental results show that PFLD outperforms the SoTA solver PSLD on the\nFFHQ-1K and ImageNet-1K datasets on inverse problem tasks of super resolution,\nGaussian debluring and inpainting.\n","authors":["Amir Nazemi","Mohammad Hadi Sepanj","Nicholas Pellegrino","Chris Czarnecki","Paul Fieguth"],"pdf_url":"https://arxiv.org/pdf/2408.13868v1.pdf","comment":"Mohammad Hadi Sepanj, Nicholas Pellegrino, and Chris Czarnecki\n contributed equally"},{"id":"http://arxiv.org/abs/2408.13860v1","updated":"2024-08-25T15:17:43Z","published":"2024-08-25T15:17:43Z","title":"Knowledge-Aware Reasoning over Multimodal Semi-structured Tables","summary":" Existing datasets for tabular question answering typically focus exclusively\non text within cells. However, real-world data is inherently multimodal, often\nblending images such as symbols, faces, icons, patterns, and charts with\ntextual content in tables. With the evolution of AI models capable of\nmultimodal reasoning, it is pertinent to assess their efficacy in handling such\nstructured data. This study investigates whether current AI models can perform\nknowledge-aware reasoning on multimodal structured data. We explore their\nability to reason on tables that integrate both images and text, introducing\nMMTabQA, a new dataset designed for this purpose. Our experiments highlight\nsubstantial challenges for current AI models in effectively integrating and\ninterpreting multiple text and image inputs, understanding visual context, and\ncomparing visual content across images. These findings establish our dataset as\na robust benchmark for advancing AI's comprehension and capabilities in\nanalyzing multimodal structured data.\n","authors":["Suyash Vardhan Mathur","Jainit Sushil Bafna","Kunal Kartik","Harshita Khandelwal","Manish Shrivastava","Vivek Gupta","Mohit Bansal","Dan Roth"],"pdf_url":"https://arxiv.org/pdf/2408.13860v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13858v1","updated":"2024-08-25T15:05:32Z","published":"2024-08-25T15:05:32Z","title":"Draw Like an Artist: Complex Scene Generation with Diffusion Model via\n Composition, Painting, and Retouching","summary":" Recent advances in text-to-image diffusion models have demonstrated\nimpressive capabilities in image quality. However, complex scene generation\nremains relatively unexplored, and even the definition of `complex scene'\nitself remains unclear. In this paper, we address this gap by providing a\nprecise definition of complex scenes and introducing a set of Complex\nDecomposition Criteria (CDC) based on this definition. Inspired by the artists\npainting process, we propose a training-free diffusion framework called Complex\nDiffusion (CxD), which divides the process into three stages: composition,\npainting, and retouching. Our method leverages the powerful chain-of-thought\ncapabilities of large language models (LLMs) to decompose complex prompts based\non CDC and to manage composition and layout. We then develop an attention\nmodulation method that guides simple prompts to specific regions to complete\nthe complex scene painting. Finally, we inject the detailed output of the LLM\ninto a retouching model to enhance the image details, thus implementing the\nretouching stage. Extensive experiments demonstrate that our method outperforms\nprevious SOTA approaches, significantly improving the generation of\nhigh-quality, semantically consistent, and visually diverse images for complex\nscenes, even with intricate prompts.\n","authors":["Minghao Liu","Le Zhang","Yingjie Tian","Xiaochao Qu","Luoqi Liu","Ting Liu"],"pdf_url":"https://arxiv.org/pdf/2408.13858v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13854v1","updated":"2024-08-25T14:47:25Z","published":"2024-08-25T14:47:25Z","title":"Tangram: A Challenging Benchmark for Geometric Element Recognizing","summary":" Significant advancements in Large Multimodal Models (LMMs) have enabled them\nto tackle complex problems involving visual-mathematical reasoning. However,\ntheir ability to identify geometric elements remains understudied. To bridge\nthis gap, we introduce Tangram, a novel benchmark designed to evaluate the\nperformance of LMMs on geometric element recognition. Tangram includes 1,080\ndiverse geometric diagrams sourced from primary and secondary school exams,\ncompetitions, and textbooks, covering from simple basic geometric shapes to\ncomplex combinations. Each diagram is associated with four questions, resulting\nin a total of 4,320 visual-question-answer pairs. Unlike existing benchmarks\nthat seek higher-level cognition and reasoning, Tangram focuses on the\nunderstanding of geometric elements, requiring models to perform a \"simple but\ninteresting\" counting task. Systematic evaluation of 10 prominent LMMs, such as\nGPT-4o and Claude 3.5 Sonnet, shows that even in the seemingly simple task,\nthese models still face significant challenges. Notably, the overall accuracy\nof the top performer across all tested models is only 56.8%, marking a\nsignificant gap when compared to human performance. These findings highlight\nthe limitations of current multimodal artificial intelligence systems in\nhandling basic perception tasks, and will inspire the development of the next\ngeneration of expert-level multimodal foundational models. The Tangram and\nevaluation code will be available soon.\n","authors":["Jiamin Tang","Chao Zhang","Xudong Zhu","Mengchi Liu"],"pdf_url":"https://arxiv.org/pdf/2408.13854v1.pdf","comment":"12 pages, 7 figures"},{"id":"http://arxiv.org/abs/2408.13852v1","updated":"2024-08-25T14:46:29Z","published":"2024-08-25T14:46:29Z","title":"LaneTCA: Enhancing Video Lane Detection with Temporal Context\n Aggregation","summary":" In video lane detection, there are rich temporal contexts among successive\nframes, which is under-explored in existing lane detectors. In this work, we\npropose LaneTCA to bridge the individual video frames and explore how to\neffectively aggregate the temporal context. Technically, we develop an\naccumulative attention module and an adjacent attention module to abstract the\nlong-term and short-term temporal context, respectively. The accumulative\nattention module continuously accumulates visual information during the journey\nof a vehicle, while the adjacent attention module propagates this lane\ninformation from the previous frame to the current frame. The two modules are\nmeticulously designed based on the transformer architecture. Finally, these\nlong-short context features are fused with the current frame features to\npredict the lane lines in the current frame. Extensive quantitative and\nqualitative experiments are conducted on two prevalent benchmark datasets. The\nresults demonstrate the effectiveness of our method, achieving several new\nstate-of-the-art records. The codes and models are available at\nhttps://github.com/Alex-1337/LaneTCA\n","authors":["Keyi Zhou","Li Li","Wengang Zhou","Yonghui Wang","Hao Feng","Houqiang Li"],"pdf_url":"https://arxiv.org/pdf/2408.13852v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13845v1","updated":"2024-08-25T14:28:49Z","published":"2024-08-25T14:28:49Z","title":"Bring the Power of Diffusion Model to Defect Detection","summary":" Due to the high complexity and technical requirements of industrial\nproduction processes, surface defects will inevitably appear, which seriously\naffects the quality of products. Although existing lightweight detection\nnetworks are highly efficient, they are susceptible to false or missed\ndetection of non-salient defects due to the lack of semantic information. In\ncontrast, the diffusion model can generate higher-order semantic\nrepresentations in the denoising process. Therefore, the aim of this paper is\nto incorporate the higher-order modelling capability of the diffusion model\ninto the detection model, so as to better assist in the classification and\nlocalization of difficult targets. First, the denoising diffusion probabilistic\nmodel (DDPM) is pre-trained to extract the features of denoising process to\nconstruct as a feature repository. In particular, to avoid the potential\nbottleneck of memory caused by the dataloader loading high-dimensional\nfeatures, a residual convolutional variational auto-encoder (ResVAE) is\ndesigned to further compress the feature repository. The image is fed into both\nimage backbone and feature repository for feature extraction and querying\nrespectively. The queried latent features are reconstructed and filtered to\nobtain high-dimensional DDPM features. A dynamic cross-fusion method is\nproposed to fully refine the contextual features of DDPM to optimize the\ndetection model. Finally, we employ knowledge distillation to migrate the\nhigher-order modelling capabilities back into the lightweight baseline model\nwithout additional efficiency cost. Experiment results demonstrate that our\nmethod achieves competitive results on several industrial datasets.\n","authors":["Xuyi Yu"],"pdf_url":"https://arxiv.org/pdf/2408.13845v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12253v2","updated":"2024-08-25T14:13:40Z","published":"2024-08-22T09:45:24Z","title":"Epsilon: Exploring Comprehensive Visual-Semantic Projection for\n Multi-Label Zero-Shot Learning","summary":" This paper investigates a challenging problem of zero-shot learning in the\nmulti-label scenario (MLZSL), wherein the model is trained to recognize\nmultiple unseen classes within a sample (e.g., an image) based on seen classes\nand auxiliary knowledge, e.g., semantic information. Existing methods usually\nresort to analyzing the relationship of various seen classes residing in a\nsample from the dimension of spatial or semantic characteristics and\ntransferring the learned model to unseen ones. However, they neglect the\nintegrity of local and global features. Although the use of the attention\nstructure will accurately locate local features, especially objects, it will\nsignificantly lose its integrity, and the relationship between classes will\nalso be affected. Rough processing of global features will also directly affect\ncomprehensiveness. This neglect will make the model lose its grasp of the main\ncomponents of the image. Relying only on the local existence of seen classes\nduring the inference stage introduces unavoidable bias. In this paper, we\npropose a novel and comprehensive visual-semantic framework for MLZSL, dubbed\nEpsilon, to fully make use of such properties and enable a more accurate and\nrobust visual-semantic projection. In terms of spatial information, we achieve\neffective refinement by group aggregating image features into several semantic\nprompts. It can aggregate semantic information rather than class information,\npreserving the correlation between semantics. In terms of global semantics, we\nuse global forward propagation to collect as much information as possible to\nensure that semantics are not omitted. Experiments on large-scale MLZSL\nbenchmark datasets NUS-Wide and Open-Images-v4 demonstrate that the proposed\nEpsilon outperforms other state-of-the-art methods with large margins.\n","authors":["Ziming Liu","Jingcai Guo","Song Guo","Xiaocheng Lu"],"pdf_url":"https://arxiv.org/pdf/2408.12253v2.pdf","comment":"11 pages, 6 figures. arXiv admin note: substantial text overlap with\n arXiv:2309.00923"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2305.04928v5","updated":"2024-08-25T06:22:00Z","published":"2023-05-05T12:14:22Z","title":"From Zero to Hero: Harnessing Transformers for Biomedical Named Entity\n Recognition in Zero- and Few-shot Contexts","summary":" Supervised named entity recognition (NER) in the biomedical domain depends on\nlarge sets of annotated texts with the given named entities. The creation of\nsuch datasets can be time-consuming and expensive, while extraction of new\nentities requires additional annotation tasks and retraining the model. To\naddress these challenges, this paper proposes a method for zero- and few-shot\nNER in the biomedical domain. The method is based on transforming the task of\nmulti-class token classification into binary token classification and\npre-training on a large amount of datasets and biomedical entities, which allow\nthe model to learn semantic relations between the given and potentially novel\nnamed entity labels. We have achieved average F1 scores of 35.44% for zero-shot\nNER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot\nNER on 9 diverse evaluated biomedical entities with fine-tuned PubMedBERT-based\nmodel. The results demonstrate the effectiveness of the proposed method for\nrecognizing new biomedical entities with no or limited number of examples,\noutperforming previous transformer-based methods, and being comparable to\nGPT3-based models using models with over 1000 times fewer parameters. We make\nmodels and developed code publicly available.\n","authors":["Miloš Košprdić","Nikola Prodanović","Adela Ljajić","Bojana Bašaragin","Nikola Milošević"],"pdf_url":"https://arxiv.org/pdf/2305.04928v5.pdf","comment":"Collaboration between Bayer Pharma R&D and Serbian Institute for\n Artificial Intelligence Research and Development. Artificial Intelligence in\n Medicine (2024)"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2408.12574v2","updated":"2024-08-25T23:58:25Z","published":"2024-08-22T17:41:45Z","title":"MuMA-ToM: Multi-modal Multi-Agent Theory of Mind","summary":" Understanding people's social interactions in complex real-world scenarios\noften relies on intricate mental reasoning. To truly understand how and why\npeople interact with one another, we must infer the underlying mental states\nthat give rise to the social interactions, i.e., Theory of Mind reasoning in\nmulti-agent interactions. Additionally, social interactions are often\nmulti-modal -- we can watch people's actions, hear their conversations, and/or\nread about their past behaviors. For AI systems to successfully and safely\ninteract with people in real-world environments, they also need to understand\npeople's mental states as well as their inferences about each other's mental\nstates based on multi-modal information about their interactions. For this, we\nintroduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark.\nMuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates\nmental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide\nvideo and text descriptions of people's multi-modal behavior in realistic\nhousehold environments. Based on the context, we then ask questions about\npeople's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM\nin a human experiment and provided a human baseline. We also proposed a novel\nmulti-modal, multi-agent ToM model, LIMP (Language model-based Inverse\nMulti-agent Planning). Our experimental results show that LIMP significantly\noutperforms state-of-the-art methods, including large multi-modal models (e.g.,\nGPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.\n","authors":["Haojun Shi","Suyu Ye","Xinyu Fang","Chuanyang Jin","Leyla Isik","Yen-Ling Kuo","Tianmin Shu"],"pdf_url":"https://arxiv.org/pdf/2408.12574v2.pdf","comment":"Project website: https://scai.cs.jhu.edu/projects/MuMA-ToM/ Code:\n https://github.com/SCAI-JHU/MuMA-ToM"},{"id":"http://arxiv.org/abs/2408.13961v1","updated":"2024-08-25T23:49:35Z","published":"2024-08-25T23:49:35Z","title":"Optimizing Luxury Vehicle Dealership Networks: A Graph Neural Network\n Approach to Site Selection","summary":" This study presents a novel application of Graph Neural Networks (GNNs) to\noptimize dealership network planning for a luxury car manufacturer in the U.S.\nBy conducting a comprehensive literature review on dealership location\ndeterminants, the study identifies 65 county-level explanatory variables,\naugmented by two additional measures of regional interconnectedness derived\nfrom social and mobility data. An ablation study involving 34 variable\ncombinations and ten state-of-the-art GNN operators reveals key insights into\nthe predictive power of various variables, particularly highlighting the\nsignificance of competition, demographic factors, and mobility patterns in\ninfluencing dealership location decisions. The analysis pinpoints seven\nspecific counties as promising targets for network expansion. This research not\nonly illustrates the effectiveness of GNNs in solving complex geospatial\ndecision-making problems but also provides actionable recommendations and\nvaluable methodological insights for industry practitioners.\n","authors":["Luca Silvano Carocci","Qiwei Han"],"pdf_url":"https://arxiv.org/pdf/2408.13961v1.pdf","comment":"10 pages, 4 figures, 6 tables"},{"id":"http://arxiv.org/abs/2408.13960v1","updated":"2024-08-25T23:48:11Z","published":"2024-08-25T23:48:11Z","title":"Time Series Analysis for Education: Methods, Applications, and Future\n Directions","summary":" Recent advancements in the collection and analysis of sequential educational\ndata have brought time series analysis to a pivotal position in educational\nresearch, highlighting its essential role in facilitating data-driven\ndecision-making. However, there is a lack of comprehensive summaries that\nconsolidate these advancements. To the best of our knowledge, this paper is the\nfirst to provide a comprehensive review of time series analysis techniques\nspecifically within the educational context. We begin by exploring the\nlandscape of educational data analytics, categorizing various data sources and\ntypes relevant to education. We then review four prominent time series\nmethods-forecasting, classification, clustering, and anomaly\ndetection-illustrating their specific application points in educational\nsettings. Subsequently, we present a range of educational scenarios and\napplications, focusing on how these methods are employed to address diverse\neducational tasks, which highlights the practical integration of multiple time\nseries methods to solve complex educational problems. Finally, we conclude with\na discussion on future directions, including personalized learning analytics,\nmultimodal data fusion, and the role of large language models (LLMs) in\neducational time series. The contributions of this paper include a detailed\ntaxonomy of educational data, a synthesis of time series techniques with\nspecific educational applications, and a forward-looking perspective on\nemerging trends and future research opportunities in educational analysis. The\nrelated papers and resources are available and regularly updated at the project\npage.\n","authors":["Shengzhong Mao","Chaoli Zhang","Yichi Song","Jindong Wang","Xiao-Jun Zeng","Zenglin Xu","Qingsong Wen"],"pdf_url":"https://arxiv.org/pdf/2408.13960v1.pdf","comment":"24 pages, 3 figures, 6 tables, project page: see\n https://github.com/ai-for-edu/time-series-analysis-for-education"},{"id":"http://arxiv.org/abs/2408.13958v1","updated":"2024-08-25T23:41:39Z","published":"2024-08-25T23:41:39Z","title":"Prediction of COPD Using Machine Learning, Clinical Summary Notes, and\n Vital Signs","summary":" Chronic obstructive pulmonary disease (COPD) is a chronic inflammatory lung\ndisease that causes obstructed airflow from the lungs. In the United States,\nmore than 15.7 million Americans have been diagnosed with COPD, with 96% of\nindividuals living with at least one other chronic health condition. It is the\n4th leading cause of death in the country. Over 2.2 million patients are\nadmitted to hospitals annually due to COPD exacerbations. Monitoring and\npredicting patient exacerbations on-time could save their life. This paper\npresents two different predictive models to predict COPD exacerbation using AI\nand natural language processing (NLP) approaches. These models use respiration\nsummary notes, symptoms, and vital signs. To train and test these models, data\nrecords containing physiologic signals and vital signs time series were used.\nThese records were captured from patient monitors and comprehensive clinical\ndata obtained from hospital medical information systems for tens of thousands\nof Intensive Care Unit (ICU) patients. We achieved an area under the Receiver\noperating characteristic (ROC) curve of 0.82 in detection and prediction of\nCOPD exacerbation.\n","authors":["Negar Orangi-Fard"],"pdf_url":"https://arxiv.org/pdf/2408.13958v1.pdf","comment":"11 pages, 5 figures"},{"id":"http://arxiv.org/abs/2404.16168v3","updated":"2024-08-25T23:06:51Z","published":"2024-04-24T19:55:50Z","title":"The Over-Certainty Phenomenon in Modern UDA Algorithms","summary":" When neural networks are confronted with unfamiliar data that deviate from\ntheir training set, this signifies a domain shift. While these networks output\npredictions on their inputs, they typically fail to account for their level of\nfamiliarity with these novel observations. While prevailing works navigate\nunsupervised domain adaptation with the goal of curtailing model entropy, they\nunintentionally birth models that grapple with sub-optimal calibration - a\ndilemma we term the over-certainty phenomenon. In this paper, we uncover a\nconcerning trend in unsupervised domain adaptation and propose a solution that\nnot only maintains accuracy but also addresses calibration.\n","authors":["Fin Amin","Jung-Eun Kim"],"pdf_url":"https://arxiv.org/pdf/2404.16168v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.04895v2","updated":"2024-08-25T22:30:42Z","published":"2024-08-09T06:46:06Z","title":"Better Not to Propagate: Understanding Edge Uncertainty and\n Over-smoothing in Signed Graph Neural Networks","summary":" Traditional Graph Neural Networks (GNNs) rely on network homophily, which can\nlead to performance degradation due to over-smoothing in many real-world\nheterophily scenarios. Recent studies analyze the smoothing effect\n(separability) after message-passing (MP), depending on the expectation of node\nfeatures. Regarding separability gain, they provided theoretical backgrounds on\nover-smoothing caused by various propagation schemes, including positive,\nsigned, and blocked MPs. More recently, by extending these theorems, some works\nhave suggested improvements in signed propagation under multiple classes.\nHowever, prior works assume that the error ratio of all propagation schemes is\nfixed, failing to investigate this phenomenon correctly. To solve this problem,\nwe propose a novel method for estimating homophily and edge error ratio,\nintegrated with dynamic selection between blocked and signed propagation during\ntraining. Our theoretical analysis, supported by extensive experiments,\ndemonstrates that blocking MP can be more effective than signed propagation\nunder high edge error ratios, improving the performance in both homophilic and\nheterophilic graphs.\n","authors":["Yoonhyuk Choi","Jiho Choi","Taewook Ko","Chong-Kwon Kim"],"pdf_url":"https://arxiv.org/pdf/2408.04895v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.09505v3","updated":"2024-08-25T21:59:43Z","published":"2023-09-18T06:33:28Z","title":"Outlier-Insensitive Kalman Filtering: Theory and Applications","summary":" State estimation of dynamical systems from noisy observations is a\nfundamental task in many applications. It is commonly addressed using the\nlinear Kalman filter (KF), whose performance can significantly degrade in the\npresence of outliers in the observations, due to the sensitivity of its convex\nquadratic objective function. To mitigate such behavior, outlier detection\nalgorithms can be applied. In this work, we propose a parameter-free algorithm\nwhich mitigates the harmful effect of outliers while requiring only a short\niterative process of the standard update step of the KF. To that end, we model\neach potential outlier as a normal process with unknown variance and apply\nonline estimation through either expectation maximization or alternating\nmaximization algorithms. Simulations and field experiment evaluations\ndemonstrate competitive performance of our method, showcasing its robustness to\noutliers in filtering scenarios compared to alternative algorithms.\n","authors":["Shunit Truzman","Guy Revach","Nir Shlezinger","Itzik Klein"],"pdf_url":"https://arxiv.org/pdf/2309.09505v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.12856v3","updated":"2024-08-25T20:59:10Z","published":"2024-03-19T16:01:25Z","title":"Equivariant Ensembles and Regularization for Reinforcement Learning in\n Map-based Path Planning","summary":" In reinforcement learning (RL), exploiting environmental symmetries can\nsignificantly enhance efficiency, robustness, and performance. However,\nensuring that the deep RL policy and value networks are respectively\nequivariant and invariant to exploit these symmetries is a substantial\nchallenge. Related works try to design networks that are equivariant and\ninvariant by construction, limiting them to a very restricted library of\ncomponents, which in turn hampers the expressiveness of the networks. This\npaper proposes a method to construct equivariant policies and invariant value\nfunctions without specialized neural network components, which we term\nequivariant ensembles. We further add a regularization term for adding\ninductive bias during training. In a map-based path planning case study, we\nshow how equivariant ensembles and regularization benefit sample efficiency and\nperformance.\n","authors":["Mirco Theile","Hongpeng Cao","Marco Caccamo","Alberto L. Sangiovanni-Vincentelli"],"pdf_url":"https://arxiv.org/pdf/2403.12856v3.pdf","comment":"Accepted at IROS 2024. A video can be found here:\n https://youtu.be/L6NOdvU7n7s. The code is available at\n https://github.com/theilem/uavSim"},{"id":"http://arxiv.org/abs/2408.13934v1","updated":"2024-08-25T20:43:34Z","published":"2024-08-25T20:43:34Z","title":"Learning to Move Like Professional Counter-Strike Players","summary":" In multiplayer, first-person shooter games like Counter-Strike: Global\nOffensive (CS:GO), coordinated movement is a critical component of high-level\nstrategic play. However, the complexity of team coordination and the variety of\nconditions present in popular game maps make it impractical to author\nhand-crafted movement policies for every scenario. We show that it is possible\nto take a data-driven approach to creating human-like movement controllers for\nCS:GO. We curate a team movement dataset comprising 123 hours of professional\ngame play traces, and use this dataset to train a transformer-based movement\nmodel that generates human-like team movement for all players in a \"Retakes\"\nround of the game. Importantly, the movement prediction model is efficient.\nPerforming inference for all players takes less than 0.5 ms per game step\n(amortized cost) on a single CPU core, making it plausible for use in\ncommercial games today. Human evaluators assess that our model behaves more\nlike humans than both commercially-available bots and procedural movement\ncontrollers scripted by experts (16% to 59% higher by TrueSkill rating of\n\"human-like\"). Using experiments involving in-game bot vs. bot self-play, we\ndemonstrate that our model performs simple forms of teamwork, makes fewer\ncommon movement mistakes, and yields movement distributions, player lifetimes,\nand kill locations similar to those observed in professional CS:GO match play.\n","authors":["David Durst","Feng Xie","Vishnu Sarukkai","Brennan Shacklett","Iuri Frosio","Chen Tessler","Joohwan Kim","Carly Taylor","Gilbert Bernstein","Sanjiban Choudhury","Pat Hanrahan","Kayvon Fatahalian"],"pdf_url":"https://arxiv.org/pdf/2408.13934v1.pdf","comment":"The project website is at https://davidbdurst.com/mlmove/"},{"id":"http://arxiv.org/abs/2201.05760v4","updated":"2024-08-25T20:43:12Z","published":"2022-01-15T05:25:03Z","title":"Network Level Spatial Temporal Traffic State Forecasting with\n Hierarchical Attention LSTM (HierAttnLSTM)","summary":" Traffic state data, such as speed, volume and travel time collected from\nubiquitous traffic monitoring sensors require advanced network level analytics\nfor forecasting and identifying significant traffic patterns. This paper\nleverages diverse traffic state datasets from the Caltrans Performance\nMeasurement System (PeMS) hosted on the open benchmark and achieved promising\nperformance compared to well recognized spatial-temporal models. Drawing\ninspiration from the success of hierarchical architectures in various\nArtificial Intelligence (AI) tasks, we integrate cell and hidden states from\nlow-level to high-level Long Short-Term Memory (LSTM) networks with an\nattention pooling mechanism, similar to human perception systems. The developed\nhierarchical structure is designed to account for dependencies across different\ntime scales, capturing the spatial-temporal correlations of network-level\ntraffic states, enabling the prediction of traffic states for all corridors\nrather than a single link or route. The efficiency of designed attention-based\nLSTM is analyzed by ablation study. Comparative results with baseline LSTM\nmodels demonstrate that the Hierarchical Attention LSTM (HierAttnLSTM) model\nnot only provides higher prediction accuracy but also effectively forecasts\nunusual congestion patterns. Data and code are made publicly available to\nsupport reproducible scientific research.\n","authors":["Tianya Terry Zhang"],"pdf_url":"https://arxiv.org/pdf/2201.05760v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.19911v2","updated":"2024-08-25T20:16:51Z","published":"2024-07-29T11:39:22Z","title":"Efficient Shield Synthesis via State-Space Transformation","summary":" We consider the problem of synthesizing safety strategies for control\nsystems, also known as shields. Since the state space is infinite, shields are\ntypically computed over a finite-state abstraction, with the most common\nabstraction being a rectangular grid. However, for many systems, such a grid\ndoes not align well with the safety property or the system dynamics. That is\nwhy a coarse grid is rarely sufficient, but a fine grid is typically\ncomputationally infeasible to obtain. In this paper, we show that appropriate\nstate-space transformations can still allow to use a coarse grid at almost no\ncomputational overhead. We demonstrate in three case studies that our\ntransformation-based synthesis outperforms a standard synthesis by several\norders of magnitude. In the first two case studies, we use domain knowledge to\nselect a suitable transformation. In the third case study, we instead report on\nresults in engineering a transformation without domain knowledge.\n","authors":["Asger Horn Brorholt","Andreas Holck Høeg-Petersen","Kim Guldstrand Larsen","Christian Schilling"],"pdf_url":"https://arxiv.org/pdf/2407.19911v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13926v1","updated":"2024-08-25T19:51:27Z","published":"2024-08-25T19:51:27Z","title":"FedGlu: A personalized federated learning-based glucose forecasting\n algorithm for improved performance in glycemic excursion regions","summary":" Continuous glucose monitoring (CGM) devices provide real-time glucose\nmonitoring and timely alerts for glycemic excursions, improving glycemic\ncontrol among patients with diabetes. However, identifying rare events like\nhypoglycemia and hyperglycemia remain challenging due to their infrequency.\nMoreover, limited access to sensitive patient data hampers the development of\nrobust machine learning models. Our objective is to accurately predict glycemic\nexcursions while addressing data privacy concerns. To tackle excursion\nprediction, we propose a novel Hypo-Hyper (HH) loss function, which\nsignificantly improves performance in the glycemic excursion regions. The HH\nloss function demonstrates a 46% improvement over mean-squared error (MSE) loss\nacross 125 patients. To address privacy concerns, we propose FedGlu, a machine\nlearning model trained in a federated learning (FL) framework. FL allows\ncollaborative learning without sharing sensitive data by training models\nlocally and sharing only model parameters across other patients. FedGlu\nachieves a 35% superior glycemic excursion detection rate compared to local\nmodels. This improvement translates to enhanced performance in predicting both,\nhypoglycemia and hyperglycemia, for 105 out of 125 patients. These results\nunderscore the effectiveness of the proposed HH loss function in augmenting the\npredictive capabilities of glucose predictions. Moreover, implementing models\nwithin a federated learning framework not only ensures better predictive\ncapabilities but also safeguards sensitive data concurrently.\n","authors":["Darpit Dave","Kathan Vyas","Jagadish Kumaran Jayagopal","Alfredo Garcia","Madhav Erraguntla","Mark Lawley"],"pdf_url":"https://arxiv.org/pdf/2408.13926v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.05282v3","updated":"2024-08-25T19:02:37Z","published":"2024-05-07T12:34:18Z","title":"The Detection of KIC 1718360, A Rotating Variable with a Possible\n Companion, Using Machine Learning","summary":" This paper presents the detection of a periodic dimming event in the\nlightcurve of the G1.5IV-V type star KIC 1718360. This is based on\nvisible-light observations conducted by both the TESS and Kepler space\ntelescopes. Analysis of the data seems to point toward a high rotation rate in\nthe star, with a rotational period of 2.938 days. The high variability seen\nwithin the star's lightcurve points toward classification as a rotating\nvariable. The initial observation was made in Kepler Quarter 16 data using the\nOne-Class SVM machine learning method. Subsequent observations by the TESS\nspace telescope corroborated these findings. It appears that KIC 1718360 is a\nnearby rotating variable that appears in little to no major catalogs as such. A\nsecondary, additional periodic dip is also present, indicating a possible\nexoplanetary companion.\n","authors":["Jakob Roche"],"pdf_url":"https://arxiv.org/pdf/2405.05282v3.pdf","comment":"6 pages, 6 figures Revised to correct errors, update and add data"},{"id":"http://arxiv.org/abs/2408.13912v1","updated":"2024-08-25T18:27:20Z","published":"2024-08-25T18:27:20Z","title":"Splatt3R: Zero-shot Gaussian Splatting from Uncalibarated Image Pairs","summary":" In this paper, we introduce Splatt3R, a pose-free, feed-forward method for\nin-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given\nuncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without\nrequiring any camera parameters or depth information. For generalizability, we\nstart from a 'foundation' 3D geometry reconstruction method, MASt3R, and extend\nit to be a full 3D structure and appearance reconstructor. Specifically, unlike\nthe original MASt3R which reconstructs only 3D point clouds, we predict the\nadditional Gaussian attributes required to construct a Gaussian primitive for\neach point. Hence, unlike other novel view synthesis methods, Splatt3R is first\ntrained by optimizing the 3D point cloud's geometry loss, and then a novel view\nsynthesis objective. By doing this, we avoid the local minima present in\ntraining 3D Gaussian Splats from stereo views. We also propose a novel loss\nmasking strategy that we empirically find is critical for strong performance on\nextrapolated viewpoints. We train Splatt3R on the ScanNet++ dataset and\ndemonstrate excellent generalisation to uncalibrated, in-the-wild images.\nSplatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and the\nresultant splats can be rendered in real-time.\n","authors":["Brandon Smart","Chuanxia Zheng","Iro Laina","Victor Adrian Prisacariu"],"pdf_url":"https://arxiv.org/pdf/2408.13912v1.pdf","comment":"Our project page can be found at: https://splatt3r.active.vision/"},{"id":"http://arxiv.org/abs/2308.06375v2","updated":"2024-08-25T18:04:21Z","published":"2023-08-11T20:17:22Z","title":"UAMM: Price-oracle based Automated Market Maker","summary":" Automated market makers (AMMs) are pricing mechanisms utilized by\ndecentralized exchanges (DEX). Traditional AMM approaches are constrained by\npricing solely based on their own liquidity pool, without consideration of\nexternal markets or risk management for liquidity providers. In this paper, we\npropose a new approach known as UBET AMM (UAMM), which calculates prices by\nconsidering external market prices and the impermanent loss of the liquidity\npool. Despite relying on external market prices, our method maintains the\ndesired properties of a constant product curve when computing slippages. The\nkey element of UAMM is determining the appropriate slippage amount based on the\ndesired target balance, which encourages the liquidity pool to minimize\nimpermanent loss. We demonstrate that our approach eliminates arbitrage\nopportunities when external market prices are efficient.\n","authors":["Daniel Jiwoong Im","Alexander Kondratskiy","Vincent Harvey","Hsuan-Wei Fu"],"pdf_url":"https://arxiv.org/pdf/2308.06375v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13906v1","updated":"2024-08-25T18:02:36Z","published":"2024-08-25T18:02:36Z","title":"ConVis: Contrastive Decoding with Hallucination Visualization for\n Mitigating Hallucinations in Multimodal Large Language Models","summary":" Hallucinations in Multimodal Large Language Models (MLLMs) where generated\nresponses fail to accurately reflect the given image pose a significant\nchallenge to their reliability. To address this, we introduce ConVis, a novel\ntraining-free contrastive decoding method. ConVis leverages a text-to-image\n(T2I) generation model to semantically reconstruct the given image from\nhallucinated captions. By comparing the contrasting probability distributions\nproduced by the original and reconstructed images, ConVis enables MLLMs to\ncapture visual contrastive signals that penalize hallucination generation.\nNotably, this method operates purely within the decoding process, eliminating\nthe need for additional data or model updates. Our extensive experiments on\nfive popular benchmarks demonstrate that ConVis effectively reduces\nhallucinations across various MLLMs, highlighting its potential to enhance\nmodel reliability.\n","authors":["Yeji Park","Deokyeong Lee","Junsuk Choe","Buru Chang"],"pdf_url":"https://arxiv.org/pdf/2408.13906v1.pdf","comment":"First two authors contributed equally. Source code is available at\n https://github.com/yejipark-m/ConVis"},{"id":"http://arxiv.org/abs/2407.10159v2","updated":"2024-08-25T17:59:22Z","published":"2024-07-14T10:59:34Z","title":"RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D\n LiDAR Segmentation","summary":" 3D point clouds play a pivotal role in outdoor scene perception, especially\nin the context of autonomous driving. Recent advancements in 3D LiDAR\nsegmentation often focus intensely on the spatial positioning and distribution\nof points for accurate segmentation. However, these methods, while robust in\nvariable conditions, encounter challenges due to sole reliance on coordinates\nand point intensity, leading to poor isometric invariance and suboptimal\nsegmentation. To tackle this challenge, our work introduces Range-Aware\nPointwise Distance Distribution (RAPiD) features and the associated RAPiD-Seg\narchitecture. Our RAPiD features exhibit rigid transformation invariance and\neffectively adapt to variations in point density, with a design focus on\ncapturing the localized geometry of neighboring structures. They utilize\ninherent LiDAR isotropic radiation and semantic categorization for enhanced\nlocal representation and computational efficiency, while incorporating a 4D\ndistance metric that integrates geometric and surface material reflectivity for\nimproved semantic segmentation. To effectively embed high-dimensional RAPiD\nfeatures, we propose a double-nested autoencoder structure with a novel\nclass-aware embedding objective to encode high-dimensional features into\nmanageable voxel-wise embeddings. Additionally, we propose RAPiD-Seg which\nincorporates a channel-wise attention fusion and two effective RAPiD-Seg\nvariants, further optimizing the embedding for enhanced performance and\ngeneralization. Our method outperforms contemporary LiDAR segmentation work in\nterms of mIoU on SemanticKITTI (76.1) and nuScenes (83.6) datasets.\n","authors":["Li Li","Hubert P. H. Shum","Toby P. Breckon"],"pdf_url":"https://arxiv.org/pdf/2407.10159v2.pdf","comment":"ECCV 2024 (Oral); 18 pages, 6 figures, 7 tables; Code at\n https://github.com/l1997i/rapid_seg"},{"id":"http://arxiv.org/abs/2408.13902v1","updated":"2024-08-25T17:59:17Z","published":"2024-08-25T17:59:17Z","title":"TraIL-Det: Transformation-Invariant Local Feature Networks for 3D LiDAR\n Object Detection with Unsupervised Pre-Training","summary":" 3D point clouds are essential for perceiving outdoor scenes, especially\nwithin the realm of autonomous driving. Recent advances in 3D LiDAR Object\nDetection focus primarily on the spatial positioning and distribution of points\nto ensure accurate detection. However, despite their robust performance in\nvariable conditions, these methods are hindered by their sole reliance on\ncoordinates and point intensity, resulting in inadequate isometric invariance\nand suboptimal detection outcomes. To tackle this challenge, our work\nintroduces Transformation-Invariant Local (TraIL) features and the associated\nTraIL-Det architecture. Our TraIL features exhibit rigid transformation\ninvariance and effectively adapt to variations in point density, with a design\nfocus on capturing the localized geometry of neighboring structures. They\nutilize the inherent isotropic radiation of LiDAR to enhance local\nrepresentation, improve computational efficiency, and boost detection\nperformance. To effectively process the geometric relations among points within\neach proposal, we propose a Multi-head self-Attention Encoder (MAE) with\nasymmetric geometric features to encode high-dimensional TraIL features into\nmanageable representations. Our method outperforms contemporary self-supervised\n3D object detection approaches in terms of mAP on KITTI (67.8, 20% label,\nmoderate) and Waymo (68.9, 20% label, moderate) datasets under various label\nratios (20%, 50%, and 100%).\n","authors":["Li Li","Tanqiu Qiao","Hubert P. H. Shum","Toby P. Breckon"],"pdf_url":"https://arxiv.org/pdf/2408.13902v1.pdf","comment":"BMVC 2024; 15 pages, 3 figures, 3 tables; Code at\n https://github.com/l1997i/rapid_seg"},{"id":"http://arxiv.org/abs/2405.14099v2","updated":"2024-08-25T17:35:33Z","published":"2024-05-23T02:01:05Z","title":"Automatic Differentiation is Essential in Training Neural Networks for\n Solving Differential Equations","summary":" Neural network-based approaches have recently shown significant promise in\nsolving partial differential equations (PDEs) in science and engineering,\nespecially in scenarios featuring complex domains or the incorporation of\nempirical data. One advantage of the neural network method for PDEs lies in its\nautomatic differentiation (AD), which necessitates only the sample points\nthemselves, unlike traditional finite difference (FD) approximations that\nrequire nearby local points to compute derivatives. In this paper, we\nquantitatively demonstrate the advantage of AD in training neural networks. The\nconcept of truncated entropy is introduced to characterize the training\nproperty. Specifically, through comprehensive experimental and theoretical\nanalyses conducted on random feature models and two-layer neural networks, we\ndiscover that the defined truncated entropy serves as a reliable metric for\nquantifying the residual loss of random feature models and the training speed\nof neural networks for both AD and FD methods. Our experimental and theoretical\nanalyses demonstrate that, from a training perspective, AD outperforms FD in\nsolving partial differential equations.\n","authors":["Chuqi Chen","Yahong Yang","Yang Xiang","Wenrui Hao"],"pdf_url":"https://arxiv.org/pdf/2405.14099v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.20891v2","updated":"2024-08-25T17:07:49Z","published":"2024-07-30T15:07:13Z","title":"Bayesian Low-Rank LeArning (Bella): A Practical Approach to Bayesian\n Neural Networks","summary":" Computational complexity of Bayesian learning is impeding its adoption in\npractical, large-scale tasks. Despite demonstrations of significant merits such\nas improved robustness and resilience to unseen or out-of-distribution inputs\nover their non- Bayesian counterparts, their practical use has faded to near\ninsignificance. In this study, we introduce an innovative framework to mitigate\nthe computational burden of Bayesian neural networks (BNNs). Our approach\nfollows the principle of Bayesian techniques based on deep ensembles, but\nsignificantly reduces their cost via multiple low-rank perturbations of\nparameters arising from a pre-trained neural network. Both vanilla version of\nensembles as well as more sophisticated schemes such as Bayesian learning with\nStein Variational Gradient Descent (SVGD), previously deemed impractical for\nlarge models, can be seamlessly implemented within the proposed framework,\ncalled Bayesian Low-Rank LeArning (Bella). In a nutshell, i) Bella achieves a\ndramatic reduction in the number of trainable parameters required to\napproximate a Bayesian posterior; and ii) it not only maintains, but in some\ninstances, surpasses the performance of conventional Bayesian learning methods\nand non-Bayesian baselines. Our results with large-scale tasks such as\nImageNet, CAMELYON17, DomainNet, VQA with CLIP, LLaVA demonstrate the\neffectiveness and versatility of Bella in building highly scalable and\npractical Bayesian deep models for real-world applications.\n","authors":["Bao Gia Doan","Afshar Shamsi","Xiao-Yu Guo","Arash Mohammadi","Hamid Alinejad-Rokny","Dino Sejdinovic","Damith C. Ranasinghe","Ehsan Abbasnejad"],"pdf_url":"https://arxiv.org/pdf/2407.20891v2.pdf","comment":"17 pages, 14 figures, 11 tables"}],"Multimedia":[{"id":"http://arxiv.org/abs/2408.13786v1","updated":"2024-08-25T09:29:20Z","published":"2024-08-25T09:29:20Z","title":"Localization of Synthetic Manipulations in Western Blot Images","summary":" Recent breakthroughs in deep learning and generative systems have\nsignificantly fostered the creation of synthetic media, as well as the local\nalteration of real content via the insertion of highly realistic synthetic\nmanipulations. Local image manipulation, in particular, poses serious\nchallenges to the integrity of digital content and societal trust. This problem\nis not only confined to multimedia data, but also extends to biological images\nincluded in scientific publications, like images depicting Western blots. In\nthis work, we address the task of localizing synthetic manipulations in Western\nblot images. To discriminate between pristine and synthetic pixels of an\nanalyzed image, we propose a synthetic detector that operates on small patches\nextracted from the image. We aggregate patch contributions to estimate a\ntampering heatmap, highlighting synthetic pixels out of pristine ones. Our\nmethodology proves effective when tested over two manipulated Western blot\nimage datasets, one altered automatically and the other manually by exploiting\nadvanced AI-based image manipulation tools that are unknown at our training\nstage. We also explore the robustness of our method over an external dataset of\nother scientific images depicting different semantics, manipulated through\nunseen generation techniques.\n","authors":["Anmol Manjunath","Viola Negroni","Sara Mandelli","Daniel Moreira","Paolo Bestagini"],"pdf_url":"https://arxiv.org/pdf/2408.13786v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13784v1","updated":"2024-08-25T09:28:04Z","published":"2024-08-25T09:28:04Z","title":"Analyzing the Impact of Splicing Artifacts in Partially Fake Speech\n Signals","summary":" Speech deepfake detection has recently gained significant attention within\nthe multimedia forensics community. Related issues have also been explored,\nsuch as the identification of partially fake signals, i.e., tracks that include\nboth real and fake speech segments. However, generating high-quality spliced\naudio is not as straightforward as it may appear. Spliced signals are typically\ncreated through basic signal concatenation. This process could introduce\nnoticeable artifacts that can make the generated data easier to detect. We\nanalyze spliced audio tracks resulting from signal concatenation, investigate\ntheir artifacts and assess whether such artifacts introduce any bias in\nexisting datasets. Our findings reveal that by analyzing splicing artifacts, we\ncan achieve a detection EER of 6.16% and 7.36% on PartialSpoof and HAD\ndatasets, respectively, without needing to train any detector. These results\nunderscore the complexities of generating reliable spliced audio data and lead\nto discussions that can help improve future research in this area.\n","authors":["Viola Negroni","Davide Salvi","Paolo Bestagini","Stefano Tubaro"],"pdf_url":"https://arxiv.org/pdf/2408.13784v1.pdf","comment":"Accepted at ASVspoof 5 Workshop (Interspeech2024 Satellite)"},{"id":"http://arxiv.org/abs/2404.13621v4","updated":"2024-08-25T06:13:24Z","published":"2024-04-21T11:21:27Z","title":"Attack on Scene Flow using Point Clouds","summary":" Deep neural networks have made significant advancements in accurately\nestimating scene flow using point clouds, which is vital for many applications\nlike video analysis, action recognition, and navigation. The robustness of\nthese techniques, however, remains a concern, particularly in the face of\nadversarial attacks that have been proven to deceive state-of-the-art deep\nneural networks in many domains. Surprisingly, the robustness of scene flow\nnetworks against such attacks has not been thoroughly investigated. To address\nthis problem, the proposed approach aims to bridge this gap by introducing\nadversarial white-box attacks specifically tailored for scene flow networks.\nExperimental results show that the generated adversarial examples obtain up to\n33.7 relative degradation in average end-point error on the KITTI and\nFlyingThings3D datasets. The study also reveals the significant impact that\nattacks targeting point clouds in only one dimension or color channel have on\naverage end-point error. Analyzing the success and failure of these attacks on\nthe scene flow networks and their 2D optical flow network variants shows a\nhigher vulnerability for the optical flow networks. Code is available at\nhttps://github.com/aheldis/Attack-on-Scene-Flow-using-Point-Clouds.git.\n","authors":["Haniyeh Ehsani Oskouie","Mohammad-Shahram Moin","Shohreh Kasaei"],"pdf_url":"https://arxiv.org/pdf/2404.13621v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13712v1","updated":"2024-08-25T03:21:48Z","published":"2024-08-25T03:21:48Z","title":"Riemann-based Multi-scale Attention Reasoning Network for Text-3D\n Retrieval","summary":" Due to the challenges in acquiring paired Text-3D data and the inherent\nirregularity of 3D data structures, combined representation learning of 3D\npoint clouds and text remains unexplored. In this paper, we propose a novel\nRiemann-based Multi-scale Attention Reasoning Network (RMARN) for text-3D\nretrieval. Specifically, the extracted text and point cloud features are\nrefined by their respective Adaptive Feature Refiner (AFR). Furthermore, we\nintroduce the innovative Riemann Local Similarity (RLS) module and the Global\nPooling Similarity (GPS) module. However, as 3D point cloud data and text data\noften possess complex geometric structures in high-dimensional space, the\nproposed RLS employs a novel Riemann Attention Mechanism to reflect the\nintrinsic geometric relationships of the data. Without explicitly defining the\nmanifold, RMARN learns the manifold parameters to better represent the\ndistances between text-point cloud samples. To address the challenges of\nlacking paired text-3D data, we have created the large-scale Text-3D Retrieval\ndataset T3DR-HIT, which comprises over 3,380 pairs of text and point cloud\ndata. T3DR-HIT contains coarse-grained indoor 3D scenes and fine-grained\nChinese artifact scenes, consisting of 1,380 and over 2,000 text-3D pairs,\nrespectively. Experiments on our custom datasets demonstrate the superior\nperformance of the proposed method. Our code and proposed datasets are\navailable at \\url{https://github.com/liwrui/RMARN}.\n","authors":["Wenrui Li","Wei Han","Yandu Chen","Yeyu Chai","Yidan Lu","Xingtao Wang","Xiaopeng Fan"],"pdf_url":"https://arxiv.org/pdf/2408.13712v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13711v1","updated":"2024-08-25T02:56:26Z","published":"2024-08-25T02:56:26Z","title":"SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with\n Panoramic Gaussian Splatting","summary":" Text-driven 3D scene generation has seen significant advancements recently.\nHowever, most existing methods generate single-view images using generative\nmodels and then stitch them together in 3D space. This independent generation\nfor each view often results in spatial inconsistency and implausibility in the\n3D scenes. To address this challenge, we proposed a novel text-driven\n3D-consistent scene generation model: SceneDreamer360. Our proposed method\nleverages a text-driven panoramic image generation model as a prior for 3D\nscene generation and employs 3D Gaussian Splatting (3DGS) to ensure consistency\nacross multi-view panoramic images. Specifically, SceneDreamer360 enhances the\nfine-tuned Panfusion generator with a three-stage panoramic enhancement,\nenabling the generation of high-resolution, detail-rich panoramic images.\nDuring the 3D scene construction, a novel point cloud fusion initialization\nmethod is used, producing higher quality and spatially consistent point clouds.\nOur extensive experiments demonstrate that compared to other methods,\nSceneDreamer360 with its panoramic image generation and 3DGS can produce higher\nquality, spatially consistent, and visually appealing 3D scenes from any text\nprompt. Our codes are available at\n\\url{https://github.com/liwrui/SceneDreamer360}.\n","authors":["Wenrui Li","Yapeng Mi","Fucheng Cai","Zhe Yang","Wangmeng Zuo","Xingtao Wang","Xiaopeng Fan"],"pdf_url":"https://arxiv.org/pdf/2408.13711v1.pdf","comment":null}]},"2024-08-24T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2408.13678v1","updated":"2024-08-24T22:03:40Z","published":"2024-08-24T22:03:40Z","title":"A layer-wise analysis of Mandarin and English suprasegmentals in SSL\n speech models","summary":" This study asks how self-supervised speech models represent suprasegmental\ncategories like Mandarin lexical tone, English lexical stress, and English\nphrasal accents. Through a series of probing tasks, we make layer-wise\ncomparisons of English and Mandarin 12 layer monolingual models. Our findings\nsuggest that 1) English and Mandarin wav2vec 2.0 models learn contextual\nrepresentations of abstract suprasegmental categories which are strongest in\nthe middle third of the network. 2) Models are better at representing features\nthat exist in the language of their training data, and this difference is\ndriven by enriched context in transformer blocks, not local acoustic\nrepresentation. 3) Fine-tuned wav2vec 2.0 improves performance in later layers\ncompared to pre-trained models mainly for lexically contrastive features like\ntone and stress, 4) HuBERT and WavLM learn similar representations to wav2vec\n2.0, differing mainly in later layer performance. Our results extend previous\nunderstanding of how models represent suprasegmentals and offer new insights\ninto the language-specificity and contextual nature of these representations.\n","authors":["Antón de la Fuente","Dan Jurafsky"],"pdf_url":"https://arxiv.org/pdf/2408.13678v1.pdf","comment":"4 pages, 3 figures, to be published in Interspeech 2024 proceedings"},{"id":"http://arxiv.org/abs/2310.06830v2","updated":"2024-08-24T21:30:00Z","published":"2023-10-10T17:57:45Z","title":"Lemur: Harmonizing Natural Language and Code for Language Agents","summary":" We introduce Lemur and Lemur-Chat, openly accessible language models\noptimized for both natural language and coding capabilities to serve as the\nbackbone of versatile language agents. The evolution from language chat models\nto functional language agents demands that models not only master human\ninteraction, reasoning, and planning but also ensure grounding in the relevant\nenvironments. This calls for a harmonious blend of language and coding\ncapabilities in the models. Lemur and Lemur-Chat are proposed to address this\nnecessity, demonstrating balanced proficiencies in both domains, unlike\nexisting open-source models that tend to specialize in either. Through\nmeticulous pre-training using a code-intensive corpus and instruction\nfine-tuning on text and code data, our models achieve state-of-the-art averaged\nperformance across diverse text and coding benchmarks among open-source models.\nComprehensive experiments demonstrate Lemur's superiority over existing\nopen-source models and its proficiency across various agent tasks involving\nhuman communication, tool usage, and interaction under fully- and partially-\nobservable environments. The harmonization between natural and programming\nlanguages enables Lemur-Chat to significantly narrow the gap with proprietary\nmodels on agent abilities, providing key insights into developing advanced\nopen-source agents adept at reasoning, planning, and operating seamlessly\nacross environments. https://github.com/OpenLemur/Lemur\n","authors":["Yiheng Xu","Hongjin Su","Chen Xing","Boyu Mi","Qian Liu","Weijia Shi","Binyuan Hui","Fan Zhou","Yitao Liu","Tianbao Xie","Zhoujun Cheng","Siheng Zhao","Lingpeng Kong","Bailin Wang","Caiming Xiong","Tao Yu"],"pdf_url":"https://arxiv.org/pdf/2310.06830v2.pdf","comment":"ICLR 2024 Spotlight; https://github.com/OpenLemur/Lemur"},{"id":"http://arxiv.org/abs/2401.13463v3","updated":"2024-08-24T20:28:38Z","published":"2024-01-24T14:08:38Z","title":"SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken\n Question Answering","summary":" Spoken Question Answering (SQA) is essential for machines to reply to user's\nquestion by finding the answer span within a given spoken passage. SQA has been\npreviously achieved without ASR to avoid recognition errors and\nOut-of-Vocabulary (OOV) problems. However, the real-world problem of\nOpen-domain SQA (openSQA), in which the machine needs to first retrieve\npassages that possibly contain the answer from a spoken archive in addition,\nwas never considered. This paper proposes the first known end-to-end framework,\nSpeech Dense Passage Retriever (SpeechDPR), for the retrieval component of the\nopenSQA problem. SpeechDPR learns a sentence-level semantic representation by\ndistilling knowledge from the cascading model of unsupervised ASR (UASR) and\ntext dense retriever (TDR). No manually transcribed speech data is needed.\nInitial experiments showed performance comparable to the cascading model of\nUASR and TDR, and significantly better when UASR was poor, verifying this\napproach is more robust to speech recognition errors.\n","authors":["Chyi-Jiunn Lin","Guan-Ting Lin","Yung-Sung Chuang","Wei-Lun Wu","Shang-Wen Li","Abdelrahman Mohamed","Hung-yi Lee","Lin-shan Lee"],"pdf_url":"https://arxiv.org/pdf/2401.13463v3.pdf","comment":"Accepted at ICASSP 2024"},{"id":"http://arxiv.org/abs/2408.09172v3","updated":"2024-08-24T20:26:43Z","published":"2024-08-17T11:33:23Z","title":"Unc-TTP: A Method for Classifying LLM Uncertainty to Improve In-Context\n Example Selection","summary":" Nowadays, Large Language Models (LLMs) have demonstrated exceptional\nperformance across various downstream tasks. However, it is challenging for\nusers to discern whether the responses are generated with certainty or are\nfabricated to meet user expectations. Estimating the uncertainty of LLMs is\nparticularly challenging due to their vast scale and the lack of white-box\naccess. In this work, we propose a novel Uncertainty Tripartite Testing\nParadigm (Unc-TTP) to classify LLM uncertainty, via evaluating the consistency\nof LLM outputs when incorporating label interference into the sampling-based\napproach. Based on Unc-TTP outputs, we aggregate instances into certain and\nuncertain categories. Further, we conduct a detailed analysis of the\nuncertainty properties of LLMs and show Unc-TTP's superiority over the existing\nsampling-based methods. In addition, we leverage the obtained uncertainty\ninformation to guide in-context example selection, demonstrating that Unc-TTP\nobviously outperforms retrieval-based and sampling-based approaches in\nselecting more informative examples. Our work paves a new way to classify the\nuncertainty of both open- and closed-source LLMs, and introduces a practical\napproach to exploit this uncertainty to improve LLMs performance.\n","authors":["Hsiu-Yuan Huang","Zichen Wu","Yutong Yang","Junzhao Zhang","Yunfang Wu"],"pdf_url":"https://arxiv.org/pdf/2408.09172v3.pdf","comment":"The model diagram in Figure 1 on page 3 of the paper has significant\n ambiguities. It may lead readers to mistakenly believe that the experiments\n were conducted in a multi-turn dialogue format. Therefore, we request the\n withdrawal of this submission"},{"id":"http://arxiv.org/abs/2408.13654v1","updated":"2024-08-24T19:11:54Z","published":"2024-08-24T19:11:54Z","title":"Symbolic Working Memory Enhances Language Models for Complex Rule\n Application","summary":" Large Language Models (LLMs) have shown remarkable reasoning performance but\nstruggle with multi-step deductive reasoning involving a series of rule\napplication steps, especially when rules are presented non-sequentially. Our\npreliminary analysis shows that while LLMs excel in single-step rule\napplication, their performance drops significantly in multi-step scenarios due\nto the challenge in rule grounding. It requires anchoring the applicable rule\nand supporting facts at each step, amidst multiple input rules, facts, and\ninferred facts. To address this, we propose augmenting LLMs with external\nworking memory and introduce a neurosymbolic framework for rule application.\nThe memory stores facts and rules in both natural language and symbolic forms,\nenabling precise tracking. Utilizing this memory, our framework iteratively\nperforms symbolic rule grounding and LLM-based rule implementation. The former\nmatches predicates and variables of symbolic rules and facts to ground\napplicable rules at each step. Experiments indicate our framework's\neffectiveness in rule application and its robustness across various steps and\nsettings~\\footnote{Code and data are available at\n\\url{https://github.com/SiyuanWangw/RuleApplication}.}.\n","authors":["Siyuan Wang","Zhongyu Wei","Yejin Choi","Xiang Ren"],"pdf_url":"https://arxiv.org/pdf/2408.13654v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13651v1","updated":"2024-08-24T18:51:47Z","published":"2024-08-24T18:51:47Z","title":"Narratives at Conflict: Computational Analysis of News Framing in\n Multilingual Disinformation Campaigns","summary":" Any report frames issues to favor a particular interpretation by highlighting\nor excluding certain aspects of a story. Despite the widespread use of framing\nin disinformation, framing properties and detection methods remain\nunderexplored outside the English-speaking world. We explore how multilingual\nframing of the same issue differs systematically. We use eight years of\nRussia-backed disinformation campaigns, spanning 8k news articles in 4\nlanguages targeting 15 countries. We find that disinformation campaigns\nconsistently and intentionally favor specific framing, depending on the target\nlanguage of the audience. We further discover how Russian-language articles\nconsistently highlight selected frames depending on the region of the media\ncoverage. We find that the two most prominent models for automatic frame\nanalysis underperform and show high disagreement, highlighting the need for\nfurther research.\n","authors":["Antonina Sinelnik","Dirk Hovy"],"pdf_url":"https://arxiv.org/pdf/2408.13651v1.pdf","comment":"Published in ACL SRW 2024 Proceedings, see\n https://aclanthology.org/2024.acl-srw.21/"},{"id":"http://arxiv.org/abs/2408.13631v1","updated":"2024-08-24T17:17:46Z","published":"2024-08-24T17:17:46Z","title":"Ancient but Digitized: Developing Handwritten Optical Character\n Recognition for East Syriac Script Through Creating KHAMIS Dataset","summary":" Many languages have vast amounts of handwritten texts, such as ancient\nscripts about folktale stories and historical narratives or contemporary\ndocuments and letters. Digitization of those texts has various applications,\nsuch as daily tasks, cultural studies, and historical research. Syriac is an\nancient, endangered, and low-resourced language that has not received the\nattention it requires and deserves. This paper reports on a research project\naimed at developing a optical character recognition (OCR) model based on the\nhandwritten Syriac texts as a starting point to build more digital services for\nthis endangered language. A dataset was created, KHAMIS (inspired by the East\nSyriac poet, Khamis bar Qardahe), which consists of handwritten sentences in\nthe East Syriac script. We used it to fine-tune the Tesseract-OCR engine's\npretrained Syriac model on handwritten data. The data was collected from\nvolunteers capable of reading and writing in the language to create KHAMIS.\nKHAMIS currently consists of 624 handwritten Syriac sentences collected from 31\nuniversity students and one professor, and it will be partially available\nonline and the whole dataset available in the near future for development and\nresearch purposes. As a result, the handwritten OCR model was able to achieve a\ncharacter error rate of 1.097-1.610% and 8.963-10.490% on both training and\nevaluation sets, respectively, and both a character error rate of 18.89-19.71%\nand a word error rate of 62.83-65.42% when evaluated on the test set, which is\ntwice as better than the default Syriac model of Tesseract.\n","authors":["Ameer Majeed","Hossein Hassani"],"pdf_url":"https://arxiv.org/pdf/2408.13631v1.pdf","comment":"15 pages, 12 figures, 5 tables"},{"id":"http://arxiv.org/abs/2407.09817v2","updated":"2024-08-24T17:01:19Z","published":"2024-07-13T09:28:24Z","title":"Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech\n Recognition System","summary":" Multi-talker speech recognition and target-talker speech recognition, both\ninvolve transcription in multi-talker contexts, remain significant challenges.\nHowever, existing methods rarely attempt to simultaneously address both tasks.\nIn this study, we propose a pioneering approach to empower Whisper, which is a\nspeech foundation model, to tackle joint multi-talker and target-talker speech\nrecognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar\nseparator into its encoder to separate mixed embedding for multiple talkers;\n(ii) a Target Talker Identifier is introduced to identify the embedding flow of\nthe target talker on the fly, requiring only three-second enrollment speech as\na cue; (iii) soft prompt tuning for decoder is explored for better task\nadaptation. Our method outperforms previous methods on two- and three-talker\nLibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable\nzero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.\n","authors":["Lingwei Meng","Jiawen Kang","Yuejiao Wang","Zengrui Jin","Xixin Wu","Xunying Liu","Helen Meng"],"pdf_url":"https://arxiv.org/pdf/2407.09817v2.pdf","comment":"Accepted to INTERSPEECH 2024"},{"id":"http://arxiv.org/abs/2408.13624v1","updated":"2024-08-24T16:35:00Z","published":"2024-08-24T16:35:00Z","title":"No Dataset Needed for Downstream Knowledge Benchmarking: Response\n Dispersion Inversely Correlates with Accuracy on Domain-specific QA","summary":" This research seeks to obviate the need for creating QA datasets and grading\n(chatbot) LLM responses when comparing LLMs' knowledge in specific topic\ndomains. This is done in an entirely end-user centric way without need for\naccess to any inner workings of the LLM, so long as it can be prompted and\ngiven a random seed to create different generations to the same prompt. The\npaper does this by, for a given topic domain, defining the \"response\ndispersion\" of an LLM by repeatedly asking an LLM the same opinion question\nabout that topic domain. Namely, the response dispersion is the count of\nsingular values needed to explain 95% of the variance in the embedding matrix\nof the LLM's responses. It is found that the response dispersion is inversely\ncorrelated with accuracy on relevant QA evaluations (average spearman rank\ncorrelation stronger than -.59). A use-case analysis shows that when comparing\ntwo different LLMs on the same topic domain, comparing their response\ndispersion is a suitable replacement for comparing their QA accuracy between\n74% and 89% of the time, the range depending on certain reasonable\naccuracy-difference tolerances that may be acceptable to an end-user in\nexchange for the labor being saved using response dispersion instead of QA\naccuracy for comparison. Two response embeddings are studied for creating the\nembedding matrix in this study, one is from OpenAI's APIs and one is a novel\nembedding, here named reference sentence similarity embeddings, that can be\ncomputed locally and performs very nearly as well in calculating response\ndispersion. Also in this research, a pre-existing dataset called the IRC-Wiki\nTrivia dataset, originally developed for trivia games, has been re-purposed,\ncurated, and the curation, called IRC-WikiTriviaQA, is made available for the\npurpose of this research.\n","authors":["Robert L Simione II"],"pdf_url":"https://arxiv.org/pdf/2408.13624v1.pdf","comment":"16 pages, 3 tables, 1 figure"},{"id":"http://arxiv.org/abs/2402.02563v4","updated":"2024-08-24T14:46:55Z","published":"2024-02-04T16:45:01Z","title":"Synergy-of-Thoughts: Eliciting Efficient Reasoning in Hybrid Language\n Models","summary":" Large language models (LLMs) have shown impressive emergent abilities in a\nwide range of tasks, but the associated expensive API cost greatly limits the\nreal application. Previous works like chain-of-thought (CoT) and\ntree-of-thoughts (ToT) have predominately focused on enhancing accuracy, but\noverlook the rapidly increasing API cost, which could be particularly\nproblematic for open-ended real-world tasks with huge solution spaces.\nMotivated by the dual process theory of human cognition, we propose \"Synergy of\nThoughts\"(SoT) to unleash the synergistic potential of hybrid LLMs with\ndifferent scales for efficient reasoning. By default, SoT uses smaller-scale\nlanguage models to generate multiple low-cost intuitive thoughts, which\nresembles the parallel intuitions produced by System 1. We then design a\nconfidence evaluator where the intuitive thoughts are cross-evaluated and\nintroduce a controllable threshold mechanism to decide their mutual conflict.\nIf these intuitive thoughts exhibit conflicts, SoT will invoke the reflective\nreasoning of scaled-up language models to emulate the intervention of System 2,\nwhich will override the intuitive thoughts and rectify the reasoning results.\nThis framework is model-agnostic and training-free, which can be flexibly\nimplemented with various off-the-shelf LLMs. Experiments on six representative\nreasoning tasks show that SoT substantially reduces the API cost by\n38.3%-75.1%, and simultaneously achieves state-of-the-art reasoning accuracy\nand solution diversity. Notably, the average token cost reduction on open-ended\ntasks reaches up to 69.1%.\n","authors":["Yu Shang","Yu Li","Fengli Xu","Yong Li"],"pdf_url":"https://arxiv.org/pdf/2402.02563v4.pdf","comment":"19 pages, 16 figures, 12 tables"},{"id":"http://arxiv.org/abs/2407.12725v2","updated":"2024-08-24T14:44:11Z","published":"2024-07-17T16:42:03Z","title":"Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language\n Models?","summary":" Elaborating a series of intermediate reasoning steps significantly improves\nthe ability of large language models (LLMs) to solve complex problems, as such\nsteps would evoke LLMs to think sequentially. However, human sarcasm\nunderstanding is often considered an intuitive and holistic cognitive process,\nin which various linguistic, contextual, and emotional cues are integrated to\nform a comprehensive understanding, in a way that does not necessarily follow a\nstep-by-step fashion. To verify the validity of this argument, we introduce a\nnew prompting framework (called SarcasmCue) containing four sub-methods, viz.\nchain of contradiction (CoC), graph of cues (GoC), bagging of cues (BoC) and\ntensor of cues (ToC), which elicits LLMs to detect human sarcasm by considering\nsequential and non-sequential prompting methods. Through a comprehensive\nempirical comparison on four benchmarks, we highlight three key findings: (1)\nCoC and GoC show superior performance with more advanced models like GPT-4 and\nClaude 3.5, with an improvement of 3.5%. (2) ToC significantly outperforms\nother methods when smaller LLMs are evaluated, boosting the F1 score by 29.7%\nover the best baseline. (3) Our proposed framework consistently pushes the\nstate-of-the-art (i.e., ToT) by 4.2%, 2.0%, 29.7%, and 58.2% in F1 scores\nacross four datasets. This demonstrates the effectiveness and stability of the\nproposed framework.\n","authors":["Ben Yao","Yazhou Zhang","Qiuchi Li","Jing Qin"],"pdf_url":"https://arxiv.org/pdf/2407.12725v2.pdf","comment":"9 pages, 5 figures"},{"id":"http://arxiv.org/abs/2408.13586v1","updated":"2024-08-24T14:14:32Z","published":"2024-08-24T14:14:32Z","title":"Balancing Diversity and Risk in LLM Sampling: How to Select Your Method\n and Parameter for Open-Ended Text Generation","summary":" Sampling-based decoding strategies have been widely adopted for Large\nLanguage Models (LLMs) in numerous applications, which target a balance between\ndiversity and quality via temperature tuning and tail truncation (e.g., top-k\nand top-p sampling). Considering the high dynamic range of the candidate\nnext-token given different prefixes, recent studies propose to adaptively\ntruncate the tail of LLM's predicted distribution. Although improved results\nhaven been reported with these methods on open-ended text generation tasks, the\nresults are highly dependent on the curated truncation parameters and exemplar\ntext. In this paper, we propose a systematic way to estimate the intrinsic\ncapacity of a truncation sampling method by considering the trade-off between\ndiversity and risk at each decoding step, based on our collected prefix tree\nwhich preserves the context of a full sentence. Our work provides a\ncomprehensive comparison between existing truncation sampling methods, as well\nas their recommended parameters as a guideline for users.\n","authors":["Yuxuan Zhou","Margret Keuper","Mario Fritz"],"pdf_url":"https://arxiv.org/pdf/2408.13586v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13585v1","updated":"2024-08-24T13:59:41Z","published":"2024-08-24T13:59:41Z","title":"FLEURS-ASL: Including American Sign Language in Massively Multilingual\n Multitask Evaluation","summary":" Sign language translation has historically been peripheral to mainstream\nmachine translation research. In order to help converge the fields, we\nintroduce FLEURS-ASL, an extension of the multiway parallel benchmarks FLORES\n(for text) and FLEURS (for speech) to support their first sign language (as\nvideo), American Sign Language, translated by 5 Certified Deaf Interpreters.\nFLEURS-ASL can be used to evaluate a variety of tasks -- primarily sentence-\nand discourse-level translation -- between ASL and 200 other languages as text,\nor 102 languages as speech. We provide baselines for tasks from ASL to English\ntext using a unified modeling approach that incorporates timestamp tokens and\nprevious text tokens in a 34-second context window, trained on random video\nclips from YouTube-ASL. This model meets or exceeds the performance of\nphrase-level baselines while supporting a multitude of new tasks. We also use\nFLEURS-ASL to show that multimodal frontier models have virtually no\nunderstanding of ASL, underscoring the importance of including sign languages\nin standard evaluation suites.\n","authors":["Garrett Tanzer"],"pdf_url":"https://arxiv.org/pdf/2408.13585v1.pdf","comment":"Access FLEURS-ASL at\n https://www.kaggle.com/datasets/googleai/fleurs-asl. arXiv admin note: text\n overlap with arXiv:2408.07065"},{"id":"http://arxiv.org/abs/2408.08688v2","updated":"2024-08-24T12:34:01Z","published":"2024-08-16T12:01:55Z","title":"The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic\n Preference Optimization Dataset Generation","summary":" This paper presents synthetic Preference Optimization (PO) datasets generated\nusing multi-agent workflows and evaluates the effectiveness and potential of\nthese workflows in the dataset generation process. PO dataset generation\nrequires two modules: (1) response evaluation, and (2) response generation. In\nthe response evaluation module, the responses from Large Language Models (LLMs)\nare evaluated and ranked - a task typically carried out by human annotators\nthat we automate using LLMs. We assess the response evaluation module in a 2\nstep process. In step 1, we assess LLMs as evaluators using three distinct\nprompting strategies. In step 2, we apply the winning prompting strategy to\ncompare the performance of LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. In\neach step, we use inter-rater agreement using Cohen's Kappa between human\nannotators and LLMs. For the response generation module, we compare different\nconfigurations for the LLM Feedback Loop using the identified LLM evaluator\nconfiguration. We use the win rate (the fraction of times a generation\nframework is selected as the best by an LLM evaluator) to determine the best\nmulti-agent configuration for generation. After identifying the best\nconfigurations for both modules, we use models from the GPT, Gemma, and Llama\nfamilies to generate our PO datasets using the above pipeline. We generate two\ntypes of PO datasets, one to improve the generation capabilities of individual\nLLM and the other to improve the multi-agent workflow. Our evaluation shows\nthat GPT-4o-as-a-Judge is more consistent across datasets when the candidate\nresponses do not include responses from the GPT family. Additionally, we find\nthat the LLM Feedback Loop, with Llama as the generator and Gemma as the\nreviewer, achieves a notable 71.8% and 73.8% win rate over single-agent Llama\nand Gemma, respectively.\n","authors":["Samee Arif","Sualeha Farid","Abdul Hameed Azeemi","Awais Athar","Agha Ali Raza"],"pdf_url":"https://arxiv.org/pdf/2408.08688v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15806v4","updated":"2024-08-24T12:01:30Z","published":"2023-09-27T17:29:41Z","title":"Lyra: Orchestrating Dual Correction in Automated Theorem Proving","summary":" Large Language Models (LLMs) present an intriguing avenue for exploration in\nthe field of formal theorem proving. Nevertheless, their full potential,\nparticularly concerning the mitigation of hallucinations and refinement through\nprover error messages, remains an area that has yet to be thoroughly\ninvestigated. To enhance the effectiveness of LLMs in the field, we introduce\nthe Lyra, a new framework that employs two distinct correction mechanisms: Tool\nCorrection (TC) and Conjecture Correction (CC). To implement Tool Correction in\nthe post-processing of formal proofs, we leverage prior knowledge to utilize\npredefined prover tools (e.g., Sledgehammer) for guiding the replacement of\nincorrect tools. Tool Correction significantly contributes to mitigating\nhallucinations, thereby improving the overall accuracy of the proof. In\naddition, we introduce Conjecture Correction, an error feedback mechanism\ndesigned to interact with prover to refine formal proof conjectures with prover\nerror messages. Compared to the previous refinement framework, the proposed\nConjecture Correction refines generation with instruction but does not collect\npaired (generation, error & refinement) prompts. Our method has achieved\nstate-of-the-art (SOTA) performance on both miniF2F validation (48.0% -> 55.3%)\nand test (45.5% -> 51.2%). We also present 3 IMO problems solved by Lyra. We\nbelieve Tool Correction (post-process for hallucination mitigation) and\nConjecture Correction (subgoal adjustment from interaction with environment)\ncould provide a promising avenue for future research in this field.\n","authors":["Chuanyang Zheng","Haiming Wang","Enze Xie","Zhengying Liu","Jiankai Sun","Huajian Xin","Jianhao Shen","Zhenguo Li","Yu Li"],"pdf_url":"https://arxiv.org/pdf/2309.15806v4.pdf","comment":"Accepted to TMLR: https://openreview.net/forum?id=9Z0yB8rmQ2"},{"id":"http://arxiv.org/abs/2408.13545v1","updated":"2024-08-24T10:34:20Z","published":"2024-08-24T10:34:20Z","title":"IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question\n Answering","summary":" To evaluate Large Language Models (LLMs) for question answering (QA),\ntraditional methods typically focus on directly assessing the immediate\nresponses generated by the models based on the given question and context. In\nthe common use case of humans seeking AI assistant's help in finding\ninformation, these non-interactive evaluations do not account for the dynamic\nnature of human-model conversations, and interaction-aware evaluations have\nshown that accurate QA models are preferred by humans (Lee et al., 2023).\nRecent works in human-computer interaction (HCI) have employed human evaluators\nto conduct interactions and evaluations, but they are often prohibitively\nexpensive and time-consuming to scale. In this work, we introduce an automatic\nevaluation framework IQA-EVAL to Interactive Question Answering Evaluation.\nMore specifically, we introduce LLM-based Evaluation Agent (LEA) that can: (1)\nsimulate human behaviors to generate interactions with IQA models; (2)\nautomatically evaluate the generated interactions. Moreover, we propose\nassigning personas to LEAs to better simulate groups of real human evaluators.\nWe show that: (1) our evaluation framework with GPT-4 (or Claude) as the\nbackbone model achieves a high correlation with human evaluations on the IQA\ntask; (2) assigning personas to LEA to better represent the crowd further\nsignificantly improves correlations. Finally, we use our automatic metric to\nevaluate five recent representative LLMs with over 1000 questions from complex\nand ambiguous question answering tasks, which comes with a substantial cost of\n$5k if evaluated by humans.\n","authors":["Ruosen Li","Barry Wang","Ruochen Li","Xinya Du"],"pdf_url":"https://arxiv.org/pdf/2408.13545v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.07536v3","updated":"2024-08-24T09:59:31Z","published":"2023-11-13T18:22:32Z","title":"A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual\n Question Answering","summary":" The emergence of multimodal large models (MLMs) has significantly advanced\nthe field of visual understanding, offering remarkable capabilities in the\nrealm of visual question answering (VQA). Yet, the true challenge lies in the\ndomain of knowledge-intensive VQA tasks, which necessitate not just recognition\nof visual elements, but also a deep comprehension of the visual information in\nconjunction with a vast repository of learned knowledge. To uncover such\ncapabilities of MLMs, particularly the newly introduced GPT-4V and Gemini, we\nprovide an in-depth evaluation from three perspectives: 1) Commonsense\nKnowledge, which assesses how well models can understand visual cues and\nconnect to general knowledge; 2) Fine-grained World Knowledge, which tests the\nmodel's skill in reasoning out specific knowledge from images, showcasing their\nproficiency across various specialized fields; 3) Comprehensive Knowledge with\nDecision-making Rationales, which examines model's capability to provide\nlogical explanations for its inference, facilitating a deeper analysis from the\ninterpretability perspective. Additionally, we utilize a visual\nknowledge-enhanced training strategy and multimodal retrieval-augmented\ngeneration approach to enhance MLMs, highlighting the future need for\nadvancements in this research direction. Extensive experiments indicate that:\na) GPT-4V demonstrates enhanced explanation generation when using composite\nimages as few-shots; b) GPT-4V and other MLMs produce severe hallucinations\nwhen dealing with world knowledge; c) Visual knowledge enhanced training and\nprompting technicals present potential to improve performance. Codes:\nhttps://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper\n","authors":["Yunxin Li","Longyue Wang","Baotian Hu","Xinyu Chen","Wanqi Zhong","Chenyang Lyu","Wei Wang","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.07536v3.pdf","comment":"20 pages, 15 pages; technical paper"},{"id":"http://arxiv.org/abs/2408.13534v1","updated":"2024-08-24T09:25:18Z","published":"2024-08-24T09:25:18Z","title":"Cultural Adaptation of Menus: A Fine-Grained Approach","summary":" Machine Translation of Culture-Specific Items (CSIs) poses significant\nchallenges. Recent work on CSI translation has shown some success using Large\nLanguage Models (LLMs) to adapt to different languages and cultures; however, a\ndeeper analysis is needed to examine the benefits and pitfalls of each method.\nIn this paper, we introduce the ChineseMenuCSI dataset, the largest for\nChinese-English menu corpora, annotated with CSI vs Non-CSI labels and a\nfine-grained test set. We define three levels of CSI figurativeness for a more\nnuanced analysis and develop a novel methodology for automatic CSI\nidentification, which outperforms GPT-based prompts in most categories.\nImportantly, we are the first to integrate human translation theories into\nLLM-driven translation processes, significantly improving translation accuracy,\nwith COMET scores increasing by up to 7 points.\n","authors":["Zhonghe Zhang","Xiaoyu He","Vivek Iyer","Alexandra Birch"],"pdf_url":"https://arxiv.org/pdf/2408.13534v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13533v1","updated":"2024-08-24T09:23:01Z","published":"2024-08-24T09:23:01Z","title":"Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the\n Role of RAG Noise in Large Language Models","summary":" Retrieval-Augmented Generation (RAG) has emerged as a crucial method for\naddressing hallucinations in large language models (LLMs). While recent\nresearch has extended RAG models to complex noisy scenarios, these explorations\noften confine themselves to limited noise types and presuppose that noise is\ninherently detrimental to LLMs, potentially deviating from real-world retrieval\nenvironments and restricting practical applicability. In this paper, we define\nseven distinct noise types from a linguistic perspective and establish a Noise\nRAG Benchmark (NoiserBench), a comprehensive evaluation framework encompassing\nmultiple datasets and reasoning tasks. Through empirical evaluation of eight\nrepresentative LLMs with diverse architectures and scales, we reveal that these\nnoises can be further categorized into two practical groups: noise that is\nbeneficial to LLMs (aka beneficial noise) and noise that is harmful to LLMs\n(aka harmful noise). While harmful noise generally impairs performance,\nbeneficial noise may enhance several aspects of model capabilities and overall\nperformance. Our analysis offers insights for developing more robust, adaptable\nRAG solutions and mitigating hallucinations across diverse retrieval scenarios.\n","authors":["Jinyang Wu","Feihu Che","Chuyuan Zhang","Jianhua Tao","Shuai Zhang","Pengpeng Shao"],"pdf_url":"https://arxiv.org/pdf/2408.13533v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12881v2","updated":"2024-08-24T09:11:13Z","published":"2023-12-20T09:45:44Z","title":"Big Tech influence over AI research revisited: memetic analysis of\n attribution of ideas to affiliation","summary":" There exists a growing discourse around the domination of Big Tech on the\nlandscape of artificial intelligence (AI) research, yet our comprehension of\nthis phenomenon remains cursory. This paper aims to broaden and deepen our\nunderstanding of Big Tech's reach and power within AI research. It highlights\nthe dominance not merely in terms of sheer publication volume but rather in the\npropagation of new ideas or memes. Current studies often oversimplify the\nconcept of influence to the share of affiliations in academic papers, typically\nsourced from limited databases such as arXiv or specific academic conferences.\n The main goal of this paper is to unravel the specific nuances of such\ninfluence, determining which AI ideas are predominantly driven by Big Tech\nentities. By employing network and memetic analysis on AI-oriented paper\nabstracts and their citation network, we are able to grasp a deeper insight\ninto this phenomenon. By utilizing two databases: OpenAlex and S2ORC, we are\nable to perform such analysis on a much bigger scale than previous attempts.\n Our findings suggest that while Big Tech-affiliated papers are\ndisproportionately more cited in some areas, the most cited papers are those\naffiliated with both Big Tech and Academia. Focusing on the most contagious\nmemes, their attribution to specific affiliation groups (Big Tech, Academia,\nmixed affiliation) seems equally distributed between those three groups. This\nsuggests that the notion of Big Tech domination over AI research is\noversimplified in the discourse.\n","authors":["Stanisław Giziński","Paulina Kaczyńska","Hubert Ruczyński","Emilia Wiśnios","Bartosz Pieliński","Przemysław Biecek","Julian Sienkiewicz"],"pdf_url":"https://arxiv.org/pdf/2312.12881v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13521v1","updated":"2024-08-24T08:50:25Z","published":"2024-08-24T08:50:25Z","title":"HRGraph: Leveraging LLMs for HR Data Knowledge Graphs with Information\n Propagation-based Job Recommendation","summary":" Knowledge Graphs (KGs) serving as semantic networks, prove highly effective\nin managing complex interconnected data in different domains, by offering a\nunified, contextualized, and structured representation with flexibility that\nallows for easy adaptation to evolving knowledge. Processing complex Human\nResources (HR) data, KGs can help in different HR functions like recruitment,\njob matching, identifying learning gaps, and enhancing employee retention.\nDespite their potential, limited efforts have been made to implement practical\nHR knowledge graphs. This study addresses this gap by presenting a framework\nfor effectively developing HR knowledge graphs from documents using Large\nLanguage Models. The resulting KG can be used for a variety of downstream\ntasks, including job matching, identifying employee skill gaps, and many more.\nIn this work, we showcase instances where HR KGs prove instrumental in precise\njob matching, yielding advantages for both employers and employees. Empirical\nevidence from experiments with information propagation in KGs and Graph Neural\nNets, along with case studies underscores the effectiveness of KGs in tasks\nsuch as job and employee recommendations and job area classification. Code and\ndata are available at : https://github.com/azminewasi/HRGraph\n","authors":["Azmine Toushik Wasi"],"pdf_url":"https://arxiv.org/pdf/2408.13521v1.pdf","comment":"7 Pages, 4 Figures. View in ACL Anthology:\n https://aclanthology.org/2024.kallm-1.6/"},{"id":"http://arxiv.org/abs/2408.13518v1","updated":"2024-08-24T08:44:04Z","published":"2024-08-24T08:44:04Z","title":"Selective Preference Optimization via Token-Level Reward Function\n Estimation","summary":" Recent advancements in large language model alignment leverage token-level\nsupervisions to perform fine-grained preference optimization. However, existing\ntoken-level alignment methods either optimize on all available tokens, which\ncan be noisy and inefficient, or perform selective training with complex and\nexpensive key token selection strategies. In this work, we propose Selective\nPreference Optimization (SePO), a novel selective alignment strategy that\ncenters on efficient key token selection. SePO proposes the first token\nselection method based on Direct Preference Optimization (DPO), which trains an\noracle model to estimate a token-level reward function on the target data. This\nmethod applies to any existing alignment datasets with response-level\nannotations and enables cost-efficient token selection with small-scale oracle\nmodels and training data. The estimated reward function is then utilized to\nscore all tokens within the target dataset, where only the key tokens are\nselected to supervise the target policy model with a reference model-free\ncontrastive objective function. Extensive experiments on three public\nevaluation benchmarks show that SePO significantly outperforms competitive\nbaseline methods by only optimizing 30% key tokens on the target dataset. SePO\napplications on weak-to-strong generalization show that weak oracle models\neffectively supervise strong policy models with up to 16.8x more parameters.\nSePO also effectively selects key tokens from out-of-distribution data to\nenhance strong policy models and alleviate the over-optimization problem.\n","authors":["Kailai Yang","Zhiwei Liu","Qianqian Xie","Jimin Huang","Erxue Min","Sophia Ananiadou"],"pdf_url":"https://arxiv.org/pdf/2408.13518v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2408.13501v1","updated":"2024-08-24T06:59:55Z","published":"2024-08-24T06:59:55Z","title":"Utilizing Large Language Models for Named Entity Recognition in\n Traditional Chinese Medicine against COVID-19 Literature: Comparative Study","summary":" Objective: To explore and compare the performance of ChatGPT and other\nstate-of-the-art LLMs on domain-specific NER tasks covering different entity\ntypes and domains in TCM against COVID-19 literature. Methods: We established a\ndataset of 389 articles on TCM against COVID-19, and manually annotated 48 of\nthem with 6 types of entities belonging to 3 domains as the ground truth,\nagainst which the NER performance of LLMs can be assessed. We then performed\nNER tasks for the 6 entity types using ChatGPT (GPT-3.5 and GPT-4) and 4\nstate-of-the-art BERT-based question-answering (QA) models (RoBERTa, MiniLM,\nPubMedBERT and SciBERT) without prior training on the specific task. A domain\nfine-tuned model (GSAP-NER) was also applied for a comprehensive comparison.\nResults: The overall performance of LLMs varied significantly in exact match\nand fuzzy match. In the fuzzy match, ChatGPT surpassed BERT-based QA models in\n5 out of 6 tasks, while in exact match, BERT-based QA models outperformed\nChatGPT in 5 out of 6 tasks but with a smaller F-1 difference. GPT-4 showed a\nsignificant advantage over other models in fuzzy match, especially on the\nentity type of TCM formula and the Chinese patent drug (TFD) and ingredient\n(IG). Although GPT-4 outperformed BERT-based models on entity type of herb,\ntarget, and research method, none of the F-1 scores exceeded 0.5. GSAP-NER,\noutperformed GPT-4 in terms of F-1 by a slight margin on RM. ChatGPT achieved\nconsiderably higher recalls than precisions, particularly in the fuzzy match.\nConclusions: The NER performance of LLMs is highly dependent on the entity\ntype, and their performance varies across application scenarios. ChatGPT could\nbe a good choice for scenarios where high recall is favored. However, for\nknowledge acquisition in rigorous scenarios, neither ChatGPT nor BERT-based QA\nmodels are off-the-shelf tools for professional practitioners.\n","authors":["Xu Tong","Nina Smirnova","Sharmila Upadhyaya","Ran Yu","Jack H. Culbert","Chao Sun","Wolfgang Otto","Philipp Mayr"],"pdf_url":"https://arxiv.org/pdf/2408.13501v1.pdf","comment":"22 pages with 2 figures"},{"id":"http://arxiv.org/abs/2408.13473v1","updated":"2024-08-24T05:15:15Z","published":"2024-08-24T05:15:15Z","title":"Why Antiwork: A RoBERTa-Based System for Work-Related Stress\n Identification and Leading Factor Analysis","summary":" Harsh working environments and work-related stress have been known to\ncontribute to mental health problems such as anxiety, depression, and suicidal\nideation. As such, it is paramount to create solutions that can both detect\nemployee unhappiness and find the root cause of the problem. While prior works\nhave examined causes of mental health using machine learning, they typically\nfocus on general mental health analysis, with few of them focusing on\nexplainable solutions or looking at the workplace-specific setting. r/antiwork\nis a subreddit for the antiwork movement, which is the desire to stop working\naltogether. Using this subreddit as a proxy for work environment\ndissatisfaction, we create a new dataset for antiwork sentiment detection and\nsubsequently train a model that highlights the words with antiwork sentiments.\nFollowing this, we performed a qualitative and quantitative analysis to uncover\nsome of the key insights into the mindset of individuals who identify with the\nantiwork movement and how their working environments influenced them. We find\nthat working environments that do not give employees authority or\nresponsibility, frustrating recruiting experiences, and unfair compensation,\nare some of the leading causes of the antiwork sentiment, resulting in a lack\nof self-confidence and motivation among their employees.\n","authors":["Tao Lu","Muzhe Wu","Xinyi Lu","Siyuan Xu","Shuyu Zhan","Anuj Tambwekar","Emily Mower Provost"],"pdf_url":"https://arxiv.org/pdf/2408.13473v1.pdf","comment":"13 pages, 8 figures"},{"id":"http://arxiv.org/abs/2408.13457v1","updated":"2024-08-24T04:03:35Z","published":"2024-08-24T04:03:35Z","title":"Make Every Penny Count: Difficulty-Adaptive Self-Consistency for\n Cost-Efficient Reasoning","summary":" Self-consistency (SC), a widely used decoding strategy for chain-of-thought\nreasoning, shows significant gains across various multi-step reasoning tasks\nbut comes with a high cost due to multiple sampling with the preset size. Its\nvariants, Adaptive self-consistency (ASC) and Early-stopping self-consistency\n(ESC), dynamically adjust the number of samples based on the posterior\ndistribution of a set of pre-samples, reducing the cost of SC with minimal\nimpact on performance. Both methods, however, do not exploit the prior\ninformation about question difficulty. It often results in unnecessary repeated\nsampling for easy questions that could be accurately answered with just one\nattempt, wasting resources. To tackle this problem, we propose\nDifficulty-Adaptive Self-Consistency (DSC), which leverages the difficulty\ninformation from both prior and posterior perspectives to adaptively allocate\ninference resources, further reducing the cost of SC. To demonstrate the\neffectiveness of DSC, we conduct extensive experiments on three popular\ncategories of reasoning tasks: arithmetic, commonsense and symbolic reasoning\non six benchmarks. The empirical results show that DSC consistently surpasses\nthe strong baseline ASC and ESC in terms of costs by a significant margin,\nwhile attaining comparable performances.\n","authors":["Xinglin Wang","Shaoxiong Feng","Yiwei Li","Peiwen Yuan","Yueqi Zhang","Boyuan Pan","Heda Wang","Yao Hu","Kan Li"],"pdf_url":"https://arxiv.org/pdf/2408.13457v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2408.11319v2","updated":"2024-08-24T03:58:40Z","published":"2024-08-21T03:59:51Z","title":"SarcasmBench: Towards Evaluating Large Language Models on Sarcasm\n Understanding","summary":" In the era of large language models (LLMs), the task of ``System I''~-~the\nfast, unconscious, and intuitive tasks, e.g., sentiment analysis, text\nclassification, etc., have been argued to be successfully solved. However,\nsarcasm, as a subtle linguistic phenomenon, often employs rhetorical devices\nlike hyperbole and figuration to convey true sentiments and intentions,\ninvolving a higher level of abstraction than sentiment analysis. There is\ngrowing concern that the argument about LLMs' success may not be fully tenable\nwhen considering sarcasm understanding. To address this question, we select\neleven SOTA LLMs and eight SOTA pre-trained language models (PLMs) and present\ncomprehensive evaluations on six widely used benchmark datasets through\ndifferent prompting approaches, i.e., zero-shot input/output (IO) prompting,\nfew-shot IO prompting, chain of thought (CoT) prompting. Our results highlight\nthree key findings: (1) current LLMs underperform supervised PLMs based sarcasm\ndetection baselines across six sarcasm benchmarks. This suggests that\nsignificant efforts are still required to improve LLMs' understanding of human\nsarcasm. (2) GPT-4 consistently and significantly outperforms other LLMs across\nvarious prompting methods, with an average improvement of 14.0\\%$\\uparrow$.\nClaude 3 and ChatGPT demonstrate the next best performance after GPT-4. (3)\nFew-shot IO prompting method outperforms the other two methods: zero-shot IO\nand few-shot CoT. The reason is that sarcasm detection, being a holistic,\nintuitive, and non-rational cognitive process, is argued not to adhere to\nstep-by-step logical reasoning, making CoT less effective in understanding\nsarcasm compared to its effectiveness in mathematical reasoning tasks.\n","authors":["Yazhou Zhang","Chunwang Zou","Zheng Lian","Prayag Tiwari","Jing Qin"],"pdf_url":"https://arxiv.org/pdf/2408.11319v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.15228v2","updated":"2024-08-24T03:51:44Z","published":"2024-04-23T16:59:02Z","title":"Re-Thinking Inverse Graphics With Large Language Models","summary":" Inverse graphics -- the task of inverting an image into physical variables\nthat, when rendered, enable reproduction of the observed scene -- is a\nfundamental challenge in computer vision and graphics. Successfully\ndisentangling an image into its constituent elements, such as the shape, color,\nand material properties of the objects of the 3D scene that produced it,\nrequires a comprehensive understanding of the environment. This complexity\nlimits the ability of existing carefully engineered approaches to generalize\nacross domains. Inspired by the zero-shot ability of large language models\n(LLMs) to generalize to novel contexts, we investigate the possibility of\nleveraging the broad world knowledge encoded in such models to solve\ninverse-graphics problems. To this end, we propose the Inverse-Graphics Large\nLanguage Model (IG-LLM), an inverse-graphics framework centered around an LLM,\nthat autoregressively decodes a visual embedding into a structured,\ncompositional 3D-scene representation. We incorporate a frozen pre-trained\nvisual encoder and a continuous numeric head to enable end-to-end training.\nThrough our investigation, we demonstrate the potential of LLMs to facilitate\ninverse graphics through next-token prediction, without the application of\nimage-space supervision. Our analysis enables new possibilities for precise\nspatial reasoning about images that exploit the visual knowledge of LLMs. We\nrelease our code and data at https://ig-llm.is.tue.mpg.de/ to ensure the\nreproducibility of our investigation and to facilitate future research.\n","authors":["Peter Kulits","Haiwen Feng","Weiyang Liu","Victoria Abrevaya","Michael J. Black"],"pdf_url":"https://arxiv.org/pdf/2404.15228v2.pdf","comment":"TMLR camera-ready; 31 pages; project page:\n https://ig-llm.is.tue.mpg.de/"},{"id":"http://arxiv.org/abs/2408.10923v3","updated":"2024-08-24T03:22:09Z","published":"2024-08-20T15:05:02Z","title":"LBC: Language-Based-Classifier for Out-Of-Variable Generalization","summary":" Large Language Models (LLMs) have great success in natural language\nprocessing tasks such as response generation. However, their use in tabular\ndata has been limited due to their inferior performance compared to traditional\nmachine learning models (TMLs) such as XGBoost. We find that the pre-trained\nknowledge of LLMs enables them to interpret new variables that appear in a test\nwithout additional training, a capability central to the concept of\nOut-of-Variable (OOV). From the findings, we propose a\nLanguage-Based-Classifier (LBC), a classifier that maximizes the benefits of\nLLMs to outperform TMLs on OOV tasks. LBC employs three key methodological\nstrategies: 1) Categorical changes to adjust data to better fit the model's\nunderstanding, 2) Advanced order and indicator to enhance data representation\nto the model, and 3) Using verbalizer to map logit scores to classes during\ninference to generate model predictions. These strategies, combined with the\npre-trained knowledge of LBC, emphasize the model's ability to effectively\nhandle OOV tasks. We empirically and theoretically validate the superiority of\nLBC. LBC is the first study to apply an LLM-based model to OOV tasks. The\nsource code is at https://github.com/sksmssh/LBCforOOVGen\n","authors":["Kangjun Noh","Baekryun Seong","Hoyoon Byun","Youngjun Choi","Sungjin Song","Kyungwoo Song"],"pdf_url":"https://arxiv.org/pdf/2408.10923v3.pdf","comment":"16 pages, 7 figures, 4 tables"},{"id":"http://arxiv.org/abs/2408.06266v2","updated":"2024-08-24T03:19:13Z","published":"2024-08-12T16:24:51Z","title":"Anchored Preference Optimization and Contrastive Revisions: Addressing\n Underspecification in Alignment","summary":" Large Language Models (LLMs) are often aligned using contrastive alignment\nobjectives and preference pair datasets. The interaction between model, paired\ndata, and objective makes alignment a complicated procedure, sometimes\nproducing subpar results. We study this and find that (i) preference data gives\na better learning signal when the underlying responses are contrastive, and\n(ii) alignment objectives lead to better performance when they specify more\ncontrol over the model during training. Based on these insights, we introduce\nContrastive Learning from AI Revisions (CLAIR), a data-creation method which\nleads to more contrastive preference pairs, and Anchored Preference\nOptimization (APO), a controllable and more stable alignment objective. We\nalign Llama-3-8B-Instruct using various comparable datasets and alignment\nobjectives and measure MixEval-Hard scores, which correlate highly with human\njudgments. The CLAIR preferences lead to the strongest performance out of all\ndatasets, and APO consistently outperforms less controllable objectives. Our\nbest model, trained on 32K CLAIR preferences with APO, improves\nLlama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code\nis available at https://github.com/ContextualAI/CLAIR_and_APO.\n","authors":["Karel D'Oosterlinck","Winnie Xu","Chris Develder","Thomas Demeester","Amanpreet Singh","Christopher Potts","Douwe Kiela","Shikib Mehri"],"pdf_url":"https://arxiv.org/pdf/2408.06266v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13442v1","updated":"2024-08-24T02:48:40Z","published":"2024-08-24T02:48:40Z","title":"A Law of Next-Token Prediction in Large Language Models","summary":" Large language models (LLMs) have been widely employed across various\napplication domains, yet their black-box nature poses significant challenges to\nunderstanding how these models process input data internally to make\npredictions. In this paper, we introduce a precise and quantitative law that\ngoverns the learning of contextualized token embeddings through intermediate\nlayers in pre-trained LLMs for next-token prediction. Our findings reveal that\neach layer contributes equally to enhancing prediction accuracy, from the\nlowest to the highest layer -- a universal phenomenon observed across a diverse\narray of open-source LLMs, built on architectures such as Transformer, RWKV,\nand Mamba. We demonstrate that this law offers new perspectives and insights to\ninform and guide practices in LLM development and applications, including model\nscaling, pre-training tasks, and information flow. Overall, our law enables\nmore fine-grained approaches to the design, training, and interpretation of\nLLMs through scrutinizing their internal data processing mechanisms.\n","authors":["Hangfeng He","Weijie J. Su"],"pdf_url":"https://arxiv.org/pdf/2408.13442v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13440v1","updated":"2024-08-24T02:40:28Z","published":"2024-08-24T02:40:28Z","title":"Knowledge-Aware Conversation Derailment Forecasting Using Graph\n Convolutional Networks","summary":" Online conversations are particularly susceptible to derailment, which can\nmanifest itself in the form of toxic communication patterns including\ndisrespectful comments and abuse. Forecasting conversation derailment predicts\nsigns of derailment in advance enabling proactive moderation of conversations.\nState-of-the-art approaches to conversation derailment forecasting sequentially\nencode conversations and use graph neural networks to model dialogue user\ndynamics. However, existing graph models are not able to capture complex\nconversational characteristics such as context propagation and emotional\nshifts. The use of common sense knowledge enables a model to capture such\ncharacteristics, thus improving performance. Following this approach, here we\nderive commonsense statements from a knowledge base of dialogue contextual\ninformation to enrich a graph neural network classification architecture. We\nfuse the multi-source information on utterance into capsules, which are used by\na transformer-based forecaster to predict conversation derailment. Our model\ncaptures conversation dynamics and context propagation, outperforming the\nstate-of-the-art models on the CGA and CMV benchmark datasets\n","authors":["Enas Altarawneh","Ameeta Agrawal","Michael Jenkin","Manos Papagelis"],"pdf_url":"https://arxiv.org/pdf/2408.13440v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2306.12982;\n text overlap with arXiv:2106.01071 by other authors"},{"id":"http://arxiv.org/abs/2408.13432v1","updated":"2024-08-24T01:58:28Z","published":"2024-08-24T01:58:28Z","title":"Integrating Multi-Head Convolutional Encoders with Cross-Attention for\n Improved SPARQL Query Translation","summary":" The main task of the KGQA system (Knowledge Graph Question Answering) is to\nconvert user input questions into query syntax (such as SPARQL). With the rise\nof modern popular encoders and decoders like Transformer and ConvS2S, many\nscholars have shifted the research direction of SPARQL generation to the Neural\nMachine Translation (NMT) architecture or the generative AI field of\nText-to-SPARQL. In NMT-based QA systems, the system treats knowledge base query\nsyntax as a language. It uses NMT-based translation models to translate natural\nlanguage questions into query syntax. Scholars use popular architectures\nequipped with cross-attention, such as Transformer, ConvS2S, and BiLSTM, to\ntrain translation models for query syntax. To achieve better query results,\nthis paper improved the ConvS2S encoder and added multi-head attention from the\nTransformer, proposing a Multi-Head Conv encoder (MHC encoder) based on the\nn-gram language model. The principle is to use convolutional layers to capture\nlocal hidden features in the input sequence with different receptive fields,\nusing multi-head attention to calculate dependencies between them. Ultimately,\nwe found that the translation model based on the Multi-Head Conv encoder\nachieved better performance than other encoders, obtaining 76.52\\% and 83.37\\%\nBLEU-1 (BiLingual Evaluation Understudy) on the QALD-9 and LC-QuAD-1.0\ndatasets, respectively. Additionally, in the end-to-end system experiments on\nthe QALD-9 and LC-QuAD-1.0 datasets, we achieved leading results over other\nKGQA systems, with Macro F1-measures reaching 52\\% and 66\\%, respectively.\nMoreover, the experimental results show that with limited computational\nresources, if one possesses an excellent encoder-decoder architecture and\ncross-attention, experts and scholars can achieve outstanding performance\nequivalent to large pre-trained models using only general embeddings.\n","authors":["Yi-Hui Chen","Eric Jui-Lin Lu","Kwan-Ho Cheng"],"pdf_url":"https://arxiv.org/pdf/2408.13432v1.pdf","comment":"24 pages, 20 figures, using the engrXiv template; the full version\n has been submitted to ACM Transactions on Information Systems and is\n currently under review. (2024)"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2408.13672v1","updated":"2024-08-24T21:22:15Z","published":"2024-08-24T21:22:15Z","title":"ColBERT's [MASK]-based Query Augmentation: Effects of Quadrupling the\n Query Input Length","summary":" A unique aspect of ColBERT is its use of [MASK] tokens in queries to score\ndocuments (query augmentation). Prior work shows [MASK] tokens weighting\nnon-[MASK] query terms, emphasizing certain tokens over others , rather than\nintroducing whole new terms as initially proposed. We begin by demonstrating\nthat a term weighting behavior previously reported for [MASK] tokens in\nColBERTv1 holds for ColBERTv2. We then examine the effect of changing the\nnumber of [MASK] tokens from zero to up to four times past the query input\nlength used in training, both for first stage retrieval, and for scoring\ncandidates, observing an initial decrease in performance with few [MASK]s, a\nlarge increase when enough [MASK]s are added to pad queries to an average\nlength of 32, then a plateau in performance afterwards. Additionally, we\ncompare baseline performance to performance when the query length is extended\nto 128 tokens, and find that differences are small (e.g., within 1% on various\nmetrics) and generally statistically insignificant, indicating performance does\nnot collapse if ColBERT is presented with more [MASK] tokens than expected.\n","authors":["Ben Giacalone","Richard Zanibbi"],"pdf_url":"https://arxiv.org/pdf/2408.13672v1.pdf","comment":"5 pages, 3 figures, two tables"},{"id":"http://arxiv.org/abs/2401.13463v3","updated":"2024-08-24T20:28:38Z","published":"2024-01-24T14:08:38Z","title":"SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken\n Question Answering","summary":" Spoken Question Answering (SQA) is essential for machines to reply to user's\nquestion by finding the answer span within a given spoken passage. SQA has been\npreviously achieved without ASR to avoid recognition errors and\nOut-of-Vocabulary (OOV) problems. However, the real-world problem of\nOpen-domain SQA (openSQA), in which the machine needs to first retrieve\npassages that possibly contain the answer from a spoken archive in addition,\nwas never considered. This paper proposes the first known end-to-end framework,\nSpeech Dense Passage Retriever (SpeechDPR), for the retrieval component of the\nopenSQA problem. SpeechDPR learns a sentence-level semantic representation by\ndistilling knowledge from the cascading model of unsupervised ASR (UASR) and\ntext dense retriever (TDR). No manually transcribed speech data is needed.\nInitial experiments showed performance comparable to the cascading model of\nUASR and TDR, and significantly better when UASR was poor, verifying this\napproach is more robust to speech recognition errors.\n","authors":["Chyi-Jiunn Lin","Guan-Ting Lin","Yung-Sung Chuang","Wei-Lun Wu","Shang-Wen Li","Abdelrahman Mohamed","Hung-yi Lee","Lin-shan Lee"],"pdf_url":"https://arxiv.org/pdf/2401.13463v3.pdf","comment":"Accepted at ICASSP 2024"},{"id":"http://arxiv.org/abs/2408.11623v2","updated":"2024-08-24T20:24:42Z","published":"2024-08-21T13:48:00Z","title":"End-to-End Cost-Effective Incentive Recommendation under Budget\n Constraint with Uplift Modeling","summary":" In modern online platforms, incentives are essential factors that enhance\nuser engagement and increase platform revenue. Over recent years, uplift\nmodeling has been introduced as a strategic approach to assign incentives to\nindividual customers. Especially in many real-world applications, online\nplatforms can only incentivize customers with specific budget constraints. This\nproblem can be reformulated as the multi-choice knapsack problem. This\noptimization aims to select the optimal incentive for each customer to maximize\nthe return on investment. Recent works in this field frequently tackle the\nbudget allocation problem using a two-stage approach. However, this solution is\nconfronted with the following challenges: (1) The causal inference methods\noften ignore the domain knowledge in online marketing, where the expected\nresponse curve of a customer should be monotonic and smooth as the incentive\nincreases. (2) An optimality gap between the two stages results in inferior\nsub-optimal allocation performance due to the loss of the incentive\nrecommendation information for the uplift prediction under the limited budget\nconstraint. To address these challenges, we propose a novel End-to-End\nCost-Effective Incentive Recommendation (E3IR) model under budget constraints.\nSpecifically, our methods consist of two modules, i.e., the uplift prediction\nmodule and the differentiable allocation module. In the uplift prediction\nmodule, we construct prediction heads to capture the incremental improvement\nbetween adjacent treatments with the marketing domain constraints (i.e.,\nmonotonic and smooth). We incorporate integer linear programming (ILP) as a\ndifferentiable layer input in the allocation module. Furthermore, we conduct\nextensive experiments on public and real product datasets, demonstrating that\nour E3IR improves allocation performance compared to existing two-stage\napproaches.\n","authors":["Zexu Sun","Hao Yang","Dugang Liu","Yunpeng Weng","Xing Tang","Xiuqiang He"],"pdf_url":"https://arxiv.org/pdf/2408.11623v2.pdf","comment":"Accepted by RecSys 2024"},{"id":"http://arxiv.org/abs/2408.13521v1","updated":"2024-08-24T08:50:25Z","published":"2024-08-24T08:50:25Z","title":"HRGraph: Leveraging LLMs for HR Data Knowledge Graphs with Information\n Propagation-based Job Recommendation","summary":" Knowledge Graphs (KGs) serving as semantic networks, prove highly effective\nin managing complex interconnected data in different domains, by offering a\nunified, contextualized, and structured representation with flexibility that\nallows for easy adaptation to evolving knowledge. Processing complex Human\nResources (HR) data, KGs can help in different HR functions like recruitment,\njob matching, identifying learning gaps, and enhancing employee retention.\nDespite their potential, limited efforts have been made to implement practical\nHR knowledge graphs. This study addresses this gap by presenting a framework\nfor effectively developing HR knowledge graphs from documents using Large\nLanguage Models. The resulting KG can be used for a variety of downstream\ntasks, including job matching, identifying employee skill gaps, and many more.\nIn this work, we showcase instances where HR KGs prove instrumental in precise\njob matching, yielding advantages for both employers and employees. Empirical\nevidence from experiments with information propagation in KGs and Graph Neural\nNets, along with case studies underscores the effectiveness of KGs in tasks\nsuch as job and employee recommendations and job area classification. Code and\ndata are available at : https://github.com/azminewasi/HRGraph\n","authors":["Azmine Toushik Wasi"],"pdf_url":"https://arxiv.org/pdf/2408.13521v1.pdf","comment":"7 Pages, 4 Figures. View in ACL Anthology:\n https://aclanthology.org/2024.kallm-1.6/"},{"id":"http://arxiv.org/abs/2408.13501v1","updated":"2024-08-24T06:59:55Z","published":"2024-08-24T06:59:55Z","title":"Utilizing Large Language Models for Named Entity Recognition in\n Traditional Chinese Medicine against COVID-19 Literature: Comparative Study","summary":" Objective: To explore and compare the performance of ChatGPT and other\nstate-of-the-art LLMs on domain-specific NER tasks covering different entity\ntypes and domains in TCM against COVID-19 literature. Methods: We established a\ndataset of 389 articles on TCM against COVID-19, and manually annotated 48 of\nthem with 6 types of entities belonging to 3 domains as the ground truth,\nagainst which the NER performance of LLMs can be assessed. We then performed\nNER tasks for the 6 entity types using ChatGPT (GPT-3.5 and GPT-4) and 4\nstate-of-the-art BERT-based question-answering (QA) models (RoBERTa, MiniLM,\nPubMedBERT and SciBERT) without prior training on the specific task. A domain\nfine-tuned model (GSAP-NER) was also applied for a comprehensive comparison.\nResults: The overall performance of LLMs varied significantly in exact match\nand fuzzy match. In the fuzzy match, ChatGPT surpassed BERT-based QA models in\n5 out of 6 tasks, while in exact match, BERT-based QA models outperformed\nChatGPT in 5 out of 6 tasks but with a smaller F-1 difference. GPT-4 showed a\nsignificant advantage over other models in fuzzy match, especially on the\nentity type of TCM formula and the Chinese patent drug (TFD) and ingredient\n(IG). Although GPT-4 outperformed BERT-based models on entity type of herb,\ntarget, and research method, none of the F-1 scores exceeded 0.5. GSAP-NER,\noutperformed GPT-4 in terms of F-1 by a slight margin on RM. ChatGPT achieved\nconsiderably higher recalls than precisions, particularly in the fuzzy match.\nConclusions: The NER performance of LLMs is highly dependent on the entity\ntype, and their performance varies across application scenarios. ChatGPT could\nbe a good choice for scenarios where high recall is favored. However, for\nknowledge acquisition in rigorous scenarios, neither ChatGPT nor BERT-based QA\nmodels are off-the-shelf tools for professional practitioners.\n","authors":["Xu Tong","Nina Smirnova","Sharmila Upadhyaya","Ran Yu","Jack H. Culbert","Chao Sun","Wolfgang Otto","Philipp Mayr"],"pdf_url":"https://arxiv.org/pdf/2408.13501v1.pdf","comment":"22 pages with 2 figures"},{"id":"http://arxiv.org/abs/2302.07335v2","updated":"2024-08-24T02:22:16Z","published":"2023-02-14T20:44:12Z","title":"Intelligent Model Update Strategy for Sequential Recommendation","summary":" Modern online platforms are increasingly employing recommendation systems to\naddress information overload and improve user engagement. There is an evolving\nparadigm in this research field that recommendation network learning occurs\nboth on the cloud and on edges with knowledge transfer in between (i.e.,\nedge-cloud collaboration). Recent works push this field further by enabling\nedge-specific context-aware adaptivity, where model parameters are updated in\nreal-time based on incoming on-edge data. However, we argue that frequent data\nexchanges between the cloud and edges often lead to inefficiency and waste of\ncommunication/computation resources, as considerable parameter updates might be\nredundant. To investigate this problem, we introduce Intelligent Edge-Cloud\nParameter Request Model, abbreviated as IntellectReq.\n IntellectReq is designed to operate on edge, evaluating the cost-benefit\nlandscape of parameter requests with minimal computation and communication\noverhead. We formulate this as a novel learning task, aimed at the detection of\nout-of-distribution data, thereby fine-tuning adaptive communication\nstrategies. Further, we employ statistical mapping techniques to convert\nreal-time user behavior into a normal distribution, thereby employing\nmulti-sample outputs to quantify the model's uncertainty and thus its\ngeneralization capabilities. Rigorous empirical validation on four\nwidely-adopted benchmarks evaluates our approach, evidencing a marked\nimprovement in the efficiency and generalizability of edge-cloud collaborative\nand dynamic recommendation systems.\n","authors":["Zheqi Lv","Wenqiao Zhang","Zhengyu Chen","Shengyu Zhang","Kun Kuang"],"pdf_url":"https://arxiv.org/pdf/2302.07335v2.pdf","comment":"Published on WWW'24(Oral): Proceedings of the ACM on Web Conference\n 2024 (pp. 3117-3128)"},{"id":"http://arxiv.org/abs/2408.13484v1","updated":"2024-08-24T06:07:25Z","published":"2024-08-24T06:07:25Z","title":"IntOPE: Off-Policy Evaluation in the Presence of Interference","summary":" Off-Policy Evaluation (OPE) is employed to assess the potential impact of a\nhypothetical policy using logged contextual bandit feedback, which is crucial\nin areas such as personalized medicine and recommender systems, where online\ninteractions are associated with significant risks and costs. Traditionally,\nOPE methods rely on the Stable Unit Treatment Value Assumption (SUTVA), which\nassumes that the reward for any given individual is unaffected by the actions\nof others. However, this assumption often fails in real-world scenarios due to\nthe presence of interference, where an individual's reward is affected not just\nby their own actions but also by the actions of their peers. This realization\nreveals significant limitations of existing OPE methods in real-world\napplications. To address this limitation, we propose IntIPW, an IPW-style\nestimator that extends the Inverse Probability Weighting (IPW) framework by\nintegrating marginalized importance weights to account for both individual\nactions and the influence of adjacent entities. Extensive experiments are\nconducted on both synthetic and real-world data to demonstrate the\neffectiveness of the proposed IntIPW method.\n","authors":["Yuqi Bai","Ziyu Zhao","Minqin Zhu","Kun Kuang"],"pdf_url":"https://arxiv.org/pdf/2408.13484v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2408.13608v1","updated":"2024-08-24T15:36:08Z","published":"2024-08-24T15:36:08Z","title":"SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural\n Language Description","summary":" Speech-language multi-modal learning presents a significant challenge due to\nthe fine nuanced information inherent in speech styles. Therefore, a\nlarge-scale dataset providing elaborate comprehension of speech style is\nurgently needed to facilitate insightful interplay between speech audio and\nnatural language. However, constructing such datasets presents a major\ntrade-off between large-scale data collection and high-quality annotation. To\ntackle this challenge, we propose an automatic speech annotation system for\nexpressiveness interpretation that annotates in-the-wild speech clips with\nexpressive and vivid human language descriptions. Initially, speech audios are\nprocessed by a series of expert classifiers and captioning models to capture\ndiverse speech characteristics, followed by a fine-tuned LLaMA for customized\nannotation generation. Unlike previous tag/templet-based annotation frameworks\nwith limited information and diversity, our system provides in-depth\nunderstandings of speech style through tailored natural language descriptions,\nthereby enabling accurate and voluminous data generation for large model\ntraining. With this system, we create SpeechCraft, a fine-grained bilingual\nexpressive speech dataset. It is distinguished by highly descriptive natural\nlanguage style prompts, containing approximately 2,000 hours of audio data and\nencompassing over two million speech clips. Extensive experiments demonstrate\nthat the proposed dataset significantly boosts speech-language task performance\nin stylist speech synthesis and speech style understanding.\n","authors":["Zeyu Jin","Jia Jia","Qixin Wang","Kehan Li","Shuoyi Zhou","Songtao Zhou","Xiaoyu Qin","Zhiyong Wu"],"pdf_url":"https://arxiv.org/pdf/2408.13608v1.pdf","comment":"Accepted by ACM Multimedia 2024"},{"id":"http://arxiv.org/abs/2408.13520v1","updated":"2024-08-24T08:47:09Z","published":"2024-08-24T08:47:09Z","title":"An Open, Cross-Platform, Web-Based Metaverse Using WebXR and A-Frame","summary":" The metaverse has received much attention in the literature and industry in\nthe last few years, but the lack of an open and cross-platform architecture has\nled to many distinct metaverses that cannot communicate with each other. This\nwork proposes a WebXR-based cross-platform architecture for developing spatial\nweb apps using the A-Frame and Networked-Aframe frameworks with a view to an\nopen and interoperable metaverse, accessible from both the web and extended\nreality devices. A prototype was implemented and evaluated, supporting the\ncapability of the technology stack to enable immersive experiences across\ndifferent platforms and devices. Positive feedback on ease of use of the\nimmersive environment further corroborates the proposed approach, underscoring\nits effectiveness in facilitating engaging and interactive virtual spaces. By\nadhering to principles of interoperability and inclusivity, it lives up to Tim\nBerners-Lee's vision of the World Wide Web as an open platform that transcends\ngeographical and technical boundaries.\n","authors":["Giuseppe Macario"],"pdf_url":"https://arxiv.org/pdf/2408.13520v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2404.05317"},{"id":"http://arxiv.org/abs/2403.02905v3","updated":"2024-08-24T00:29:50Z","published":"2024-03-05T12:13:18Z","title":"MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model","summary":" The body movements accompanying speech aid speakers in expressing their\nideas. Co-speech motion generation is one of the important approaches for\nsynthesizing realistic avatars. Due to the intricate correspondence between\nspeech and motion, generating realistic and diverse motion is a challenging\ntask. In this paper, we propose MMoFusion, a Multi-modal co-speech Motion\ngeneration framework based on the diffusion model to ensure both the\nauthenticity and diversity of generated motion. We propose a progressive fusion\nstrategy to enhance the interaction of inter-modal and intra-modal, efficiently\nintegrating multi-modal information. Specifically, we employ a masked style\nmatrix based on emotion and identity information to control the generation of\ndifferent motion styles. Temporal modeling of speech and motion is partitioned\ninto style-guided specific feature encoding and shared feature encoding, aiming\nto learn both inter-modal and intra-modal features. Besides, we propose a\ngeometric loss to enforce the joints' velocity and acceleration coherence among\nframes. Our framework generates vivid, diverse, and style-controllable motion\nof arbitrary length through inputting speech and editing identity and emotion.\nExtensive experiments demonstrate that our method outperforms current co-speech\nmotion generation methods including upper body and challenging full body.\n","authors":["Sen Wang","Jiangning Zhang","Xin Tan","Zhifeng Xie","Chengjie Wang","Lizhuang Ma"],"pdf_url":"https://arxiv.org/pdf/2403.02905v3.pdf","comment":null}]},"2024-08-27T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2401.16553v7","updated":"2024-08-27T17:57:07Z","published":"2024-01-29T20:44:10Z","title":"SelectLLM: Can LLMs Select Important Instructions to Annotate?","summary":" Instruction tuning benefits from large and diverse datasets; however,\ncreating such datasets involves a high cost of human labeling. While synthetic\ndatasets generated by large language models (LLMs) have partly solved this\nissue, they often contain low-quality data. One effective solution is\nselectively annotating unlabelled instructions, especially given the relative\nease of acquiring unlabeled instructions or texts from various sources.\nHowever, how to select unlabelled instructions is not well-explored, especially\nin the context of LLMs. Therefore, we introduce SelectLLM, an alternative\nframework that leverages the capabilities of LLMs to select unlabeled\ninstructions more effectively. Specifically, SelectLLM consists of two key\nsteps: Coreset-based clustering of unlabelled instructions for enlarging\ndiversity and prompting of LLM to identify the most beneficial instructions\nwithin each cluster. We evaluate SelectLLM on AlpacaEval2 and MT-Bench,\ndemonstrating its ability to outperform state-of-the-art methods like\nAlpagasus. In addition, we compare the performance and compatibility of\nSelectLLM with various LLMs, such as ChatGPT, LLaMA-3.1-70B, and Gemma-2-27b.\nSelectLLM's adaptability and robustness are further evidenced by its ability to\nmaintain high performance across both human and synthetic datasets. All code\nand data are publicly available (https://github.com/minnesotanlp/select-llm).\n","authors":["Ritik Sachin Parkar","Jaehyung Kim","Jong Inn Park","Dongyeop Kang"],"pdf_url":"https://arxiv.org/pdf/2401.16553v7.pdf","comment":"First Authors: Ritik Sachin Parkar and Jaehyung Kim | Second Author:\n Jong Inn Park | PI: Dongyeop Kang"},{"id":"http://arxiv.org/abs/2408.15232v1","updated":"2024-08-27T17:50:03Z","published":"2024-08-27T17:50:03Z","title":"Into the Unknown Unknowns: Engaged Human Learning through Participation\n in Language Model Agent Conversations","summary":" While language model (LM)-powered chatbots and generative search engines\nexcel at answering concrete queries, discovering information in the terrain of\nunknown unknowns remains challenging for users. To emulate the common\neducational scenario where children/students learn by listening to and\nparticipating in conversations of their parents/teachers, we create\nCollaborative STORM (Co-STORM). Unlike QA systems that require users to ask all\nthe questions, Co-STORM lets users observe and occasionally steer the discourse\namong several LM agents. The agents ask questions on the user's behalf,\nallowing the user to discover unknown unknowns serendipitously. To facilitate\nuser interaction, Co-STORM assists users in tracking the discourse by\norganizing the uncovered information into a dynamic mind map, ultimately\ngenerating a comprehensive report as takeaways. For automatic evaluation, we\nconstruct the WildSeek dataset by collecting real information-seeking records\nwith user goals. Co-STORM outperforms baseline methods on both discourse trace\nand report quality. In a further human evaluation, 70% of participants prefer\nCo-STORM over a search engine, and 78% favor it over a RAG chatbot.\n","authors":["Yucheng Jiang","Yijia Shao","Dekun Ma","Sina J. Semnani","Monica S. Lam"],"pdf_url":"https://arxiv.org/pdf/2408.15232v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.03334v2","updated":"2024-08-27T17:46:31Z","published":"2024-03-05T21:36:23Z","title":"DIVERSE: A Dataset of YouTube Video Comment Stances with a Data\n Programming Model","summary":" Stance detection of social media text is a key component of many real-world\napplications like evaluating marketing campaigns, evaluating political policies\nor candidates, or evaluating information environments. However, creating\nautomatic stance labeling systems requires the manual annotation of stances,\nwhich is both tedious and resource-intensive. This paper introduces a stance\nlabeling method that makes use of weak signals of sentence tone, then\nconsolidating these signals with a Data Programmingmodel for the final stance\nlabel. In a time of international conflict, understanding the public opinion\ntowards the country's military is crucial for recruitment. We present DIVERSE,\na dataset involve stances towards YouTube videos of the US military (Dataset\navailable at https://doi.org/10.5281/zenodo.10493803). On average, the videos\nhave 200 comments each, and the stances skew slightly towards the \"against\"\ncharacterization for both the US army and the video.\n","authors":["Iain J. Cruickshank","Amir Soofi","Lynnette Hui Xian Ng"],"pdf_url":"https://arxiv.org/pdf/2403.03334v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15221v1","updated":"2024-08-27T17:33:30Z","published":"2024-08-27T17:33:30Z","title":"LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet","summary":" Recent large language model (LLM) defenses have greatly improved models'\nability to refuse harmful queries, even when adversarially attacked. However,\nLLM defenses are primarily evaluated against automated adversarial attacks in a\nsingle turn of conversation, an insufficient threat model for real-world\nmalicious use. We demonstrate that multi-turn human jailbreaks uncover\nsignificant vulnerabilities, exceeding 70% attack success rate (ASR) on\nHarmBench against defenses that report single-digit ASRs with automated\nsingle-turn attacks. Human jailbreaks also reveal vulnerabilities in machine\nunlearning defenses, successfully recovering dual-use biosecurity knowledge\nfrom unlearned models. We compile these results into Multi-Turn Human\nJailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks.\nWe publicly release MHJ alongside a compendium of jailbreak tactics developed\nacross dozens of commercial red teaming engagements, supporting research\ntowards stronger LLM defenses.\n","authors":["Nathaniel Li","Ziwen Han","Ian Steneker","Willow Primack","Riley Goodside","Hugh Zhang","Zifan Wang","Cristina Menghini","Summer Yue"],"pdf_url":"https://arxiv.org/pdf/2408.15221v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15213v1","updated":"2024-08-27T17:19:57Z","published":"2024-08-27T17:19:57Z","title":"Classifying populist language in American presidential and governor\n speeches using automatic text analysis","summary":" Populism is a concept that is often used but notoriously difficult to\nmeasure. Common qualitative measurements like holistic grading or content\nanalysis require great amounts of time and labour, making it difficult to\nquickly scope out which politicians should be classified as populist and which\nshould not, while quantitative methods show mixed results when it comes to\nclassifying populist rhetoric. In this paper, we develop a pipeline to train\nand validate an automated classification model to estimate the use of populist\nlanguage. We train models based on sentences that were identified as populist\nand pluralist in 300 US governors' speeches from 2010 to 2018 and in 45\nspeeches of presidential candidates in 2016. We find that these models classify\nmost speeches correctly, including 84% of governor speeches and 89% of\npresidential speeches. These results extend to different time periods (with 92%\naccuracy on more recent American governors), different amounts of data (with as\nfew as 70 training sentences per category achieving similar results), and when\nclassifying politicians instead of individual speeches. This pipeline is thus\nan effective tool that can optimise the systematic and swift classification of\nthe use of populist language in politicians' speeches.\n","authors":["Olaf van der Veen","Semir Dzebo","Levi Littvay","Kirk Hawkins","Oren Dar"],"pdf_url":"https://arxiv.org/pdf/2408.15213v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15204v1","updated":"2024-08-27T17:03:18Z","published":"2024-08-27T17:03:18Z","title":"Can Unconfident LLM Annotations Be Used for Confident Conclusions?","summary":" Large language models (LLMs) have shown high agreement with human raters\nacross a variety of tasks, demonstrating potential to ease the challenges of\nhuman data collection. In computational social science (CSS), researchers are\nincreasingly leveraging LLM annotations to complement slow and expensive human\nannotations. Still, guidelines for collecting and using LLM annotations,\nwithout compromising the validity of downstream conclusions, remain limited. We\nintroduce Confidence-Driven Inference: a method that combines LLM annotations\nand LLM confidence indicators to strategically select which human annotations\nshould be collected, with the goal of producing accurate statistical estimates\nand provably valid confidence intervals while reducing the number of human\nannotations needed. Our approach comes with safeguards against LLM annotations\nof poor quality, guaranteeing that the conclusions will be both valid and no\nless accurate than if we only relied on human annotations. We demonstrate the\neffectiveness of Confidence-Driven Inference over baselines in statistical\nestimation tasks across three CSS settings--text politeness, stance, and\nbias--reducing the needed number of human annotations by over 25% in each.\nAlthough we use CSS settings for demonstration, Confidence-Driven Inference can\nbe used to estimate most standard quantities across a broad range of NLP\nproblems.\n","authors":["Kristina Gligorić","Tijana Zrnic","Cinoo Lee","Emmanuel J. Candès","Dan Jurafsky"],"pdf_url":"https://arxiv.org/pdf/2408.15204v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15188v1","updated":"2024-08-27T16:44:41Z","published":"2024-08-27T16:44:41Z","title":"Infusing Acoustic Pause Context into Text-Based Dementia Assessment","summary":" Speech pauses, alongside content and structure, offer a valuable and\nnon-invasive biomarker for detecting dementia. This work investigates the use\nof pause-enriched transcripts in transformer-based language models to\ndifferentiate the cognitive states of subjects with no cognitive impairment,\nmild cognitive impairment, and Alzheimer's dementia based on their speech from\na clinical assessment. We address three binary classification tasks: Onset,\nmonitoring, and dementia exclusion. The performance is evaluated through\nexperiments on a German Verbal Fluency Test and a Picture Description Test,\ncomparing the model's effectiveness across different speech production\ncontexts. Starting from a textual baseline, we investigate the effect of\nincorporation of pause information and acoustic context. We show the test\nshould be chosen depending on the task, and similarly, lexical pause\ninformation and acoustic cross-attention contribute differently.\n","authors":["Franziska Braun","Sebastian P. Bayerl","Florian Hönig","Hartmut Lehfeld","Thomas Hillemacher","Tobias Bocklet","Korbinian Riedhammer"],"pdf_url":"https://arxiv.org/pdf/2408.15188v1.pdf","comment":"Accepted at INTERSPEECH 2024"},{"id":"http://arxiv.org/abs/2408.15176v1","updated":"2024-08-27T16:18:51Z","published":"2024-08-27T16:18:51Z","title":"Unlocking Potential in Pre-Trained Music Language Models for Versatile\n Multi-Track Music Arrangement","summary":" Large language models have shown significant capabilities across various\ndomains, including symbolic music generation. However, leveraging these\npre-trained models for controllable music arrangement tasks, each requiring\ndifferent forms of musical information as control, remains a novel challenge.\nIn this paper, we propose a unified sequence-to-sequence framework that enables\nthe fine-tuning of a symbolic music language model for multiple multi-track\narrangement tasks, including band arrangement, piano reduction, drum\narrangement, and voice separation. Our experiments demonstrate that the\nproposed approach consistently achieves higher musical quality compared to\ntask-specific baselines across all four tasks. Furthermore, through additional\nexperiments on probing analysis, we show the pre-training phase equips the\nmodel with essential knowledge to understand musical conditions, which is hard\nto acquired solely through task-specific fine-tuning.\n","authors":["Longshen Ou","Jingwei Zhao","Ziyu Wang","Gus Xia","Ye Wang"],"pdf_url":"https://arxiv.org/pdf/2408.15176v1.pdf","comment":"Submitted to AAAI 2025"},{"id":"http://arxiv.org/abs/2408.15172v1","updated":"2024-08-27T16:10:21Z","published":"2024-08-27T16:10:21Z","title":"X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation","summary":" Large Language Models (LLMs) and Large Multimodal Models (LMMs) have been\nshown to enhance the effectiveness of enriching item descriptions, thereby\nimproving the accuracy of recommendation systems. However, most existing\napproaches either rely on text-only prompting or employ basic multimodal\nstrategies that do not fully exploit the complementary information available\nfrom both textual and visual modalities. This paper introduces a novel\nframework, Cross-Reflection Prompting, termed X-Reflect, designed to address\nthese limitations by prompting LMMs to explicitly identify and reconcile\nsupportive and conflicting information between text and images. By capturing\nnuanced insights from both modalities, this approach generates more\ncomprehensive and contextually richer item representations. Extensive\nexperiments conducted on two widely used benchmarks demonstrate that our method\noutperforms existing prompting baselines in downstream recommendation accuracy.\nAdditionally, we evaluate the generalizability of our framework across\ndifferent LMM backbones and the robustness of the prompting strategies,\noffering insights for optimization. This work underscores the importance of\nintegrating multimodal information and presents a novel solution for improving\nitem understanding in multimodal recommendation systems.\n","authors":["Hanjia Lyu","Ryan Rossi","Xiang Chen","Md Mehrab Tanjim","Stefano Petrangeli","Somdeb Sarkhel","Jiebo Luo"],"pdf_url":"https://arxiv.org/pdf/2408.15172v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15171v1","updated":"2024-08-27T16:09:56Z","published":"2024-08-27T16:09:56Z","title":"Measuring text summarization factuality using atomic facts entailment\n metrics in the context of retrieval augmented generation","summary":" The use of large language models (LLMs) has significantly increased since the\nintroduction of ChatGPT in 2022, demonstrating their value across various\napplications. However, a major challenge for enterprise and commercial adoption\nof LLMs is their tendency to generate inaccurate information, a phenomenon\nknown as \"hallucination.\" This project proposes a method for estimating the\nfactuality of a summary generated by LLMs when compared to a source text. Our\napproach utilizes Naive Bayes classification to assess the accuracy of the\ncontent produced.\n","authors":["N. E. Kriman"],"pdf_url":"https://arxiv.org/pdf/2408.15171v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2408.15138v1","updated":"2024-08-27T15:23:09Z","published":"2024-08-27T15:23:09Z","title":"How transformers learn structured data: insights from hierarchical\n filtering","summary":" We introduce a hierarchical filtering procedure for generative models of\nsequences on trees, enabling control over the range of positional correlations\nin the data. Leveraging this controlled setting, we provide evidence that\nvanilla encoder-only transformer architectures can implement the optimal Belief\nPropagation algorithm on both root classification and masked language modeling\ntasks. Correlations at larger distances corresponding to increasing layers of\nthe hierarchy are sequentially included as the network is trained. We analyze\nhow the transformer layers succeed by focusing on attention maps from models\ntrained with varying degrees of filtering. These attention maps show clear\nevidence for iterative hierarchical reconstruction of correlations, and we can\nrelate these observations to a plausible implementation of the exact inference\nalgorithm for the network sizes considered.\n","authors":["Jerome Garnier-Brun","Marc Mézard","Emanuele Moscato","Luca Saglietti"],"pdf_url":"https://arxiv.org/pdf/2408.15138v1.pdf","comment":"18 pages, 9 figures"},{"id":"http://arxiv.org/abs/2408.07531v2","updated":"2024-08-27T15:16:06Z","published":"2024-08-14T13:03:41Z","title":"Development of a Large Language Model-based Multi-Agent Clinical\n Decision Support System for Korean Triage and Acuity Scale (KTAS)-Based\n Triage and Treatment Planning in Emergency Departments","summary":" Emergency department (ED) overcrowding and the complexity of rapid\ndecision-making in critical care settings pose significant challenges to\nhealthcare systems worldwide. While clinical decision support systems (CDSS)\nhave shown promise, the integration of large language models (LLMs) offers new\npossibilities for enhancing triage accuracy and clinical decision-making. This\nstudy presents an LLM-driven CDSS designed to assist ED physicians and nurses\nin patient triage, treatment planning, and overall emergency care management.\n We developed a multi-agent CDSS utilizing Llama-3-70b as the base LLM,\norchestrated by CrewAI and Langchain. The system comprises four AI agents\nemulating key ED roles: Triage Nurse, Emergency Physician, Pharmacist, and ED\nCoordinator. It incorporates the Korean Triage and Acuity Scale (KTAS) for\ntriage assessment and integrates with the RxNorm API for medication management.\n The model was evaluated using the Asclepius dataset, with performance\nassessed by a clinical emergency medicine specialist. The CDSS demonstrated\nhigh accuracy in triage decision-making compared to the baseline of a\nsingle-agent system. Furthermore, the system exhibited strong performance in\ncritical areas, including primary diagnosis, critical findings identification,\ndisposition decision-making, treatment planning, and resource allocation.\n Our multi-agent CDSS demonstrates significant potential for supporting\ncomprehensive emergency care management. By leveraging state-of-the-art AI\ntechnologies, this system offers a scalable and adaptable tool that could\nenhance emergency medical care delivery, potentially alleviating ED\novercrowding and improving patient outcomes. This work contributes to the\ngrowing field of AI applications in emergency medicine and offers a promising\ndirection for future research and clinical implementation.\n","authors":["Seungjun Han","Wongyung Choi"],"pdf_url":"https://arxiv.org/pdf/2408.07531v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15091v1","updated":"2024-08-27T14:22:02Z","published":"2024-08-27T14:22:02Z","title":"Relation Also Knows: Rethinking the Recall and Editing of Factual\n Associations in Auto-Regressive Transformer Language Models","summary":" The storage and recall of factual associations in auto-regressive transformer\nlanguage models (LMs) have drawn a great deal of attention, inspiring knowledge\nediting by directly modifying the located model weights. Most editing works\nachieve knowledge editing under the guidance of existing interpretations of\nknowledge recall that mainly focus on subject knowledge. However, these\ninterpretations are seriously flawed, neglecting relation information and\nleading to the over-generalizing problem for editing. In this work, we discover\na novel relation-focused perspective to interpret the knowledge recall of\ntransformer LMs during inference and apply it on knowledge editing to avoid\nover-generalizing. Experimental results on the dataset supplemented with a new\nR-Specificity criterion demonstrate that our editing approach significantly\nalleviates over-generalizing while remaining competitive on other criteria,\nbreaking the domination of subject-focused editing for future research.\n","authors":["Xiyu Liu","Zhengxiao Liu","Naibin Gu","Zheng Lin","Wanli Ma","Ji Xiang","Weiping Wang"],"pdf_url":"https://arxiv.org/pdf/2408.15091v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.05195v2","updated":"2024-08-27T14:20:57Z","published":"2023-11-09T08:19:34Z","title":"PRODIGy: a PROfile-based DIalogue Generation dataset","summary":" Providing dialogue agents with a profile representation can improve their\nconsistency and coherence, leading to better conversations. However, current\nprofile-based dialogue datasets for training such agents contain either\nexplicit profile representations that are simple and dialogue-specific, or\nimplicit representations that are difficult to collect. In this work, we\npropose a unified framework in which we bring together both standard and more\nsophisticated profile representations by creating a new resource where each\ndialogue is aligned with all possible speaker representations such as\ncommunication style, biographies, and personality. This framework allows to\ntest several baselines built using generative language models with several\nprofile configurations. The automatic evaluation shows that profile-based\nmodels have better generalisation capabilities than models trained on dialogues\nonly, both in-domain and cross-domain settings. These results are consistent\nfor fine-tuned models and instruction-based LLMs. Additionally, human\nevaluation demonstrates a clear preference for generations consistent with both\nprofile and context. Finally, to account for possible privacy concerns, all\nexperiments are done under two configurations: inter-character and\nintra-character. In the former, the LM stores the information about the\ncharacter in its internal representation, while in the latter, the LM does not\nretain any personal information but uses it only at inference time.\n","authors":["Daniela Occhipinti","Serra Sinem Tekiroglu","Marco Guerini"],"pdf_url":"https://arxiv.org/pdf/2311.05195v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14340v2","updated":"2024-08-27T14:09:44Z","published":"2024-08-26T15:13:14Z","title":"Foundation Models for Music: A Survey","summary":" In recent years, foundation models (FMs) such as large language models (LLMs)\nand latent diffusion models (LDMs) have profoundly impacted diverse sectors,\nincluding music. This comprehensive review examines state-of-the-art (SOTA)\npre-trained models and foundation models in music, spanning from representation\nlearning, generative learning and multimodal learning. We first contextualise\nthe significance of music in various industries and trace the evolution of AI\nin music. By delineating the modalities targeted by foundation models, we\ndiscover many of the music representations are underexplored in FM development.\nThen, emphasis is placed on the lack of versatility of previous methods on\ndiverse music applications, along with the potential of FMs in music\nunderstanding, generation and medical application. By comprehensively exploring\nthe details of the model pre-training paradigm, architectural choices,\ntokenisation, finetuning methodologies and controllability, we emphasise the\nimportant topics that should have been well explored, like instruction tuning\nand in-context learning, scaling law and emergent ability, as well as\nlong-sequence modelling etc. A dedicated section presents insights into music\nagents, accompanied by a thorough analysis of datasets and evaluations\nessential for pre-training and downstream tasks. Finally, by underscoring the\nvital importance of ethical considerations, we advocate that following research\non FM for music should focus more on such issues as interpretability,\ntransparency, human responsibility, and copyright issues. The paper offers\ninsights into future challenges and trends on FMs for music, aiming to shape\nthe trajectory of human-AI collaboration in the music realm.\n","authors":["Yinghao Ma","Anders Øland","Anton Ragni","Bleiz MacSen Del Sette","Charalampos Saitis","Chris Donahue","Chenghua Lin","Christos Plachouras","Emmanouil Benetos","Elio Quinton","Elona Shatri","Fabio Morreale","Ge Zhang","György Fazekas","Gus Xia","Huan Zhang","Ilaria Manco","Jiawen Huang","Julien Guinot","Liwei Lin","Luca Marinelli","Max W. Y. Lam","Megha Sharma","Qiuqiang Kong","Roger B. Dannenberg","Ruibin Yuan","Shangda Wu","Shih-Lun Wu","Shuqi Dai","Shun Lei","Shiyin Kang","Simon Dixon","Wenhu Chen","Wenhao Huang","Xingjian Du","Xingwei Qu","Xu Tan","Yizhi Li","Zeyue Tian","Zhiyong Wu","Zhizheng Wu","Ziyang Ma","Ziyu Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14340v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15079v1","updated":"2024-08-27T14:08:23Z","published":"2024-08-27T14:08:23Z","title":"BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and\n Deduplication by Introducing a Competitive Large Language Model Baseline","summary":" The general capabilities of Large Language Models (LLM) highly rely on the\ncomposition and selection on extensive pretraining datasets, treated as\ncommercial secrets by several institutions. To mitigate this issue, we\nopen-source the details of a universally applicable data processing pipeline\nand validate its effectiveness and potential by introducing a competitive LLM\nbaseline. Specifically, the data processing pipeline consists of broad\ncollection to scale up and reweighting to improve quality. We then pretrain a\n7B model BaichuanSEED with 3T tokens processed by our pipeline without any\ndeliberate downstream task-related optimization, followed by an easy but\neffective supervised fine-tuning stage. BaichuanSEED demonstrates consistency\nand predictability throughout training and achieves comparable performance on\ncomprehensive benchmarks with several commercial advanced large language\nmodels, such as Qwen1.5 and Llama3. We also conduct several heuristic\nexperiments to discuss the potential for further optimization of downstream\ntasks, such as mathematics and coding.\n","authors":["Guosheng Dong","Da Pan","Yiding Sun","Shusen Zhang","Zheng Liang","Xin Wu","Yanjun Shen","Fan Yang","Haoze Sun","Tianpeng Li","Mingan Lin","Jianhua Xu","Yufan Zhang","Xiaonan Nie","Lei Su","Bingning Wang","Wentao Zhang","Jiaxin Mao","Zenan Zhou","Weipeng Chen"],"pdf_url":"https://arxiv.org/pdf/2408.15079v1.pdf","comment":"19 pages, 6 figures"},{"id":"http://arxiv.org/abs/2408.15050v1","updated":"2024-08-27T13:19:32Z","published":"2024-08-27T13:19:32Z","title":"Self-supervised Topic Taxonomy Discovery in the Box Embedding Space","summary":" Topic taxonomy discovery aims at uncovering topics of different abstraction\nlevels and constructing hierarchical relations between them. Unfortunately,\nmost of prior work can hardly model semantic scopes of words and topics by\nholding the Euclidean embedding space assumption. What's worse, they infer\nasymmetric hierarchical relations by symmetric distances between topic\nembeddings. As a result, existing methods suffer from problems of low-quality\ntopics at high abstraction levels and inaccurate hierarchical relations. To\nalleviate these problems, this paper develops a Box embedding-based Topic Model\n(BoxTM) that maps words and topics into the box embedding space, where the\nasymmetric metric is defined to properly infer hierarchical relations among\ntopics. Additionally, our BoxTM explicitly infers upper-level topics based on\ncorrelation between specific topics through recursive clustering on topic\nboxes. Finally, extensive experiments validate high-quality of the topic\ntaxonomy learned by BoxTM.\n","authors":["Yuyin Lu","Hegang Chen","Pengbo Mao","Yanghui Rao","Haoran Xie","Fu Lee Wang","Qing Li"],"pdf_url":"https://arxiv.org/pdf/2408.15050v1.pdf","comment":"to be published in TACL"},{"id":"http://arxiv.org/abs/2408.15040v1","updated":"2024-08-27T13:10:05Z","published":"2024-08-27T13:10:05Z","title":"A Survey of Large Language Models for European Languages","summary":" Large Language Models (LLMs) have gained significant attention due to their\nhigh performance on a wide range of natural language tasks since the release of\nChatGPT. The LLMs learn to understand and generate language by training\nbillions of model parameters on vast volumes of text data. Despite being a\nrelatively new field, LLM research is rapidly advancing in various directions.\nIn this paper, we present an overview of LLM families, including LLaMA, PaLM,\nGPT, and MoE, and the methods developed to create and enhance LLMs for official\nEuropean Union (EU) languages. We provide a comprehensive summary of common\nmonolingual and multilingual datasets used for pretraining LLMs.\n","authors":["Wazir Ali","Sampo Pyysalo"],"pdf_url":"https://arxiv.org/pdf/2408.15040v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15037v1","updated":"2024-08-27T13:07:07Z","published":"2024-08-27T13:07:07Z","title":"Evidence-Enhanced Triplet Generation Framework for Hallucination\n Alleviation in Generative Question Answering","summary":" To address the hallucination in generative question answering (GQA) where the\nanswer can not be derived from the document, we propose a novel\nevidence-enhanced triplet generation framework, EATQA, encouraging the model to\npredict all the combinations of (Question, Evidence, Answer) triplet by\nflipping the source pair and the target label to understand their logical\nrelationships, i.e., predict Answer(A), Question(Q), and Evidence(E) given a\nQE, EA, and QA pairs, respectively. Furthermore, we bridge the distribution gap\nto distill the knowledge from evidence in inference stage. Our framework\nensures the model to learn the logical relation between query, evidence and\nanswer, which simultaneously improves the evidence generation and query\nanswering. In this paper, we apply EATQA to LLama and it outperforms other\nLLMs-based methods and hallucination mitigation approaches on two challenging\nGQA benchmarks. Further analysis shows that our method not only keeps prior\nknowledge within LLM, but also mitigates hallucination and generates faithful\nanswers.\n","authors":["Haowei Du","Huishuai Zhang","Dongyan Zhao"],"pdf_url":"https://arxiv.org/pdf/2408.15037v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14991v1","updated":"2024-08-27T12:15:43Z","published":"2024-08-27T12:15:43Z","title":"Speech Recognition Transformers: Topological-lingualism Perspective","summary":" Transformers have evolved with great success in various artificial\nintelligence tasks. Thanks to our recent prevalence of self-attention\nmechanisms, which capture long-term dependency, phenomenal outcomes in speech\nprocessing and recognition tasks have been produced. The paper presents a\ncomprehensive survey of transformer techniques oriented in speech modality. The\nmain contents of this survey include (1) background of traditional ASR,\nend-to-end transformer ecosystem, and speech transformers (2) foundational\nmodels in a speech via lingualism paradigm, i.e., monolingual, bilingual,\nmultilingual, and cross-lingual (3) dataset and languages, acoustic features,\narchitecture, decoding, and evaluation metric from a specific topological\nlingualism perspective (4) popular speech transformer toolkit for building\nend-to-end ASR systems. Finally, highlight the discussion of open challenges\nand potential research directions for the community to conduct further research\nin this domain.\n","authors":["Shruti Singh","Muskaan Singh","Virender Kadyan"],"pdf_url":"https://arxiv.org/pdf/2408.14991v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14972v1","updated":"2024-08-27T11:24:38Z","published":"2024-08-27T11:24:38Z","title":"AgentMonitor: A Plug-and-Play Framework for Predictive and Secure\n Multi-Agent Systems","summary":" The rapid advancement of large language models (LLMs) has led to the rise of\nLLM-based agents. Recent research shows that multi-agent systems (MAS), where\neach agent plays a specific role, can outperform individual LLMs. However,\nconfiguring an MAS for a task remains challenging, with performance only\nobservable post-execution. Inspired by scaling laws in LLM development, we\ninvestigate whether MAS performance can be predicted beforehand. We introduce\nAgentMonitor, a framework that integrates at the agent level to capture inputs\nand outputs, transforming them into statistics for training a regression model\nto predict task performance. Additionally, it can further apply real-time\ncorrections to address security risks posed by malicious agents, mitigating\nnegative impacts and enhancing MAS security. Experiments demonstrate that an\nXGBoost model achieves a Spearman correlation of 0.89 in-domain and 0.58 in\nmore challenging scenarios. Furthermore, using AgentMonitor reduces harmful\ncontent by 6.2% and increases helpful content by 1.8% on average, enhancing\nsafety and reliability. Code is available at\n\\url{https://github.com/chanchimin/AgentMonitor}.\n","authors":["Chi-Min Chan","Jianxuan Yu","Weize Chen","Chunyang Jiang","Xinyu Liu","Weijie Shi","Zhiyuan Liu","Wei Xue","Yike Guo"],"pdf_url":"https://arxiv.org/pdf/2408.14972v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14968v1","updated":"2024-08-27T11:21:19Z","published":"2024-08-27T11:21:19Z","title":"MRSE: An Efficient Multi-modality Retrieval System for Large Scale\n E-commerce","summary":" Providing high-quality item recall for text queries is crucial in large-scale\ne-commerce search systems. Current Embedding-based Retrieval Systems (ERS)\nembed queries and items into a shared low-dimensional space, but uni-modality\nERS rely too heavily on textual features, making them unreliable in complex\ncontexts. While multi-modality ERS incorporate various data sources, they often\noverlook individual preferences for different modalities, leading to suboptimal\nresults. To address these issues, we propose MRSE, a Multi-modality Retrieval\nSystem that integrates text, item images, and user preferences through\nlightweight mixture-of-expert (LMoE) modules to better align features across\nand within modalities. MRSE also builds user profiles at a multi-modality level\nand introduces a novel hybrid loss function that enhances consistency and\nrobustness using hard negative sampling. Experiments on a large-scale dataset\nfrom Shopee and online A/B testing show that MRSE achieves an 18.9% improvement\nin offline relevance and a 3.7% gain in online core metrics compared to\nShopee's state-of-the-art uni-modality system.\n","authors":["Hao Jiang","Haoxiang Zhang","Qingshan Hou","Chaofeng Chen","Weisi Lin","Jingchang Zhang","Annan Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14968v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14960v1","updated":"2024-08-27T11:07:15Z","published":"2024-08-27T11:07:15Z","title":"Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual\n Progress","summary":" The use of synthetic data has played a critical role in recent state-of-art\nbreakthroughs. However, overly relying on a single oracle teacher model to\ngenerate data has been shown to lead to model collapse and invite propagation\nof biases. These limitations are particularly evident in multilingual settings,\nwhere the absence of a universally effective teacher model that excels across\nall languages presents significant challenges. In this work, we address these\nextreme difference by introducing \"multilingual arbitrage\", which capitalizes\non performance variations between multiple models for a given language. To do\nso, we strategically route samples through a diverse pool of models, each with\nunique strengths in different languages. Across exhaustive experiments on\nstate-of-art models, our work suggests that arbitrage techniques allow for\nspectacular gains in performance that far outperform relying on a single\nteacher. In particular, compared to the best single teacher, we observe gains\nof up to 56.5% improvement in win rates averaged across all languages when\nswitching to multilingual arbitrage. We observe the most significant gains for\nthe least resourced languages in our pool.\n","authors":["Ayomide Odumakinde","Daniel D'souza","Pat Verga","Beyza Ermis","Sara Hooker"],"pdf_url":"https://arxiv.org/pdf/2408.14960v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.15504v2","updated":"2024-08-27T10:07:27Z","published":"2024-06-19T16:43:56Z","title":"Dr.E Bridges Graphs with Large Language Models through Words","summary":" Significant efforts have been dedicated to integrating the powerful Large\nLanguage Models (LLMs) with diverse modalities, particularly focusing on the\nfusion of language, vision and audio data. However, the graph-structured data,\nwhich is inherently rich in structural and domain-specific knowledge, has not\nyet been gracefully adapted to LLMs. Existing methods either describe the graph\nwith raw text, suffering the loss of graph structural information, or feed\nGraph Neural Network (GNN) embeddings into LLMs at the cost of losing\nexplainable prompt semantics. To bridge this gap, we introduce an end-to-end\nmodality-aligning framework for LLM-graph alignment: Dual-Residual Vector\nQuantized-Variational AutoEncoder, namely Dr.E. Our approach is purposefully\ndesigned to facilitate token-level alignment with LLMs, enabling an effective\ntranslation of the intrinsic `language' of graphs into comprehensible natural\nlanguage. We also manage to enhance LLMs' more robust structural understanding\nof graphs by incorporating multiple views of the central nodes based on their\nsurrounding nodes at various distances. Our experimental evaluations on\nstandard graph tasks demonstrate competitive performance against other\nstate-of-the-art (SOTA) approaches. Additionally, our framework ensures certain\nvisual interpretability, efficiency, and robustness, marking the promising\nsuccessful endeavor to achieve token-level alignment between LLMs and GNNs. Our\ncode is available at: https://anonymous.4open.science/r/dre-817.\n","authors":["Zipeng Liu","Likang Wu","Ming He","Zhong Guan","Hongke Zhao","Nan Feng"],"pdf_url":"https://arxiv.org/pdf/2406.15504v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14909v1","updated":"2024-08-27T09:35:49Z","published":"2024-08-27T09:35:49Z","title":"SpikingSSMs: Learning Long Sequences with Sparse and Parallel Spiking\n State Space Models","summary":" Known as low energy consumption networks, spiking neural networks (SNNs) have\ngained a lot of attention within the past decades. While SNNs are increasing\ncompetitive with artificial neural networks (ANNs) for vision tasks, they are\nrarely used for long sequence tasks, despite their intrinsic temporal dynamics.\nIn this work, we develop spiking state space models (SpikingSSMs) for long\nsequence learning by leveraging on the sequence learning abilities of state\nspace models (SSMs). Inspired by dendritic neuron structure, we hierarchically\nintegrate neuronal dynamics with the original SSM block, meanwhile realizing\nsparse synaptic computation. Furthermore, to solve the conflict of event-driven\nneuronal dynamics with parallel computing, we propose a light-weight surrogate\ndynamic network which accurately predicts the after-reset membrane potential\nand compatible to learnable thresholds, enabling orders of acceleration in\ntraining speed compared with conventional iterative methods. On the long range\narena benchmark task, SpikingSSM achieves competitive performance to\nstate-of-the-art SSMs meanwhile realizing on average 90\\% of network sparsity.\nOn language modeling, our network significantly surpasses existing spiking\nlarge language models (spikingLLMs) on the WikiText-103 dataset with only a\nthird of the model size, demonstrating its potential as backbone architecture\nfor low computation cost LLMs.\n","authors":["Shuaijie Shen","Chao Wang","Renzhuo Huang","Yan Zhong","Qinghai Guo","Zhichao Lu","Jianguo Zhang","Luziwei Leng"],"pdf_url":"https://arxiv.org/pdf/2408.14909v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14908v1","updated":"2024-08-27T09:35:13Z","published":"2024-08-27T09:35:13Z","title":"Triplètoile: Extraction of Knowledge from Microblogging Text","summary":" Numerous methods and pipelines have recently emerged for the automatic\nextraction of knowledge graphs from documents such as scientific publications\nand patents. However, adapting these methods to incorporate alternative text\nsources like micro-blogging posts and news has proven challenging as they\nstruggle to model open-domain entities and relations, typically found in these\nsources. In this paper, we propose an enhanced information extraction pipeline\ntailored to the extraction of a knowledge graph comprising open-domain entities\nfrom micro-blogging posts on social media platforms. Our pipeline leverages\ndependency parsing and classifies entity relations in an unsupervised manner\nthrough hierarchical clustering over word embeddings. We provide a use case on\nextracting semantic triples from a corpus of 100 thousand tweets about digital\ntransformation and publicly release the generated knowledge graph. On the same\ndataset, we conduct two experimental evaluations, showing that the system\nproduces triples with precision over 95% and outperforms similar pipelines of\naround 5% in terms of precision, while generating a comparatively higher number\nof triples.\n","authors":["Vanni Zavarella","Sergio Consoli","Diego Reforgiato Recupero","Gianni Fenu","Simone Angioni","Davide Buscaldi","Danilo Dessì","Francesco Osborne"],"pdf_url":"https://arxiv.org/pdf/2408.14908v1.pdf","comment":"42 pages, 6 figures"},{"id":"http://arxiv.org/abs/2408.14906v1","updated":"2024-08-27T09:34:38Z","published":"2024-08-27T09:34:38Z","title":"Writing in the Margins: Better Inference Pattern for Long Context\n Retrieval","summary":" In this paper, we introduce Writing in the Margins (WiM), a new inference\npattern for Large Language Models designed to optimize the handling of long\ninput sequences in retrieval-oriented tasks. This approach leverages the\nchunked prefill of the key-value cache to perform segment-wise inference, which\nenables efficient processing of extensive contexts along with the generation\nand classification of intermediate information (\"margins\") that guide the model\ntowards specific tasks. This method increases computational overhead marginally\nwhile significantly enhancing the performance of off-the-shelf models without\nthe need for fine-tuning. Specifically, we observe that WiM provides an average\nenhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG)\nand more than a 30.0% increase in the F1-score for aggregation tasks (CWE).\nAdditionally, we show how the proposed pattern fits into an interactive\nretrieval design that provides end-users with ongoing updates about the\nprogress of context processing, and pinpoints the integration of relevant\ninformation into the final response. We release our implementation of WiM using\nHugging Face Transformers library at\nhttps://github.com/writer/writing-in-the-margins.\n","authors":["Melisa Russak","Umar Jamil","Christopher Bryant","Kiran Kamble","Axel Magnuson","Mateusz Russak","Waseem AlShikh"],"pdf_url":"https://arxiv.org/pdf/2408.14906v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14895v1","updated":"2024-08-27T09:18:57Z","published":"2024-08-27T09:18:57Z","title":"VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view\n Videos of Daily Activities","summary":" Multi-modal knowledge graphs (MMKGs), which ground various non-symbolic data\n(e.g., images and videos) into symbols, have attracted attention as resources\nenabling knowledge processing and machine learning across modalities. However,\nthe construction of MMKGs for videos consisting of multiple events, such as\ndaily activities, is still in the early stages. In this paper, we construct an\nMMKG based on synchronized multi-view simulated videos of daily activities.\nBesides representing the content of daily life videos as event-centric\nknowledge, our MMKG also includes frame-by-frame fine-grained changes, such as\nbounding boxes within video frames. In addition, we provide support tools for\nquerying our MMKG. As an application example, we demonstrate that our MMKG\nfacilitates benchmarking vision-language models by providing the necessary\nvision-language datasets for a tailored task.\n","authors":["Shusaku Egami","Takahiro Ugai","Ken Fukuda"],"pdf_url":"https://arxiv.org/pdf/2408.14895v1.pdf","comment":"5 pages,4 figures, accepted by CIKM2024 Resource Track"},{"id":"http://arxiv.org/abs/2408.14892v1","updated":"2024-08-27T09:07:37Z","published":"2024-08-27T09:07:37Z","title":"A Functional Trade-off between Prosodic and Semantic Cues in Conveying\n Sarcasm","summary":" This study investigates the acoustic features of sarcasm and disentangles the\ninterplay between the propensity of an utterance being used sarcastically and\nthe presence of prosodic cues signaling sarcasm. Using a dataset of sarcastic\nutterances compiled from television shows, we analyze the prosodic features\nwithin utterances and key phrases belonging to three distinct sarcasm\ncategories (embedded, propositional, and illocutionary), which vary in the\ndegree of semantic cues present, and compare them to neutral expressions.\nResults show that in phrases where the sarcastic meaning is salient from the\nsemantics, the prosodic cues are less relevant than when the sarcastic meaning\nis not evident from the semantics, suggesting a trade-off between prosodic and\nsemantic cues of sarcasm at the phrase level. These findings highlight a\nlessened reliance on prosodic modulation in semantically dense sarcastic\nexpressions and a nuanced interaction that shapes the communication of\nsarcastic intent.\n","authors":["Zhu Li","Xiyuan Gao","Yuqing Zhang","Shekhar Nayak","Matt Coler"],"pdf_url":"https://arxiv.org/pdf/2408.14892v1.pdf","comment":"accepted at Interspeech 2024"},{"id":"http://arxiv.org/abs/2408.14874v1","updated":"2024-08-27T08:43:32Z","published":"2024-08-27T08:43:32Z","title":"Inverse-Q*: Token Level Reinforcement Learning for Aligning Large\n Language Models Without Preference Data","summary":" Reinforcement Learning from Human Feedback (RLHF) has proven effective in\naligning large language models with human intentions, yet it often relies on\ncomplex methodologies like Proximal Policy Optimization (PPO) that require\nextensive hyper-parameter tuning and present challenges in sample efficiency\nand stability. In this paper, we introduce Inverse-Q*, an innovative framework\nthat transcends traditional RL methods by optimizing token-level reinforcement\nlearning without the need for additional reward or value models. Inverse-Q*\nleverages direct preference optimization techniques but extends them by\nestimating the conditionally optimal policy directly from the model's\nresponses, facilitating more granular and flexible policy shaping. Our approach\nreduces reliance on human annotation and external supervision, making it\nespecially suitable for low-resource settings. We present extensive\nexperimental results demonstrating that Inverse-Q* not only matches but\npotentially exceeds the effectiveness of PPO in terms of convergence speed and\nthe alignment of model responses with human preferences. Our findings suggest\nthat Inverse-Q* offers a practical and robust alternative to conventional RLHF\napproaches, paving the way for more efficient and adaptable model training\napproaches.\n","authors":["Han Xia","Songyang Gao","Qiming Ge","Zhiheng Xi","Qi Zhang","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2408.14874v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14866v1","updated":"2024-08-27T08:38:48Z","published":"2024-08-27T08:38:48Z","title":"Advancing Adversarial Suffix Transfer Learning on Aligned Large Language\n Models","summary":" Language Language Models (LLMs) face safety concerns due to potential misuse\nby malicious users. Recent red-teaming efforts have identified adversarial\nsuffixes capable of jailbreaking LLMs using the gradient-based search algorithm\nGreedy Coordinate Gradient (GCG). However, GCG struggles with computational\ninefficiency, limiting further investigations regarding suffix transferability\nand scalability across models and data. In this work, we bridge the connection\nbetween search efficiency and suffix transferability. We propose a two-stage\ntransfer learning framework, DeGCG, which decouples the search process into\nbehavior-agnostic pre-searching and behavior-relevant post-searching.\nSpecifically, we employ direct first target token optimization in pre-searching\nto facilitate the search process. We apply our approach to cross-model,\ncross-data, and self-transfer scenarios. Furthermore, we introduce an\ninterleaved variant of our approach, i-DeGCG, which iteratively leverages\nself-transferability to accelerate the search process. Experiments on HarmBench\ndemonstrate the efficiency of our approach across various models and domains.\nNotably, our i-DeGCG outperforms the baseline on Llama2-chat-7b with ASRs of\n$43.9$ ($+22.2$) and $39.0$ ($+19.5$) on valid and test sets, respectively.\nFurther analysis on cross-model transfer indicates the pivotal role of first\ntarget token optimization in leveraging suffix transferability for efficient\nsearching.\n","authors":["Hongfu Liu","Yuxi Xie","Ye Wang","Michael Shieh"],"pdf_url":"https://arxiv.org/pdf/2408.14866v1.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2402.03848v7","updated":"2024-08-27T08:33:29Z","published":"2024-02-06T09:50:08Z","title":"ANLS* -- A Universal Document Processing Metric for Generative Large\n Language Models","summary":" Traditionally, discriminative models have been the predominant choice for\ntasks like document classification and information extraction. These models\nmake predictions that fall into a limited number of predefined classes,\nfacilitating a binary true or false evaluation and enabling the direct\ncalculation of metrics such as the F1 score. However, recent advancements in\ngenerative large language models (GLLMs) have prompted a shift in the field due\nto their enhanced zero-shot capabilities, which eliminate the need for a\ndownstream dataset and computationally expensive fine-tuning. However,\nevaluating GLLMs presents a challenge as the binary true or false evaluation\nused for discriminative models is not applicable to the predictions made by\nGLLMs.\n This paper introduces a new metric for generative models called ANLS* for\nevaluating a wide variety of tasks, including information extraction and\nclassification tasks. The ANLS* metric extends existing ANLS metrics as a\ndrop-in-replacement and is still compatible with previously reported ANLS\nscores. An evaluation of 7 different datasets, and more than 10 different GLLMs\ntogether with 3 different prompting methods using the ANLS* metric is also\nprovided, demonstrating the importance of the proposed metric.\n We also benchmark a novel approach to generate prompts for documents, called\nSFT, against other prompting techniques such as LATIN. In almost all cases, SFT\noutperforms other techniques and improves the state-of-the-art, sometimes by as\nmuch as $10$ percentage points.\n Sources are available at https://github.com/deepopinion/anls_star_metric\n","authors":["David Peer","Philemon Schöpf","Volckmar Nebendahl","Alexander Rietzler","Sebastian Stabinger"],"pdf_url":"https://arxiv.org/pdf/2402.03848v7.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.03870v2","updated":"2024-08-27T08:31:04Z","published":"2024-03-06T17:23:28Z","title":"Learning to Decode Collaboratively with Multiple Language Models","summary":" We propose a method to teach multiple large language models (LLM) to\ncollaborate by interleaving their generations at the token level. We model the\ndecision of which LLM generates the next token as a latent variable. By\noptimizing the marginal likelihood of a training set under our latent variable\nmodel, the base LLM automatically learns when to generate itself and when to\ncall on one of the ``assistant'' language models to generate, all without\ndirect supervision. Token-level collaboration during decoding allows for a\nfusion of each model's expertise in a manner tailored to the specific task at\nhand. Our collaborative decoding is especially useful in cross-domain settings\nwhere a generalist base LLM learns to invoke domain expert models. On\ninstruction-following, domain-specific QA, and reasoning tasks, we show that\nthe performance of the joint system exceeds that of the individual models.\nThrough qualitative analysis of the learned latent decisions, we show models\ntrained with our method exhibit several interesting collaboration patterns,\ne.g., template-filling. Our code is available at\nhttps://github.com/clinicalml/co-llm.\n","authors":["Shannon Zejiang Shen","Hunter Lang","Bailin Wang","Yoon Kim","David Sontag"],"pdf_url":"https://arxiv.org/pdf/2403.03870v2.pdf","comment":"16 pages, 4 figures, 11 tables"},{"id":"http://arxiv.org/abs/2408.14853v1","updated":"2024-08-27T08:12:08Z","published":"2024-08-27T08:12:08Z","title":"Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language\n Models","summary":" Large Language Models (LLMs) have become a focal point in the rapidly\nevolving field of artificial intelligence. However, a critical concern is the\npresence of toxic content within the pre-training corpus of these models, which\ncan lead to the generation of inappropriate outputs. Investigating methods for\ndetecting internal faults in LLMs can help us understand their limitations and\nimprove their security. Existing methods primarily focus on jailbreaking\nattacks, which involve manually or automatically constructing adversarial\ncontent to prompt the target LLM to generate unexpected responses. These\nmethods rely heavily on prompt engineering, which is time-consuming and usually\nrequires specially designed questions. To address these challenges, this paper\nproposes a target-driven attack paradigm that focuses on directly eliciting the\ntarget response instead of optimizing the prompts. We introduce the use of\nanother LLM as the detector for toxic content, referred to as ToxDet. Given a\ntarget toxic response, ToxDet can generate a possible question and a\npreliminary answer to provoke the target model into producing desired toxic\nresponses with meanings equivalent to the provided one. ToxDet is trained by\ninteracting with the target LLM and receiving reward signals from it, utilizing\nreinforcement learning for the optimization process. While the primary focus of\nthe target models is on open-source LLMs, the fine-tuned ToxDet can also be\ntransferred to attack black-box models such as GPT-4o, achieving notable\nresults. Experimental results on AdvBench and HH-Harmless datasets demonstrate\nthe effectiveness of our methods in detecting the tendencies of target LLMs to\ngenerate harmful responses. This algorithm not only exposes vulnerabilities but\nalso provides a valuable resource for researchers to strengthen their models\nagainst such attacks.\n","authors":["Yuhao Du","Zhuo Li","Pengyu Cheng","Xiang Wan","Anningzhe Gao"],"pdf_url":"https://arxiv.org/pdf/2408.14853v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14053v2","updated":"2024-08-27T08:05:07Z","published":"2024-08-26T07:19:07Z","title":"Enhancing Depression Diagnosis with Chain-of-Thought Prompting","summary":" When using AI to detect signs of depressive disorder, AI models habitually\ndraw preemptive conclusions. We theorize that using chain-of-thought (CoT)\nprompting to evaluate Patient Health Questionnaire-8 (PHQ-8) scores will\nimprove the accuracy of the scores determined by AI models. In our findings,\nwhen the models reasoned with CoT, the estimated PHQ-8 scores were consistently\ncloser on average to the accepted true scores reported by each participant\ncompared to when not using CoT. Our goal is to expand upon AI models'\nunderstanding of the intricacies of human conversation, allowing them to more\neffectively assess a patient's feelings and tone, therefore being able to more\naccurately discern mental disorder symptoms; ultimately, we hope to augment AI\nmodels' abilities, so that they can be widely accessible and used in the\nmedical field.\n","authors":["Elysia Shi","Adithri Manda","London Chowdhury","Runeema Arun","Kevin Zhu","Michael Lam"],"pdf_url":"https://arxiv.org/pdf/2408.14053v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14849v1","updated":"2024-08-27T08:01:13Z","published":"2024-08-27T08:01:13Z","title":"Project SHADOW: Symbolic Higher-order Associative Deductive reasoning On\n Wikidata using LM probing","summary":" We introduce SHADOW, a fine-tuned language model trained on an intermediate\ntask using associative deductive reasoning, and measure its performance on a\nknowledge base construction task using Wikidata triple completion. We evaluate\nSHADOW on the LM-KBC 2024 challenge and show that it outperforms the baseline\nsolution by 20% with a F1 score of 68.72%.\n","authors":["Hanna Abi Akl"],"pdf_url":"https://arxiv.org/pdf/2408.14849v1.pdf","comment":"6 pages, 1 figure"},{"id":"http://arxiv.org/abs/2407.03600v2","updated":"2024-08-27T08:00:03Z","published":"2024-07-04T03:20:31Z","title":"Chain-of-Thought Augmentation with Logit Contrast for Enhanced Reasoning\n in Language Models","summary":" Rapidly increasing model scales coupled with steering methods such as\nchain-of-thought prompting have led to drastic improvements in language model\nreasoning. At the same time, models struggle with compositional generalization\nand are far from human performance on many reasoning-based benchmarks.\nLeveraging the success of chain-of-thought prompting, and also taking\ninspiration from context-aware decoding (CAD), we explore input-based\ncontrasting methods to further encourage the type of reasoning induced by\nchain-of-thought prompting. While work remains to stabilize these results\nacross datasets and models, the improvements we find warrant further\ninvestigation into input-based steering methods for context-aware reasoning.\n","authors":["Jay Shim","Grant Kruttschnitt","Alyssa Ma","Daniel Kim","Benjamin Chek","Athul Anand","Kevin Zhu","Sean O'Brien"],"pdf_url":"https://arxiv.org/pdf/2407.03600v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14845v1","updated":"2024-08-27T07:56:35Z","published":"2024-08-27T07:56:35Z","title":"AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark","summary":" Detecting biases in natural language understanding (NLU) for African American\nVernacular English (AAVE) is crucial to developing inclusive natural language\nprocessing (NLP) systems. To address dialect-induced performance discrepancies,\nwe introduce AAVENUE ({AAVE} {N}atural Language {U}nderstanding {E}valuation),\na benchmark for evaluating large language model (LLM) performance on NLU tasks\nin AAVE and Standard American English (SAE). AAVENUE builds upon and extends\nexisting benchmarks like VALUE, replacing deterministic syntactic and\nmorphological transformations with a more flexible methodology leveraging\nLLM-based translation with few-shot prompting, improving performance across our\nevaluation metrics when translating key tasks from the GLUE and SuperGLUE\nbenchmarks. We compare AAVENUE and VALUE translations using five popular LLMs\nand a comprehensive set of metrics including fluency, BARTScore, quality,\ncoherence, and understandability. Additionally, we recruit fluent AAVE speakers\nto validate our translations for authenticity. Our evaluations reveal that LLMs\nconsistently perform better on SAE tasks than AAVE-translated versions,\nunderscoring inherent biases and highlighting the need for more inclusive NLP\nmodels. We have open-sourced our source code on GitHub and created a website to\nshowcase our work at https://aavenue.live.\n","authors":["Abhay Gupta","Philip Meng","Ece Yurtseven","Sean O'Brien","Kevin Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.14845v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14840v1","updated":"2024-08-27T07:51:26Z","published":"2024-08-27T07:51:26Z","title":"CL4KGE: A Curriculum Learning Method for Knowledge Graph Embedding","summary":" Knowledge graph embedding (KGE) constitutes a foundational task, directed\ntowards learning representations for entities and relations within knowledge\ngraphs (KGs), with the objective of crafting representations comprehensive\nenough to approximate the logical and symbolic interconnections among entities.\nIn this paper, we define a metric Z-counts to measure the difficulty of\ntraining each triple ($<$head entity, relation, tail entity$>$) in KGs with\ntheoretical analysis. Based on this metric, we propose \\textbf{CL4KGE}, an\nefficient \\textbf{C}urriculum \\textbf{L}earning based training strategy for\n\\textbf{KGE}. This method includes a difficulty measurer and a training\nscheduler that aids in the training of KGE models. Our approach possesses the\nflexibility to act as a plugin within a wide range of KGE models, with the\nadded advantage of adaptability to the majority of KGs in existence. The\nproposed method has been evaluated on popular KGE models, and the results\ndemonstrate that it enhances the state-of-the-art methods. The use of Z-counts\nas a metric has enabled the identification of challenging triples in KGs, which\nhelps in devising effective training strategies.\n","authors":["Yang Liu","Chuan Zhou","Peng Zhang","Yanan Cao","Yongchao Liu","Zhao Li","Hongyang Chen"],"pdf_url":"https://arxiv.org/pdf/2408.14840v1.pdf","comment":"16 pages, 3 figures"},{"id":"http://arxiv.org/abs/2408.14830v1","updated":"2024-08-27T07:27:16Z","published":"2024-08-27T07:27:16Z","title":"PolicyLR: A Logic Representation For Privacy Policies","summary":" Privacy policies are crucial in the online ecosystem, defining how services\nhandle user data and adhere to regulations such as GDPR and CCPA. However,\ntheir complexity and frequent updates often make them difficult for\nstakeholders to understand and analyze. Current automated analysis methods,\nwhich utilize natural language processing, have limitations. They typically\nfocus on individual tasks and fail to capture the full context of the policies.\nWe propose PolicyLR, a new paradigm that offers a comprehensive\nmachine-readable representation of privacy policies, serving as an all-in-one\nsolution for multiple downstream tasks. PolicyLR converts privacy policies into\na machine-readable format using valuations of atomic formulae, allowing for\nformal definitions of tasks like compliance and consistency. We have developed\na compiler that transforms unstructured policy text into this format using\noff-the-shelf Large Language Models (LLMs). This compiler breaks down the\ntransformation task into a two-stage translation and entailment procedure. This\nprocedure considers the full context of the privacy policy to infer a complex\nformula, where each formula consists of simpler atomic formulae. The advantage\nof this model is that PolicyLR is interpretable by design and grounded in\nsegments of the privacy policy. We evaluated the compiler using ToS;DR, a\ncommunity-annotated privacy policy entailment dataset. Utilizing open-source\nLLMs, our compiler achieves precision and recall values of 0.91 and 0.88,\nrespectively. Finally, we demonstrate the utility of PolicyLR in three privacy\ntasks: Policy Compliance, Inconsistency Detection, and Privacy Comparison\nShopping.\n","authors":["Ashish Hooda","Rishabh Khandelwal","Prasad Chalasani","Kassem Fawaz","Somesh Jha"],"pdf_url":"https://arxiv.org/pdf/2408.14830v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.16349v3","updated":"2024-08-27T07:22:50Z","published":"2023-08-30T22:50:32Z","title":"Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning\n Based on Visually Grounded Conversations","summary":" We introduce Affective Visual Dialog, an emotion explanation and reasoning\ntask as a testbed for research on understanding the formation of emotions in\nvisually grounded conversations. The task involves three skills: (1)\nDialog-based Question Answering (2) Dialog-based Emotion Prediction and (3)\nAffective emotion explanation generation based on the dialog. Our key\ncontribution is the collection of a large-scale dataset, dubbed AffectVisDial,\nconsisting of 50K 10-turn visually grounded dialogs as well as concluding\nemotion attributions and dialog-informed textual emotion explanations,\nresulting in a total of 27,180 working hours. We explain our design decisions\nin collecting the dataset and introduce the questioner and answerer tasks that\nare associated with the participants in the conversation. We train and\ndemonstrate solid Affective Visual Dialog baselines adapted from\nstate-of-the-art models. Remarkably, the responses generated by our models show\npromising emotional reasoning abilities in response to visually grounded\nconversations. Our project page is available at\nhttps://affective-visual-dialog.github.io.\n","authors":["Kilichbek Haydarov","Xiaoqian Shen","Avinash Madasu","Mahmoud Salem","Li-Jia Li","Gamaleldin Elsayed","Mohamed Elhoseiny"],"pdf_url":"https://arxiv.org/pdf/2308.16349v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14825v1","updated":"2024-08-27T07:11:45Z","published":"2024-08-27T07:11:45Z","title":"From Rule-Based Models to Deep Learning Transformers Architectures for\n Natural Language Processing and Sign Language Translation Systems: Survey,\n Taxonomy and Performance Evaluation","summary":" With the growing Deaf and Hard of Hearing population worldwide and the\npersistent shortage of certified sign language interpreters, there is a\npressing need for an efficient, signs-driven, integrated end-to-end translation\nsystem, from sign to gloss to text and vice-versa. There has been a wealth of\nresearch on machine translations and related reviews. However, there are few\nworks on sign language machine translation considering the particularity of the\nlanguage being continuous and dynamic. This paper aims to address this void,\nproviding a retrospective analysis of the temporal evolution of sign language\nmachine translation algorithms and a taxonomy of the Transformers\narchitectures, the most used approach in language translation. We also present\nthe requirements of a real-time Quality-of-Service sign language ma-chine\ntranslation system underpinned by accurate deep learning algorithms. We propose\nfuture research directions for sign language translation systems.\n","authors":["Nada Shahin","Leila Ismail"],"pdf_url":"https://arxiv.org/pdf/2408.14825v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.05885v2","updated":"2024-08-27T06:53:16Z","published":"2024-06-09T18:45:41Z","title":"Are Large Language Models Actually Good at Text Style Transfer?","summary":" We analyze the performance of large language models (LLMs) on Text Style\nTransfer (TST), specifically focusing on sentiment transfer and text\ndetoxification across three languages: English, Hindi, and Bengali. Text Style\nTransfer involves modifying the linguistic style of a text while preserving its\ncore content. We evaluate the capabilities of pre-trained LLMs using zero-shot\nand few-shot prompting as well as parameter-efficient finetuning on publicly\navailable datasets. Our evaluation using automatic metrics, GPT-4 and human\nevaluations reveals that while some prompted LLMs perform well in English,\ntheir performance in on other languages (Hindi, Bengali) remains average.\nHowever, finetuning significantly improves results compared to zero-shot and\nfew-shot prompting, making them comparable to previous state-of-the-art. This\nunderscores the necessity of dedicated datasets and specialized models for\neffective TST.\n","authors":["Sourabrata Mukherjee","Atul Kr. Ojha","Ondřej Dušek"],"pdf_url":"https://arxiv.org/pdf/2406.05885v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20805v3","updated":"2024-08-27T06:51:00Z","published":"2024-05-31T14:05:27Z","title":"Multilingual Text Style Transfer: Datasets & Models for Indian Languages","summary":" Text style transfer (TST) involves altering the linguistic style of a text\nwhile preserving its core content. This paper focuses on sentiment transfer, a\npopular TST subtask, across a spectrum of Indian languages: Hindi, Magahi,\nMalayalam, Marathi, Punjabi, Odia, Telugu, and Urdu, expanding upon previous\nwork on English-Bangla sentiment transfer (Mukherjee et al., 2023). We\nintroduce dedicated datasets of 1,000 positive and 1,000 negative\nstyle-parallel sentences for each of these eight languages. We then evaluate\nthe performance of various benchmark models categorized into parallel,\nnon-parallel, cross-lingual, and shared learning approaches, including the\nLlama2 and GPT-3.5 large language models (LLMs). Our experiments highlight the\nsignificance of parallel data in TST and demonstrate the effectiveness of the\nMasked Style Filling (MSF) approach (Mukherjee et al., 2023) in non-parallel\ntechniques. Moreover, cross-lingual and joint multilingual learning methods\nshow promise, offering insights into selecting optimal models tailored to the\nspecific language and task requirements. To the best of our knowledge, this\nwork represents the first comprehensive exploration of the TST task as\nsentiment transfer across a diverse set of languages.\n","authors":["Sourabrata Mukherjee","Atul Kr. Ojha","Akanksha Bansal","Deepak Alok","John P. McCrae","Ondřej Dušek"],"pdf_url":"https://arxiv.org/pdf/2405.20805v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14809v1","updated":"2024-08-27T06:44:28Z","published":"2024-08-27T06:44:28Z","title":"GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer\n Based Fusion Network for Multimodal Sentiment Analysis","summary":" Multimodal Sentiment Analysis (MSA) leverages multiple modals to analyze\nsentiments. Typically, advanced fusion methods and representation\nlearning-based methods are designed to tackle it. Our proposed GSIFN solves two\nkey problems to be solved in MSA: (i) In multimodal fusion, the decoupling of\nmodal combinations and tremendous parameter redundancy in existing fusion\nmethods, which lead to poor fusion performance and efficiency. (ii) The\ntrade-off between representation capability and computation overhead of the\nunimodal feature extractors and enhancers. GSIFN incorporates two main\ncomponents to solve these problems: (i) Graph-Structured and Interlaced-Masked\nMultimodal Transformer. It adopts the Interlaced Mask mechanism to construct\nrobust multimodal graph embedding, achieve all-modal-in-one Transformer-based\nfusion, and greatly reduce the computation overhead. (ii) A self-supervised\nlearning framework with low computation overhead and high performance, which\nutilizes a parallelized LSTM with matrix memory to enhance non-verbal modal\nfeature for unimodal label generation. Evaluated on the MSA datasets CMU-MOSI,\nCMU-MOSEI, and CH-SIMS, GSIFN demonstrates superior performance with\nsignificantly lower computation overhead compared with state-of-the-art\nmethods.\n","authors":["Yijie Jin"],"pdf_url":"https://arxiv.org/pdf/2408.14809v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08841v2","updated":"2024-08-27T06:23:45Z","published":"2024-08-16T17:00:11Z","title":"FLEXTAF: Enhancing Table Reasoning with Flexible Tabular Formats","summary":" The table reasoning task aims to answer the question according to the given\ntable. Currently, using Large Language Models (LLMs) is the predominant method\nfor table reasoning. Most existing methods employ a fixed tabular format to\nrepresent the table, which could limit the performance. Given that each\ninstance requires different capabilities and models possess varying abilities,\nwe assert that different instances and models suit different tabular formats.\nWe prove the aforementioned claim through quantitative analysis of experimental\nresults, where different instances and models achieve different performances\nusing various tabular formats. Building on this discussion, we propose\nFLEXTAF-Single and FLEXTAF-Vote to enhance table reasoning performance by\nemploying flexible tabular formats. Specifically, (i) FLEXTAF-Single trains a\nclassifier to predict the most suitable tabular format based on the instance\nand the LLM. (ii) FLEXTAF-Vote integrates the results across different formats.\nOur experiments on WikiTableQuestions and TabFact reveal significant\nimprovements, with average gains of 2.3% and 4.8% compared to the best\nperformance achieved using a fixed tabular format with greedy decoding and\nself-consistency decoding, thereby validating the effectiveness of our methods.\n","authors":["Xuanliang Zhang","Dingzirui Wang","Longxu Dou","Baoxin Wang","Dayong Wu","Qingfu Zhu","Wanxiang Che"],"pdf_url":"https://arxiv.org/pdf/2408.08841v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.14043v2","updated":"2024-08-27T06:18:05Z","published":"2024-06-20T07:06:58Z","title":"Taxonomy-Guided Zero-Shot Recommendations with LLMs","summary":" With the emergence of large language models (LLMs) and their ability to\nperform a variety of tasks, their application in recommender systems (RecSys)\nhas shown promise. However, we are facing significant challenges when deploying\nLLMs into RecSys, such as limited prompt length, unstructured item information,\nand un-constrained generation of recommendations, leading to sub-optimal\nperformance. To address these issues, we propose a novel method using a\ntaxonomy dictionary. This method provides a systematic framework for\ncategorizing and organizing items, improving the clarity and structure of item\ninformation. By incorporating the taxonomy dictionary into LLM prompts, we\nachieve efficient token utilization and controlled feature generation, leading\nto more accurate and contextually relevant recommendations. Our Taxonomy-guided\nRecommendation (TaxRec) approach features a two-step process: one-time taxonomy\ncategorization and LLM-based recommendation, enabling zero-shot recommendations\nwithout the need for domain-specific fine-tuning. Experimental results\ndemonstrate TaxRec significantly enhances recommendation quality compared to\ntraditional zero-shot approaches, showcasing its efficacy as personal\nrecommender with LLMs. Code is available at\nhttps://github.com/yueqingliang1/TaxRec.\n","authors":["Yueqing Liang","Liangwei Yang","Chen Wang","Xiongxiao Xu","Philip S. Yu","Kai Shu"],"pdf_url":"https://arxiv.org/pdf/2406.14043v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08779v2","updated":"2024-08-27T06:14:54Z","published":"2024-08-16T14:43:15Z","title":"DAC: Decomposed Automation Correction for Text-to-SQL","summary":" Text-to-SQL is an important task that helps people obtain information from\ndatabases by automatically generating SQL queries. Considering the brilliant\nperformance, approaches based on Large Language Models (LLMs) become the\nmainstream for text-to-SQL. Among these approaches, automated correction is an\neffective approach that further enhances performance by correcting the mistakes\nin the generated results. The existing correction methods require LLMs to\ndirectly correct with generated SQL, while previous research shows that LLMs do\nnot know how to detect mistakes, leading to poor performance. Therefore, in\nthis paper, we propose to employ the decomposed correction to enhance\ntext-to-SQL performance. We first demonstrate that decomposed correction\noutperforms direct correction since detecting and fixing mistakes with the\nresults of the decomposed sub-tasks is easier than with SQL. Based on this\nanalysis, we introduce Decomposed Automation Correction (DAC), which corrects\nSQL by decomposing text-to-SQL into entity linking and skeleton parsing. DAC\nfirst generates the entity and skeleton corresponding to the question and then\ncompares the differences between the initial SQL and the generated entities and\nskeleton as feedback for correction. Experimental results show that our method\nimproves performance by $3.7\\%$ on average of Spider, Bird, and KaggleDBQA\ncompared with the baseline method, demonstrating the effectiveness of DAC.\n","authors":["Dingzirui Wang","Longxu Dou","Xuanliang Zhang","Qingfu Zhu","Wanxiang Che"],"pdf_url":"https://arxiv.org/pdf/2408.08779v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08072v2","updated":"2024-08-27T04:50:12Z","published":"2024-08-15T10:44:38Z","title":"I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative\n Self-Enhancement Paradigm","summary":" Large Language Models (LLMs) have achieved significant advancements, however,\nthe common learning paradigm treats LLMs as passive information repositories,\nneglecting their potential for active learning and alignment. Some approaches\ntrain LLMs using their own generated synthetic data, exploring the possibility\nof active alignment. However, there is still a huge gap between these one-time\nalignment methods and the continuous automatic alignment of humans. In this\npaper, we introduce \\textbf{I-SHEEP}, an \\textbf{I}terative\n\\textbf{S}elf-En\\textbf{H}anc\\textbf{E}m\\textbf{E}nt \\textbf{P}aradigm.This\nhuman-like paradigm enables LLMs to \\textbf{continuously self-align from\nscratch with nothing}. Compared to the one-time alignment method Dromedary\n\\cite{sun2023principledriven}, which refers to the first iteration in this\npaper, I-SHEEP can significantly enhance capacities on both Qwen and Llama\nmodels. I-SHEEP achieves a maximum relative improvement of 78.2\\% in the Alpaca\nEval, 24.0\\% in the MT Bench, and an absolute increase of 8.88\\% in the IFEval\naccuracy over subsequent iterations in Qwen-1.5 72B model. Additionally,\nI-SHEEP surpasses the base model in various standard benchmark generation\ntasks, achieving an average improvement of 24.77\\% in code generation tasks,\n12.04\\% in TrivialQA, and 20.29\\% in SQuAD. We also provide new insights based\non the experiment results. Our codes, datasets, and models are available at\n\\textbf{https://anonymous.4open.science/r/I-SHEEP}.\n","authors":["Yiming Liang","Ge Zhang","Xingwei Qu","Tianyu Zheng","Jiawei Guo","Xinrun Du","Zhenzhu Yang","Jiaheng Liu","Chenghua Lin","Lei Ma","Wenhao Huang","Jiajun Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.08072v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.06537v2","updated":"2024-08-27T04:43:59Z","published":"2024-07-09T04:17:39Z","title":"Efficient and Accurate Memorable Conversation Model using DPO based on\n sLLM","summary":" In multi-session dialog system, it is essential to continuously update the\nmemory as the session progresses. Simply accumulating memory can make it\ndifficult to focus on the content of the conversation for inference due to the\nlimited input sentence size. Therefore, efficient and accurate conversation\nmodel that is capable of managing memory to reflect the conversation history\ncontinuously is necessary. This paper presents a conversation model that\nefficiently manages memory as sessions progress and incorporates this into the\nmodel to reflect the conversation history accurately with 3 methodologies: SFT,\nDPO and DPO with SFT model. Our model using DPO algorithm shows an improvement\nabout 0.0591 of BERTScore in memory accuracy, and the rate of responses\nreflecting the memory increased as well. Also, response generation performance\nenhanced about 4.292 in fluency, 3.935 in coherence, and 2.896 in consistency.\nThis paper describes a training method that yields better performance than\nmodels with more than twice the parameter size, even when the model size is\nsmaller. Thus, our model demonstrates efficiency not only in terms of accuracy\nbut also in resource utilization.\n","authors":["Youngkyung Seo","Yoonseok Heo","Jun-Seok Koh","Du-Seong Chang"],"pdf_url":"https://arxiv.org/pdf/2407.06537v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.08374v2","updated":"2024-08-27T04:35:20Z","published":"2023-06-14T09:04:29Z","title":"SpeechGLUE: How Well Can Self-Supervised Speech Models Capture\n Linguistic Knowledge?","summary":" Self-supervised learning (SSL) for speech representation has been\nsuccessfully applied in various downstream tasks, such as speech and speaker\nrecognition. More recently, speech SSL models have also been shown to be\nbeneficial in advancing spoken language understanding tasks, implying that the\nSSL models have the potential to learn not only acoustic but also linguistic\ninformation. In this paper, we aim to clarify if speech SSL techniques can well\ncapture linguistic knowledge. For this purpose, we introduce SpeechGLUE, a\nspeech version of the General Language Understanding Evaluation (GLUE)\nbenchmark. Since GLUE comprises a variety of natural language understanding\ntasks, SpeechGLUE can elucidate the degree of linguistic ability of speech SSL\nmodels. Experiments demonstrate that speech SSL models, although inferior to\ntext-based SSL models, perform better than baselines, suggesting that they can\nacquire a certain amount of general linguistic knowledge from just unlabeled\nspeech data.\n","authors":["Takanori Ashihara","Takafumi Moriya","Kohei Matsuura","Tomohiro Tanaka","Yusuke Ijima","Taichi Asami","Marc Delcroix","Yukinori Honma"],"pdf_url":"https://arxiv.org/pdf/2306.08374v2.pdf","comment":"Accepted at INTERSPEECH 2023. This paper has been extended in a\n subsequent journal paper, see\n https://ieeexplore.ieee.org/abstract/document/10597571"},{"id":"http://arxiv.org/abs/2408.14774v1","updated":"2024-08-27T04:31:58Z","published":"2024-08-27T04:31:58Z","title":"Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning","summary":" We introduce Instruct-SkillMix, an automated approach for creating diverse,\nhigh quality SFT data. The Instruct-SkillMix pipeline involves two stages, each\nleveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to\nextract core \"skills\" for instruction-following, either from existing datasets,\nor by directly prompting the model; (2) Data generation: uses the powerful LLM\nto generate (instruction, response) data that exhibit a randomly chosen pair of\nthese skills. Here, the use of random skill combinations promotes diversity and\ndifficulty.\n Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from\nInstruct-SkillMix leads to strong gains on instruction following benchmarks\nsuch as AlpacaEval 2.0, MT-Bench, and WildBench. With just $4$K examples,\nLLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0.\nTo our knowledge, this achieves state-of-the-art performance among all models\nthat have only undergone SFT (no RL methods) and competes with proprietary\nmodels such as Claude 3 Opus and LLaMA-3.1-405B-Instruct.\n Ablation studies also suggest plausible reasons for why creating open\ninstruction-tuning datasets via naive crowd-sourcing has proved difficult.\nIntroducing low quality answers (\"shirkers\") in $20\\%$ of Instruct-SkillMix\nexamples causes performance to plummet, sometimes catastrophically.\n The Instruct-SkillMix pipeline is flexible and is adaptable to other\nsettings.\n","authors":["Simran Kaur","Simon Park","Anirudh Goyal","Sanjeev Arora"],"pdf_url":"https://arxiv.org/pdf/2408.14774v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.14856v2","updated":"2024-08-27T04:30:29Z","published":"2023-07-27T13:37:06Z","title":"Exploiting the Potential of Seq2Seq Models as Robust Few-Shot Learners","summary":" In-context learning, which offers substantial advantages over fine-tuning, is\npredominantly observed in decoder-only models, while encoder-decoder (i.e.,\nseq2seq) models excel in methods that rely on weight updates. Recently, a few\nstudies have demonstrated the feasibility of few-shot learning with seq2seq\nmodels; however, this has been limited to tasks that align well with the\nseq2seq architecture, such as summarization and translation. Inspired by these\ninitial studies, we provide a first-ever extensive experiment comparing the\nin-context few-shot learning capabilities of decoder-only and encoder-decoder\nmodels on a broad range of tasks. Furthermore, we propose two methods to more\neffectively elicit in-context learning ability in seq2seq models:\nobjective-aligned prompting and a fusion-based approach. Remarkably, our\napproach outperforms a decoder-only model that is six times larger and exhibits\nsignificant performance improvements compared to conventional seq2seq models\nacross a variety of settings. We posit that, with the right configuration and\nprompt design, seq2seq models can be highly effective few-shot learners for a\nwide spectrum of applications.\n","authors":["Jihyeon Lee","Dain Kim","Doohae Jung","Boseop Kim","Kyoung-Woon On"],"pdf_url":"https://arxiv.org/pdf/2307.14856v2.pdf","comment":"Accepted to COLM'2024"},{"id":"http://arxiv.org/abs/2408.14772v1","updated":"2024-08-27T04:20:10Z","published":"2024-08-27T04:20:10Z","title":"A global AI community requires language-diverse publishing","summary":" In this provocation, we discuss the English dominance of the AI research\ncommunity, arguing that the requirement for English language publishing upholds\nand reinforces broader regimes of extraction in AI. While large language models\nand machine translation have been celebrated as a way to break down barriers,\nwe regard their use as a symptom of linguistic exclusion of scientists and\npotential readers. We propose alternative futures for a healthier publishing\nculture, organized around three themes: administering conferences in the\nlanguages of the country in which they are held, instructing peer reviewers not\nto adjudicate the language appropriateness of papers, and offering\nopportunities to publish and present in multiple languages. We welcome new\ntranslations of this piece. Please contact the authors if you would like to\ncontribute one.\n","authors":["Haley Lepp","Parth Sarin"],"pdf_url":"https://arxiv.org/pdf/2408.14772v1.pdf","comment":"Translations by Michael Hardy (Guarani), Vandana Sarin and Vivek\n Sarin (Hindi), Roshna Omer Abdulrahman (Soran\\^i Kurdish), Gabriel Poesia\n (Portuguese), and Mat\\'ias Grinberg (Spanish). In the proceedings of the\n Global AI Cultures Workshop at the Twelfth International Conference on\n Learning Representations (ICLR) 2024, Vienna, Austria, May 7-11, 2024"},{"id":"http://arxiv.org/abs/2408.14470v2","updated":"2024-08-27T03:56:11Z","published":"2024-08-26T17:58:53Z","title":"Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large\n Language Models","summary":" Fine-tuning large language models (LLMs) on downstream tasks requires\nsubstantial computational resources. A class of parameter-efficient fine-tuning\n(PEFT) aims to mitigate these computational challenges by selectively\nfine-tuning only a small fraction of the model parameters. Although\ncomputationally efficient, these techniques often fail to match the performance\nof fully fine-tuned models, primarily due to inherent biases introduced during\nparameter selection. Traditional selective PEFT techniques use a fixed set of\nparameters based on a predefined budget (a process also known as unmasking),\nfailing to capture parameter importance dynamically and often ending up\nexceeding the budget. We introduce $\\text{ID}^3$, a novel selective PEFT method\nthat calculates parameter importance continually and dynamically unmasks\nparameters by balancing exploration and exploitation in parameter selection.\nOur empirical study on 15 tasks spanning natural language understanding and\ngenerative tasks demonstrates the effectiveness of our method compared to\nfixed-masking-based PEFT techniques. We analytically show that $\\text{ID}^3$\nreduces the number of gradient updates by a factor of two, enhancing\ncomputational efficiency. $\\text{ID}^3$ is robust to random initialization of\nneurons and, therefore, can be seamlessly integrated into existing additive and\nreparametrization-based PEFT modules such as adapters and LoRA for dynamic\nsparsification.\n","authors":["Aradhye Agarwal","Suhas K Ramesh","Ayan Sengupta","Tanmoy Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2408.14470v2.pdf","comment":"15 pages, 7 tables, 9 figures"},{"id":"http://arxiv.org/abs/2402.10260v2","updated":"2024-08-27T03:32:47Z","published":"2024-02-15T18:58:09Z","title":"A StrongREJECT for Empty Jailbreaks","summary":" Most jailbreak papers claim the jailbreaks they propose are highly effective,\noften boasting near-100% attack success rates. However, it is perhaps more\ncommon than not for jailbreak developers to substantially exaggerate the\neffectiveness of their jailbreaks. We suggest this problem arises because\njailbreak researchers lack a standard, high-quality benchmark for evaluating\njailbreak performance, leaving researchers to create their own. To create a\nbenchmark, researchers must choose a dataset of forbidden prompts to which a\nvictim model will respond, along with an evaluation method that scores the\nharmfulness of the victim model's responses. We show that existing benchmarks\nsuffer from significant shortcomings and introduce the StrongREJECT benchmark\nto address these issues. StrongREJECT's dataset contains prompts that victim\nmodels must answer with specific, harmful information, while its automated\nevaluator measures the extent to which a response gives useful information to\nforbidden prompts. In doing so, the StrongREJECT evaluator achieves\nstate-of-the-art agreement with human judgments of jailbreak effectiveness.\nNotably, we find that existing evaluation methods significantly overstate\njailbreak effectiveness compared to human judgments and the StrongREJECT\nevaluator. We describe a surprising and novel phenomenon that explains this\ndiscrepancy: jailbreaks bypassing a victim model's safety fine-tuning tend to\nreduce its capabilities. Together, our findings underscore the need for\nresearchers to use a high-quality benchmark, such as StrongREJECT, when\ndeveloping new jailbreak attacks. We release the StrongREJECT code and data at\nhttps://strong-reject.readthedocs.io/en/latest/.\n","authors":["Alexandra Souly","Qingyuan Lu","Dillon Bowen","Tu Trinh","Elvis Hsieh","Sana Pandey","Pieter Abbeel","Justin Svegliato","Scott Emmons","Olivia Watkins","Sam Toyer"],"pdf_url":"https://arxiv.org/pdf/2402.10260v2.pdf","comment":"Code and data at https://strong-reject.readthedocs.io/en/latest/"},{"id":"http://arxiv.org/abs/2408.13184v2","updated":"2024-08-27T03:27:08Z","published":"2024-08-23T16:02:54Z","title":"Can LLM be a Good Path Planner based on Prompt Engineering? Mitigating\n the Hallucination for Path Planning","summary":" Spatial reasoning in Large Language Models (LLMs) is the foundation for\nembodied intelligence. However, even in simple maze environments, LLMs still\nencounter challenges in long-term path-planning, primarily influenced by their\nspatial hallucination and context inconsistency hallucination by long-term\nreasoning. To address this challenge, this study proposes an innovative model,\nSpatial-to-Relational Transformation and Curriculum Q-Learning (S2RCQL). To\naddress the spatial hallucination of LLMs, we propose the Spatial-to-Relational\napproach, which transforms spatial prompts into entity relations and paths\nrepresenting entity relation chains. This approach fully taps the potential of\nLLMs in terms of sequential thinking. As a result, we design a path-planning\nalgorithm based on Q-learning to mitigate the context inconsistency\nhallucination, which enhances the reasoning ability of LLMs. Using the Q-value\nof state-action as auxiliary information for prompts, we correct the\nhallucinations of LLMs, thereby guiding LLMs to learn the optimal path.\nFinally, we propose a reverse curriculum learning technique based on LLMs to\nfurther mitigate the context inconsistency hallucination. LLMs can rapidly\naccumulate successful experiences by reducing task difficulty and leveraging\nthem to tackle more complex tasks. We performed comprehensive experiments based\non Baidu's self-developed LLM: ERNIE-Bot 4.0. The results showed that our\nS2RCQL achieved a 23%--40% improvement in both success and optimality rates\ncompared with advanced prompt engineering.\n","authors":["Hourui Deng","Hongjie Zhang","Jie Ou","Chaosheng Feng"],"pdf_url":"https://arxiv.org/pdf/2408.13184v2.pdf","comment":"Submitted to ICASSP"},{"id":"http://arxiv.org/abs/2408.01262v3","updated":"2024-08-27T03:13:50Z","published":"2024-08-02T13:35:11Z","title":"RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework","summary":" Retrieval-Augmented Generation (RAG) systems have demonstrated their\nadvantages in alleviating the hallucination of Large Language Models (LLMs).\nExisting RAG benchmarks mainly focus on evaluating whether LLMs can correctly\nanswer the general knowledge. However, they are unable to evaluate the\neffectiveness of the RAG system in dealing with the data from different\nvertical domains. This paper introduces RAGEval, a framework for automatically\ngenerating evaluation datasets to evaluate the knowledge usage ability of\ndifferent LLMs in different scenarios. Specifically, RAGEval summarizes a\nschema from seed documents, applies the configurations to generate diverse\ndocuments, and constructs question-answering pairs according to both articles\nand configurations. We propose three novel metrics, Completeness,\nHallucination, and Irrelevance, to carefully evaluate the responses generated\nby LLMs. By benchmarking RAG models in vertical domains, RAGEval has the\nability to better evaluate the knowledge usage ability of LLMs, which avoids\nthe confusion regarding the source of knowledge in answering question in\nexisting QA datasets--whether it comes from parameterized memory or retrieval.\nThe code and dataset will be released.\n","authors":["Kunlun Zhu","Yifan Luo","Dingling Xu","Ruobing Wang","Shi Yu","Shuo Wang","Yukun Yan","Zhenghao Liu","Xu Han","Zhiyuan Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2408.01262v3.pdf","comment":"add github repo"},{"id":"http://arxiv.org/abs/2408.14750v1","updated":"2024-08-27T03:01:48Z","published":"2024-08-27T03:01:48Z","title":"LyCon: Lyrics Reconstruction from the Bag-of-Words Using Large Language\n Models","summary":" This paper addresses the unique challenge of conducting research in lyric\nstudies, where direct use of lyrics is often restricted due to copyright\nconcerns. Unlike typical data, internet-sourced lyrics are frequently protected\nunder copyright law, necessitating alternative approaches. Our study introduces\na novel method for generating copyright-free lyrics from publicly available\nBag-of-Words (BoW) datasets, which contain the vocabulary of lyrics but not the\nlyrics themselves. Utilizing metadata associated with BoW datasets and large\nlanguage models, we successfully reconstructed lyrics. We have compiled and\nmade available a dataset of reconstructed lyrics, LyCon, aligned with metadata\nfrom renowned sources including the Million Song Dataset, Deezer Mood Detection\nDataset, and AllMusic Genre Dataset, available for public access. We believe\nthat the integration of metadata such as mood annotations or genres enables a\nvariety of academic experiments on lyrics, such as conditional lyric\ngeneration.\n","authors":["Haven Kim","Kahyun Choi"],"pdf_url":"https://arxiv.org/pdf/2408.14750v1.pdf","comment":"Dataset downlodable at https://github.com/havenpersona/lycon"},{"id":"http://arxiv.org/abs/2408.10903v4","updated":"2024-08-27T02:58:39Z","published":"2024-08-20T14:47:38Z","title":"BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General\n Role-Playing Language Model","summary":" The rapid advancement of large language models (LLMs) has revolutionized\nrole-playing, enabling the development of general role-playing models. However,\ncurrent role-playing training has two significant issues: (I) Using a\npredefined role profile to prompt dialogue training for specific scenarios\nusually leads to inconsistencies and even conflicts between the dialogue and\nthe profile, resulting in training biases. (II) The model learns to imitate the\nrole based solely on the profile, neglecting profile-dialogue alignment at the\nsentence level. In this work, we propose a simple yet effective framework\ncalled BEYOND DIALOGUE, designed to overcome these hurdles. This framework\ninnovatively introduces \"beyond dialogue\" tasks to align dialogue with profile\ntraits based on each specific scenario, thereby eliminating biases during\ntraining. Furthermore, by adopting an innovative prompting mechanism that\ngenerates reasoning outcomes for training, the framework allows the model to\nachieve fine-grained alignment between profile and dialogue at the sentence\nlevel. The aforementioned methods are fully automated and low-cost.\nAdditionally, the integration of automated dialogue and objective evaluation\nmethods forms a comprehensive framework, paving the way for general\nrole-playing. Experimental results demonstrate that our model excels in\nadhering to and reflecting various dimensions of role profiles, outperforming\nmost proprietary general and specialized role-playing baselines. All code and\ndatasets are available at https://github.com/yuyouyu32/BeyondDialogue.\n","authors":["Yeyong Yu","Runsheng Yu","Haojie Wei","Zhanqiu Zhang","Quan Qian"],"pdf_url":"https://arxiv.org/pdf/2408.10903v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.05706v2","updated":"2024-08-27T02:24:30Z","published":"2024-02-08T14:35:09Z","title":"Integrating Paralinguistics in Speech-Empowered Large Language Models\n for Natural Conversation","summary":" Recent work shows promising results in expanding the capabilities of large\nlanguage models (LLM) to directly understand and synthesize speech. However, an\nLLM-based strategy for modeling spoken dialogs remains elusive, calling for\nfurther investigation. This paper introduces an extensive speech-text LLM\nframework, the Unified Spoken Dialog Model (USDM), designed to generate\ncoherent spoken responses with naturally occurring prosodic features relevant\nto the given input speech without relying on explicit automatic speech\nrecognition (ASR) or text-to-speech (TTS) systems. We have verified the\ninclusion of prosody in speech tokens that predominantly contain semantic\ninformation and have used this foundation to construct a prosody-infused\nspeech-text model. Additionally, we propose a generalized speech-text\npretraining scheme that enhances the capture of cross-modal semantics. To\nconstruct USDM, we fine-tune our speech-text model on spoken dialog data using\na multi-step spoken dialog template that stimulates the chain-of-reasoning\ncapabilities exhibited by the underlying LLM. Automatic and human evaluations\non the DailyTalk dataset demonstrate that our approach effectively generates\nnatural-sounding spoken responses, surpassing previous and cascaded baselines.\nWe will make our code and checkpoints publicly available.\n","authors":["Heeseung Kim","Soonshin Seo","Kyeongseok Jeong","Ohsung Kwon","Soyoon Kim","Jungwhan Kim","Jaehong Lee","Eunwoo Song","Myungwoo Oh","Jung-Woo Ha","Sungroh Yoon","Kang Min Yoo"],"pdf_url":"https://arxiv.org/pdf/2402.05706v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.12654v3","updated":"2024-08-27T02:22:00Z","published":"2024-02-20T02:04:38Z","title":"OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech\n Recognition, Translation, and Language Identification","summary":" There has been an increasing interest in large speech models that can perform\nmultiple tasks in a single model. Such models usually adopt an encoder-decoder\nor decoder-only architecture due to their popularity and good performance in\nmany domains. However, autoregressive models can be slower during inference\ncompared to non-autoregressive models and also have potential risks of\nhallucination. Though prior studies observed promising results of\nnon-autoregressive models for certain tasks at small scales, it remains unclear\nif they can be scaled to speech-to-text generation in diverse languages and\ntasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we\npropose OWSM-CTC, a novel encoder-only speech foundation model based on\nConnectionist Temporal Classification (CTC). It is trained on 180k hours of\npublic audio data for multilingual automatic speech recognition (ASR), speech\ntranslation (ST), and language identification (LID). Compared to\nencoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up\nto 24% relative improvement on ST, while it is more robust and 3 to 4 times\nfaster for inference. OWSM-CTC also improves the long-form ASR result with 20x\nspeed-up. We will publicly release our code, pre-trained model, and training\nlogs to promote open science in speech foundation models.\n","authors":["Yifan Peng","Yui Sudo","Muhammad Shakeel","Shinji Watanabe"],"pdf_url":"https://arxiv.org/pdf/2402.12654v3.pdf","comment":"Accepted at ACL 2024 main conference"},{"id":"http://arxiv.org/abs/2401.16658v3","updated":"2024-08-27T02:15:49Z","published":"2024-01-30T01:22:18Z","title":"OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on\n E-Branchformer","summary":" Recent studies have highlighted the importance of fully open foundation\nmodels. The Open Whisper-style Speech Model (OWSM) is an initial step towards\nreproducing OpenAI Whisper using public data and open-source toolkits. However,\nprevious versions of OWSM (v1 to v3) are still based on standard Transformer,\nwhich might lead to inferior performance compared to state-of-the-art speech\nencoder architectures. This work aims to improve the performance and efficiency\nof OWSM without additional data. We present a series of E-Branchformer-based\nmodels named OWSM v3.1, ranging from 100M to 1B parameters. OWSM v3.1\noutperforms its predecessor, OWSM v3, in most evaluation benchmarks, while\nshowing an improved inference speed of up to 25%. We further reveal the\nemergent ability of OWSM v3.1 in zero-shot contextual biasing speech\nrecognition. We also provide a model trained on a subset of data with low\nlicense restrictions. We will publicly release the code, pre-trained models,\nand training logs.\n","authors":["Yifan Peng","Jinchuan Tian","William Chen","Siddhant Arora","Brian Yan","Yui Sudo","Muhammad Shakeel","Kwanghee Choi","Jiatong Shi","Xuankai Chang","Jee-weon Jung","Shinji Watanabe"],"pdf_url":"https://arxiv.org/pdf/2401.16658v3.pdf","comment":"Accepted at INTERSPEECH 2024. Webpage:\n https://www.wavlab.org/activities/2024/owsm/"},{"id":"http://arxiv.org/abs/2404.02342v2","updated":"2024-08-27T02:12:57Z","published":"2024-04-02T22:31:38Z","title":"A Computational Analysis of Lyric Similarity Perception","summary":" In musical compositions that include vocals, lyrics significantly contribute\nto artistic expression. Consequently, previous studies have introduced the\nconcept of a recommendation system that suggests lyrics similar to a user's\nfavorites or personalized preferences, aiding in the discovery of lyrics among\nmillions of tracks. However, many of these systems do not fully consider human\nperceptions of lyric similarity, primarily due to limited research in this\narea. To bridge this gap, we conducted a comparative analysis of computational\nmethods for modeling lyric similarity with human perception. Results indicated\nthat computational models based on similarities between embeddings from\npre-trained BERT-based models, the audio from which the lyrics are derived, and\nphonetic components are indicative of perceptual lyric similarity. This finding\nunderscores the importance of semantic, stylistic, and phonetic similarities in\nhuman perception about lyric similarity. We anticipate that our findings will\nenhance the development of similarity-based lyric recommendation systems by\noffering pseudo-labels for neural network development and introducing objective\nevaluation metrics.\n","authors":["Haven Kim","Taketo Akama"],"pdf_url":"https://arxiv.org/pdf/2404.02342v2.pdf","comment":"In the process of a detailed revision"},{"id":"http://arxiv.org/abs/2408.11247v2","updated":"2024-08-27T02:11:32Z","published":"2024-08-20T23:54:26Z","title":"Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S. Labor\n Data","summary":" Large Language Models (LLMs) are prone to inheriting and amplifying societal\nbiases embedded within their training data, potentially reinforcing harmful\nstereotypes related to gender, occupation, and other sensitive categories. This\nissue becomes particularly problematic as biased LLMs can have far-reaching\nconsequences, leading to unfair practices and exacerbating social inequalities\nacross various domains, such as recruitment, online content moderation, or even\nthe criminal justice system. Although prior research has focused on detecting\nbias in LLMs using specialized datasets designed to highlight intrinsic biases,\nthere has been a notable lack of investigation into how these findings\ncorrelate with authoritative datasets, such as those from the U.S. National\nBureau of Labor Statistics (NBLS). To address this gap, we conduct empirical\nresearch that evaluates LLMs in a ``bias-out-of-the-box\" setting, analyzing how\nthe generated outputs compare with the distributions found in NBLS data.\nFurthermore, we propose a straightforward yet effective debiasing mechanism\nthat directly incorporates NBLS instances to mitigate bias within LLMs. Our\nstudy spans seven different LLMs, including instructable, base, and\nmixture-of-expert models, and reveals significant levels of bias that are often\noverlooked by existing bias detection techniques. Importantly, our debiasing\nmethod, which does not rely on external datasets, demonstrates a substantial\nreduction in bias scores, highlighting the efficacy of our approach in creating\nfairer and more reliable LLMs.\n","authors":["Atmika Gorti","Manas Gaur","Aman Chadha"],"pdf_url":"https://arxiv.org/pdf/2408.11247v2.pdf","comment":"Accepted in AAAI Spring Symposium 2024"},{"id":"http://arxiv.org/abs/2312.06635v6","updated":"2024-08-27T01:27:29Z","published":"2023-12-11T18:51:59Z","title":"Gated Linear Attention Transformers with Hardware-Efficient Training","summary":" Transformers with linear attention allow for efficient parallel training but\ncan simultaneously be formulated as an RNN with 2D (matrix-valued) hidden\nstates, thus enjoying linear-time inference complexity. However, linear\nattention generally underperforms ordinary softmax attention. Moreover, current\nimplementations of linear attention lack I/O-awareness and are thus slower than\nhighly optimized implementations of softmax attention. This work describes a\nhardware-efficient algorithm for linear attention that trades off memory\nmovement against parallelizability. The resulting implementation, dubbed\nFLASHLINEARATTENTION, is faster than FLASHATTENTION-2 (Dao, 2023) as a\nstandalone layer even on short sequence lengths (e.g., 1K). We then generalize\nthis algorithm to a more expressive variant of linear attention with\ndata-dependent gates. When used as a replacement for the standard attention\nlayer in Transformers, the resulting gated linear attention (GLA) Transformer\nis found to perform competitively against the LLaMA-architecture Transformer\n(Touvron et al., 2023) as well recent linear-time-inference baselines such as\nRetNet (Sun et al., 2023a) and Mamba (Gu & Dao, 2023) on moderate-scale\nlanguage modeling experiments. GLA Transformer is especially effective at\nlength generalization, enabling a model trained on 2K to generalize to\nsequences longer than 20K without significant perplexity degradations. For\ntraining speed, the GLA Transformer has higher throughput than a\nsimilarly-sized Mamba model.\n","authors":["Songlin Yang","Bailin Wang","Yikang Shen","Rameswar Panda","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2312.06635v6.pdf","comment":"minor update"},{"id":"http://arxiv.org/abs/2408.14721v1","updated":"2024-08-27T01:04:14Z","published":"2024-08-27T01:04:14Z","title":"PAT: Pruning-Aware Tuning for Large Language Models","summary":" Large language models (LLMs) excel in language tasks, especially with\nsupervised fine-tuning after pre-training. However, their substantial memory\nand computational requirements hinder practical applications. Structural\npruning, which reduces less significant weight dimensions, is one solution.\nYet, traditional post-hoc pruning often leads to significant performance loss,\nwith limited recovery from further fine-tuning due to reduced capacity. Since\nthe model fine-tuning refines the general and chaotic knowledge in pre-trained\nmodels, we aim to incorporate structural pruning with the fine-tuning, and\npropose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy\nwhile preserving the model performance to the maximum extend. Specifically, we\ninsert the innovative Hybrid Sparsification Modules (HSMs) between the\nAttention and FFN components to accordingly sparsify the upstream and\ndownstream linear modules. The HSM comprises a lightweight operator and a\nglobally shared trainable mask. The lightweight operator maintains a training\noverhead comparable to that of LoRA, while the trainable mask unifies the\nchannels to be sparsified, ensuring structural pruning. Additionally, we\npropose the Identity Loss which decouples the transformation and scaling\nproperties of the HSMs to enhance training robustness. Extensive experiments\ndemonstrate that PAT excels in both performance and efficiency. For example,\nour Llama2-7b model with a 25\\% pruning ratio achieves 1.33$\\times$ speedup\nwhile outperforming the LoRA-finetuned model by up to 1.26\\% in accuracy with a\nsimilar training cost. Code:\nhttps://github.com/kriskrisliu/PAT_Pruning-Aware-Tuning\n","authors":["Yijiang Liu","Huanrui Yang","Youxin Chen","Rongyu Zhang","Miao Wang","Yuan Du","Li Du"],"pdf_url":"https://arxiv.org/pdf/2408.14721v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.12023v4","updated":"2024-08-27T00:48:35Z","published":"2023-11-20T18:57:41Z","title":"LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient\n Language Model Finetuning","summary":" We propose a simple approach for memory-efficient adaptation of pretrained\nlanguage models. Our approach uses an iterative algorithm to decompose each\npretrained matrix into a high-precision low-rank component and a\nmemory-efficient quantized component. During finetuning, the quantized\ncomponent remains fixed and only the low-rank component is updated. We present\nan integer linear programming formulation of the quantization component which\nenables dynamic configuration of quantization parameters (e.g., bit-width,\nblock size) for each matrix given an overall target memory budget. We further\nexplore a data-aware version of the algorithm which uses an approximation of\nthe Fisher information matrix to weight the reconstruction objective during\nmatrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and\n70B) demonstrate that our low-rank plus quantized matrix decomposition approach\n(LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and enables\naggressive quantization to sub-3 bits with only minor performance degradations.\nWhen finetuned on a language modeling calibration dataset, LQ-LoRA can also be\nused for model compression; in this setting our 2.75-bit LLaMA-2-70B model\n(which has 2.85 bits on average when including the low-rank components and\nrequires 27GB of GPU memory) performs respectably compared to the 16-bit\nbaseline.\n","authors":["Han Guo","Philip Greengard","Eric P. Xing","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2311.12023v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.10960v2","updated":"2024-08-27T00:27:12Z","published":"2024-07-15T17:55:42Z","title":"Fast Matrix Multiplications for Lookup Table-Quantized LLMs","summary":" The deployment of large language models (LLMs) is often constrained by memory\nbandwidth, where the primary bottleneck is the cost of transferring model\nparameters from the GPU's global memory to its registers. When coupled with\ncustom kernels that fuse the dequantization and matmul operations, weight-only\nquantization can thus enable faster inference by reducing the amount of memory\nmovement. However, developing high-performance kernels for weight-quantized\nLLMs presents substantial challenges, especially when the weights are\ncompressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform,\nlookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup\ntable engine for LUT-quantized LLMs, which uses offline restructuring of the\nquantized weight matrix to minimize bit manipulations associated with\nunpacking, and vectorization and duplication of the lookup table to mitigate\nshared memory bandwidth constraints. At batch sizes < 32 and quantization group\nsize of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster\nthan existing GEMM kernels. As an application of FLUTE, we explore a simple\nextension to lookup table-based NormalFloat quantization and apply it to\nquantize LLaMA3 to various configurations, obtaining competitive quantization\nperformance against strong baselines while obtaining an end-to-end throughput\nincrease of 1.5 to 2 times.\n","authors":["Han Guo","William Brandon","Radostin Cholakov","Jonathan Ragan-Kelley","Eric P. Xing","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2407.10960v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.08816v2","updated":"2024-08-27T22:51:57Z","published":"2024-04-12T21:16:53Z","title":"Measuring the Quality of Answers in Political Q&As with Large Language\n Models","summary":" This paper introduces a new approach for measuring the quality of answers in\npolitical question-and-answer sessions. We propose to measure answer quality\nbased on the degree to which it allows to infer the initial question\naccurately. This measure of answer quality reflects how well the answer engages\nwith and addresses the initial question. Drawing an analogy with semantic\nsearch, we demonstrate that this measurement approach can be implemented by\nfine-tuning a large language model on the corpus of observed questions and\nanswers without additional labeled data. We showcase our approach within the\ncontext of the Question Period in the Canadian House of Commons, providing\nvaluable insights into the correlates of answer quality. Our findings reveal\nsignificant variations in answer quality based on the party affiliation of the\nmembers of Parliament asking the question. Additionally, we find a meaningful\ncorrelation between answer quality and the topic raised in the question.\n","authors":["R. Michael Alvarez","Jacob Morrier"],"pdf_url":"https://arxiv.org/pdf/2404.08816v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.17678v2","updated":"2024-08-27T22:06:20Z","published":"2024-07-25T00:27:07Z","title":"Efficient LLM Training and Serving with Heterogeneous Context Sharding\n among Attention Heads","summary":" Existing LLM training and inference frameworks struggle in boosting\nefficiency with sparsity while maintaining the integrity of context and model\narchitecture. Inspired by the sharding concept in database and the fact that\nattention parallelizes over heads on accelerators, we propose Sparsely-Sharded\n(S2) Attention, an attention algorithm that allocates heterogeneous context\npartitions for different attention heads to divide and conquer. S2-Attention\nenforces each attention head to only attend to a partition of contexts\nfollowing a strided sparsity pattern, while the full context is preserved as\nthe union of all the shards. As attention heads are processed in separate\nthread blocks, the context reduction for each head can thus produce end-to-end\nspeed-up and memory reduction. At inference, LLMs trained with S2-Attention can\nthen take the KV cache reduction as free meals with guaranteed model quality\npreserve. In experiments, we show S2-Attentioncan provide as much as (1) 25.3X\nwall-clock attention speed-up over FlashAttention-2, resulting in 6X reduction\nin end-to-end training time and 10X inference latency, (2) on-par model\ntraining quality compared to default attention, (3)perfect needle retrieval\naccuracy over 32K context window. On top of the algorithm, we build DKernel, an\nLLM training and inference kernel library that allows users to customize\nsparsity patterns for their own models. We open-sourced DKerneland make it\ncompatible with Megatron, Pytorch, and vLLM.\n","authors":["Xihui Lin","Yunan Zhang","Suyu Ge","Barun Patra","Vishrav Chaudhary","Hao Peng","Xia Song"],"pdf_url":"https://arxiv.org/pdf/2407.17678v2.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2408.15417v1","updated":"2024-08-27T21:46:47Z","published":"2024-08-27T21:46:47Z","title":"Implicit Geometry of Next-token Prediction: From Language Sparsity\n Patterns to Model Representations","summary":" Next-token prediction (NTP) over large text corpora has become the go-to\nparadigm to train large language models. Yet, it remains unclear how NTP\ninfluences the mapping of linguistic patterns to geometric properties of the\nresulting model representations. We frame training of large language models as\nsoft-label classification over sparse probabilistic label vectors, coupled with\nan analytical approximation that allows unrestricted generation of context\nembeddings. This approach links NTP training to rank-constrained, nuclear-norm\nregularized optimization in the logit domain, offering a framework for\nanalyzing the geometry of word and context embeddings. In large embedding\nspaces, we find that NTP implicitly favors learning logits with a sparse plus\nlow-rank structure. While the sparse component captures the co-occurrence\nfrequency of context-word pairs, the orthogonal low-rank component, which\nbecomes dominant as training progresses, depends solely on the sparsity pattern\nof the co-occurrence matrix. Consequently, when projected onto an appropriate\nsubspace, representations of contexts that are followed by the same set of\nnext-tokens collapse, a phenomenon we term subspace-collapse. We validate our\nfindings on synthetic and small-scale real language datasets. Finally, we\noutline potential research directions aimed at deepening the understanding of\nNTP's influence on the learning of linguistic patterns and regularities.\n","authors":["Yize Zhao","Tina Behnia","Vala Vakilian","Christos Thrampoulidis"],"pdf_url":"https://arxiv.org/pdf/2408.15417v1.pdf","comment":"Accepted at COLM 2024"},{"id":"http://arxiv.org/abs/2310.07819v3","updated":"2024-08-27T21:37:57Z","published":"2023-10-11T19:00:40Z","title":"Faithfulness Measurable Masked Language Models","summary":" A common approach to explaining NLP models is to use importance measures that\nexpress which tokens are important for a prediction. Unfortunately, such\nexplanations are often wrong despite being persuasive. Therefore, it is\nessential to measure their faithfulness. One such metric is if tokens are truly\nimportant, then masking them should result in worse model performance. However,\ntoken masking introduces out-of-distribution issues, and existing solutions\nthat address this are computationally expensive and employ proxy models.\nFurthermore, other metrics are very limited in scope. This work proposes an\ninherently faithfulness measurable model that addresses these challenges. This\nis achieved using a novel fine-tuning method that incorporates masking, such\nthat masking tokens become in-distribution by design. This differs from\nexisting approaches, which are completely model-agnostic but are inapplicable\nin practice. We demonstrate the generality of our approach by applying it to 16\ndifferent datasets and validate it using statistical in-distribution tests. The\nfaithfulness is then measured with 9 different importance measures. Because\nmasking is in-distribution, importance measures that themselves use masking\nbecome consistently more faithful. Additionally, because the model makes\nfaithfulness cheap to measure, we can optimize explanations towards maximal\nfaithfulness; thus, our model becomes indirectly inherently explainable.\n","authors":["Andreas Madsen","Siva Reddy","Sarath Chandar"],"pdf_url":"https://arxiv.org/pdf/2310.07819v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15409v1","updated":"2024-08-27T21:19:37Z","published":"2024-08-27T21:19:37Z","title":"Awes, Laws, and Flaws From Today's LLM Research","summary":" We perform a critical examination of the scientific methodology behind\ncontemporary large language model (LLM) research. For this we assess over 2,000\nresearch works based on criteria typical of what is considered good research\n(e.g. presence of statistical tests and reproducibility) and cross-validate it\nwith arguments that are at the centre of controversy (e.g., claims of emergent\nbehaviour, the use of LLMs as evaluators). We find multiple trends, such as\ndeclines in claims of emergent behaviour and the presence of ethics\ndisclaimers; and the rise of LLMs as evaluators. This paper underscores the\nneed for more scrutiny and rigour by and from this field. Critical reading and\nfamiliarity with the literature are crucial to live up to the fundamentals of a\nresponsible scientific method that is ethical, reproducible, systematic, and\nopen to criticism.\n","authors":["Adrian de Wynter"],"pdf_url":"https://arxiv.org/pdf/2408.15409v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2408.15406v1","updated":"2024-08-27T21:03:42Z","published":"2024-08-27T21:03:42Z","title":"Intertwined Biases Across Social Media Spheres: Unpacking Correlations\n in Media Bias Dimensions","summary":" Media bias significantly shapes public perception by reinforcing stereotypes\nand exacerbating societal divisions. Prior research has often focused on\nisolated media bias dimensions such as \\textit{political bias} or\n\\textit{racial bias}, neglecting the complex interrelationships among various\nbias dimensions across different topic domains. Moreover, we observe that\nmodels trained on existing media bias benchmarks fail to generalize effectively\non recent social media posts, particularly in certain bias identification\ntasks. This shortfall primarily arises because these benchmarks do not\nadequately reflect the rapidly evolving nature of social media content, which\nis characterized by shifting user behaviors and emerging trends. In response to\nthese limitations, our research introduces a novel dataset collected from\nYouTube and Reddit over the past five years. Our dataset includes automated\nannotations for YouTube content across a broad spectrum of bias dimensions,\nsuch as gender, racial, and political biases, as well as hate speech, among\nothers. It spans diverse domains including politics, sports, healthcare,\neducation, and entertainment, reflecting the complex interplay of biases across\ndifferent societal sectors. Through comprehensive statistical analysis, we\nidentify significant differences in bias expression patterns and intra-domain\nbias correlations across these domains. By utilizing our understanding of the\ncorrelations among various bias dimensions, we lay the groundwork for creating\nadvanced systems capable of detecting multiple biases simultaneously. Overall,\nour dataset advances the field of media bias identification, contributing to\nthe development of tools that promote fairer media consumption. The\ncomprehensive awareness of existing media bias fosters more ethical journalism,\npromotes cultural sensitivity, and supports a more informed and equitable\npublic discourse.\n","authors":["Yifan Liu","Yike Li","Dong Wang"],"pdf_url":"https://arxiv.org/pdf/2408.15406v1.pdf","comment":"Accepted to ASONAM 2024"},{"id":"http://arxiv.org/abs/2408.15399v1","updated":"2024-08-27T20:51:06Z","published":"2024-08-27T20:51:06Z","title":"A Statistical Framework for Data-dependent Retrieval-Augmented Models","summary":" Modern ML systems increasingly augment input instances with additional\nrelevant information to enhance final prediction. Despite growing interest in\nsuch retrieval-augmented models, their fundamental properties and training are\nnot well understood. We propose a statistical framework to study such models\nwith two components: 1) a {\\em retriever} to identify the relevant information\nout of a large corpus via a data-dependent metric; and 2) a {\\em predictor}\nthat consumes the input instances along with the retrieved information to make\nthe final predictions. We present a principled method for end-to-end training\nof both components and draw connections with various training approaches in the\nliterature. Furthermore, we establish excess risk bounds for\nretrieval-augmented models while delineating the contributions of both\nretriever and predictor towards the model performance. We validate the utility\nof our proposed training methods along with the key takeaways from our\nstatistical analysis on open domain question answering task where retrieval\naugmentation is important.\n","authors":["Soumya Basu","Ankit Singh Rawat","Manzil Zaheer"],"pdf_url":"https://arxiv.org/pdf/2408.15399v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.01966v2","updated":"2024-08-27T20:05:59Z","published":"2024-08-04T09:04:44Z","title":"ML-EAT: A Multilevel Embedding Association Test for Interpretable and\n Transparent Social Science","summary":" This research introduces the Multilevel Embedding Association Test (ML-EAT),\na method designed for interpretable and transparent measurement of intrinsic\nbias in language technologies. The ML-EAT addresses issues of ambiguity and\ndifficulty in interpreting the traditional EAT measurement by quantifying bias\nat three levels of increasing granularity: the differential association between\ntwo target concepts with two attribute concepts; the individual effect size of\neach target concept with two attribute concepts; and the association between\neach individual target concept and each individual attribute concept. Using the\nML-EAT, this research defines a taxonomy of EAT patterns describing the nine\npossible outcomes of an embedding association test, each of which is associated\nwith a unique EAT-Map, a novel four-quadrant visualization for interpreting the\nML-EAT. Empirical analysis of static and diachronic word embeddings, GPT-2\nlanguage models, and a CLIP language-and-image model shows that EAT patterns\nadd otherwise unobservable information about the component biases that make up\nan EAT; reveal the effects of prompting in zero-shot models; and can also\nidentify situations when cosine similarity is an ineffective metric, rendering\nan EAT unreliable. Our work contributes a method for rendering bias more\nobservable and interpretable, improving the transparency of computational\ninvestigations into human minds and societies.\n","authors":["Robert Wolfe","Alexis Hiniker","Bill Howe"],"pdf_url":"https://arxiv.org/pdf/2408.01966v2.pdf","comment":"Accepted at Artificial Intelligence, Ethics, and Society 2024"},{"id":"http://arxiv.org/abs/2408.01959v2","updated":"2024-08-27T19:57:45Z","published":"2024-08-04T08:26:58Z","title":"Dataset Scale and Societal Consistency Mediate Facial Impression Bias in\n Vision-Language AI","summary":" Multimodal AI models capable of associating images and text hold promise for\nnumerous domains, ranging from automated image captioning to accessibility\napplications for blind and low-vision users. However, uncertainty about bias\nhas in some cases limited their adoption and availability. In the present work,\nwe study 43 CLIP vision-language models to determine whether they learn\nhuman-like facial impression biases, and we find evidence that such biases are\nreflected across three distinct CLIP model families. We show for the first time\nthat the the degree to which a bias is shared across a society predicts the\ndegree to which it is reflected in a CLIP model. Human-like impressions of\nvisually unobservable attributes, like trustworthiness and sexuality, emerge\nonly in models trained on the largest dataset, indicating that a better fit to\nuncurated cultural data results in the reproduction of increasingly subtle\nsocial biases. Moreover, we use a hierarchical clustering approach to show that\ndataset size predicts the extent to which the underlying structure of facial\nimpression bias resembles that of facial impression bias in humans. Finally, we\nshow that Stable Diffusion models employing CLIP as a text encoder learn facial\nimpression biases, and that these biases intersect with racial biases in Stable\nDiffusion XL-Turbo. While pretrained CLIP models may prove useful for\nscientific studies of bias, they will also require significant dataset curation\nwhen intended for use as general-purpose models in a zero-shot setting.\n","authors":["Robert Wolfe","Aayushi Dangol","Alexis Hiniker","Bill Howe"],"pdf_url":"https://arxiv.org/pdf/2408.01959v2.pdf","comment":"Accepted at Artificial Intelligence, Ethics, and Society 2024"},{"id":"http://arxiv.org/abs/2408.15379v1","updated":"2024-08-27T19:33:15Z","published":"2024-08-27T19:33:15Z","title":"DualKanbaFormer: Kolmogorov-Arnold Networks and State Space Model\n DualKanbaFormer: Kolmogorov-Arnold Networks and State Space Model Transformer\n for Multimodal Aspect-based Sentiment Analysis","summary":" Multimodal aspect-based sentiment analysis (MABSA) enhances sentiment\ndetection by combining text with other data types like images. However, despite\nsetting significant benchmarks, attention mechanisms exhibit limitations in\nefficiently modelling long-range dependencies between aspect and opinion\ntargets within the text. They also face challenges in capturing global-context\ndependencies for visual representations. To this end, we propose\nKolmogorov-Arnold Networks (KANs) and Selective State Space model (Mamba)\ntransformer (DualKanbaFormer), a novel architecture to address the above\nissues. We leverage the power of Mamba to capture global context dependencies,\nMulti-head Attention (MHA) to capture local context dependencies, and KANs to\ncapture non-linear modelling patterns for both textual representations (textual\nKanbaFormer) and visual representations (visual KanbaFormer). Furthermore, we\nfuse the textual KanbaFormer and visual KanbaFomer with a gated fusion layer to\ncapture the inter-modality dynamics. According to extensive experimental\nresults, our model outperforms some state-of-the-art (SOTA) studies on two\npublic datasets.\n","authors":["Adamu Lawan","Juhua Pu","Haruna Yunusa","Muhammad Lawan","Aliyu Umar","Adamu Sani Yahya"],"pdf_url":"https://arxiv.org/pdf/2408.15379v1.pdf","comment":"10 pages, 2 figures, and 3 tables"},{"id":"http://arxiv.org/abs/2408.15366v1","updated":"2024-08-27T19:03:11Z","published":"2024-08-27T19:03:11Z","title":"Pitfalls and Outlooks in Using COMET","summary":" Since its introduction, the COMET metric has blazed a trail in the machine\ntranslation community, given its strong correlation with human judgements of\ntranslation quality. Its success stems from being a modified pre-trained\nmultilingual model finetuned for quality assessment. However, it being a\nmachine learning model also gives rise to a new set of pitfalls that may not be\nwidely known. We investigate these unexpected behaviours from three aspects: 1)\ntechnical: obsolete software versions and compute precision; 2) data: empty\ncontent, language mismatch, and translationese at test time as well as\ndistribution and domain biases in training; 3) usage and reporting:\nmulti-reference support and model referencing in the literature. All of these\nproblems imply that COMET scores is not comparable between papers or even\ntechnical setups and we put forward our perspective on fixing each issue.\nFurthermore, we release the SacreCOMET package that can generate a signature\nfor the software and model configuration as well as an appropriate citation.\nThe goal of this work is to help the community make more sound use of the COMET\nmetric.\n","authors":["Vilém Zouhar","Pinzhen Chen","Tsz Kin Lam","Nikita Moghe","Barry Haddow"],"pdf_url":"https://arxiv.org/pdf/2408.15366v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.06062v3","updated":"2024-08-27T18:47:13Z","published":"2024-08-12T11:23:24Z","title":"On Tables with Numbers, with Numbers","summary":" This paper is a critical reflection on the epistemic culture of contemporary\ncomputational linguistics, framed in the context of its growing obsession with\ntables with numbers. We argue against tables with numbers on the basis of their\nepistemic irrelevance, their environmental impact, their role in enabling and\nexacerbating social inequalities, and their deep ties to commercial\napplications and profit-driven research. We substantiate our arguments with\nempirical evidence drawn from a meta-analysis of computational linguistics\nresearch over the last decade.\n","authors":["Konstantinos Kogkalidis","Stergios Chatzikyriakidis"],"pdf_url":"https://arxiv.org/pdf/2408.06062v3.pdf","comment":"v3: Stergios' acknowledgements"},{"id":"http://arxiv.org/abs/2408.15339v1","updated":"2024-08-27T18:04:07Z","published":"2024-08-27T18:04:07Z","title":"UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized\n Implicit Reward Function","summary":" An LLM is pretrained on trillions of tokens, but the pretrained LLM may still\ngenerate undesired responses. To solve this problem, alignment techniques such\nas RLHF, DPO and KTO are proposed. However, these alignment techniques have\nlimitations. For example, RLHF requires training the reward model and policy\nseparately, which is complex, time-consuming, memory intensive and unstable\nduring training processes. DPO proposes a mapping between an optimal policy and\na reward, greatly simplifying the training process of RLHF. However, it can not\ntake full advantages of a reward model and it is limited to pairwise preference\ndata.\n In this paper, we propose \\textbf{UN}ified \\textbf{A}lignment (UNA) which\nunifies RLHF/PPO, DPO and KTO. Firstly, we mathematically prove that given the\nclassical RLHF objective, the optimal policy is induced by a generalize\nimplicit reward function. With this novel mapping between a reward model and an\noptimal policy, UNA can 1. unify RLHF/PPO, DPO and KTO into a supervised\nlearning of minimizing the difference between an implicit reward and an\nexplicit reward; 2. outperform RLHF/PPO while simplify, stabilize, speed up and\nreduce memory burden of RL fine-tuning process; 3. accommodate different\nfeedback types including pairwise, binary and scalar feedback. Downstream\nexperiments show UNA outperforms DPO, KTO and RLHF.\n","authors":["Zhichao Wang","Bin Bi","Can Huang","Shiva Kumar Pentyala","Zixu James Zhu","Sitaram Asur","Na Claire Cheng"],"pdf_url":"https://arxiv.org/pdf/2408.15339v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15313v1","updated":"2024-08-27T17:31:21Z","published":"2024-08-27T17:31:21Z","title":"Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in\n Language Models","summary":" Fine-tuning large language models (LLMs) on human preferences, typically\nthrough reinforcement learning from human feedback (RLHF), has proven\nsuccessful in enhancing their capabilities. However, ensuring the safety of\nLLMs during the fine-tuning remains a critical concern, and mitigating the\npotential conflicts in safety and helpfulness is costly in RLHF. To address\nthis issue, we propose a supervised learning framework called Bi-Factorial\nPreference Optimization (BFPO), which re-parameterizes a joint RLHF objective\nof both safety and helpfulness into a single supervised learning objective. In\nthe supervised optimization, a labeling function is used to capture global\npreferences ranking to balance both safety and helpfulness. To evaluate BFPO,\nwe develop a benchmark including comprehensive discriminative and generative\ntasks for helpfulness and harmlessness. The results indicate that our method\nsignificantly outperforms existing approaches in both safety and helpfulness.\nMoreover, BFPO eliminates the need for human prompting and annotation in LLM\nfine-tuning while achieving the same level of safety as methods that heavily\nrely on human labor, with less than 10% of the computational resources. The\ntraining recipes and models will be released.\n","authors":["Wenxuan Zhang","Philip H. S. Torr","Mohamed Elhoseiny","Adel Bibi"],"pdf_url":"https://arxiv.org/pdf/2408.15313v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15297v1","updated":"2024-08-27T11:31:12Z","published":"2024-08-27T11:31:12Z","title":"YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection","summary":" Dysfluent speech detection is the bottleneck for disordered speech analysis\nand spoken language learning. Current state-of-the-art models are governed by\nrule-based systems which lack efficiency and robustness, and are sensitive to\ntemplate design. In this paper, we propose YOLO-Stutter: a first end-to-end\nmethod that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes\nimperfect speech-text alignment as input, followed by a spatial feature\naggregator, and a temporal dependency extractor to perform region-wise boundary\nand class predictions. We also introduce two dysfluency corpus, VCTK-Stutter\nand VCTK-TTS, that simulate natural spoken dysfluencies including repetition,\nblock, missing, replacement, and prolongation. Our end-to-end method achieves\nstate-of-the-art performance with a minimum number of trainable parameters for\non both simulated data and real aphasia speech. Code and datasets are\nopen-sourced at https://github.com/rorizzz/YOLO-Stutter\n","authors":["Xuanru Zhou","Anshul Kashyap","Steve Li","Ayati Sharma","Brittany Morin","David Baquirin","Jet Vonk","Zoe Ezzes","Zachary Miller","Maria Luisa Gorno Tempini","Jiachen Lian","Gopala Krishna Anumanchipalli"],"pdf_url":"https://arxiv.org/pdf/2408.15297v1.pdf","comment":"Interspeech 2024"},{"id":"http://arxiv.org/abs/2408.15293v1","updated":"2024-08-27T08:19:34Z","published":"2024-08-27T08:19:34Z","title":"Learning Granularity Representation for Temporal Knowledge Graph\n Completion","summary":" Temporal Knowledge Graphs (TKGs) incorporate temporal information to reflect\nthe dynamic structural knowledge and evolutionary patterns of real-world facts.\nNevertheless, TKGs are still limited in downstream applications due to the\nproblem of incompleteness. Consequently, TKG completion (also known as link\nprediction) has been widely studied, with recent research focusing on\nincorporating independent embeddings of time or combining them with entities\nand relations to form temporal representations. However, most existing methods\noverlook the impact of history from a multi-granularity aspect. The inherent\nsemantics of human-defined temporal granularities, such as ordinal dates,\nreveal general patterns to which facts typically adhere. To counter this\nlimitation, this paper proposes \\textbf{L}earning \\textbf{G}ranularity\n\\textbf{Re}presentation (termed $\\mathsf{LGRe}$) for TKG completion. It\ncomprises two main components: Granularity Representation Learning (GRL) and\nAdaptive Granularity Balancing (AGB). Specifically, GRL employs time-specific\nmulti-layer convolutional neural networks to capture interactions between\nentities and relations at different granularities. After that, AGB generates\nadaptive weights for these embeddings according to temporal semantics,\nresulting in expressive representations of predictions. Moreover, to reflect\nsimilar semantics of adjacent timestamps, a temporal loss function is\nintroduced. Extensive experimental results on four event benchmarks demonstrate\nthe effectiveness of $\\mathsf{LGRe}$ in learning time-related representations.\nTo ensure reproducibility, our code is available at\nhttps://github.com/KcAcoZhang/LGRe.\n","authors":["Jinchuan Zhang","Tianqi Wan","Chong Mu","Guangxi Lu","Ling Tian"],"pdf_url":"https://arxiv.org/pdf/2408.15293v1.pdf","comment":"15 pages. Accepted at ICONIP 2024"},{"id":"http://arxiv.org/abs/2408.13609v2","updated":"2024-08-27T04:49:46Z","published":"2024-08-24T15:43:02Z","title":"GNN: Graph Neural Network and Large Language Model for Data Discovery","summary":" Our algorithm GNN: Graph Neural Network and Large Language Model for Data\nDiscovery inherit the benefits of \\cite{hoang2024plod} (PLOD: Predictive\nLearning Optimal Data Discovery), \\cite{Hoang2024BODBO} (BOD: Blindly Optimal\nData Discovery) in terms of overcoming the challenges of having to predefine\nutility function and the human input for attribute ranking, which helps prevent\nthe time-consuming loop process. In addition to these previous works, our\nalgorithm GNN leverages the advantages of graph neural networks and large\nlanguage models to understand text type values that cannot be understood by\nPLOD and MOD, thus making the task of predicting outcomes more reliable. GNN\ncould be seen as an extension of PLOD in terms of understanding the text type\nvalue and the user's preferences, not only numerical values but also text\nvalues, making the promise of data science and analytics purposes.\n","authors":["Thomas Hoang"],"pdf_url":"https://arxiv.org/pdf/2408.13609v2.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2408.15242v1","updated":"2024-08-27T17:59:55Z","published":"2024-08-27T17:59:55Z","title":"Drone-assisted Road Gaussian Splatting with Cross-view Uncertainty","summary":" Robust and realistic rendering for large-scale road scenes is essential in\nautonomous driving simulation. Recently, 3D Gaussian Splatting (3D-GS) has made\ngroundbreaking progress in neural rendering, but the general fidelity of\nlarge-scale road scene renderings is often limited by the input imagery, which\nusually has a narrow field of view and focuses mainly on the street-level local\narea. Intuitively, the data from the drone's perspective can provide a\ncomplementary viewpoint for the data from the ground vehicle's perspective,\nenhancing the completeness of scene reconstruction and rendering. However,\ntraining naively with aerial and ground images, which exhibit large view\ndisparity, poses a significant convergence challenge for 3D-GS, and does not\ndemonstrate remarkable improvements in performance on road views. In order to\nenhance the novel view synthesis of road views and to effectively use the\naerial information, we design an uncertainty-aware training method that allows\naerial images to assist in the synthesis of areas where ground images have poor\nlearning outcomes instead of weighting all pixels equally in 3D-GS training\nlike prior work did. We are the first to introduce the cross-view uncertainty\nto 3D-GS by matching the car-view ensemble-based rendering uncertainty to\naerial images, weighting the contribution of each pixel to the training\nprocess. Additionally, to systematically quantify evaluation metrics, we\nassemble a high-quality synthesized dataset comprising both aerial and ground\nimages for road scenes.\n","authors":["Saining Zhang","Baijun Ye","Xiaoxue Chen","Yuantao Chen","Zongzheng Zhang","Cheng Peng","Yongliang Shi","Hao Zhao"],"pdf_url":"https://arxiv.org/pdf/2408.15242v1.pdf","comment":"BMVC2024 Project Page: https://sainingzhang.github.io/project/uc-gs/\n Code: https://github.com/SainingZhang/uc-gs/"},{"id":"http://arxiv.org/abs/2408.15241v1","updated":"2024-08-27T17:59:41Z","published":"2024-08-27T17:59:41Z","title":"GenRec: Unifying Video Generation and Recognition with Diffusion Models","summary":" Video diffusion models are able to generate high-quality videos by learning\nstrong spatial-temporal priors on large-scale datasets. In this paper, we aim\nto investigate whether such priors derived from a generative process are\nsuitable for video recognition, and eventually joint optimization of generation\nand recognition. Building upon Stable Video Diffusion, we introduce GenRec, the\nfirst unified framework trained with a random-frame conditioning process so as\nto learn generalized spatial-temporal representations. The resulting framework\ncan naturally supports generation and recognition, and more importantly is\nrobust even when visual inputs contain limited information. Extensive\nexperiments demonstrate the efficacy of GenRec for both recognition and\ngeneration. In particular, GenRec achieves competitive recognition performance,\noffering 75.8% and 87.2% accuracy on SSV2 and K400, respectively. GenRec also\nperforms the best class-conditioned image-to-video generation results,\nachieving 46.5 and 49.3 FVD scores on SSV2 and EK-100 datasets. Furthermore,\nGenRec demonstrates extraordinary robustness in scenarios that only limited\nframes can be observed.\n","authors":["Zejia Weng","Xitong Yang","Zhen Xing","Zuxuan Wu","Yu-Gang Jiang"],"pdf_url":"https://arxiv.org/pdf/2408.15241v1.pdf","comment":"17 pages, 6 figures, 7 tables"},{"id":"http://arxiv.org/abs/2408.15239v1","updated":"2024-08-27T17:57:14Z","published":"2024-08-27T17:57:14Z","title":"Generative Inbetweening: Adapting Image-to-Video Models for Keyframe\n Interpolation","summary":" We present a method for generating video sequences with coherent motion\nbetween a pair of input key frames. We adapt a pretrained large-scale\nimage-to-video diffusion model (originally trained to generate videos moving\nforward in time from a single input image) for key frame interpolation, i.e.,\nto produce a video in between two input frames. We accomplish this adaptation\nthrough a lightweight fine-tuning technique that produces a version of the\nmodel that instead predicts videos moving backwards in time from a single input\nimage. This model (along with the original forward-moving model) is\nsubsequently used in a dual-directional diffusion sampling process that\ncombines the overlapping model estimates starting from each of the two\nkeyframes. Our experiments show that our method outperforms both existing\ndiffusion-based methods and traditional frame interpolation techniques.\n","authors":["Xiaojuan Wang","Boyang Zhou","Brian Curless","Ira Kemelmacher-Shlizerman","Aleksander Holynski","Steven M. Seitz"],"pdf_url":"https://arxiv.org/pdf/2408.15239v1.pdf","comment":"project page: https://svd-keyframe-interpolation.github.io/"},{"id":"http://arxiv.org/abs/2408.15235v1","updated":"2024-08-27T17:53:18Z","published":"2024-08-27T17:53:18Z","title":"Learning-based Multi-View Stereo: A Survey","summary":" 3D reconstruction aims to recover the dense 3D structure of a scene. It plays\nan essential role in various applications such as Augmented/Virtual Reality\n(AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene\ncaptured from different viewpoints, Multi-View Stereo (MVS) algorithms\nsynthesize a comprehensive 3D representation, enabling precise reconstruction\nin complex environments. Due to its efficiency and effectiveness, MVS has\nbecome a pivotal method for image-based 3D reconstruction. Recently, with the\nsuccess of deep learning, many learning-based MVS methods have been proposed,\nachieving impressive performance against traditional methods. We categorize\nthese learning-based methods as: depth map-based, voxel-based, NeRF-based, 3D\nGaussian Splatting-based, and large feed-forward methods. Among these, we focus\nsignificantly on depth map-based methods, which are the main family of MVS due\nto their conciseness, flexibility and scalability. In this survey, we provide a\ncomprehensive review of the literature at the time of this writing. We\ninvestigate these learning-based methods, summarize their performances on\npopular benchmarks, and discuss promising future research directions in this\narea.\n","authors":["Fangjinhua Wang","Qingtian Zhu","Di Chang","Quankai Gao","Junlin Han","Tong Zhang","Richard Hartley","Marc Pollefeys"],"pdf_url":"https://arxiv.org/pdf/2408.15235v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15231v1","updated":"2024-08-27T17:48:29Z","published":"2024-08-27T17:48:29Z","title":"DCT-CryptoNets: Scaling Private Inference in the Frequency Domain","summary":" The convergence of fully homomorphic encryption (FHE) and machine learning\noffers unprecedented opportunities for private inference of sensitive data. FHE\nenables computation directly on encrypted data, safeguarding the entire machine\nlearning pipeline, including data and model confidentiality. However, existing\nFHE-based implementations for deep neural networks face significant challenges\nin computational cost, latency, and scalability, limiting their practical\ndeployment. This paper introduces DCT-CryptoNets, a novel approach that\nleverages frequency-domain learning to tackle these issues. Our method operates\ndirectly in the frequency domain, utilizing the discrete cosine transform (DCT)\ncommonly employed in JPEG compression. This approach is inherently compatible\nwith remote computing services, where images are usually transmitted and stored\nin compressed formats. DCT-CryptoNets reduces the computational burden of\nhomomorphic operations by focusing on perceptually relevant low-frequency\ncomponents. This is demonstrated by substantial latency reduction of up to\n5.3$\\times$ compared to prior work on image classification tasks, including a\nnovel demonstration of ImageNet inference within 2.5 hours, down from 12.5\nhours compared to prior work on equivalent compute resources. Moreover,\nDCT-CryptoNets improves the reliability of encrypted accuracy by reducing\nvariability (e.g., from $\\pm$2.5\\% to $\\pm$1.0\\% on ImageNet). This study\ndemonstrates a promising avenue for achieving efficient and practical\nprivacy-preserving deep learning on high resolution images seen in real-world\napplications.\n","authors":["Arjun Roy","Kaushik Roy"],"pdf_url":"https://arxiv.org/pdf/2408.15231v1.pdf","comment":"Under Review; 10 pages content, 3 pages appendix, 4 figures, 8\n tables; Code TBD"},{"id":"http://arxiv.org/abs/2408.15224v1","updated":"2024-08-27T17:39:33Z","published":"2024-08-27T17:39:33Z","title":"SAM & SAM 2 in 3D Slicer: SegmentWithSAM Extension for Annotating\n Medical Images","summary":" Creating annotations for 3D medical data is time-consuming and often requires\nhighly specialized expertise. Various tools have been implemented to aid this\nprocess. Segment Anything Model 2 (SAM 2) offers a general-purpose prompt-based\nsegmentation algorithm designed to annotate videos. In this paper, we adapt\nthis model to the annotation of 3D medical images and offer our implementation\nin the form of an extension to the popular annotation software: 3D Slicer. Our\nextension allows users to place point prompts on 2D slices to generate\nannotation masks and propagate these annotations across entire volumes in\neither single-directional or bi-directional manners. Our code is publicly\navailable on https://github.com/mazurowski-lab/SlicerSegmentWithSAM and can be\neasily installed directly from the Extension Manager of 3D Slicer as well.\n","authors":["Zafer Yildiz","Yuwen Chen","Maciej A. Mazurowski"],"pdf_url":"https://arxiv.org/pdf/2408.15224v1.pdf","comment":"Future work: support for box and mask inputs for the video predictor\n of SAM 2"},{"id":"http://arxiv.org/abs/2408.15218v1","updated":"2024-08-27T17:31:00Z","published":"2024-08-27T17:31:00Z","title":"Histo-Diffusion: A Diffusion Super-Resolution Method for Digital\n Pathology with Comprehensive Quality Assessment","summary":" Digital pathology has advanced significantly over the last decade, with Whole\nSlide Images (WSIs) encompassing vast amounts of data essential for accurate\ndisease diagnosis. High-resolution WSIs are essential for precise diagnosis but\ntechnical limitations in scanning equipment and variablity in slide preparation\ncan hinder obtaining these images. Super-resolution techniques can enhance\nlow-resolution images; while Generative Adversarial Networks (GANs) have been\neffective in natural image super-resolution tasks, they often struggle with\nhistopathology due to overfitting and mode collapse. Traditional evaluation\nmetrics fall short in assessing the complex characteristics of histopathology\nimages, necessitating robust histology-specific evaluation methods.\n We introduce Histo-Diffusion, a novel diffusion-based method specially\ndesigned for generating and evaluating super-resolution images in digital\npathology. It includes a restoration module for histopathology prior and a\ncontrollable diffusion module for generating high-quality images. We have\ncurated two histopathology datasets and proposed a comprehensive evaluation\nstrategy which incorporates both full-reference and no-reference metrics to\nthoroughly assess the quality of digital pathology images.\n Comparative analyses on multiple datasets with state-of-the-art methods\nreveal that Histo-Diffusion outperforms GANs. Our method offers a versatile\nsolution for histopathology image super-resolution, capable of handling\nmulti-resolution generation from varied input sizes, providing valuable support\nin diagnostic processes.\n","authors":["Xuan Xu","Saarthak Kapse","Prateek Prasanna"],"pdf_url":"https://arxiv.org/pdf/2408.15218v1.pdf","comment":"We have submitted our paper to Medical Image Analysis and are\n currently awaiting feedback"},{"id":"http://arxiv.org/abs/2408.15217v1","updated":"2024-08-27T17:30:49Z","published":"2024-08-27T17:30:49Z","title":"Fundus2Video: Cross-Modal Angiography Video Generation from Static\n Fundus Photography with Clinical Knowledge Guidance","summary":" Fundus Fluorescein Angiography (FFA) is a critical tool for assessing retinal\nvascular dynamics and aiding in the diagnosis of eye diseases. However, its\ninvasive nature and less accessibility compared to Color Fundus (CF) images\npose significant challenges. Current CF to FFA translation methods are limited\nto static generation. In this work, we pioneer dynamic FFA video generation\nfrom static CF images. We introduce an autoregressive GAN for smooth,\nmemory-saving frame-by-frame FFA synthesis. To enhance the focus on dynamic\nlesion changes in FFA regions, we design a knowledge mask based on clinical\nexperience. Leveraging this mask, our approach integrates innovative knowledge\nmask-guided techniques, including knowledge-boosted attention, knowledge-aware\ndiscriminators, and mask-enhanced patchNCE loss, aimed at refining generation\nin critical areas and addressing the pixel misalignment challenge. Our method\nachieves the best FVD of 1503.21 and PSNR of 11.81 compared to other common\nvideo generation approaches. Human assessment by an ophthalmologist confirms\nits high generation quality. Notably, our knowledge mask surpasses supervised\nlesion segmentation masks, offering a promising non-invasive alternative to\ntraditional FFA for research and clinical applications. The code is available\nat https://github.com/Michi-3000/Fundus2Video.\n","authors":["Weiyi Zhang","Siyu Huang","Jiancheng Yang","Ruoyu Chen","Zongyuan Ge","Yingfeng Zheng","Danli Shi","Mingguang He"],"pdf_url":"https://arxiv.org/pdf/2408.15217v1.pdf","comment":"The paper has been accepted by Medical Image Computing and Computer\n Assisted Intervention Society (MICCAI) 2024"},{"id":"http://arxiv.org/abs/2407.05921v2","updated":"2024-08-27T17:14:16Z","published":"2024-07-08T13:28:47Z","title":"TAPVid-3D: A Benchmark for Tracking Any Point in 3D","summary":" We introduce a new benchmark, TAPVid-3D, for evaluating the task of\nlong-range Tracking Any Point in 3D (TAP-3D). While point tracking in two\ndimensions (TAP) has many benchmarks measuring performance on real-world\nvideos, such as TAPVid-DAVIS, three-dimensional point tracking has none. To\nthis end, leveraging existing footage, we build a new benchmark for 3D point\ntracking featuring 4,000+ real-world videos, composed of three different data\nsources spanning a variety of object types, motion patterns, and indoor and\noutdoor environments. To measure performance on the TAP-3D task, we formulate a\ncollection of metrics that extend the Jaccard-based metric used in TAP to\nhandle the complexities of ambiguous depth scales across models, occlusions,\nand multi-track spatio-temporal smoothness. We manually verify a large sample\nof trajectories to ensure correct video annotations, and assess the current\nstate of the TAP-3D task by constructing competitive baselines using existing\ntracking models. We anticipate this benchmark will serve as a guidepost to\nimprove our ability to understand precise 3D motion and surface deformation\nfrom monocular video. Code for dataset download, generation, and model\nevaluation is available at https://tapvid3d.github.io\n","authors":["Skanda Koppula","Ignacio Rocco","Yi Yang","Joe Heyward","João Carreira","Andrew Zisserman","Gabriel Brostow","Carl Doersch"],"pdf_url":"https://arxiv.org/pdf/2407.05921v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15205v1","updated":"2024-08-27T17:06:22Z","published":"2024-08-27T17:06:22Z","title":"Leveraging Hallucinations to Reduce Manual Prompt Dependency in\n Promptable Segmentation","summary":" Promptable segmentation typically requires instance-specific manual prompts\nto guide the segmentation of each desired object. To minimize such a need,\ntask-generic promptable segmentation has been introduced, which employs a\nsingle task-generic prompt to segment various images of different objects in\nthe same task. Current methods use Multimodal Large Language Models (MLLMs) to\nreason detailed instance-specific prompts from a task-generic prompt for\nimproving segmentation accuracy. The effectiveness of this segmentation heavily\ndepends on the precision of these derived prompts. However, MLLMs often suffer\nhallucinations during reasoning, resulting in inaccurate prompting. While\nexisting methods focus on eliminating hallucinations to improve a model, we\nargue that MLLM hallucinations can reveal valuable contextual insights when\nleveraged correctly, as they represent pre-trained large-scale knowledge beyond\nindividual images. In this paper, we utilize hallucinations to mine\ntask-related information from images and verify its accuracy for enhancing\nprecision of the generated prompts. Specifically, we introduce an iterative\nPrompt-Mask Cycle generation framework (ProMaC) with a prompt generator and a\nmask generator.The prompt generator uses a multi-scale chain of thought\nprompting, initially exploring hallucinations for extracting extended\ncontextual knowledge on a test image.These hallucinations are then reduced to\nformulate precise instance-specific prompts, directing the mask generator to\nproduce masks that are consistent with task semantics by mask semantic\nalignment. The generated masks iteratively induce the prompt generator to focus\nmore on task-relevant image areas and reduce irrelevant hallucinations,\nresulting jointly in better prompts and masks. Experiments on 5 benchmarks\ndemonstrate the effectiveness of ProMaC. Code given in\nhttps://lwpyh.github.io/ProMaC/.\n","authors":["Jian Hu","Jiayi Lin","Junchi Yan","Shaogang Gong"],"pdf_url":"https://arxiv.org/pdf/2408.15205v1.pdf","comment":"We propose using hallucinations as prior knowledge to extract and\n validate task-related information, which helps generate instance-specific\n prompts for reducing reliance on manual prompts in promptable segmentation"},{"id":"http://arxiv.org/abs/2408.15201v1","updated":"2024-08-27T17:02:03Z","published":"2024-08-27T17:02:03Z","title":"An Investigation on The Position Encoding in Vision-Based Dynamics\n Prediction","summary":" Despite the success of vision-based dynamics prediction models, which predict\nobject states by utilizing RGB images and simple object descriptions, they were\nchallenged by environment misalignments. Although the literature has\ndemonstrated that unifying visual domains with both environment context and\nobject abstract, such as semantic segmentation and bounding boxes, can\neffectively mitigate the visual domain misalignment challenge, discussions were\nfocused on the abstract of environment context, and the insight of using\nbounding box as the object abstract is under-explored. Furthermore, we notice\nthat, as empirical results shown in the literature, even when the visual\nappearance of objects is removed, object bounding boxes alone, instead of being\ndirectly fed into the network, can indirectly provide sufficient position\ninformation via the Region of Interest Pooling operation for dynamics\nprediction. However, previous literature overlooked discussions regarding how\nsuch position information is implicitly encoded in the dynamics prediction\nmodel. Thus, in this paper, we provide detailed studies to investigate the\nprocess and necessary conditions for encoding position information via using\nthe bounding box as the object abstract into output features. Furthermore, we\nstudy the limitation of solely using object abstracts, such that the dynamics\nprediction performance will be jeopardized when the environment context varies.\n","authors":["Jiageng Zhu","Hanchen Xie","Jiazhi Li","Mahyar Khayatkhoei","Wael AbdAlmageed"],"pdf_url":"https://arxiv.org/pdf/2408.15201v1.pdf","comment":"13 pages, 4 tables, and 3 figures. Accepted to ECCV2024 eXCV workshop"},{"id":"http://arxiv.org/abs/2408.02088v3","updated":"2024-08-27T16:46:53Z","published":"2024-08-04T16:54:49Z","title":"KAN-RCBEVDepth: A multi-modal fusion algorithm in object detection for\n autonomous driving","summary":" Accurate 3D object detection in autonomous driving is critical yet\nchallenging due to occlusions, varying object sizes, and complex urban\nenvironments. This paper introduces the KAN-RCBEVDepth method, an innovative\napproach aimed at enhancing 3D object detection by fusing multimodal sensor\ndata from cameras, LiDAR, and millimeter-wave radar. Our unique Bird's Eye\nView-based approach significantly improves detection accuracy and efficiency by\nseamlessly integrating diverse sensor inputs, refining spatial relationship\nunderstanding, and optimizing computational procedures. Experimental results\nshow that the proposed method outperforms existing techniques across multiple\ndetection metrics, achieving a higher Mean Distance AP (0.389, 23\\%\nimprovement), a better ND Score (0.485, 17.1\\% improvement), and a faster\nEvaluation Time (71.28s, 8\\% faster). Additionally, the KAN-RCBEVDepth method\nsignificantly reduces errors compared to BEVDepth, with lower Transformation\nError (0.6044, 13.8\\% improvement), Scale Error (0.2780, 2.6\\% improvement),\nOrientation Error (0.5830, 7.6\\% improvement), Velocity Error (0.4244, 28.3\\%\nimprovement), and Attribute Error (0.2129, 3.2\\% improvement). These findings\nsuggest that our method offers enhanced accuracy, reliability, and efficiency,\nmaking it well-suited for dynamic and demanding autonomous driving scenarios.\nThe code will be released in \\url{https://github.com/laitiamo/RCBEVDepth-KAN}.\n","authors":["Zhihao Lai","Chuanhao Liu","Shihui Sheng","Zhiqiang Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.02088v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08789v4","updated":"2024-08-27T16:43:17Z","published":"2023-07-17T19:17:10Z","title":"Creating Image Datasets in Agricultural Environments using DALL.E:\n Generative AI-Powered Large Language Model","summary":" This research investigated the role of artificial intelligence (AI),\nspecifically the DALL.E model by OpenAI, in advancing data generation and\nvisualization techniques in agriculture. DALL.E, an advanced AI image\ngenerator, works alongside ChatGPT's language processing to transform text\ndescriptions and image clues into realistic visual representations of the\ncontent. The study used both approaches of image generation: text-to-image and\nimage-to image (variation). Six types of datasets depicting fruit crop\nenvironment were generated. These AI-generated images were then compared\nagainst ground truth images captured by sensors in real agricultural fields.\nThe comparison was based on Peak Signal-to-Noise Ratio (PSNR) and Feature\nSimilarity Index (FSIM) metrics. The image-to-image generation exhibited a\n5.78% increase in average PSNR over text-to-image methods, signifying superior\nimage clarity and quality. However, this method also resulted in a 10.23%\ndecrease in average FSIM, indicating a diminished structural and textural\nsimilarity to the original images. Similar to these measures, human evaluation\nalso showed that images generated using image-to-image-based method were more\nrealistic compared to those generated with text-to-image approach. The results\nhighlighted DALL.E's potential in generating realistic agricultural image\ndatasets and thus accelerating the development and adoption of imaging-based\nprecision agricultural solutions.\n","authors":["Ranjan Sapkota","Manoj Karkee"],"pdf_url":"https://arxiv.org/pdf/2307.08789v4.pdf","comment":"9 Figures, 1 table, 17 pages"},{"id":"http://arxiv.org/abs/2408.15185v1","updated":"2024-08-27T16:40:14Z","published":"2024-08-27T16:40:14Z","title":"PoseWatch: A Transformer-based Architecture for Human-centric Video\n Anomaly Detection Using Spatio-temporal Pose Tokenization","summary":" Video Anomaly Detection (VAD) presents a significant challenge in computer\nvision, particularly due to the unpredictable and infrequent nature of\nanomalous events, coupled with the diverse and dynamic environments in which\nthey occur. Human-centric VAD, a specialized area within this domain, faces\nadditional complexities, including variations in human behavior, potential\nbiases in data, and substantial privacy concerns related to human subjects.\nThese issues complicate the development of models that are both robust and\ngeneralizable. To address these challenges, recent advancements have focused on\npose-based VAD, which leverages human pose as a high-level feature to mitigate\nprivacy concerns, reduce appearance biases, and minimize background\ninterference. In this paper, we introduce PoseWatch, a novel transformer-based\narchitecture designed specifically for human-centric pose-based VAD. PoseWatch\nfeatures an innovative Spatio-Temporal Pose and Relative Pose (ST-PRP)\ntokenization method that enhances the representation of human motion over time,\nwhich is also beneficial for broader human behavior analysis tasks. The\narchitecture's core, a Unified Encoder Twin Decoders (UETD) transformer,\nsignificantly improves the detection of anomalous behaviors in video data.\nExtensive evaluations across multiple benchmark datasets demonstrate that\nPoseWatch consistently outperforms existing methods, establishing a new\nstate-of-the-art in pose-based VAD. This work not only demonstrates the\nefficacy of PoseWatch but also highlights the potential of integrating Natural\nLanguage Processing techniques with computer vision to advance human behavior\nanalysis.\n","authors":["Ghazal Alinezhad Noghre","Armin Danesh Pazho","Hamed Tabkhi"],"pdf_url":"https://arxiv.org/pdf/2408.15185v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.12040v3","updated":"2024-08-27T16:25:47Z","published":"2024-07-01T17:59:55Z","title":"Comprehensive Performance Evaluation of YOLOv10, YOLOv9 and YOLOv8 on\n Detecting and Counting Fruitlet in Complex Orchard Environments","summary":" This study performed an extensive evaluation of the performances of all\nconfigurations of YOLOv8, YOLOv9, and YOLOv10 object detection algorithms for\nfruitlet (of green fruit) detection in commercial orchards. Additionally, this\nresearch performed and validated in-field counting of fruitlets using an iPhone\nand machine vision sensors in 5 different apple varieties (Scifresh, Scilate,\nHoneycrisp, Cosmic crisp & Golden delicious). This comprehensive investigation\nof total 17 different configurations (5 for YOLOv8, 6 for YOLOv9 and 6 for\nYOLOv10) revealed that YOLOv9 outperforms YOLOv10 and YOLOv8 in terms of\nmAP@50, while YOLOv10x outperformed all 17 configurations tested in terms of\nprecision and recall. Specifically, YOLOv9 Gelan-e achieved the highest mAP@50\nof 0.935, outperforming YOLOv10n's 0.921 and YOLOv8s's 0.924. In terms of\nprecision, YOLOv10x achieved the highest precision of 0.908, indicating\nsuperior object identification accuracy compared to other configurations tested\n(e.g. YOLOv9 Gelan-c with a precision of 0.903 and YOLOv8m with 0.897. In terms\nof recall, YOLOv10s achieved the highest in its series (0.872), while YOLOv9\nGelan m performed the best among YOLOv9 configurations (0.899), and YOLOv8n\nperformed the best among the YOLOv8 configurations (0.883). Meanwhile, three\nconfigurations of YOLOv10: YOLOv10b, YOLOv10l, and YOLOv10x achieved superior\npost-processing speeds of 1.5 milliseconds, outperforming all other\nconfigurations within the YOLOv9 and YOLOv8 families. Specifically, YOLOv9\nGelan-e recorded a post-processing speed of 1.9 milliseconds, and YOLOv8m\nachieved 2.1 milliseconds. Furthermore, YOLOv8n exhibited the highest inference\nspeed among all configurations tested, achieving a processing time of 4.1\nmilliseconds while YOLOv9 Gelan-t and YOLOv10n also demonstrated comparatively\nslower inference speeds of 9.3 ms and 5.5 ms, respectively.\n","authors":["Ranjan Sapkota","Zhichao Meng","Martin Churuvija","Xiaoqiang Du","Zenghong Ma","Manoj Karkee"],"pdf_url":"https://arxiv.org/pdf/2407.12040v3.pdf","comment":"14 figures, 2 tables"},{"id":"http://arxiv.org/abs/2408.15178v1","updated":"2024-08-27T16:22:18Z","published":"2024-08-27T16:22:18Z","title":"A Review of Transformer-Based Models for Computer Vision Tasks:\n Capturing Global Context and Spatial Relationships","summary":" Transformer-based models have transformed the landscape of natural language\nprocessing (NLP) and are increasingly applied to computer vision tasks with\nremarkable success. These models, renowned for their ability to capture\nlong-range dependencies and contextual information, offer a promising\nalternative to traditional convolutional neural networks (CNNs) in computer\nvision. In this review paper, we provide an extensive overview of various\ntransformer architectures adapted for computer vision tasks. We delve into how\nthese models capture global context and spatial relationships in images,\nempowering them to excel in tasks such as image classification, object\ndetection, and segmentation. Analyzing the key components, training\nmethodologies, and performance metrics of transformer-based models, we\nhighlight their strengths, limitations, and recent advancements. Additionally,\nwe discuss potential research directions and applications of transformer-based\nmodels in computer vision, offering insights into their implications for future\nadvancements in the field.\n","authors":["Gracile Astlin Pereira","Muhammad Hussain"],"pdf_url":"https://arxiv.org/pdf/2408.15178v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.10636v2","updated":"2024-08-27T16:19:22Z","published":"2024-08-20T08:22:29Z","title":"UWF-RI2FA: Generating Multi-frame Ultrawide-field Fluorescein\n Angiography from Ultrawide-field Retinal Imaging Improves Diabetic\n Retinopathy Stratification","summary":" Ultrawide-field fluorescein angiography (UWF-FA) facilitates diabetic\nretinopathy (DR) detection by providing a clear visualization of peripheral\nretinal lesions. However, the intravenous dye injection with potential risks\nhamper its application. We aim to acquire dye-free UWF-FA images from\nnoninvasive UWF retinal imaging (UWF-RI) using generative artificial\nintelligence (GenAI) and evaluate its effectiveness in DR screening. A total of\n18,321 UWF-FA images of different phases were registered with corresponding\nUWF-RI images and fed into a generative adversarial networks (GAN)-based model\nfor training. The quality of generated UWF-FA images was evaluated through\nquantitative metrics and human evaluation. The DeepDRiD dataset was used to\nexternally assess the contribution of generated UWF-FA images to DR\nclassification, using area under the receiver operating characteristic curve\n(AUROC) as outcome metrics. The generated early, mid, and late phase UWF-FA\nimages achieved high authenticity, with multi-scale similarity scores ranging\nfrom 0.70 to 0.91 and qualitative visual scores ranging from 1.64 to 1.98\n(1=real UWF-FA quality). In fifty randomly selected images, 56% to 76% of the\ngenerated images were difficult to distinguish from real images in the Turing\ntest. Moreover, adding these generated UWF-FA images for DR classification\nsignificantly increased the AUROC from 0.869 to 0.904 compared to the baseline\nmodel using UWF-RI images (P < .001). The model successfully generates\nrealistic multi-frame UWF-FA images for enhancing DR stratification without\nintravenous dye injection.\n","authors":["Ruoyu Chen","Kezheng Xu","Kangyan Zheng","Weiyi Zhang","Yan Lu","Danli Shi","Mingguang He"],"pdf_url":"https://arxiv.org/pdf/2408.10636v2.pdf","comment":"22 pages, 2 figures"},{"id":"http://arxiv.org/abs/2408.13698v2","updated":"2024-08-27T16:11:44Z","published":"2024-08-25T01:27:35Z","title":"CNN-Transformer Rectified Collaborative Learning for Medical Image\n Segmentation","summary":" Automatic and precise medical image segmentation (MIS) is of vital importance\nfor clinical diagnosis and analysis. Current MIS methods mainly rely on the\nconvolutional neural network (CNN) or self-attention mechanism (Transformer)\nfor feature modeling. However, CNN-based methods suffer from the inaccurate\nlocalization owing to the limited global dependency while Transformer-based\nmethods always present the coarse boundary for the lack of local emphasis.\nAlthough some CNN-Transformer hybrid methods are designed to synthesize the\ncomplementary local and global information for better performance, the\ncombination of CNN and Transformer introduces numerous parameters and increases\nthe computation cost. To this end, this paper proposes a CNN-Transformer\nrectified collaborative learning (CTRCL) framework to learn stronger CNN-based\nand Transformer-based models for MIS tasks via the bi-directional knowledge\ntransfer between them. Specifically, we propose a rectified logit-wise\ncollaborative learning (RLCL) strategy which introduces the ground truth to\nadaptively select and rectify the wrong regions in student soft labels for\naccurate knowledge transfer in the logit space. We also propose a class-aware\nfeature-wise collaborative learning (CFCL) strategy to achieve effective\nknowledge transfer between CNN-based and Transformer-based models in the\nfeature space by granting their intermediate features the similar capability of\ncategory perception. Extensive experiments on three popular MIS benchmarks\ndemonstrate that our CTRCL outperforms most state-of-the-art collaborative\nlearning methods under different evaluation metrics.\n","authors":["Lanhu Wu","Miao Zhang","Yongri Piao","Zhenyan Yao","Weibing Sun","Feng Tian","Huchuan Lu"],"pdf_url":"https://arxiv.org/pdf/2408.13698v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15172v1","updated":"2024-08-27T16:10:21Z","published":"2024-08-27T16:10:21Z","title":"X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation","summary":" Large Language Models (LLMs) and Large Multimodal Models (LMMs) have been\nshown to enhance the effectiveness of enriching item descriptions, thereby\nimproving the accuracy of recommendation systems. However, most existing\napproaches either rely on text-only prompting or employ basic multimodal\nstrategies that do not fully exploit the complementary information available\nfrom both textual and visual modalities. This paper introduces a novel\nframework, Cross-Reflection Prompting, termed X-Reflect, designed to address\nthese limitations by prompting LMMs to explicitly identify and reconcile\nsupportive and conflicting information between text and images. By capturing\nnuanced insights from both modalities, this approach generates more\ncomprehensive and contextually richer item representations. Extensive\nexperiments conducted on two widely used benchmarks demonstrate that our method\noutperforms existing prompting baselines in downstream recommendation accuracy.\nAdditionally, we evaluate the generalizability of our framework across\ndifferent LMM backbones and the robustness of the prompting strategies,\noffering insights for optimization. This work underscores the importance of\nintegrating multimodal information and presents a novel solution for improving\nitem understanding in multimodal recommendation systems.\n","authors":["Hanjia Lyu","Ryan Rossi","Xiang Chen","Md Mehrab Tanjim","Stefano Petrangeli","Somdeb Sarkhel","Jiebo Luo"],"pdf_url":"https://arxiv.org/pdf/2408.15172v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.13993v3","updated":"2024-08-27T15:56:33Z","published":"2024-04-22T08:59:35Z","title":"Zero-Shot Character Identification and Speaker Prediction in Comics via\n Iterative Multimodal Fusion","summary":" Recognizing characters and predicting speakers of dialogue are critical for\ncomic processing tasks, such as voice generation or translation. However,\nbecause characters vary by comic title, supervised learning approaches like\ntraining character classifiers which require specific annotations for each\ncomic title are infeasible. This motivates us to propose a novel zero-shot\napproach, allowing machines to identify characters and predict speaker names\nbased solely on unannotated comic images. In spite of their importance in\nreal-world applications, these task have largely remained unexplored due to\nchallenges in story comprehension and multimodal integration. Recent large\nlanguage models (LLMs) have shown great capability for text understanding and\nreasoning, while their application to multimodal content analysis is still an\nopen problem. To address this problem, we propose an iterative multimodal\nframework, the first to employ multimodal information for both character\nidentification and speaker prediction tasks. Our experiments demonstrate the\neffectiveness of the proposed framework, establishing a robust baseline for\nthese tasks. Furthermore, since our method requires no training data or\nannotations, it can be used as-is on any comic series.\n","authors":["Yingxuan Li","Ryota Hinami","Kiyoharu Aizawa","Yusuke Matsui"],"pdf_url":"https://arxiv.org/pdf/2404.13993v3.pdf","comment":"Accepted to ACM Multimedia 2024"},{"id":"http://arxiv.org/abs/2408.15159v1","updated":"2024-08-27T15:55:18Z","published":"2024-08-27T15:55:18Z","title":"Empowering Sign Language Communication: Integrating Sentiment and\n Semantics for Facial Expression Synthesis","summary":" Translating written sentences from oral languages to a sequence of manual and\nnon-manual gestures plays a crucial role in building a more inclusive society\nfor deaf and hard-of-hearing people. Facial expressions (non-manual), in\nparticular, are responsible for encoding the grammar of the sentence to be\nspoken, applying punctuation, pronouns, or emphasizing signs. These non-manual\ngestures are closely related to the semantics of the sentence being spoken and\nalso to the utterance of the speaker's emotions. However, most Sign Language\nProduction (SLP) approaches are centered on synthesizing manual gestures and do\nnot focus on modeling the speakers expression. This paper introduces a new\nmethod focused in synthesizing facial expressions for sign language. Our goal\nis to improve sign language production by integrating sentiment information in\nfacial expression generation. The approach leverages a sentence sentiment and\nsemantic features to sample from a meaningful representation space, integrating\nthe bias of the non-manual components into the sign language production\nprocess. To evaluate our method, we extend the Frechet Gesture Distance (FGD)\nand propose a new metric called Frechet Expression Distance (FED) and apply an\nextensive set of metrics to assess the quality of specific regions of the face.\nThe experimental results showed that our method achieved state of the art,\nbeing superior to the competitors on How2Sign and PHOENIX14T datasets.\nMoreover, our architecture is based on a carefully designed graph pyramid that\nmakes it simpler, easier to train, and capable of leveraging emotions to\nproduce facial expressions.\n","authors":["Rafael Azevedo","Thiago Coutinho","João Ferreira","Thiago Gomes","Erickson Nascimento"],"pdf_url":"https://arxiv.org/pdf/2408.15159v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15143v1","updated":"2024-08-27T15:31:45Z","published":"2024-08-27T15:31:45Z","title":"A Preliminary Exploration Towards General Image Restoration","summary":" Despite the tremendous success of deep models in various individual image\nrestoration tasks, there are at least two major technical challenges preventing\nthese works from being applied to real-world usages: (1) the lack of\ngeneralization ability and (2) the complex and unknown degradations in\nreal-world scenarios. Existing deep models, tailored for specific individual\nimage restoration tasks, often fall short in effectively addressing these\nchallenges. In this paper, we present a new problem called general image\nrestoration (GIR) which aims to address these challenges within a unified\nmodel. GIR covers most individual image restoration tasks (\\eg, image\ndenoising, deblurring, deraining and super-resolution) and their combinations\nfor general purposes. This paper proceeds to delineate the essential aspects of\nGIR, including problem definition and the overarching significance of\ngeneralization performance. Moreover, the establishment of new datasets and a\nthorough evaluation framework for GIR models is discussed. We conduct a\ncomprehensive evaluation of existing approaches for tackling the GIR challenge,\nilluminating their strengths and pragmatic challenges. By analyzing these\napproaches, we not only underscore the effectiveness of GIR but also highlight\nthe difficulties in its practical implementation. At last, we also try to\nunderstand and interpret these models' behaviors to inspire the future\ndirection. Our work can open up new valuable research directions and contribute\nto the research of general vision.\n","authors":["Xiangtao Kong","Jinjin Gu","Yihao Liu","Wenlong Zhang","Xiangyu Chen","Yu Qiao","Chao Dong"],"pdf_url":"https://arxiv.org/pdf/2408.15143v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13896v2","updated":"2024-08-27T15:13:01Z","published":"2024-08-25T17:33:40Z","title":"RT-Attack: Jailbreaking Text-to-Image Models via Random Token","summary":" Recently, Text-to-Image(T2I) models have achieved remarkable success in image\ngeneration and editing, yet these models still have many potential issues,\nparticularly in generating inappropriate or Not-Safe-For-Work(NSFW) content.\nStrengthening attacks and uncovering such vulnerabilities can advance the\ndevelopment of reliable and practical T2I models. Most of the previous works\ntreat T2I models as white-box systems, using gradient optimization to generate\nadversarial prompts. However, accessing the model's gradient is often\nimpossible in real-world scenarios. Moreover, existing defense methods, those\nusing gradient masking, are designed to prevent attackers from obtaining\naccurate gradient information. While some black-box jailbreak attacks have been\nexplored, these typically rely on simply replacing sensitive words, leading to\nsuboptimal attack performance. To address this issue, we introduce a two-stage\nquery-based black-box attack method utilizing random search. In the first\nstage, we establish a preliminary prompt by maximizing the semantic similarity\nbetween the adversarial and target harmful prompts. In the second stage, we use\nthis initial prompt to refine our approach, creating a detailed adversarial\nprompt aimed at jailbreaking and maximizing the similarity in image features\nbetween the images generated from this prompt and those produced by the target\nharmful prompt. Extensive experiments validate the effectiveness of our method\nin attacking the latest prompt checkers, post-hoc image checkers, securely\ntrained T2I models, and online commercial models.\n","authors":["Sensen Gao","Xiaojun Jia","Yihao Huang","Ranjie Duan","Jindong Gu","Yang Liu","Qing Guo"],"pdf_url":"https://arxiv.org/pdf/2408.13896v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15127v1","updated":"2024-08-27T15:07:58Z","published":"2024-08-27T15:07:58Z","title":"T-FAKE: Synthesizing Thermal Images for Facial Landmarking","summary":" Facial analysis is a key component in a wide range of applications such as\nsecurity, autonomous driving, entertainment, and healthcare. Despite the\navailability of various facial RGB datasets, the thermal modality, which plays\na crucial role in life sciences, medicine, and biometrics, has been largely\noverlooked. To address this gap, we introduce the T-FAKE dataset, a new\nlarge-scale synthetic thermal dataset with sparse and dense landmarks. To\nfacilitate the creation of the dataset, we propose a novel RGB2Thermal loss\nfunction, which enables the transfer of thermal style to RGB faces. By\nutilizing the Wasserstein distance between thermal and RGB patches and the\nstatistical analysis of clinical temperature distributions on faces, we ensure\nthat the generated thermal images closely resemble real samples. Using\nRGB2Thermal style transfer based on our RGB2Thermal loss function, we create\nthe T-FAKE dataset, a large-scale synthetic thermal dataset of faces.\nLeveraging our novel T-FAKE dataset, probabilistic landmark prediction, and\nlabel adaptation networks, we demonstrate significant improvements in landmark\ndetection methods on thermal images across different landmark conventions. Our\nmodels show excellent performance with both sparse 70-point landmarks and dense\n478-point landmark annotations. Our code and models are available at\nhttps://github.com/phflot/tfake.\n","authors":["Philipp Flotho","Moritz Piening","Anna Kukleva","Gabriele Steidl"],"pdf_url":"https://arxiv.org/pdf/2408.15127v1.pdf","comment":"22 pages, 12 figures, Philipp Flotho and Moritz Piening share equal\n contribution"},{"id":"http://arxiv.org/abs/2408.15122v1","updated":"2024-08-27T15:03:20Z","published":"2024-08-27T15:03:20Z","title":"Machine Learning for Methane Detection and Quantification from Space --\n A survey","summary":" Methane (CH_4) is a potent anthropogenic greenhouse gas, contributing 86\ntimes more to global warming than Carbon Dioxide (CO_2) over 20 years, and it\nalso acts as an air pollutant. Given its high radiative forcing potential and\nrelatively short atmospheric lifetime (9\\textpm1 years), methane has important\nimplications for climate change, therefore, cutting methane emissions is\ncrucial for effective climate change mitigation. This work expands existing\ninformation on operational methane point source detection sensors in the\nShort-Wave Infrared (SWIR) bands. It reviews the state-of-the-art for\ntraditional as well as Machine Learning (ML) approaches. The architecture and\ndata used in such ML models will be discussed separately for methane plume\nsegmentation and emission rate estimation. Traditionally, experts rely on\nlabor-intensive manually adjusted methods for methane detection. However, ML\napproaches offer greater scalability. Our analysis reveals that ML models\noutperform traditional methods, particularly those based on convolutional\nneural networks (CNN), which are based on the U-net and transformer\narchitectures. These ML models extract valuable information from\nmethane-sensitive spectral data, enabling a more accurate detection. Challenges\narise when comparing these methods due to variations in data, sensor\nspecifications, and evaluation metrics. To address this, we discuss existing\ndatasets and metrics, providing an overview of available resources and\nidentifying open research problems. Finally, we explore potential future\nadvances in ML, emphasizing approaches for model comparability, large dataset\ncreation, and the European Union's forthcoming methane strategy.\n","authors":["Enno Tiemann","Shanyu Zhou","Alexander Kläser","Konrad Heidler","Rochelle Schneider","Xiao Xiang Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.15122v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.05892v3","updated":"2024-08-27T15:00:53Z","published":"2024-08-12T02:10:18Z","title":"Polyp SAM 2: Advancing Zero shot Polyp Segmentation in Colorectal Cancer\n Detection","summary":" Polyp segmentation plays a crucial role in the early detection and diagnosis\nof colorectal cancer. However, obtaining accurate segmentations often requires\nlabor-intensive annotations and specialized models. Recently, Meta AI Research\nreleased a general Segment Anything Model 2 (SAM 2), which has demonstrated\npromising performance in several segmentation tasks. In this manuscript, we\nevaluate the performance of SAM 2 in segmenting polyps under various prompted\nsettings. We hope this report will provide insights to advance the field of\npolyp segmentation and promote more interesting work in the future. This\nproject is publicly available at https://github.com/ sajjad-sh33/Polyp-SAM-2.\n","authors":["Mobina Mansoori","Sajjad Shahabodini","Jamshid Abouei","Konstantinos N. Plataniotis","Arash Mohammadi"],"pdf_url":"https://arxiv.org/pdf/2408.05892v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15119v1","updated":"2024-08-27T14:58:13Z","published":"2024-08-27T14:58:13Z","title":"Urdu Digital Text Word Optical Character Recognition Using Permuted Auto\n Regressive Sequence Modeling","summary":" This research paper introduces an innovative word-level Optical Character\nRecognition (OCR) model specifically designed for digital Urdu text\nrecognition. Utilizing transformer-based architectures and attention\nmechanisms, the model was trained on a comprehensive dataset of approximately\n160,000 Urdu text images, achieving a character error rate (CER) of 0.178,\nwhich highlights its superior accuracy in recognizing Urdu characters. The\nmodel's strength lies in its unique architecture, incorporating the permuted\nautoregressive sequence (PARSeq) model, which allows for context-aware\ninference and iterative refinement by leveraging bidirectional context\ninformation to enhance recognition accuracy. Furthermore, its capability to\nhandle a diverse range of Urdu text styles, fonts, and variations enhances its\napplicability in real-world scenarios. Despite its promising results, the model\nhas some limitations, such as difficulty with blurred images, non-horizontal\norientations, and overlays of patterns, lines, or other text, which can\noccasionally lead to suboptimal performance. Additionally, trailing or\nfollowing punctuation marks can introduce noise into the recognition process.\nAddressing these challenges will be a focus of future research, aiming to\nrefine the model further, explore data augmentation techniques, optimize\nhyperparameters, and integrate contextual improvements for more accurate and\nefficient Urdu text recognition.\n","authors":["Ahmed Mustafa","Ijlal Baig","Hasan Sajid"],"pdf_url":"https://arxiv.org/pdf/2408.15119v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15118v1","updated":"2024-08-27T14:58:08Z","published":"2024-08-27T14:58:08Z","title":"DIFR3CT: Latent Diffusion for Probabilistic 3D CT Reconstruction from\n Few Planar X-Rays","summary":" Computed Tomography (CT) scans are the standard-of-care for the visualization\nand diagnosis of many clinical ailments, and are needed for the treatment\nplanning of external beam radiotherapy. Unfortunately, the availability of CT\nscanners in low- and mid-resource settings is highly variable. Planar x-ray\nradiography units, in comparison, are far more prevalent, but can only provide\nlimited 2D observations of the 3D anatomy. In this work we propose DIFR3CT, a\n3D latent diffusion model, that can generate a distribution of plausible CT\nvolumes from one or few (<10) planar x-ray observations. DIFR3CT works by\nfusing 2D features from each x-ray into a joint 3D space, and performing\ndiffusion conditioned on these fused features in a low-dimensional latent\nspace. We conduct extensive experiments demonstrating that DIFR3CT is better\nthan recent sparse CT reconstruction baselines in terms of standard pixel-level\n(PSNR, SSIM) on both the public LIDC and in-house post-mastectomy CT datasets.\nWe also show that DIFR3CT supports uncertainty quantification via Monte Carlo\nsampling, which provides an opportunity to measure reconstruction reliability.\nFinally, we perform a preliminary pilot study evaluating DIFR3CT for automated\nbreast radiotherapy contouring and planning -- and demonstrate promising\nfeasibility. Our code is available at https://github.com/yransun/DIFR3CT.\n","authors":["Yiran Sun","Hana Baroudi","Tucker Netherton","Laurence Court","Osama Mawlawi","Ashok Veeraraghavan","Guha Balakrishnan"],"pdf_url":"https://arxiv.org/pdf/2408.15118v1.pdf","comment":"11 pages, 9 figures"},{"id":"http://arxiv.org/abs/2408.15114v1","updated":"2024-08-27T14:54:33Z","published":"2024-08-27T14:54:33Z","title":"Few-Shot Unsupervised Implicit Neural Shape Representation Learning with\n Spatial Adversaries","summary":" Implicit Neural Representations have gained prominence as a powerful\nframework for capturing complex data modalities, encompassing a wide range from\n3D shapes to images and audio. Within the realm of 3D shape representation,\nNeural Signed Distance Functions (SDF) have demonstrated remarkable potential\nin faithfully encoding intricate shape geometry. However, learning SDFs from\nsparse 3D point clouds in the absence of ground truth supervision remains a\nvery challenging task. While recent methods rely on smoothness priors to\nregularize the learning, our method introduces a regularization term that\nleverages adversarial samples around the shape to improve the learned SDFs.\nThrough extensive experiments and evaluations, we illustrate the efficacy of\nour proposed method, highlighting its capacity to improve SDF learning with\nrespect to baselines and the state-of-the-art using synthetic and real data.\n","authors":["Amine Ouasfi","Adnane Boukhayma"],"pdf_url":"https://arxiv.org/pdf/2408.15114v1.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2408.15113v1","updated":"2024-08-27T14:51:34Z","published":"2024-08-27T14:51:34Z","title":"AnomalousPatchCore: Exploring the Use of Anomalous Samples in Industrial\n Anomaly Detection","summary":" Visual inspection, or industrial anomaly detection, is one of the most common\nquality control types in manufacturing. The task is to identify the presence of\nan anomaly given an image, e.g., a missing component on an image of a circuit\nboard, for subsequent manual inspection. While industrial anomaly detection has\nseen a surge in recent years, most anomaly detection methods still utilize\nknowledge only from normal samples, failing to leverage the information from\nthe frequently available anomalous samples. Additionally, they heavily rely on\nvery general feature extractors pre-trained on common image classification\ndatasets. In this paper, we address these shortcomings and propose the new\nanomaly detection system AnomalousPatchCore~(APC) based on a feature extractor\nfine-tuned with normal and anomalous in-domain samples and a subsequent memory\nbank for identifying unusual features. To fine-tune the feature extractor in\nAPC, we propose three auxiliary tasks that address the different aspects of\nanomaly detection~(classification vs. localization) and mitigate the effect of\nthe imbalance between normal and anomalous samples. Our extensive evaluation on\nthe MVTec dataset shows that APC outperforms state-of-the-art systems in\ndetecting anomalies, which is especially important in industrial anomaly\ndetection given the subsequent manual inspection. In detailed ablation studies,\nwe further investigate the properties of our APC.\n","authors":["Mykhailo Koshil","Tilman Wegener","Detlef Mentrup","Simone Frintrop","Christian Wilms"],"pdf_url":"https://arxiv.org/pdf/2408.15113v1.pdf","comment":"Accepted at the 2nd workshop on Vision-based InduStrial InspectiON\n (VISION) @ ECCV"},{"id":"http://arxiv.org/abs/2408.15103v1","updated":"2024-08-27T14:40:19Z","published":"2024-08-27T14:40:19Z","title":"Enhancing License Plate Super-Resolution: A Layout-Aware and\n Character-Driven Approach","summary":" Despite significant advancements in License Plate Recognition (LPR) through\ndeep learning, most improvements rely on high-resolution images with clear\ncharacters. This scenario does not reflect real-world conditions where traffic\nsurveillance often captures low-resolution and blurry images. Under these\nconditions, characters tend to blend with the background or neighboring\ncharacters, making accurate LPR challenging. To address this issue, we\nintroduce a novel loss function, Layout and Character Oriented Focal Loss\n(LCOFL), which considers factors such as resolution, texture, and structural\ndetails, as well as the performance of the LPR task itself. We enhance\ncharacter feature learning using deformable convolutions and shared weights in\nan attention module and employ a GAN-based training approach with an Optical\nCharacter Recognition (OCR) model as the discriminator to guide the\nsuper-resolution process. Our experimental results show significant\nimprovements in character reconstruction quality, outperforming two\nstate-of-the-art methods in both quantitative and qualitative measures. Our\ncode is publicly available at https://github.com/valfride/lpsr-lacd\n","authors":["Valfride Nascimento","Rayson Laroca","Rafael O. Ribeiro","William Robson Schwartz","David Menotti"],"pdf_url":"https://arxiv.org/pdf/2408.15103v1.pdf","comment":"Accepted for presentation at the Conference on Graphics, Patterns and\n Images (SIBGRAPI) 2024"},{"id":"http://arxiv.org/abs/2408.15101v1","updated":"2024-08-27T14:36:46Z","published":"2024-08-27T14:36:46Z","title":"MTMamba++: Enhancing Multi-Task Dense Scene Understanding via\n Mamba-Based Decoders","summary":" Multi-task dense scene understanding, which trains a model for multiple dense\nprediction tasks, has a wide range of application scenarios. Capturing\nlong-range dependency and enhancing cross-task interactions are crucial to\nmulti-task dense prediction. In this paper, we propose MTMamba++, a novel\narchitecture for multi-task scene understanding featuring with a Mamba-based\ndecoder. It contains two types of core blocks: self-task Mamba (STM) block and\ncross-task Mamba (CTM) block. STM handles long-range dependency by leveraging\nstate-space models, while CTM explicitly models task interactions to facilitate\ninformation exchange across tasks. We design two types of CTM block, namely\nF-CTM and S-CTM, to enhance cross-task interaction from feature and semantic\nperspectives, respectively. Experiments on NYUDv2, PASCAL-Context, and\nCityscapes datasets demonstrate the superior performance of MTMamba++ over\nCNN-based and Transformer-based methods. The code is available at\nhttps://github.com/EnVision-Research/MTMamba.\n","authors":["Baijiong Lin","Weisen Jiang","Pengguang Chen","Shu Liu","Ying-Cong Chen"],"pdf_url":"https://arxiv.org/pdf/2408.15101v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2407.02228"},{"id":"http://arxiv.org/abs/2408.15098v1","updated":"2024-08-27T14:30:36Z","published":"2024-08-27T14:30:36Z","title":"CLIP-AGIQA: Boosting the Performance of AI-Generated Image Quality\n Assessment with CLIP","summary":" With the rapid development of generative technologies, AI-Generated Images\n(AIGIs) have been widely applied in various aspects of daily life. However, due\nto the immaturity of the technology, the quality of the generated images\nvaries, so it is important to develop quality assessment techniques for the\ngenerated images. Although some models have been proposed to assess the quality\nof generated images, they are inadequate when faced with the ever-increasing\nand diverse categories of generated images. Consequently, the development of\nmore advanced and effective models for evaluating the quality of generated\nimages is urgently needed. Recent research has explored the significant\npotential of the visual language model CLIP in image quality assessment,\nfinding that it performs well in evaluating the quality of natural images.\nHowever, its application to generated images has not been thoroughly\ninvestigated. In this paper, we build on this idea and further explore the\npotential of CLIP in evaluating the quality of generated images. We design\nCLIP-AGIQA, a CLIP-based regression model for quality assessment of generated\nimages, leveraging rich visual and textual knowledge encapsulated in CLIP.\nParticularly, we implement multi-category learnable prompts to fully utilize\nthe textual knowledge in CLIP for quality assessment. Extensive experiments on\nseveral generated image quality assessment benchmarks, including AGIQA-3K and\nAIGCIQA2023, demonstrate that CLIP-AGIQA outperforms existing IQA models,\nachieving excellent results in evaluating the quality of generated images.\n","authors":["Zhenchen Tang","Zichuan Wang","Bo Peng","Jing Dong"],"pdf_url":"https://arxiv.org/pdf/2408.15098v1.pdf","comment":"accepted by ICPR2024"},{"id":"http://arxiv.org/abs/2408.15094v1","updated":"2024-08-27T14:25:42Z","published":"2024-08-27T14:25:42Z","title":"Constrained Diffusion Models via Dual Training","summary":" Diffusion models have attained prominence for their ability to synthesize a\nprobability distribution for a given dataset via a diffusion process, enabling\nthe generation of new data points with high fidelity. However, diffusion\nprocesses are prone to generating biased data based on the training dataset. To\naddress this issue, we develop constrained diffusion models by imposing\ndiffusion constraints based on desired distributions that are informed by\nrequirements. Specifically, we cast the training of diffusion models under\nrequirements as a constrained distribution optimization problem that aims to\nreduce the distribution difference between original and generated data while\nobeying constraints on the distribution of generated data. We show that our\nconstrained diffusion models generate new data from a mixture data distribution\nthat achieves the optimal trade-off among objective and constraints. To train\nconstrained diffusion models, we develop a dual training algorithm and\ncharacterize the optimality of the trained constrained diffusion model. We\nempirically demonstrate the effectiveness of our constrained models in two\nconstrained generation tasks: (i) we consider a dataset with one or more\nunderrepresented classes where we train the model with constraints to ensure\nfairly sampling from all classes during inference; (ii) we fine-tune a\npre-trained diffusion model to sample from a new dataset while avoiding\noverfitting.\n","authors":["Shervin Khalafi","Dongsheng Ding","Alejandro Ribeiro"],"pdf_url":"https://arxiv.org/pdf/2408.15094v1.pdf","comment":"41 pages, 4 figures, 2 tables"},{"id":"http://arxiv.org/abs/2307.10895v4","updated":"2024-08-27T14:23:51Z","published":"2023-07-20T14:18:44Z","title":"Variational Autoencoding of Dental Point Clouds","summary":" Digital dentistry has made significant advancements, yet numerous challenges\nremain. This paper introduces the FDI 16 dataset, an extensive collection of\ntooth meshes and point clouds. Additionally, we present a novel approach:\nVariational FoldingNet (VF-Net), a fully probabilistic variational autoencoder\nfor point clouds. Notably, prior latent variable models for point clouds lack a\none-to-one correspondence between input and output points. Instead, they rely\non optimizing Chamfer distances, a metric that lacks a normalized\ndistributional counterpart, rendering it unsuitable for probabilistic modeling.\nWe replace the explicit minimization of Chamfer distances with a suitable\nencoder, increasing computational efficiency while simplifying the\nprobabilistic extension. This allows for straightforward application in various\ntasks, including mesh generation, shape completion, and representation\nlearning. Empirically, we provide evidence of lower reconstruction error in\ndental reconstruction and interpolation, showcasing state-of-the-art\nperformance in dental sample generation while identifying valuable latent\nrepresentations\n","authors":["Johan Ziruo Ye","Thomas Ørkild","Peter Lempel Søndergaard","Søren Hauberg"],"pdf_url":"https://arxiv.org/pdf/2307.10895v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.01870v2","updated":"2024-08-27T14:16:06Z","published":"2022-09-05T10:06:03Z","title":"Unsupervised Domain Adaptation via Style-Aware Self-intermediate Domain","summary":" Unsupervised domain adaptation (UDA) has attracted considerable attention,\nwhich transfers knowledge from a label-rich source domain to a related but\nunlabeled target domain. Reducing inter-domain differences has always been a\ncrucial factor to improve performance in UDA, especially for tasks where there\nis a large gap between source and target domains. To this end, we propose a\nnovel style-aware feature fusion method (SAFF) to bridge the large domain gap\nand transfer knowledge while alleviating the loss of class-discriminative\ninformation. Inspired by the human transitive inference and learning ability, a\nnovel style-aware self-intermediate domain (SSID) is investigated to link two\nseemingly unrelated concepts through a series of intermediate auxiliary\nsynthesized concepts. Specifically, we propose a novel learning strategy of\nSSID, which selects samples from both source and target domains as anchors, and\nthen randomly fuses the object and style features of these anchors to generate\nlabeled and style-rich intermediate auxiliary features for knowledge transfer.\nMoreover, we design an external memory bank to store and update specified\nlabeled features to obtain stable class features and class-wise style features.\nBased on the proposed memory bank, the intra- and inter-domain loss functions\nare designed to improve the class recognition ability and feature\ncompatibility, respectively. Meanwhile, we simulate the rich latent feature\nspace of SSID by infinite sampling and the convergence of the loss function by\nmathematical theory. Finally, we conduct comprehensive experiments on commonly\nused domain adaptive benchmarks to evaluate the proposed SAFF, and the\nexperimental results show that the proposed SAFF can be easily combined with\ndifferent backbone networks and obtain better performance as a plug-in-plug-out\nmodule.\n","authors":["Lianyu Wang","Meng Wang","Daoqiang Zhang","Huazhu Fu"],"pdf_url":"https://arxiv.org/pdf/2209.01870v2.pdf","comment":"13 pages, 7 figures"},{"id":"http://arxiv.org/abs/2408.13627v2","updated":"2024-08-27T14:14:51Z","published":"2024-08-24T16:48:25Z","title":"Recent Event Camera Innovations: A Survey","summary":" Event-based vision, inspired by the human visual system, offers\ntransformative capabilities such as low latency, high dynamic range, and\nreduced power consumption. This paper presents a comprehensive survey of event\ncameras, tracing their evolution over time. It introduces the fundamental\nprinciples of event cameras, compares them with traditional frame cameras, and\nhighlights their unique characteristics and operational differences. The survey\ncovers various event camera models from leading manufacturers, key\ntechnological milestones, and influential research contributions. It explores\ndiverse application areas across different domains and discusses essential\nreal-world and synthetic datasets for research advancement. Additionally, the\nrole of event camera simulators in testing and development is discussed. This\nsurvey aims to consolidate the current state of event cameras and inspire\nfurther innovation in this rapidly evolving field. To support the research\ncommunity, a GitHub page\n(https://github.com/chakravarthi589/Event-based-Vision_Resources) categorizes\npast and future research articles and consolidates valuable resources.\n","authors":["Bharatesh Chakravarthi","Aayush Atul Verma","Kostas Daniilidis","Cornelia Fermuller","Yezhou Yang"],"pdf_url":"https://arxiv.org/pdf/2408.13627v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15077v1","updated":"2024-08-27T14:05:48Z","published":"2024-08-27T14:05:48Z","title":"MMASD+: A Novel Dataset for Privacy-Preserving Behavior Analysis of\n Children with Autism Spectrum Disorder","summary":" Autism spectrum disorder (ASD) is characterized by significant challenges in\nsocial interaction and comprehending communication signals. Recently,\ntherapeutic interventions for ASD have increasingly utilized Deep learning\npowered-computer vision techniques to monitor individual progress over time.\nThese models are trained on private, non-public datasets from the autism\ncommunity, creating challenges in comparing results across different models due\nto privacy-preserving data-sharing issues. This work introduces MMASD+. MMASD+\nconsists of diverse data modalities, including 3D-Skeleton, 3D Body Mesh, and\nOptical Flow data. It integrates the capabilities of Yolov8 and Deep SORT\nalgorithms to distinguish between the therapist and children, addressing a\nsignificant barrier in the original dataset. Additionally, a Multimodal\nTransformer framework is proposed to predict 11 action types and the presence\nof ASD. This framework achieves an accuracy of 95.03% for predicting action\ntypes and 96.42% for predicting ASD presence, demonstrating over a 10%\nimprovement compared to models trained on single data modalities. These\nfindings highlight the advantages of integrating multiple data modalities\nwithin the Multimodal Transformer framework.\n","authors":["Pavan Uttej Ravva","Behdokht Kiafar","Pinar Kullu","Jicheng Li","Anjana Bhat","Roghayeh Leila Barmaki"],"pdf_url":"https://arxiv.org/pdf/2408.15077v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15069v1","updated":"2024-08-27T13:56:48Z","published":"2024-08-27T13:56:48Z","title":"Geometric Artifact Correction for Symmetric Multi-Linear Trajectory CT:\n Theory, Method, and Generalization","summary":" For extending CT field-of-view to perform non-destructive testing, the\nSymmetric Multi-Linear trajectory Computed Tomography (SMLCT) has been\ndeveloped as a successful example of non-standard CT scanning modes. However,\ninevitable geometric errors can cause severe artifacts in the reconstructed\nimages. The existing calibration method for SMLCT is both crude and\ninefficient. It involves reconstructing hundreds of images by exhaustively\nsubstituting each potential error, and then manually identifying the images\nwith the fewest geometric artifacts to estimate the final geometric errors for\ncalibration. In this paper, we comprehensively and efficiently address the\nchallenging geometric artifacts in SMLCT, , and the corresponding works mainly\ninvolve theory, method, and generalization. In particular, after identifying\nsensitive parameters and conducting some theory analysis of geometric\nartifacts, we summarize several key properties between sensitive geometric\nparameters and artifact characteristics. Then, we further construct\nmathematical relationships that relate sensitive geometric errors to the pixel\noffsets of reconstruction images with artifact characteristics. To accurately\nextract pixel bias, we innovatively adapt the Generalized Cross-Correlation\nwith Phase Transform (GCC-PHAT) algorithm, commonly used in sound processing,\nfor our image registration task for each paired symmetric LCT. This adaptation\nleads to the design of a highly efficient rigid translation registration\nmethod. Simulation and physical experiments have validated the excellent\nperformance of this work. Additionally, our results demonstrate significant\ngeneralization to common rotated CT and a variant of SMLCT.\n","authors":["Zhisheng Wang","Yanxu Sun","Shangyu Li","Legeng Lin","Shunli Wang","Junning Cui"],"pdf_url":"https://arxiv.org/pdf/2408.15069v1.pdf","comment":"15 pages, 10 figures"},{"id":"http://arxiv.org/abs/2312.11470v2","updated":"2024-08-27T13:55:17Z","published":"2023-11-14T11:36:20Z","title":"An Improved Anomaly Detection Model for Automated Inspection of Power\n Line Insulators","summary":" Inspection of insulators is important to ensure reliable operation of the\npower system. Deep learning is being increasingly exploited to automate the\ninspection process by leveraging object detection models to analyse aerial\nimages captured by drones. A purely object detection-based approach, however,\nsuffers from class imbalance-induced poor performance, which can be accentuated\nfor infrequent and hard-to-detect incipient faults. This article proposes the\nuse of anomaly detection along with object detection in a two-stage approach\nfor incipient fault detection in a data-efficient manner. An explainable\nconvolutional one-class classifier is adopted for anomaly detection. The\none-class formulation reduces the reliance on plentifully available images of\nfaulty insulators, while the explainability of the model is expected to promote\nadoption by the industry. A modified loss function is developed that addresses\ncomputational and interpretability issues with the existing model, also\nallowing for the integration of other losses. The superiority of the novel loss\nfunction is demonstrated with MVTec-AD dataset. The models are trained for\ninsulator inspection with two datasets -- representing data-abundant and\ndata-scarce scenarios -- in unsupervised and semi-supervised settings. The\nresults suggest that including as few as five real anomalies in the training\ndataset significantly improves the model's performance and enables reliable\ndetection of rarely occurring incipient faults in insulators.\n","authors":["Laya Das","Blazhe Gjorgiev","Giovanni Sansavini"],"pdf_url":"https://arxiv.org/pdf/2312.11470v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15063v1","updated":"2024-08-27T13:47:31Z","published":"2024-08-27T13:47:31Z","title":"Adapting Segment Anything Model to Multi-modal Salient Object Detection\n with Semantic Feature Fusion Guidance","summary":" Although most existing multi-modal salient object detection (SOD) methods\ndemonstrate effectiveness through training models from scratch, the limited\nmulti-modal data hinders these methods from reaching optimality. In this paper,\nwe propose a novel framework to explore and exploit the powerful feature\nrepresentation and zero-shot generalization ability of the pre-trained Segment\nAnything Model (SAM) for multi-modal SOD. Despite serving as a recent vision\nfundamental model, driving the class-agnostic SAM to comprehend and detect\nsalient objects accurately is non-trivial, especially in challenging scenes. To\nthis end, we develop \\underline{SAM} with se\\underline{m}antic\nf\\underline{e}ature fu\\underline{s}ion guidanc\\underline{e} (Sammese), which\nincorporates multi-modal saliency-specific knowledge into SAM to adapt SAM to\nmulti-modal SOD tasks. However, it is difficult for SAM trained on single-modal\ndata to directly mine the complementary benefits of multi-modal inputs and\ncomprehensively utilize them to achieve accurate saliency prediction.To address\nthese issues, we first design a multi-modal complementary fusion module to\nextract robust multi-modal semantic features by integrating information from\nvisible and thermal or depth image pairs. Then, we feed the extracted\nmulti-modal semantic features into both the SAM image encoder and mask decoder\nfor fine-tuning and prompting, respectively. Specifically, in the image\nencoder, a multi-modal adapter is proposed to adapt the single-modal SAM to\nmulti-modal information. In the mask decoder, a semantic-geometric prompt\ngeneration strategy is proposed to produce corresponding embeddings with\nvarious saliency cues. Extensive experiments on both RGB-D and RGB-T SOD\nbenchmarks show the effectiveness of the proposed framework.\n","authors":["Kunpeng Wang","Keke Chen","Chenglong Li","Zhengzheng Tu","Bin Luo"],"pdf_url":"https://arxiv.org/pdf/2408.15063v1.pdf","comment":"10 pages, 9 figures"},{"id":"http://arxiv.org/abs/2407.04833v3","updated":"2024-08-27T13:45:49Z","published":"2024-07-05T19:38:10Z","title":"3D Adaptive Structural Convolution Network for Domain-Invariant Point\n Cloud Recognition","summary":" Adapting deep learning networks for point cloud data recognition in\nself-driving vehicles faces challenges due to the variability in datasets and\nsensor technologies, emphasizing the need for adaptive techniques to maintain\naccuracy across different conditions. In this paper, we introduce the 3D\nAdaptive Structural Convolution Network (3D-ASCN), a cutting-edge framework for\n3D point cloud recognition. It combines 3D convolution kernels, a structural\ntree structure, and adaptive neighborhood sampling for effective geometric\nfeature extraction. This method obtains domain-invariant features and\ndemonstrates robust, adaptable performance on a variety of point cloud\ndatasets, ensuring compatibility across diverse sensor configurations without\nthe need for parameter adjustments. This highlights its potential to\nsignificantly enhance the reliability and efficiency of self-driving vehicle\ntechnology.\n","authors":["Younggun Kim","Beomsik Cho","Seonghoon Ryoo","Soomok Lee"],"pdf_url":"https://arxiv.org/pdf/2407.04833v3.pdf","comment":"11 pages, 3 figures"},{"id":"http://arxiv.org/abs/2209.11200v3","updated":"2024-08-27T13:44:11Z","published":"2022-09-22T17:42:44Z","title":"Attention is All They Need: Exploring the Media Archaeology of the\n Computer Vision Research Paper","summary":" Research papers, in addition to textual documents, are a designed interface\nthrough which researchers communicate. Recently, rapid growth has transformed\nthat interface in many fields of computing. In this work, we examine the\neffects of this growth from a media archaeology perspective, through the\nchanges to figures and tables in research papers. Specifically, we study these\nchanges in computer vision over the past decade, as the deep learning\nrevolution has driven unprecedented growth in the discipline. We ground our\ninvestigation through interviews with veteran researchers spanning computer\nvision, graphics, and visualization. Our analysis focuses on the research\nattention economy: how research paper elements contribute towards advertising,\nmeasuring, and disseminating an increasingly commodified \"contribution.\"\nThrough this work, we seek to motivate future discussion surrounding the design\nof both the research paper itself as well as the larger sociotechnical research\npublishing system, including tools for finding, reading, and writing research\npapers.\n","authors":["Samuel Goree","Gabriel Appleby","David Crandall","Norman Su"],"pdf_url":"https://arxiv.org/pdf/2209.11200v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.14745v3","updated":"2024-08-27T13:36:12Z","published":"2024-04-23T04:54:32Z","title":"TAAT: Think and Act from Arbitrary Texts in Text2Motion","summary":" Text to Motion aims to generate human motions from texts. Existing settings\nassume that texts include action labels, which limits flexibility in practical\nscenarios. This paper extends this task with a more realistic assumption that\nthe texts are arbitrary. Specifically, in our setting, arbitrary texts include\nexisting action texts composed of action labels and introduce scene texts\nwithout explicit action labels. To address this practical issue, we extend the\naction texts in the HUMANML3D dataset by incorporating additional scene texts,\nthereby creating a new dataset, HUMANML3D++. Concurrently, we propose a simple\nframework that extracts action representations from arbitrary texts using a\nLarge Language Model (LLM) and subsequently generates motions. Furthermore, we\nenhance the existing evaluation methodologies to address their inadequacies.\nExtensive experiments are conducted under different application scenarios to\nvalidate the effectiveness of the proposed framework on existing and proposed\ndatasets. The results indicate that Text to Motion in this realistic setting is\nvery challenging, fostering new research in this practical direction. Our\ndataset and code will be released.\n","authors":["Runqi Wang","Caoyuan Ma","Guopeng Li","Zheng Wang"],"pdf_url":"https://arxiv.org/pdf/2404.14745v3.pdf","comment":"Updated errors in author information"},{"id":"http://arxiv.org/abs/2408.15045v1","updated":"2024-08-27T13:13:38Z","published":"2024-08-27T13:13:38Z","title":"DocLayLLM: An Efficient and Effective Multi-modal Extension of Large\n Language Models for Text-rich Document Understanding","summary":" Text-rich document understanding (TDU) refers to analyzing and comprehending\ndocuments containing substantial textual content. With the rapid evolution of\nlarge language models (LLMs), they have been widely leveraged for TDU due to\ntheir remarkable versatility and generalization. In this paper, we introduce\nDocLayLLM, an efficient and effective multi-modal extension of LLMs\nspecifically designed for TDU. By integrating visual patch tokens and 2D\npositional tokens into LLMs and encoding the document content using the LLMs\nthemselves, we fully take advantage of the document comprehension capability of\nLLMs and enhance their perception of OCR information. We have also deeply\nconsidered the role of the chain-of-thought (CoT) and innovatively proposed the\ntechniques of CoT Pre-training and CoT Annealing. Our DocLayLLM can achieve\nremarkable performances with lightweight training settings, showcasing its\nefficiency and effectiveness. Experimental results demonstrate that our\nDocLayLLM surpasses existing OCR-dependent methods and also outperforms\nOCR-free competitors.\n","authors":["Wenhui Liao","Jiapeng Wang","Hongliang Li","Chengyu Wang","Jun Huang","Lianwen Jin"],"pdf_url":"https://arxiv.org/pdf/2408.15045v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15038v1","updated":"2024-08-27T13:07:09Z","published":"2024-08-27T13:07:09Z","title":"Interactive Occlusion Boundary Estimation through Exploitation of\n Synthetic Data","summary":" Occlusion boundaries (OBs) geometrically localize the occlusion events in a\n2D image, and contain useful information for addressing various scene\nunderstanding problems. To advance their study, we have led the investigation\nin the following three aspects. Firstly, we have studied interactive estimation\nof OBs, which is the first in the literature, and proposed an efficient\ndeep-network-based method using multiple-scribble intervention, named DNMMSI,\nwhich significantly improves the performance over the state-of-the-art\nfully-automatic methods. Secondly, we propose to exploit the synthetic\nbenchmark for the training process, thanks to the particularity that OBs are\ndetermined geometrically and unambiguously from the 3D scene. To this end, we\nhave developed an efficient tool, named Mesh2OB, for the automatic generation\nof 2D images together with their ground-truth OBs, using which we have\nconstructed a synthetic benchmark, named OB-FUTURE. Abundant experimental\nresults demonstrate that leveraging such a synthetic benchmark for training\nachieves promising performance, even without the use of domain adaptation\ntechniques. Finally, to achieve a more compelling and robust evaluation in\nOB-related research, we have created a real benchmark, named OB-LabName,\nconsisting of 120 high-resolution images together with their ground-truth OBs,\nwith precision surpassing that of previous benchmarks. We will release DNMMSI\nwith pre-trained parameters, Mesh2OB, OB-FUTURE, and OB-LabName to support\nfurther research.\n","authors":["Lintao Xu","Chaohui Wang"],"pdf_url":"https://arxiv.org/pdf/2408.15038v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.20710v2","updated":"2024-08-27T13:05:27Z","published":"2023-10-31T17:59:58Z","title":"FPO++: Efficient Encoding and Rendering of Dynamic Neural Radiance\n Fields by Analyzing and Enhancing Fourier PlenOctrees","summary":" Fourier PlenOctrees have shown to be an efficient representation for\nreal-time rendering of dynamic Neural Radiance Fields (NeRF). Despite its many\nadvantages, this method suffers from artifacts introduced by the involved\ncompression when combining it with recent state-of-the-art techniques for\ntraining the static per-frame NeRF models. In this paper, we perform an\nin-depth analysis of these artifacts and leverage the resulting insights to\npropose an improved representation. In particular, we present a novel density\nencoding that adapts the Fourier-based compression to the characteristics of\nthe transfer function used by the underlying volume rendering procedure and\nleads to a substantial reduction of artifacts in the dynamic model.\nFurthermore, we show an augmentation of the training data that relaxes the\nperiodicity assumption of the compression. We demonstrate the effectiveness of\nour enhanced Fourier PlenOctrees in the scope of quantitative and qualitative\nevaluations on synthetic and real-world scenes.\n","authors":["Saskia Rabich","Patrick Stotko","Reinhard Klein"],"pdf_url":"https://arxiv.org/pdf/2310.20710v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15032v1","updated":"2024-08-27T13:01:19Z","published":"2024-08-27T13:01:19Z","title":"Mamba2MIL: State Space Duality Based Multiple Instance Learning for\n Computational Pathology","summary":" Computational pathology (CPath) has significantly advanced the clinical\npractice of pathology. Despite the progress made, Multiple Instance Learning\n(MIL), a promising paradigm within CPath, continues to face challenges,\nparticularly related to incomplete information utilization. Existing\nframeworks, such as those based on Convolutional Neural Networks (CNNs),\nattention, and selective scan space state sequential model (SSM), lack\nsufficient flexibility and scalability in fusing diverse features, and cannot\neffectively fuse diverse features. Additionally, current approaches do not\nadequately exploit order-related and order-independent features, resulting in\nsuboptimal utilization of sequence information. To address these limitations,\nwe propose a novel MIL framework called Mamba2MIL. Our framework utilizes the\nstate space duality model (SSD) to model long sequences of patches of whole\nslide images (WSIs), which, combined with weighted feature selection, supports\nthe fusion processing of more branching features and can be extended according\nto specific application needs. Moreover, we introduce a sequence transformation\nmethod tailored to varying WSI sizes, which enhances sequence-independent\nfeatures while preserving local sequence information, thereby improving\nsequence information utilization. Extensive experiments demonstrate that\nMamba2MIL surpasses state-of-the-art MIL methods. We conducted extensive\nexperiments across multiple datasets, achieving improvements in nearly all\nperformance metrics. Specifically, on the NSCLC dataset, Mamba2MIL achieves a\nbinary tumor classification AUC of 0.9533 and an accuracy of 0.8794. On the\nBRACS dataset, it achieves a multiclass classification AUC of 0.7986 and an\naccuracy of 0.4981. The code is available at\nhttps://github.com/YuqiZhang-Buaa/Mamba2MIL.\n","authors":["Yuqi Zhang","Xiaoqian Zhang","Jiakai Wang","Yuancheng Yang","Taiying Peng","Chao Tong"],"pdf_url":"https://arxiv.org/pdf/2408.15032v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15026v1","updated":"2024-08-27T12:55:54Z","published":"2024-08-27T12:55:54Z","title":"Sequence-aware Pre-training for Echocardiography Probe Guidance","summary":" Cardiac ultrasound probe guidance aims to help novices adjust the 6-DOF probe\npose to obtain high-quality sectional images. Cardiac ultrasound faces two\nmajor challenges: (1) the inherently complex structure of the heart, and (2)\nsignificant individual variations. Previous works have only learned the\npopulation-averaged 2D and 3D structures of the heart rather than personalized\ncardiac structural features, leading to a performance bottleneck. Clinically,\nwe observed that sonographers adjust their understanding of a patient's cardiac\nstructure based on prior scanning sequences, thereby modifying their scanning\nstrategies. Inspired by this, we propose a sequence-aware self-supervised\npre-training method. Specifically, our approach learns personalized 2D and 3D\ncardiac structural features by predicting the masked-out images and actions in\na scanning sequence. We hypothesize that if the model can predict the missing\ncontent it has acquired a good understanding of the personalized cardiac\nstructure. In the downstream probe guidance task, we also introduced a sequence\nmodeling approach that models individual cardiac structural information based\non the images and actions from historical scan data, enabling more accurate\nnavigation decisions. Experiments on a large-scale dataset with 1.36 million\nsamples demonstrated that our proposed sequence-aware paradigm can\nsignificantly reduce navigation errors, with translation errors decreasing by\n15.90% to 36.87% and rotation errors decreasing by 11.13% to 20.77%, compared\nto state-of-the-art methods.\n","authors":["Haojun Jiang","Zhenguo Sun","Yu Sun","Ning Jia","Meng Li","Shaqi Luo","Shiji Song","Gao Huang"],"pdf_url":"https://arxiv.org/pdf/2408.15026v1.pdf","comment":"Tech Report"},{"id":"http://arxiv.org/abs/2408.15020v1","updated":"2024-08-27T12:53:25Z","published":"2024-08-27T12:53:25Z","title":"Hierarchical Graph Interaction Transformer with Dynamic Token Clustering\n for Camouflaged Object Detection","summary":" Camouflaged object detection (COD) aims to identify the objects that\nseamlessly blend into the surrounding backgrounds. Due to the intrinsic\nsimilarity between the camouflaged objects and the background region, it is\nextremely challenging to precisely distinguish the camouflaged objects by\nexisting approaches. In this paper, we propose a hierarchical graph interaction\nnetwork termed HGINet for camouflaged object detection, which is capable of\ndiscovering imperceptible objects via effective graph interaction among the\nhierarchical tokenized features. Specifically, we first design a region-aware\ntoken focusing attention (RTFA) with dynamic token clustering to excavate the\npotentially distinguishable tokens in the local region. Afterwards, a\nhierarchical graph interaction transformer (HGIT) is proposed to construct\nbi-directional aligned communication between hierarchical features in the\nlatent interaction space for visual semantics enhancement. Furthermore, we\npropose a decoder network with confidence aggregated feature fusion (CAFF)\nmodules, which progressively fuses the hierarchical interacted features to\nrefine the local detail in ambiguous regions. Extensive experiments conducted\non the prevalent datasets, i.e. COD10K, CAMO, NC4K and CHAMELEON demonstrate\nthe superior performance of HGINet compared to existing state-of-the-art\nmethods. Our code is available at https://github.com/Garyson1204/HGINet.\n","authors":["Siyuan Yao","Hao Sun","Tian-Zhu Xiang","Xiao Wang","Xiaochun Cao"],"pdf_url":"https://arxiv.org/pdf/2408.15020v1.pdf","comment":"Submitted to IEEE Transactions on Image Processing"},{"id":"http://arxiv.org/abs/2408.15015v1","updated":"2024-08-27T12:50:12Z","published":"2024-08-27T12:50:12Z","title":"Alternating Minimization Schemes for Computing\n Rate-Distortion-Perception Functions with $f$-Divergence Perception\n Constraints","summary":" We study the computation of the rate-distortion-perception function (RDPF)\nfor discrete memoryless sources subject to a single-letter average distortion\nconstraint and a perception constraint that belongs to the family of\n$f$-divergences. In this setting, the RDPF forms a convex programming problem\nfor which we characterize the optimal parametric solutions. We employ the\ndeveloped solutions in an alternating minimization scheme, namely Optimal\nAlternating Minimization (OAM), for which we provide convergence guarantees.\nNevertheless, the OAM scheme does not lead to a direct implementation of a\ngeneralized Blahut-Arimoto (BA) type of algorithm due to the presence of\nimplicit equations in the structure of the iteration. To overcome this\ndifficulty, we propose two alternative minimization approaches whose\napplicability depends on the smoothness of the used perception metric: a\nNewton-based Alternating Minimization (NAM) scheme, relying on Newton's\nroot-finding method for the approximation of the optimal iteration solution,\nand a Relaxed Alternating Minimization (RAM) scheme, based on a relaxation of\nthe OAM iterates. Both schemes are shown, via the derivation of necessary and\nsufficient conditions, to guarantee convergence to a globally optimal solution.\nWe also provide sufficient conditions on the distortion and the perception\nconstraints which guarantee that the proposed algorithms converge exponentially\nfast in the number of iteration steps. We corroborate our theoretical results\nwith numerical simulations and draw connections with existing results.\n","authors":["Giuseppe Serra","Photios A. Stavrou","Marios Kountouris"],"pdf_url":"https://arxiv.org/pdf/2408.15015v1.pdf","comment":"This work has been submitted for possible publication"},{"id":"http://arxiv.org/abs/2408.15011v1","updated":"2024-08-27T12:48:46Z","published":"2024-08-27T12:48:46Z","title":"Pre-training Everywhere: Parameter-Efficient Fine-Tuning for Medical\n Image Analysis via Target Parameter Pre-training","summary":" Parameter-efficient fine-tuning (PEFT) techniques have emerged to address\nissues of overfitting and high computational costs associated with fully\nfine-tuning in the paradigm of self-supervised learning. Mainstream methods\nbased on PEFT involve adding a few trainable parameters while keeping the\npre-trained parameters of the backbone fixed. These methods achieve\ncomparative, and often superior, performance to fully fine-tuning,\ndemonstrating the powerful representation ability of the pre-trained backbone.\nDespite its success, these methods typically ignore the initialization of the\nnew parameters, often relying solely on random initialization. We argue that if\npre-training is significantly beneficial, it should be applied to all\nparameters requiring representational capacity. Motivated by this insight, we\npropose a simple yet effective fine-tuning framework based on Target Parameter\nPre-training (TPP). The target parameters refer to the new parameters\nintroduced during fine-tuning. TPP includes an additional stage before PEFT to\npre-train these target parameters. During this stage, the pre-trained backbone\nparameters are frozen, and only the target parameters are trainable. A defined\npre-text task is used to encourage the target parameters to learn specific\nrepresentations of downstream data. When PEFT is subsequently employed, the\npre-trained target parameters are loaded to enhance fine-tuning efficiency. The\nproposed TPP framework is versatile, allowing for the integration of various\npretext tasks for pre-training and supporting different PEFT methods as\nbackbones. We evaluated the fine-tining performance of our method using five\npublic datasets, including three modalities and two task types. The results\ndemonstrate that the proposed TPP can be easily integrated into existing PEFT\nmethods, significantly improving performance.\n","authors":["Xingliang Lei","Yiwen Ye","Ziyang Chen","Minglei Shu","Yong Xia"],"pdf_url":"https://arxiv.org/pdf/2408.15011v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.03209v2","updated":"2024-08-27T12:39:18Z","published":"2024-08-06T14:08:22Z","title":"IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning\n using Instruct Prompts","summary":" Diffusion models continuously push the boundary of state-of-the-art image\ngeneration, but the process is hard to control with any nuance: practice proves\nthat textual prompts are inadequate for accurately describing image style or\nfine structural details (such as faces). ControlNet and IPAdapter address this\nshortcoming by conditioning the generative process on imagery instead, but each\nindividual instance is limited to modeling a single conditional posterior: for\npractical use-cases, where multiple different posteriors are desired within the\nsame workflow, training and using multiple adapters is cumbersome. We propose\nIPAdapter-Instruct, which combines natural-image conditioning with ``Instruct''\nprompts to swap between interpretations for the same conditioning image: style\ntransfer, object extraction, both, or something else still? IPAdapterInstruct\nefficiently learns multiple tasks with minimal loss in quality compared to\ndedicated per-task models.\n","authors":["Ciara Rowles","Shimon Vainer","Dante De Nigris","Slava Elizarov","Konstantin Kutsy","Simon Donné"],"pdf_url":"https://arxiv.org/pdf/2408.03209v2.pdf","comment":"17 pages, 10 figures, Project page:\n https://unity-research.github.io/IP-Adapter-Instruct.github.io/"},{"id":"http://arxiv.org/abs/2408.15002v1","updated":"2024-08-27T12:34:41Z","published":"2024-08-27T12:34:41Z","title":"Knowledge Discovery in Optical Music Recognition: Enhancing Information\n Retrieval with Instance Segmentation","summary":" Optical Music Recognition (OMR) automates the transcription of musical\nnotation from images into machine-readable formats like MusicXML, MEI, or MIDI,\nsignificantly reducing the costs and time of manual transcription. This study\nexplores knowledge discovery in OMR by applying instance segmentation using\nMask R-CNN to enhance the detection and delineation of musical symbols in sheet\nmusic. Unlike Optical Character Recognition (OCR), OMR must handle the\nintricate semantics of Common Western Music Notation (CWMN), where symbol\nmeanings depend on shape, position, and context. Our approach leverages\ninstance segmentation to manage the density and overlap of musical symbols,\nfacilitating more precise information retrieval from music scores. Evaluations\non the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with\nour method achieving a mean Average Precision (mAP) of up to 59.70\\% in dense\nsymbol environments, achieving comparable results to object detection.\nFurthermore, using traditional computer vision techniques, we add a parallel\nstep for staff detection to infer the pitch for the recognised symbols. This\nstudy emphasises the role of pixel-wise segmentation in advancing accurate\nmusic symbol recognition, contributing to knowledge discovery in OMR. Our\nfindings indicate that instance segmentation provides more precise\nrepresentations of musical symbols, particularly in densely populated scores,\nadvancing OMR technology. We make our implementation, pre-processing scripts,\ntrained models, and evaluation results publicly available to support further\nresearch and development.\n","authors":["Elona Shatri","George Fazekas"],"pdf_url":"https://arxiv.org/pdf/2408.15002v1.pdf","comment":"8 pages content and one references, accepted version at the\n International Conference on Knowledge Discovery and Information Retrieval\n 2024, Porto, Portugal"},{"id":"http://arxiv.org/abs/2408.14998v1","updated":"2024-08-27T12:28:41Z","published":"2024-08-27T12:28:41Z","title":"FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene\n Text Spotting","summary":" The proliferation of scene text in both structured and unstructured\nenvironments presents significant challenges in optical character recognition\n(OCR), necessitating more efficient and robust text spotting solutions. This\npaper presents FastTextSpotter, a framework that integrates a Swin Transformer\nvisual backbone with a Transformer Encoder-Decoder architecture, enhanced by a\nnovel, faster self-attention unit, SAC2, to improve processing speeds while\nmaintaining accuracy. FastTextSpotter has been validated across multiple\ndatasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for\narbitrary-shaped texts, benchmarking against current state-of-the-art models.\nOur results indicate that FastTextSpotter not only achieves superior accuracy\nin detecting and recognizing multilingual scene text (English and Vietnamese)\nbut also improves model efficiency, thereby setting new benchmarks in the\nfield. This study underscores the potential of advanced transformer\narchitectures in improving the adaptability and speed of text spotting\napplications in diverse real-world settings. The dataset, code, and pre-trained\nmodels have been released in our Github.\n","authors":["Alloy Das","Sanket Biswas","Umapada Pal","Josep Lladós","Saumik Bhattacharya"],"pdf_url":"https://arxiv.org/pdf/2408.14998v1.pdf","comment":"Accepted in ICPR 2024"},{"id":"http://arxiv.org/abs/2408.14997v1","updated":"2024-08-27T12:25:12Z","published":"2024-08-27T12:25:12Z","title":"Depth Restoration of Hand-Held Transparent Objects for Human-to-Robot\n Handover","summary":" Transparent objects are common in daily life, while their unique optical\nproperties pose challenges for RGB-D cameras, which struggle to capture\naccurate depth information. For assistant robots, accurately perceiving\ntransparent objects held by humans is essential for effective human-robot\ninteraction. This paper presents a Hand-Aware Depth Restoration (HADR) method\nfor hand-held transparent objects based on creating an implicit neural\nrepresentation function from a single RGB-D image. The proposed method\nintroduces the hand posture as an important guidance to leverage semantic and\ngeometric information. To train and evaluate the proposed method, we create a\nhigh-fidelity synthetic dataset called TransHand-14K with a real-to-sim data\ngeneration scheme. Experiments show that our method has a better performance\nand generalization ability compared with existing methods. We further develop a\nreal-world human-to-robot handover system based on the proposed depth\nrestoration method, demonstrating its application value in human-robot\ninteraction.\n","authors":["Ran Yu","Haixin Yu","Huang Yan","Ziwu Song","Shoujie Li","Wenbo Ding"],"pdf_url":"https://arxiv.org/pdf/2408.14997v1.pdf","comment":"7 pages, 7 figures, conference"},{"id":"http://arxiv.org/abs/2407.21687v2","updated":"2024-08-27T12:03:00Z","published":"2024-07-31T15:29:34Z","title":"Dynamic Object Queries for Transformer-based Incremental Object\n Detection","summary":" Incremental object detection (IOD) aims to sequentially learn new classes,\nwhile maintaining the capability to locate and identify old ones. As the\ntraining data arrives with annotations only with new classes, IOD suffers from\ncatastrophic forgetting. Prior methodologies mainly tackle the forgetting issue\nthrough knowledge distillation and exemplar replay, ignoring the conflict\nbetween limited model capacity and increasing knowledge. In this paper, we\nexplore \\textit{dynamic object queries} for incremental object detection built\non Transformer architecture. We propose the \\textbf{Dy}namic object\n\\textbf{Q}uery-based \\textbf{DE}tection \\textbf{TR}ansformer (DyQ-DETR), which\nincrementally expands the model representation ability to achieve\nstability-plasticity tradeoff. First, a new set of learnable object queries are\nfed into the decoder to represent new classes. These new object queries are\naggregated with those from previous phases to adapt both old and new knowledge\nwell. Second, we propose the isolated bipartite matching for object queries in\ndifferent phases, based on disentangled self-attention. The interaction among\nthe object queries at different phases is eliminated to reduce inter-class\nconfusion. Thanks to the separate supervision and computation over object\nqueries, we further present the risk-balanced partial calibration for effective\nexemplar replay. Extensive experiments demonstrate that DyQ-DETR significantly\nsurpasses the state-of-the-art methods, with limited parameter overhead. Code\nwill be made publicly available.\n","authors":["Jichuan Zhang","Wei Li","Shuang Cheng","Ya-Li Li","Shengjin Wang"],"pdf_url":"https://arxiv.org/pdf/2407.21687v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.07257v2","updated":"2024-08-27T11:44:56Z","published":"2024-05-12T11:41:44Z","title":"Listen, Disentangle, and Control: Controllable Speech-Driven Talking\n Head Generation","summary":" Most earlier investigations on talking face generation have focused on the\nsynchronization of lip motion and speech content. However, human head pose and\nfacial emotions are equally important characteristics of natural human faces.\nWhile audio-driven talking face generation has seen notable advancements,\nexisting methods either overlook facial emotions or are limited to specific\nindividuals and cannot be applied to arbitrary subjects. In this paper, we\npropose a one-shot Talking Head Generation framework (SPEAK) that distinguishes\nitself from general Talking Face Generation by enabling emotional and postural\ncontrol. Specifically, we introduce the Inter-Reconstructed Feature\nDisentanglement (IRFD) method to decouple human facial features into three\nlatent spaces. We then design a face editing module that modifies speech\ncontent and facial latent codes into a single latent space. Subsequently, we\npresent a novel generator that employs modified latent codes derived from the\nediting module to regulate emotional expression, head poses, and speech content\nin synthesizing facial animations. Extensive trials demonstrate that our method\ncan generate realistic talking head with coordinated lip motions, authentic\nfacial emotions, and smooth head movements. The demo video is available at the\nanonymous link: https://anonymous.4open.science/r/SPEAK-F56E\n","authors":["Changpeng Cai","Guinan Guo","Jiao Li","Junhao Su","Chenghao He","Jing Xiao","Yuanxu Chen","Lei Dai","Feiyu Zhu"],"pdf_url":"https://arxiv.org/pdf/2405.07257v2.pdf","comment":"Due to our negligence, there are factual errors in the experimental\n results, so we are considering resubmitting the paper after an overhaul"},{"id":"http://arxiv.org/abs/2408.14977v1","updated":"2024-08-27T11:40:23Z","published":"2024-08-27T11:40:23Z","title":"LN-Gen: Rectal Lymph Nodes Generation via Anatomical Features","summary":" Accurate segmentation of rectal lymph nodes is crucial for the staging and\ntreatment planning of rectal cancer. However, the complexity of the surrounding\nanatomical structures and the scarcity of annotated data pose significant\nchallenges. This study introduces a novel lymph node synthesis technique aimed\nat generating diverse and realistic synthetic rectal lymph node samples to\nmitigate the reliance on manual annotation. Unlike direct diffusion methods,\nwhich often produce masks that are discontinuous and of suboptimal quality, our\napproach leverages an implicit SDF-based method for mask generation, ensuring\nthe production of continuous, stable, and morphologically diverse masks.\nExperimental results demonstrate that our synthetic data significantly improves\nsegmentation performance. Our work highlights the potential of diffusion model\nfor accurately synthesizing structurally complex lesions, such as lymph nodes\nin rectal cancer, alleviating the challenge of limited annotated data in this\nfield and aiding in advancements in rectal cancer diagnosis and treatment.\n","authors":["Weidong Guo","Hantao Zhang","Shouhong Wan","Bingbing Zou","Wanqin Wang","Peiquan Jin"],"pdf_url":"https://arxiv.org/pdf/2408.14977v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2408.14976v1","updated":"2024-08-27T11:38:01Z","published":"2024-08-27T11:38:01Z","title":"Prior-free Balanced Replay: Uncertainty-guided Reservoir Sampling for\n Long-Tailed Continual Learning","summary":" Even in the era of large models, one of the well-known issues in continual\nlearning (CL) is catastrophic forgetting, which is significantly challenging\nwhen the continual data stream exhibits a long-tailed distribution, termed as\nLong-Tailed Continual Learning (LTCL). Existing LTCL solutions generally\nrequire the label distribution of the data stream to achieve re-balance\ntraining. However, obtaining such prior information is often infeasible in real\nscenarios since the model should learn without pre-identifying the majority and\nminority classes. To this end, we propose a novel Prior-free Balanced Replay\n(PBR) framework to learn from long-tailed data stream with less forgetting.\nConcretely, motivated by our experimental finding that the minority classes are\nmore likely to be forgotten due to the higher uncertainty, we newly design an\nuncertainty-guided reservoir sampling strategy to prioritize rehearsing\nminority data without using any prior information, which is based on the mutual\ndependence between the model and samples. Additionally, we incorporate two\nprior-free components to further reduce the forgetting issue: (1) Boundary\nconstraint is to preserve uncertain boundary supporting samples for continually\nre-estimating task boundaries. (2) Prototype constraint is to maintain the\nconsistency of learned class prototypes along with training. Our approach is\nevaluated on three standard long-tailed benchmarks, demonstrating superior\nperformance to existing CL methods and previous SOTA LTCL approach in both\ntask- and class-incremental learning settings, as well as ordered- and\nshuffled-LTCL settings.\n","authors":["Lei Liu","Li Liu","Yawen Cui"],"pdf_url":"https://arxiv.org/pdf/2408.14976v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14975v1","updated":"2024-08-27T11:31:47Z","published":"2024-08-27T11:31:47Z","title":"MegActor-$Σ$: Unlocking Flexible Mixed-Modal Control in Portrait\n Animation with Diffusion Transformer","summary":" Diffusion models have demonstrated superior performance in the field of\nportrait animation. However, current approaches relied on either visual or\naudio modality to control character movements, failing to exploit the potential\nof mixed-modal control. This challenge arises from the difficulty in balancing\nthe weak control strength of audio modality and the strong control strength of\nvisual modality. To address this issue, we introduce MegActor-$\\Sigma$: a\nmixed-modal conditional diffusion transformer (DiT), which can flexibly inject\naudio and visual modality control signals into portrait animation.\nSpecifically, we make substantial advancements over its predecessor, MegActor,\nby leveraging the promising model structure of DiT and integrating audio and\nvisual conditions through advanced modules within the DiT framework. To further\nachieve flexible combinations of mixed-modal control signals, we propose a\n``Modality Decoupling Control\" training strategy to balance the control\nstrength between visual and audio modalities, along with the ``Amplitude\nAdjustment\" inference strategy to freely regulate the motion amplitude of each\nmodality. Finally, to facilitate extensive studies in this field, we design\nseveral dataset evaluation metrics to filter out public datasets and solely use\nthis filtered dataset to train MegActor-$\\Sigma$. Extensive experiments\ndemonstrate the superiority of our approach in generating vivid portrait\nanimations, outperforming previous methods trained on private dataset.\n","authors":["Shurong Yang","Huadong Li","Juhao Wu","Minhao Jing","Linze Li","Renhe Ji","Jiajun Liang","Haoqiang Fan","Jin Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14975v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14962v1","updated":"2024-08-27T11:09:34Z","published":"2024-08-27T11:09:34Z","title":"Deep Learning-based Average Shear Wave Velocity Prediction using\n Accelerometer Records","summary":" Assessing seismic hazards and thereby designing earthquake-resilient\nstructures or evaluating structural damage that has been incurred after an\nearthquake are important objectives in earthquake engineering. Both tasks\nrequire critical evaluation of strong ground motion records, and the knowledge\nof site conditions at the earthquake stations plays a major role in achieving\nthe aforementioned objectives. Site conditions are generally represented by the\ntime-averaged shear wave velocity in the upper 30 meters of the geological\nmaterials (Vs30). Several strong motion stations lack Vs30 measurements\nresulting in potentially inaccurate assessment of seismic hazards and\nevaluation of ground motion records. In this study, we present a deep\nlearning-based approach for predicting Vs30 at strong motion station locations\nusing three-channel earthquake records. For this purpose, Convolutional Neural\nNetworks (CNNs) with dilated and causal convolutional layers are used to\nextract deep features from accelerometer records collected from over 700\nstations located in Turkey. In order to overcome the limited availability of\nlabeled data, we propose a two-phase training approach. In the first phase, a\nCNN is trained to estimate the epicenters, for which ground truth is available\nfor all records. After the CNN is trained, the pre-trained encoder is\nfine-tuned based on the Vs30 ground truth. The performance of the proposed\nmethod is compared with machine learning models that utilize hand-crafted\nfeatures. The results demonstrate that the deep convolutional encoder based\nVs30 prediction model outperforms the machine learning models that rely on\nhand-crafted features.\n","authors":["Barış Yılmaz","Melek Türkmen","Sanem Meral","Erdem Akagündüz","Salih Tileylioglu"],"pdf_url":"https://arxiv.org/pdf/2408.14962v1.pdf","comment":"12 pages, 14 figures, Accepted by 18th World Conference on Earthquake\n Engineering WCEE2024"},{"id":"http://arxiv.org/abs/2408.14961v1","updated":"2024-08-27T11:07:19Z","published":"2024-08-27T11:07:19Z","title":"CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task","summary":" In recent years, the rapid expansion of model sizes has led to large-scale\npre-trained models demonstrating remarkable capabilities. Consequently, there\nhas been a trend towards increasing the scale of models. However, this trend\nintroduces significant challenges, including substantial computational costs of\ntraining and transfer to downstream tasks. To address these issues,\nParameter-Efficient Fine-Tuning (PEFT) methods have been introduced. These\nmethods optimize large-scale pre-trained models for specific tasks by\nfine-tuning a select group of parameters. Among these PEFT methods,\nadapter-based and prompt-based methods are the primary techniques.\nSpecifically, in the field of visual fine-tuning, adapters gain prominence over\nprompts because of the latter's relatively weaker performance and efficiency.\nUnder the circumstances, we refine the widely-used Visual Prompt Tuning (VPT)\nmethod, proposing Cross Visual Prompt Tuning (CVPT). CVPT calculates\ncross-attention between the prompt tokens and the embedded tokens, which allows\nus to compute the semantic relationship between them and conduct the\nfine-tuning of models exactly to adapt visual tasks better. Furthermore, we\nintroduce the weight-sharing mechanism to initialize the parameters of\ncross-attention, which avoids massive learnable parameters from cross-attention\nand enhances the representative capability of cross-attention. We conduct\ncomprehensive testing across 25 datasets and the result indicates that CVPT\nsignificantly improves VPT's performance and efficiency in visual tasks. For\nexample, on the VTAB-1K benchmark, CVPT outperforms VPT over 4% in average\naccuracy, rivaling the advanced adapter-based methods in performance and\nefficiency. Our experiments confirm that prompt-based methods can achieve\nexceptional results in visual fine-tuning.\n","authors":["Lingyun Huang","Jianxu Mao","Yaonan Wang","Junfei Yi","Ziming Tao"],"pdf_url":"https://arxiv.org/pdf/2408.14961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14957v1","updated":"2024-08-27T11:04:53Z","published":"2024-08-27T11:04:53Z","title":"Applying ViT in Generalized Few-shot Semantic Segmentation","summary":" This paper explores the capability of ViT-based models under the generalized\nfew-shot semantic segmentation (GFSS) framework. We conduct experiments with\nvarious combinations of backbone models, including ResNets and pretrained\nVision Transformer (ViT)-based models, along with decoders featuring a linear\nclassifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and\nlinear classifier takes the lead on popular few-shot segmentation bench mark\nPASCAL-$5^i$, substantially outperforming the best of ResNet structure by 116%\nin one-shot scenario. We demonstrate the great potential of large pretrained\nViT-based model on GFSS task, and expect further improvement on testing\nbenchmarks. However, a potential caveat is that when applying pure ViT-based\nmodel and large scale ViT decoder, the model is easy to overfit.\n","authors":["Liyuan Geng","Jinhong Xia","Yuanhe Guo"],"pdf_url":"https://arxiv.org/pdf/2408.14957v1.pdf","comment":"7 pages, 4 figures"},{"id":"http://arxiv.org/abs/2406.17640v2","updated":"2024-08-27T11:00:47Z","published":"2024-06-25T15:24:06Z","title":"BayTTA: Uncertainty-aware medical image classification with optimized\n test-time augmentation using Bayesian model averaging","summary":" Test-time augmentation (TTA) is a well-known technique employed during the\ntesting phase of computer vision tasks. It involves aggregating multiple\naugmented versions of input data. Combining predictions using a simple average\nformulation is a common and straightforward approach after performing TTA. This\npaper introduces a novel framework for optimizing TTA, called BayTTA\n(Bayesian-based TTA), which is based on Bayesian Model Averaging (BMA). First,\nwe generate a prediction list associated with different variations of the input\ndata created through TTA. Then, we use BMA to combine predictions weighted by\nthe respective posterior probabilities. Such an approach allows one to take\ninto account model uncertainty, and thus to enhance the predictive performance\nof the related machine learning or deep learning model. We evaluate the\nperformance of BayTTA on various public data, including three medical image\ndatasets comprising skin cancer, breast cancer, and chest X-ray images and two\nwell-known gene editing datasets, CRISPOR and GUIDE-seq. Our experimental\nresults indicate that BayTTA can be effectively integrated into\nstate-of-the-art deep learning models used in medical image analysis as well as\ninto some popular pre-trained CNN models such as VGG-16, MobileNetV2,\nDenseNet201, ResNet152V2, and InceptionRes-NetV2, leading to the enhancement in\ntheir accuracy and robustness performance. The source code of the proposed\nBayTTA method is freely available at: \\underline\n{https://github.com/Z-Sherkat/BayTTA}.\n","authors":["Zeinab Sherkatghanad","Moloud Abdar","Mohammadreza Bakhtyari","Pawel Plawiak","Vladimir Makarenkov"],"pdf_url":"https://arxiv.org/pdf/2406.17640v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14950v1","updated":"2024-08-27T10:54:37Z","published":"2024-08-27T10:54:37Z","title":"NeuralOOD: Improving Out-of-Distribution Generalization Performance with\n Brain-machine Fusion Learning Framework","summary":" Deep Neural Networks (DNNs) have demonstrated exceptional recognition\ncapabilities in traditional computer vision (CV) tasks. However, existing CV\nmodels often suffer a significant decrease in accuracy when confronted with\nout-of-distribution (OOD) data. In contrast to these DNN models, human can\nmaintain a consistently low error rate when facing OOD scenes, partly\nattributed to the rich prior cognitive knowledge stored in the human brain.\nPrevious OOD generalization researches only focus on the single modal,\noverlooking the advantages of multimodal learning method. In this paper, we\nutilize the multimodal learning method to improve the OOD generalization and\npropose a novel Brain-machine Fusion Learning (BMFL) framework. We adopt the\ncross-attention mechanism to fuse the visual knowledge from CV model and prior\ncognitive knowledge from the human brain. Specially, we employ a pre-trained\nvisual neural encoding model to predict the functional Magnetic Resonance\nImaging (fMRI) from visual features which eliminates the need for the fMRI data\ncollection and pre-processing, effectively reduces the workload associated with\nconventional BMFL methods. Furthermore, we construct a brain transformer to\nfacilitate the extraction of knowledge inside the fMRI data. Moreover, we\nintroduce the Pearson correlation coefficient maximization regularization\nmethod into the training process, which improves the fusion capability with\nbetter constrains. Our model outperforms the DINOv2 and baseline models on the\nImageNet-1k validation dataset as well as six curated OOD datasets, showcasing\nits superior performance in diverse scenarios.\n","authors":["Shuangchen Zhao","Changde Du","Hui Li","Huiguang He"],"pdf_url":"https://arxiv.org/pdf/2408.14950v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.19513v2","updated":"2024-08-27T10:50:13Z","published":"2024-04-30T12:45:41Z","title":"A Smartphone-Based Method for Assessing Tomato Nutrient Status through\n Trichome Density Measurement","summary":" Early detection of fertilizer-induced stress in tomato plants is crucial for\ntimely crop management interventions and yield optimization. Conventional\noptical methods detect fertilizer stress in young leaves with difficulty. This\nstudy proposes a novel, noninvasive technique for quantifying the density of\ntrichomes-elongated hair-like structures found on plant surfaces-on young\nleaves using a smartphone. This method exhibits superior detection latency,\nenabling earlier and more accurate identification of fertilizer stress in\ntomato plants. Our approach combines augmented reality technology and image\nprocessing algorithms to analyze smartphone images of a specialized measurement\npaper. This measurement paper is applied to a tomato leaf to transfer trichomes\nonto its adhesive surface. The captured images are then processed through a\npipeline involving region of interest extraction, perspective transformation,\nand illumination correction. Trichome detection and spatial distribution\nanalysis of these preprocessed images yield a robust density metric. We\nvalidated our method through experiments on hydroponically grown tomatoes under\nvarying fertilizer concentrations. Using leave-one-out cross-validation\n(LOOCV), our model achieves a mean area under the precision-recall curve of\n0.824 and a receiver operating characteristic curve of 0.641 for predicting\nadditional fertilization needs. Based on LOOCV, quantitative analysis revealed\na strong relationship between trichome density and explanatory variables,\nincluding nitrate ion concentration, explaining 62.48% of the variation ($R^2 =\n0.625$). The predicted and actual trichome densities were strongly correlated\n($r = 0.794$). This straightforward and cost-effective method overcomes the\nlimitations of traditional techniques, demonstrating the potential of using\nsmartphones for practical plant nutrition diagnosis.\n","authors":["Sho Ueda","Xujun Ye"],"pdf_url":"https://arxiv.org/pdf/2404.19513v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14253v2","updated":"2024-08-27T10:50:13Z","published":"2024-08-26T13:16:03Z","title":"Text3DAug -- Prompted Instance Augmentation for LiDAR Perception","summary":" LiDAR data of urban scenarios poses unique challenges, such as heterogeneous\ncharacteristics and inherent class imbalance. Therefore, large-scale datasets\nare necessary to apply deep learning methods. Instance augmentation has emerged\nas an efficient method to increase dataset diversity. However, current methods\nrequire the time-consuming curation of 3D models or costly manual data\nannotation. To overcome these limitations, we propose Text3DAug, a novel\napproach leveraging generative models for instance augmentation. Text3DAug does\nnot depend on labeled data and is the first of its kind to generate instances\nand annotations from text. This allows for a fully automated pipeline,\neliminating the need for manual effort in practical applications. Additionally,\nText3DAug is sensor agnostic and can be applied regardless of the LiDAR sensor\nused. Comprehensive experimental analysis on LiDAR segmentation, detection and\nnovel class discovery demonstrates that Text3DAug is effective in supplementing\nexisting methods or as a standalone method, performing on par or better than\nestablished methods, however while overcoming their specific drawbacks. The\ncode is publicly available.\n","authors":["Laurenz Reichardt","Luca Uhr","Oliver Wasenmüller"],"pdf_url":"https://arxiv.org/pdf/2408.14253v2.pdf","comment":"Accepted at the 2024 IEEE/RSJ International Conference on Intelligent\n Robots and Systems (IROS 2024)"},{"id":"http://arxiv.org/abs/2308.13997v2","updated":"2024-08-27T10:47:47Z","published":"2023-08-27T03:54:55Z","title":"Adaptive Fusion of Radiomics and Deep Features for Lung Adenocarcinoma\n Subtype Recognition","summary":" The most common type of lung cancer, lung adenocarcinoma (LUAD), has been\nincreasingly detected since the advent of low-dose computed tomography\nscreening technology. In clinical practice, pre-invasive LUAD (Pre-IAs) should\nonly require regular follow-up care, while invasive LUAD (IAs) should receive\nimmediate treatment with appropriate lung cancer resection, based on the cancer\nsubtype. However, prior research on diagnosing LUAD has mainly focused on\nclassifying Pre-IAs/IAs, as techniques for distinguishing different subtypes of\nIAs have been lacking. In this study, we proposed a multi-head attentional\nfeature fusion (MHA-FF) model for not only distinguishing IAs from Pre-IAs, but\nalso for distinguishing the different subtypes of IAs. To predict the subtype\nof each nodule accurately, we leveraged both radiomics and deep features\nextracted from computed tomography images. Furthermore, those features were\naggregated through an adaptive fusion module that can learn attention-based\ndiscriminative features. The utility of our proposed method is demonstrated\nhere by means of real-world data collected from a multi-center cohort.\n","authors":["Jing Zhou","Xiaotong Fu","Xirong Li","Ying Ji"],"pdf_url":"https://arxiv.org/pdf/2308.13997v2.pdf","comment":"7 pages, 5 figures and 4 tables"},{"id":"http://arxiv.org/abs/2408.14947v1","updated":"2024-08-27T10:44:34Z","published":"2024-08-27T10:44:34Z","title":"ERX: A Fast Real-Time Anomaly Detection Algorithm for Hyperspectral\n Line-Scanning","summary":" Detecting unexpected objects (anomalies) in real-time has great potential for\nmonitoring, managing, and protecting the environment. Hyperspectral line-scan\ncameras are a low-cost solution that enhance confidence in anomaly detection\nover RGB and multispectral imagery. However, real-time algorithms for these\ncameras must be fast when using small computers (e.g., those onboard a drone or\nsmall satellite), scalable to high dimensions, adaptable to changing scenery,\nand robust against geometric and radiometric distortions. This paper introduces\nthe Exponentially moving RX algorithm (ERX) and compares it to existing\nRX-based anomaly detection methods for real-time line-scanning. ERX was tested\nusing a Jetson Xavier NX compute module, achieving the best combination of\nspeed and detection across three novel datasets compared to the other\nalgorithms. This research paves the way for future studies in grouping and\nlocating anomalous objects, adaptive and automatic threshold selection, and\nreal-time field tests. The Python code for the algorithms and experiments is\navailable at https://github.com/WiseGamgee/HyperAD.\n","authors":["Samuel Garske","Bradley Evans","Christopher Artlett","KC Wong"],"pdf_url":"https://arxiv.org/pdf/2408.14947v1.pdf","comment":"10 pages, 9 figures, 3 tables, code and datasets accessible at\n https://github.com/WiseGamgee/HyperAD"},{"id":"http://arxiv.org/abs/2408.14941v1","updated":"2024-08-27T10:26:05Z","published":"2024-08-27T10:26:05Z","title":"BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and\n Localization","summary":" Object detection and global localization play a crucial role in robotics,\nspanning across a great spectrum of applications from autonomous cars to\nmulti-layered 3D Scene Graphs for semantic scene understanding. This article\nproposes BOX3D, a novel multi-modal and lightweight scheme for localizing\nobjects of interest by fusing the information from RGB camera and 3D LiDAR.\nBOX3D is structured around a three-layered architecture, building up from the\nlocal perception of the incoming sequential sensor data to the global\nperception refinement that covers for outliers and the general consistency of\neach object's observation. More specifically, the first layer handles the\nlow-level fusion of camera and LiDAR data for initial 3D bounding box\nextraction. The second layer converts each LiDAR's scan 3D bounding boxes to\nthe world coordinate frame and applies a spatial pairing and merging mechanism\nto maintain the uniqueness of objects observed from different viewpoints.\nFinally, BOX3D integrates the third layer that supervises the consistency of\nthe results on the global map iteratively, using a point-to-voxel comparison\nfor identifying all points in the global map that belong to the object.\nBenchmarking results of the proposed novel architecture are showcased in\nmultiple experimental trials on public state-of-the-art large-scale dataset of\nurban environments.\n","authors":["Mario A. V. Saucedo","Nikolaos Stathoulopoulos","Vidya Sumathy","Christoforos Kanellakis","George Nikolakopoulos"],"pdf_url":"https://arxiv.org/pdf/2408.14941v1.pdf","comment":"Presented in MED 2024"},{"id":"http://arxiv.org/abs/2408.14930v1","updated":"2024-08-27T10:09:17Z","published":"2024-08-27T10:09:17Z","title":"Cross-Modal Temporal Alignment for Event-guided Video Deblurring","summary":" Video deblurring aims to enhance the quality of restored results in\nmotion-blurred videos by effectively gathering information from adjacent video\nframes to compensate for the insufficient data in a single blurred frame.\nHowever, when faced with consecutively severe motion blur situations,\nframe-based video deblurring methods often fail to find accurate temporal\ncorrespondence among neighboring video frames, leading to diminished\nperformance. To address this limitation, we aim to solve the video deblurring\ntask by leveraging an event camera with micro-second temporal resolution. To\nfully exploit the dense temporal resolution of the event camera, we propose two\nmodules: 1) Intra-frame feature enhancement operates within the exposure time\nof a single blurred frame, iteratively enhancing cross-modality features in a\nrecurrent manner to better utilize the rich temporal information of events, 2)\nInter-frame temporal feature alignment gathers valuable long-range temporal\ninformation to target frames, aggregating sharp features leveraging the\nadvantages of the events. In addition, we present a novel dataset composed of\nreal-world blurred RGB videos, corresponding sharp videos, and event data. This\ndataset serves as a valuable resource for evaluating event-guided deblurring\nmethods. We demonstrate that our proposed methods outperform state-of-the-art\nframe-based and event-based motion deblurring methods through extensive\nexperiments conducted on both synthetic and real-world deblurring datasets. The\ncode and dataset are available at https://github.com/intelpro/CMTA.\n","authors":["Taewoo Kim","Hoonhee Cho","Kuk-Jin Yoon"],"pdf_url":"https://arxiv.org/pdf/2408.14930v1.pdf","comment":"Accepted in ECCV2024"},{"id":"http://arxiv.org/abs/2408.14927v1","updated":"2024-08-27T10:01:58Z","published":"2024-08-27T10:01:58Z","title":"Automatic Detection of COVID-19 from Chest X-ray Images Using Deep\n Learning Model","summary":" The infectious disease caused by novel corona virus (2019-nCoV) has been\nwidely spreading since last year and has shaken the entire world. It has caused\nan unprecedented effect on daily life, global economy and public health. Hence\nthis disease detection has life-saving importance for both patients as well as\ndoctors. Due to limited test kits, it is also a daunting task to test every\npatient with severe respiratory problems using conventional techniques\n(RT-PCR). Thus implementing an automatic diagnosis system is urgently required\nto overcome the scarcity problem of Covid-19 test kits at hospital, health care\nsystems. The diagnostic approach is mainly classified into two\ncategories-laboratory based and Chest radiography approach. In this paper, a\nnovel approach for computerized corona virus (2019-nCoV) detection from lung\nx-ray images is presented. Here, we propose models using deep learning to show\nthe effectiveness of diagnostic systems. In the experimental result, we\nevaluate proposed models on publicly available data-set which exhibit\nsatisfactory performance and promising results compared with other previous\nexisting methods.\n","authors":["Alloy Das","Rohit Agarwal","Rituparna Singh","Arindam Chowdhury","Debashis Nandi"],"pdf_url":"https://arxiv.org/pdf/2408.14927v1.pdf","comment":"Accepted in AIP Conference Proceedings (Vol. 2424, No. 1)"},{"id":"http://arxiv.org/abs/2311.12084v2","updated":"2024-08-27T09:55:37Z","published":"2023-11-20T11:08:06Z","title":"ODDR: Outlier Detection & Dimension Reduction Based Defense Against\n Adversarial Patches","summary":" Adversarial attacks present a significant challenge to the dependable\ndeployment of machine learning models, with patch-based attacks being\nparticularly potent. These attacks introduce adversarial perturbations in\nlocalized regions of an image, deceiving even well-trained models. In this\npaper, we propose Outlier Detection and Dimension Reduction (ODDR), a\ncomprehensive defense strategy engineered to counteract patch-based adversarial\nattacks through advanced statistical methodologies. Our approach is based on\nthe observation that input features corresponding to adversarial\npatches-whether naturalistic or synthetic-deviate from the intrinsic\ndistribution of the remaining image data and can thus be identified as\noutliers. ODDR operates through a robust three-stage pipeline: Fragmentation,\nSegregation, and Neutralization. This model-agnostic framework is versatile,\noffering protection across various tasks, including image classification,\nobject detection, and depth estimation, and is proved effective in both\nCNN-based and Transformer-based architectures. In the Fragmentation stage,\nimage samples are divided into smaller segments, preparing them for the\nSegregation stage, where advanced outlier detection techniques isolate\nanomalous features linked to adversarial perturbations. The Neutralization\nstage then applies dimension reduction techniques to these outliers,\neffectively neutralizing the adversarial impact while preserving critical\ninformation for the machine learning task. Extensive evaluation on benchmark\ndatasets against state-of-the-art adversarial patches underscores the efficacy\nof ODDR. Our method enhances model accuracy from 39.26% to 79.1% under the\nGoogleAp attack, outperforming leading defenses such as LGS (53.86%), Jujutsu\n(60%), and Jedi (64.34%).\n","authors":["Nandish Chattopadhyay","Amira Guesmi","Muhammad Abdullah Hanif","Bassem Ouni","Muhammad Shafique"],"pdf_url":"https://arxiv.org/pdf/2311.12084v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14916v1","updated":"2024-08-27T09:44:54Z","published":"2024-08-27T09:44:54Z","title":"Towards Real-world Event-guided Low-light Video Enhancement and\n Deblurring","summary":" In low-light conditions, capturing videos with frame-based cameras often\nrequires long exposure times, resulting in motion blur and reduced visibility.\nWhile frame-based motion deblurring and low-light enhancement have been\nstudied, they still pose significant challenges. Event cameras have emerged as\na promising solution for improving image quality in low-light environments and\naddressing motion blur. They provide two key advantages: capturing scene\ndetails well even in low light due to their high dynamic range, and effectively\ncapturing motion information during long exposures due to their high temporal\nresolution. Despite efforts to tackle low-light enhancement and motion\ndeblurring using event cameras separately, previous work has not addressed both\nsimultaneously. To explore the joint task, we first establish real-world\ndatasets for event-guided low-light enhancement and deblurring using a hybrid\ncamera system based on beam splitters. Subsequently, we introduce an end-to-end\nframework to effectively handle these tasks. Our framework incorporates a\nmodule to efficiently leverage temporal information from events and frames.\nFurthermore, we propose a module to utilize cross-modal feature information to\nemploy a low-pass filter for noise suppression while enhancing the main\nstructural information. Our proposed method significantly outperforms existing\napproaches in addressing the joint task. Our project pages are available at\nhttps://github.com/intelpro/ELEDNet.\n","authors":["Taewoo Kim","Jaeseok Jeong","Hoonhee Cho","Yuhwan Jeong","Kuk-Jin Yoon"],"pdf_url":"https://arxiv.org/pdf/2408.14916v1.pdf","comment":"Accepted in ECCV2024"},{"id":"http://arxiv.org/abs/2408.14899v1","updated":"2024-08-27T09:23:18Z","published":"2024-08-27T09:23:18Z","title":"MeshUp: Multi-Target Mesh Deformation via Blended Score Distillation","summary":" We propose MeshUp, a technique that deforms a 3D mesh towards multiple target\nconcepts, and intuitively controls the region where each concept is expressed.\nConveniently, the concepts can be defined as either text queries, e.g., \"a dog\"\nand \"a turtle,\" or inspirational images, and the local regions can be selected\nas any number of vertices on the mesh. We can effectively control the influence\nof the concepts and mix them together using a novel score distillation\napproach, referred to as the Blended Score Distillation (BSD). BSD operates on\neach attention layer of the denoising U-Net of a diffusion model as it extracts\nand injects the per-objective activations into a unified denoising pipeline\nfrom which the deformation gradients are calculated. To localize the expression\nof these activations, we create a probabilistic Region of Interest (ROI) map on\nthe surface of the mesh, and turn it into 3D-consistent masks that we use to\ncontrol the expression of these activations. We demonstrate the effectiveness\nof BSD empirically and show that it can deform various meshes towards multiple\nobjectives.\n","authors":["Hyunwoo Kim","Itai Lang","Noam Aigerman","Thibault Groueix","Vladimir G. Kim","Rana Hanocka"],"pdf_url":"https://arxiv.org/pdf/2408.14899v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14895v1","updated":"2024-08-27T09:18:57Z","published":"2024-08-27T09:18:57Z","title":"VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view\n Videos of Daily Activities","summary":" Multi-modal knowledge graphs (MMKGs), which ground various non-symbolic data\n(e.g., images and videos) into symbols, have attracted attention as resources\nenabling knowledge processing and machine learning across modalities. However,\nthe construction of MMKGs for videos consisting of multiple events, such as\ndaily activities, is still in the early stages. In this paper, we construct an\nMMKG based on synchronized multi-view simulated videos of daily activities.\nBesides representing the content of daily life videos as event-centric\nknowledge, our MMKG also includes frame-by-frame fine-grained changes, such as\nbounding boxes within video frames. In addition, we provide support tools for\nquerying our MMKG. As an application example, we demonstrate that our MMKG\nfacilitates benchmarking vision-language models by providing the necessary\nvision-language datasets for a tailored task.\n","authors":["Shusaku Egami","Takahiro Ugai","Ken Fukuda"],"pdf_url":"https://arxiv.org/pdf/2408.14895v1.pdf","comment":"5 pages,4 figures, accepted by CIKM2024 Resource Track"},{"id":"http://arxiv.org/abs/2408.00591v2","updated":"2024-08-27T09:09:18Z","published":"2024-08-01T14:20:47Z","title":"Regional quality estimation for echocardiography using deep learning","summary":" Automatic estimation of cardiac ultrasound image quality can be beneficial\nfor guiding operators and ensuring the accuracy of clinical measurements.\nPrevious work often fails to distinguish the view correctness of the\nechocardiogram from the image quality. Additionally, previous studies only\nprovide a global image quality value, which limits their practical utility. In\nthis work, we developed and compared three methods to estimate image quality:\n1) classic pixel-based metrics like the generalized contrast-to-noise ratio\n(gCNR) on myocardial segments as region of interest and left ventricle lumen as\nbackground, obtained using a U-Net segmentation 2) local image coherence\nderived from a U-Net model that predicts coherence from B-Mode images 3) a deep\nconvolutional network that predicts the quality of each region directly in an\nend-to-end fashion. We evaluate each method against manual regional image\nquality annotations by three experienced cardiologists. The results indicate\npoor performance of the gCNR metric, with Spearman correlation to the\nannotations of rho = 0.24. The end-to-end learning model obtains the best\nresult, rho = 0.69, comparable to the inter-observer correlation, rho = 0.63.\nFinally, the coherence-based method, with rho = 0.58, outperformed the\nclassical metrics and is more generic than the end-to-end approach.\n","authors":["Gilles Van De Vyver","Svein-Erik Måsøy","Håvard Dalen","Bjørnar Leangen Grenne","Espen Holte","Sindre Hellum Olaisen","John Nyberg","Andreas Østvik","Lasse Løvstakken","Erik Smistad"],"pdf_url":"https://arxiv.org/pdf/2408.00591v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14879v1","updated":"2024-08-27T08:48:21Z","published":"2024-08-27T08:48:21Z","title":"Adversarial Manhole: Challenging Monocular Depth Estimation and Semantic\n Segmentation Models with Patch Attack","summary":" Monocular depth estimation (MDE) and semantic segmentation (SS) are crucial\nfor the navigation and environmental interpretation of many autonomous driving\nsystems. However, their vulnerability to practical adversarial attacks is a\nsignificant concern. This paper presents a novel adversarial attack using\npractical patches that mimic manhole covers to deceive MDE and SS models. The\ngoal is to cause these systems to misinterpret scenes, leading to false\ndetections of near obstacles or non-passable objects. We use Depth Planar\nMapping to precisely position these patches on road surfaces, enhancing the\nattack's effectiveness. Our experiments show that these adversarial patches\ncause a 43% relative error in MDE and achieve a 96% attack success rate in SS.\nThese patches create affected error regions over twice their size in MDE and\napproximately equal to their size in SS. Our studies also confirm the patch's\neffectiveness in physical simulations, the adaptability of the patches across\ndifferent target models, and the effectiveness of our proposed modules,\nhighlighting their practical implications.\n","authors":["Naufal Suryanto","Andro Aprila Adiputra","Ahmada Yusril Kadiptya","Yongsu Kim","Howon Kim"],"pdf_url":"https://arxiv.org/pdf/2408.14879v1.pdf","comment":"Accepted for WISA 2024. Code and dataset:\n https://github.com/naufalso/adversarial-manhole"},{"id":"http://arxiv.org/abs/2408.14868v1","updated":"2024-08-27T08:39:47Z","published":"2024-08-27T08:39:47Z","title":"ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning","summary":" Zero-shot learning (ZSL) aims to recognize unseen classes by transferring\nsemantic knowledge from seen classes to unseen ones, guided by semantic\ninformation. To this end, existing works have demonstrated remarkable\nperformance by utilizing global visual features from Convolutional Neural\nNetworks (CNNs) or Vision Transformers (ViTs) for visual-semantic interactions.\nDue to the limited receptive fields of CNNs and the quadratic complexity of\nViTs, however, these visual backbones achieve suboptimal visual-semantic\ninteractions. In this paper, motivated by the visual state space model (i.e.,\nVision Mamba), which is capable of capturing long-range dependencies and\nmodeling complex visual dynamics, we propose a parameter-efficient ZSL\nframework called ZeroMamba to advance ZSL. Our ZeroMamba comprises three key\ncomponents: Semantic-aware Local Projection (SLP), Global Representation\nLearning (GRL), and Semantic Fusion (SeF). Specifically, SLP integrates\nsemantic embeddings to map visual features to local semantic-related\nrepresentations, while GRL encourages the model to learn global semantic\nrepresentations. SeF combines these two semantic representations to enhance the\ndiscriminability of semantic features. We incorporate these designs into Vision\nMamba, forming an end-to-end ZSL framework. As a result, the learned semantic\nrepresentations are better suited for classification. Through extensive\nexperiments on four prominent ZSL benchmarks, ZeroMamba demonstrates superior\nperformance, significantly outperforming the state-of-the-art (i.e., CNN-based\nand ViT-based) methods under both conventional ZSL (CZSL) and generalized ZSL\n(GZSL) settings. Code is available at:\nhttps://anonymous.4open.science/r/ZeroMamba.\n","authors":["Wenjin Hou","Dingjie Fu","Kun Li","Shiming Chen","Hehe Fan","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2408.14868v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.00737v2","updated":"2024-08-27T08:33:31Z","published":"2024-06-30T15:50:32Z","title":"LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image\n Generation","summary":" Diffusion models have exhibited substantial success in text-to-image\ngeneration. However, they often encounter challenges when dealing with complex\nand dense prompts involving multiple objects, attribute binding, and long\ndescriptions. In this paper, we propose a novel framework called\n\\textbf{LLM4GEN}, which enhances the semantic understanding of text-to-image\ndiffusion models by leveraging the representation of Large Language Models\n(LLMs). It can be seamlessly incorporated into various diffusion models as a\nplug-and-play component. A specially designed Cross-Adapter Module (CAM)\nintegrates the original text features of text-to-image models with LLM\nfeatures, thereby enhancing text-to-image generation. Additionally, to\nfacilitate and correct entity-attribute relationships in text prompts, we\ndevelop an entity-guided regularization loss to further improve generation\nperformance. We also introduce DensePrompts, which contains $7,000$ dense\nprompts to provide a comprehensive evaluation for the text-to-image generation\ntask. Experiments indicate that LLM4GEN significantly improves the semantic\nalignment of SD1.5 and SDXL, demonstrating increases of 9.69\\% and 12.90\\% in\ncolor on T2I-CompBench, respectively. Moreover, it surpasses existing models in\nterms of sample quality, image-text alignment, and human evaluation.\n","authors":["Mushui Liu","Yuhang Ma","Yang Zhen","Jun Dan","Yunlong Yu","Zeng Zhao","Zhipeng Hu","Bai Liu","Changjie Fan"],"pdf_url":"https://arxiv.org/pdf/2407.00737v2.pdf","comment":"11 pages, 13 figures"},{"id":"http://arxiv.org/abs/2408.14860v1","updated":"2024-08-27T08:28:01Z","published":"2024-08-27T08:28:01Z","title":"DiffSurf: A Transformer-based Diffusion Model for Generating and\n Reconstructing 3D Surfaces in Pose","summary":" This paper presents DiffSurf, a transformer-based denoising diffusion model\nfor generating and reconstructing 3D surfaces. Specifically, we design a\ndiffusion transformer architecture that predicts noise from noisy 3D surface\nvertices and normals. With this architecture, DiffSurf is able to generate 3D\nsurfaces in various poses and shapes, such as human bodies, hands, animals and\nman-made objects. Further, DiffSurf is versatile in that it can address various\n3D downstream tasks including morphing, body shape variation and 3D human mesh\nfitting to 2D keypoints. Experimental results on 3D human model benchmarks\ndemonstrate that DiffSurf can generate shapes with greater diversity and higher\nquality than previous generative models. Furthermore, when applied to the task\nof single-image 3D human mesh recovery, DiffSurf achieves accuracy comparable\nto prior techniques at a near real-time rate.\n","authors":["Yusuke Yoshiyasu","Leyuan Sun"],"pdf_url":"https://arxiv.org/pdf/2408.14860v1.pdf","comment":"Accepted at ECCV2024"},{"id":"http://arxiv.org/abs/2405.18911v2","updated":"2024-08-27T08:22:54Z","published":"2024-05-29T09:13:30Z","title":"Exploring Human-in-the-Loop Test-Time Adaptation by Synergizing Active\n Learning and Model Selection","summary":" Existing test-time adaptation (TTA) approaches often adapt models with the\nunlabeled testing data stream. A recent attempt relaxed the assumption by\nintroducing limited human annotation, referred to as Human-In-the-Loop\nTest-Time Adaptation (HILTTA) in this study. The focus of existing HILTTA\nstudies lies in selecting the most informative samples to label, a.k.a. active\nlearning. In this work, we are motivated by a pitfall of TTA, i.e. sensitivity\nto hyper-parameters, and propose to approach HILTTA by synergizing active\nlearning and model selection. Specifically, we first select samples for human\nannotation (active learning) and then use the labeled data to select optimal\nhyper-parameters (model selection). To prevent the model selection process from\noverfitting to local distributions, multiple regularization techniques are\nemployed to complement the validation objective. A sample selection strategy is\nfurther tailored by considering the balance between active learning and model\nselection purposes. We demonstrate on 5 TTA datasets that the proposed HILTTA\napproach is compatible with off-the-shelf TTA methods and such combinations\nsubstantially outperform the state-of-the-art HILTTA methods. Importantly, our\nproposed method can always prevent choosing the worst hyper-parameters on all\noff-the-shelf TTA methods. The source code will be released upon publication.\n","authors":["Yushu Li","Yongyi Su","Xulei Yang","Kui Jia","Xun Xu"],"pdf_url":"https://arxiv.org/pdf/2405.18911v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.09126v2","updated":"2024-08-27T08:19:22Z","published":"2024-08-17T07:27:14Z","title":"Barbie: Text to Barbie-Style 3D Avatars","summary":" Recent advances in text-guided 3D avatar generation have made substantial\nprogress by distilling knowledge from diffusion models. Despite the plausible\ngenerated appearance, existing methods cannot achieve fine-grained\ndisentanglement or high-fidelity modeling between inner body and outfit. In\nthis paper, we propose Barbie, a novel framework for generating 3D avatars that\ncan be dressed in diverse and high-quality Barbie-like garments and\naccessories. Instead of relying on a holistic model, Barbie achieves\nfine-grained disentanglement on avatars by semantic-aligned separated models\nfor human body and outfits. These disentangled 3D representations are then\noptimized by different expert models to guarantee the domain-specific fidelity.\nTo balance geometry diversity and reasonableness, we propose a series of losses\nfor template-preserving and human-prior evolving. The final avatar is enhanced\nby unified texture refinement for superior texture consistency. Extensive\nexperiments demonstrate that Barbie outperforms existing methods in both\ndressed human and outfit generation, supporting flexible apparel combination\nand animation. The code will be released for research purposes. Our project\npage is: https://xiaokunsun.github.io/Barbie.github.io/.\n","authors":["Xiaokun Sun","Zhenyu Zhang","Ying Tai","Qian Wang","Hao Tang","Zili Yi","Jian Yang"],"pdf_url":"https://arxiv.org/pdf/2408.09126v2.pdf","comment":"9 pages, 7 figures"},{"id":"http://arxiv.org/abs/2407.21705v2","updated":"2024-08-27T08:14:16Z","published":"2024-07-31T15:53:20Z","title":"Tora: Trajectory-oriented Diffusion Transformer for Video Generation","summary":" Recent advancements in Diffusion Transformer (DiT) have demonstrated\nremarkable proficiency in producing high-quality video content. Nonetheless,\nthe potential of transformer-based diffusion models for effectively generating\nvideos with controllable motion remains an area of limited exploration. This\npaper introduces Tora, the first trajectory-oriented DiT framework that\nconcurrently integrates textual, visual, and trajectory conditions, thereby\nenabling scalable video generation with effective motion guidance.\nSpecifically, Tora consists of a Trajectory Extractor(TE), a Spatial-Temporal\nDiT, and a Motion-guidance Fuser(MGF). The TE encodes arbitrary trajectories\ninto hierarchical spacetime motion patches with a 3D video compression network.\nThe MGF integrates the motion patches into the DiT blocks to generate\nconsistent videos that accurately follow designated trajectories. Our design\naligns seamlessly with DiT's scalability, allowing precise control of video\ncontent's dynamics with diverse durations, aspect ratios, and resolutions.\nExtensive experiments demonstrate Tora's excellence in achieving high motion\nfidelity, while also meticulously simulating the intricate movement of the\nphysical world.\n","authors":["Zhenghao Zhang","Junchao Liao","Menghao Li","Zuozhuo Dai","Bingxue Qiu","Siyu Zhu","Long Qin","Weizhi Wang"],"pdf_url":"https://arxiv.org/pdf/2407.21705v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.16580v4","updated":"2024-08-27T08:13:01Z","published":"2023-05-26T02:09:48Z","title":"TFDet: Target-Aware Fusion for RGB-T Pedestrian Detection","summary":" Pedestrian detection plays a critical role in computer vision as it\ncontributes to ensuring traffic safety. Existing methods that rely solely on\nRGB images suffer from performance degradation under low-light conditions due\nto the lack of useful information. To address this issue, recent multispectral\ndetection approaches have combined thermal images to provide complementary\ninformation and have obtained enhanced performances. Nevertheless, few\napproaches focus on the negative effects of false positives caused by noisy\nfused feature maps. Different from them, we comprehensively analyze the impacts\nof false positives on the detection performance and find that enhancing feature\ncontrast can significantly reduce these false positives. In this paper, we\npropose a novel target-aware fusion strategy for multispectral pedestrian\ndetection, named TFDet. TFDet achieves state-of-the-art performance on two\nmultispectral pedestrian benchmarks, KAIST and LLVIP. TFDet can easily extend\nto multi-class object detection scenarios. It outperforms the previous best\napproaches on two multispectral object detection benchmarks, FLIR and M3FD.\nImportantly, TFDet has comparable inference efficiency to the previous\napproaches, and has remarkably good detection performance even under low-light\nconditions, which is a significant advancement for ensuring road safety.\n","authors":["Xue Zhang","Xiaohan Zhang","Jiangtao Wang","Jiacheng Ying","Zehua Sheng","Heng Yu","Chunguang Li","Hui-Liang Shen"],"pdf_url":"https://arxiv.org/pdf/2305.16580v4.pdf","comment":"This paper has been accepted by IEEE T-NNLS journal. Please jump to\n External DOI to view the official version"},{"id":"http://arxiv.org/abs/2408.13766v2","updated":"2024-08-27T08:07:20Z","published":"2024-08-25T08:23:06Z","title":"Enhancing Robustness of Human Detection Algorithms in Maritime SAR\n through Augmented Aerial Images to Simulate Weather Conditions","summary":" 7,651 cases of Search and Rescue Missions (SAR) were reported by the United\nStates Coast Guard in 2024, with over 1322 SAR helicopters deployed in the 6\nfirst months alone. Through the utilizations of YOLO, we were able to run\ndifferent weather conditions and lighting from our augmented dataset for\ntraining. YOLO then utilizes CNNs to apply a series of convolutions and pooling\nlayers to the input image, where the convolution layers are able to extract the\nmain features of the image. Through this, our YOLO model is able to learn to\ndifferentiate different objects which may considerably improve its accuracy,\npossibly enhancing the efficiency of SAR operations through enhanced detection\naccuracy. This paper aims to improve the model's accuracy of human detection in\nmaritime SAR by evaluating a robust datasets containing various elevations and\ngeological locations, as well as through data augmentation which simulates\ndifferent weather and lighting. We observed that models trained on augmented\ndatasets outperformed their non-augmented counterparts in which the human\nrecall scores ranged from 0.891 to 0.911 with an improvement rate of 3.4\\% on\nthe YOLOv5l model. Results showed that these models demonstrate greater\nrobustness to real-world conditions in varying of weather, brightness, tint,\nand contrast.\n","authors":["Miguel Tjia","Artem Kim","Elaine Wynette Wijaya","Hanna Tefara","Kevin Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.13766v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.19698v2","updated":"2024-08-27T08:06:38Z","published":"2024-07-29T04:43:58Z","title":"Classification Matters: Improving Video Action Detection with\n Class-Specific Attention","summary":" Video action detection (VAD) aims to detect actors and classify their actions\nin a video. We figure that VAD suffers more from classification rather than\nlocalization of actors. Hence, we analyze how prevailing methods form features\nfor classification and find that they prioritize actor regions, yet often\noverlooking the essential contextual information necessary for accurate\nclassification. Accordingly, we propose to reduce the bias toward actor and\nencourage paying attention to the context that is relevant to each action\nclass. By assigning a class-dedicated query to each action class, our model can\ndynamically determine where to focus for effective classification. The proposed\nmodel demonstrates superior performance on three challenging benchmarks with\nsignificantly fewer parameters and less computation.\n","authors":["Jinsung Lee","Taeoh Kim","Inwoong Lee","Minho Shim","Dongyoon Wee","Minsu Cho","Suha Kwak"],"pdf_url":"https://arxiv.org/pdf/2407.19698v2.pdf","comment":"31 pages, accepted to ECCV 2024 (oral)"},{"id":"http://arxiv.org/abs/2408.14847v1","updated":"2024-08-27T07:58:08Z","published":"2024-08-27T07:58:08Z","title":"Intraoperative Glioma Segmentation with YOLO + SAM for Improved Accuracy\n in Tumor Resection","summary":" Gliomas, a common type of malignant brain tumor, present significant surgical\nchallenges due to their similarity to healthy tissue. Preoperative Magnetic\nResonance Imaging (MRI) images are often ineffective during surgery due to\nfactors such as brain shift, which alters the position of brain structures and\ntumors. This makes real-time intraoperative MRI (ioMRI) crucial, as it provides\nupdated imaging that accounts for these shifts, ensuring more accurate tumor\nlocalization and safer resections. This paper presents a deep learning pipeline\ncombining You Only Look Once Version 8 (YOLOv8) and Segment Anything Model\nVision Transformer-base (SAM ViT-b) to enhance glioma detection and\nsegmentation during ioMRI. Our model was trained using the Brain Tumor\nSegmentation 2021 (BraTS 2021) dataset, which includes standard magnetic\nresonance imaging (MRI) images, and noise-augmented MRI images that simulate\nioMRI images. Noised MRI images are harder for a deep learning pipeline to\nsegment, but they are more representative of surgical conditions. Achieving a\nDice Similarity Coefficient (DICE) score of 0.79, our model performs comparably\nto state-of-the-art segmentation models tested on noiseless data. This\nperformance demonstrates the model's potential to assist surgeons in maximizing\ntumor resection and improving surgical outcomes.\n","authors":["Samir Kassam","Angelo Markham","Katie Vo","Yashas Revanakara","Michael Lam","Kevin Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.14847v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14846v1","updated":"2024-08-27T07:57:58Z","published":"2024-08-27T07:57:58Z","title":"Diffusion-Occ: 3D Point Cloud Completion via Occupancy Diffusion","summary":" Point clouds are crucial for capturing three-dimensional data but often\nsuffer from incompleteness due to limitations such as resolution and occlusion.\nTraditional methods typically rely on point-based approaches within\ndiscriminative frameworks for point cloud completion. In this paper, we\nintroduce \\textbf{Diffusion-Occ}, a novel framework for Diffusion Point Cloud\nCompletion. Diffusion-Occ utilizes a two-stage coarse-to-fine approach. In the\nfirst stage, the Coarse Density Voxel Prediction Network (CDNet) processes\npartial points to predict coarse density voxels, streamlining global feature\nextraction through voxel classification, as opposed to previous\nregression-based methods. In the second stage, we introduce the Occupancy\nGeneration Network (OccGen), a conditional occupancy diffusion model based on a\ntransformer architecture and enhanced by our Point-Voxel Fuse (PVF) block. This\nblock integrates coarse density voxels with partial points to leverage both\nglobal and local features for comprehensive completion. By thresholding the\noccupancy field, we convert it into a complete point cloud. Additionally, our\nmethod employs diverse training mixtures and efficient diffusion\nparameterization to enable effective one-step sampling during both training and\ninference. Experimental results demonstrate that Diffusion-Occ outperforms\nexisting discriminative and generative methods.\n","authors":["Guoqing Zhang","Jian Liu"],"pdf_url":"https://arxiv.org/pdf/2408.14846v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14842v1","updated":"2024-08-27T07:54:01Z","published":"2024-08-27T07:54:01Z","title":"From Bias to Balance: Detecting Facial Expression Recognition Biases in\n Large Multimodal Foundation Models","summary":" This study addresses the racial biases in facial expression recognition (FER)\nsystems within Large Multimodal Foundation Models (LMFMs). Despite advances in\ndeep learning and the availability of diverse datasets, FER systems often\nexhibit higher error rates for individuals with darker skin tones. Existing\nresearch predominantly focuses on traditional FER models (CNNs, RNNs, ViTs),\nleaving a gap in understanding racial biases in LMFMs. We benchmark four\nleading LMFMs: GPT-4o, PaliGemma, Gemini, and CLIP to assess their performance\nin facial emotion detection across different racial demographics. A linear\nclassifier trained on CLIP embeddings obtains accuracies of 95.9\\% for RADIATE,\n90.3\\% for Tarr, and 99.5\\% for Chicago Face. Furthermore, we identify that\nAnger is misclassified as Disgust 2.1 times more often in Black Females than\nWhite Females. This study highlights the need for fairer FER systems and\nestablishes a foundation for developing unbiased, accurate FER technologies.\nVisit https://kvjvhub.github.io/FERRacialBias/ for further information\nregarding the biases within facial expression recognition.\n","authors":["Kaylee Chhua","Zhoujinyi Wen","Vedant Hathalia","Kevin Zhu","Sean O'Brien"],"pdf_url":"https://arxiv.org/pdf/2408.14842v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14841v1","updated":"2024-08-27T07:52:44Z","published":"2024-08-27T07:52:44Z","title":"Diffusion based Semantic Outlier Generation via Nuisance Awareness for\n Out-of-Distribution Detection","summary":" Out-of-distribution (OOD) detection, which determines whether a given sample\nis part of the in-distribution (ID), has recently shown promising results\nthrough training with synthetic OOD datasets. Nonetheless, existing methods\noften produce outliers that are considerably distant from the ID, showing\nlimited efficacy for capturing subtle distinctions between ID and OOD. To\naddress these issues, we propose a novel framework, Semantic Outlier generation\nvia Nuisance Awareness (SONA), which notably produces challenging outliers by\ndirectly leveraging pixel-space ID samples through diffusion models. Our\napproach incorporates SONA guidance, providing separate control over semantic\nand nuisance regions of ID samples. Thereby, the generated outliers achieve two\ncrucial properties: (i) they present explicit semantic-discrepant information,\nwhile (ii) maintaining various levels of nuisance resemblance with ID.\nFurthermore, the improved OOD detector training with SONA outliers facilitates\nlearning with a focus on semantic distinctions. Extensive experiments\ndemonstrate the effectiveness of our framework, achieving an impressive AUROC\nof 88% on near-OOD datasets, which surpasses the performance of baseline\nmethods by a significant margin of approximately 6%.\n","authors":["Suhee Yoon","Sanghyu Yoon","Hankook Lee","Ye Seul Sim","Sungik Choi","Kyungeun Lee","Hye-Seung Cho","Woohyung Lim"],"pdf_url":"https://arxiv.org/pdf/2408.14841v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14837v1","updated":"2024-08-27T07:46:07Z","published":"2024-08-27T07:46:07Z","title":"Diffusion Models Are Real-Time Game Engines","summary":" We present GameNGen, the first game engine powered entirely by a neural model\nthat enables real-time interaction with a complex environment over long\ntrajectories at high quality. GameNGen can interactively simulate the classic\ngame DOOM at over 20 frames per second on a single TPU. Next frame prediction\nachieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are\nonly slightly better than random chance at distinguishing short clips of the\ngame from clips of the simulation. GameNGen is trained in two phases: (1) an\nRL-agent learns to play the game and the training sessions are recorded, and\n(2) a diffusion model is trained to produce the next frame, conditioned on the\nsequence of past frames and actions. Conditioning augmentations enable stable\nauto-regressive generation over long trajectories.\n","authors":["Dani Valevski","Yaniv Leviathan","Moab Arar","Shlomi Fruchter"],"pdf_url":"https://arxiv.org/pdf/2408.14837v1.pdf","comment":"Project page: https://gamengen.github.io/"},{"id":"http://arxiv.org/abs/2407.16232v2","updated":"2024-08-27T07:31:37Z","published":"2024-07-23T07:17:10Z","title":"Channel-Partitioned Windowed Attention And Frequency Learning for Single\n Image Super-Resolution","summary":" Recently, window-based attention methods have shown great potential for\ncomputer vision tasks, particularly in Single Image Super-Resolution (SISR).\nHowever, it may fall short in capturing long-range dependencies and\nrelationships between distant tokens. Additionally, we find that learning on\nspatial domain does not convey the frequency content of the image, which is a\ncrucial aspect in SISR. To tackle these issues, we propose a new\nChannel-Partitioned Attention Transformer (CPAT) to better capture long-range\ndependencies by sequentially expanding windows along the height and width of\nfeature maps. In addition, we propose a novel Spatial-Frequency Interaction\nModule (SFIM), which incorporates information from spatial and frequency\ndomains to provide a more comprehensive information from feature maps. This\nincludes information about the frequency content and enhances the receptive\nfield across the entire image. Experimental findings show the effectiveness of\nour proposed modules and architecture. In particular, CPAT surpasses current\nstate-of-the-art methods by up to 0.31dB at x2 SR on Urban100.\n","authors":["Dinh Phu Tran","Dao Duy Hung","Daeyoung Kim"],"pdf_url":"https://arxiv.org/pdf/2407.16232v2.pdf","comment":"Camera ready version, BMVC 2024"},{"id":"http://arxiv.org/abs/2408.14829v1","updated":"2024-08-27T07:26:10Z","published":"2024-08-27T07:26:10Z","title":"Time-Aware Face Anti-Spoofing with Rotation Invariant Local Binary\n Patterns and Deep Learning","summary":" Facial recognition systems have become an integral part of the modern world.\nThese methods accomplish the task of human identification in an automatic,\nfast, and non-interfering way. Past research has uncovered high vulnerability\nto simple imitation attacks that could lead to erroneous identification and\nsubsequent authentication of attackers. Similar to face recognition, imitation\nattacks can also be detected with Machine Learning. Attack detection systems\nuse a variety of facial features and advanced machine learning models for\nuncovering the presence of attacks. In this work, we assess existing work on\nliveness detection and propose a novel approach that promises high\nclassification accuracy by combining previously unused features with time-aware\ndeep learning strategies.\n","authors":["Moritz Finke","Alexandra Dmitrienko"],"pdf_url":"https://arxiv.org/pdf/2408.14829v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.12017v3","updated":"2024-08-27T07:23:22Z","published":"2023-08-23T09:20:05Z","title":"Distribution-Aware Calibration for Object Detection with Noisy Bounding\n Boxes","summary":" Large-scale well-annotated datasets are of great importance for training an\neffective object detector. However, obtaining accurate bounding box annotations\nis laborious and demanding. Unfortunately, the resultant noisy bounding boxes\ncould cause corrupt supervision signals and thus diminish detection\nperformance. Motivated by the observation that the real ground-truth is usually\nsituated in the aggregation region of the proposals assigned to a noisy\nground-truth, we propose DIStribution-aware CalibratiOn (DISCO) to model the\nspatial distribution of proposals for calibrating supervision signals. In\nDISCO, spatial distribution modeling is performed to statistically extract the\npotential locations of objects. Based on the modeled distribution, three\ndistribution-aware techniques, i.e., distribution-aware proposal augmentation\n(DA-Aug), distribution-aware box refinement (DA-Ref), and distribution-aware\nconfidence estimation (DA-Est), are developed to improve classification,\nlocalization, and interpretability, respectively. Extensive experiments on\nlarge-scale noisy image datasets (i.e., Pascal VOC and MS-COCO) demonstrate\nthat DISCO can achieve state-of-the-art detection performance, especially at\nhigh noise levels. Code is available at https://github.com/Correr-Zhou/DISCO.\n","authors":["Donghao Zhou","Jialin Li","Jinpeng Li","Jiancheng Huang","Qiang Nie","Yong Liu","Bin-Bin Gao","Qiong Wang","Pheng-Ann Heng","Guangyong Chen"],"pdf_url":"https://arxiv.org/pdf/2308.12017v3.pdf","comment":"Accepted by BMVC2024"},{"id":"http://arxiv.org/abs/2408.11413v2","updated":"2024-08-27T07:21:02Z","published":"2024-08-21T08:19:12Z","title":"Pano2Room: Novel View Synthesis from a Single Indoor Panorama","summary":" Recent single-view 3D generative methods have made significant advancements\nby leveraging knowledge distilled from extensive 3D object datasets. However,\nchallenges persist in the synthesis of 3D scenes from a single view, primarily\ndue to the complexity of real-world environments and the limited availability\nof high-quality prior resources. In this paper, we introduce a novel approach\ncalled Pano2Room, designed to automatically reconstruct high-quality 3D indoor\nscenes from a single panoramic image. These panoramic images can be easily\ngenerated using a panoramic RGBD inpainter from captures at a single location\nwith any camera. The key idea is to initially construct a preliminary mesh from\nthe input panorama, and iteratively refine this mesh using a panoramic RGBD\ninpainter while collecting photo-realistic 3D-consistent pseudo novel views.\nFinally, the refined mesh is converted into a 3D Gaussian Splatting field and\ntrained with the collected pseudo novel views. This pipeline enables the\nreconstruction of real-world 3D scenes, even in the presence of large\nocclusions, and facilitates the synthesis of photo-realistic novel views with\ndetailed geometry. Extensive qualitative and quantitative experiments have been\nconducted to validate the superiority of our method in single-panorama indoor\nnovel synthesis compared to the state-of-the-art. Our code and data are\navailable at \\url{https://github.com/TrickyGo/Pano2Room}.\n","authors":["Guo Pu","Yiming Zhao","Zhouhui Lian"],"pdf_url":"https://arxiv.org/pdf/2408.11413v2.pdf","comment":"SIGGRAPH Asia 2024 Conference Papers (SA Conference Papers '24),\n December 3--6, 2024, Tokyo, Japan"},{"id":"http://arxiv.org/abs/2408.14826v1","updated":"2024-08-27T07:13:44Z","published":"2024-08-27T07:13:44Z","title":"Alfie: Democratising RGBA Image Generation With No $$$","summary":" Designs and artworks are ubiquitous across various creative fields, requiring\ngraphic design skills and dedicated software to create compositions that\ninclude many graphical elements, such as logos, icons, symbols, and art scenes,\nwhich are integral to visual storytelling. Automating the generation of such\nvisual elements improves graphic designers' productivity, democratizes and\ninnovates the creative industry, and helps generate more realistic synthetic\ndata for related tasks. These illustration elements are mostly RGBA images with\nirregular shapes and cutouts, facilitating blending and scene composition.\nHowever, most image generation models are incapable of generating such images\nand achieving this capability requires expensive computational resources,\nspecific training recipes, or post-processing solutions. In this work, we\npropose a fully-automated approach for obtaining RGBA illustrations by\nmodifying the inference-time behavior of a pre-trained Diffusion Transformer\nmodel, exploiting the prompt-guided controllability and visual quality offered\nby such models with no additional computational cost. We force the generation\nof entire subjects without sharp croppings, whose background is easily removed\nfor seamless integration into design projects or artistic scenes. We show with\na user study that, in most cases, users prefer our solution over generating and\nthen matting an image, and we show that our generated illustrations yield good\nresults when used as inputs for composite scene generation pipelines. We\nrelease the code at https://github.com/aimagelab/Alfie.\n","authors":["Fabio Quattrini","Vittorio Pippi","Silvia Cascianelli","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2408.14826v1.pdf","comment":"Accepted at ECCV AI for Visual Arts Workshop and Challenges"},{"id":"http://arxiv.org/abs/2408.13423v2","updated":"2024-08-27T07:12:52Z","published":"2024-08-24T01:33:28Z","title":"Training-free Long Video Generation with Chain of Diffusion Model\n Experts","summary":" Video generation models hold substantial potential in areas such as\nfilmmaking. However, current video diffusion models need high computational\ncosts and produce suboptimal results due to high complexity of video generation\ntask. In this paper, we propose \\textbf{ConFiner}, an efficient high-quality\nvideo generation framework that decouples video generation into easier\nsubtasks: structure \\textbf{con}trol and spatial-temporal re\\textbf{fine}ment.\nIt can generate high-quality videos with chain of off-the-shelf diffusion model\nexperts, each expert responsible for a decoupled subtask. During the\nrefinement, we introduce coordinated denoising, which can merge multiple\ndiffusion experts' capabilities into a single sampling. Furthermore, we design\nConFiner-Long framework, which can generate long coherent video with three\nconstraint strategies on ConFiner. Experimental results indicate that with only\n10\\% of the inference cost, our ConFiner surpasses representative models like\nLavie and Modelscope across all objective and subjective metrics. And\nConFiner-Long can generate high-quality and coherent videos with up to 600\nframes.\n","authors":["Wenhao Li","Yichao Cao","Xiu Su","Xi Lin","Shan You","Mingkai Zheng","Yi Chen","Chang Xu"],"pdf_url":"https://arxiv.org/pdf/2408.13423v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14825v1","updated":"2024-08-27T07:11:45Z","published":"2024-08-27T07:11:45Z","title":"From Rule-Based Models to Deep Learning Transformers Architectures for\n Natural Language Processing and Sign Language Translation Systems: Survey,\n Taxonomy and Performance Evaluation","summary":" With the growing Deaf and Hard of Hearing population worldwide and the\npersistent shortage of certified sign language interpreters, there is a\npressing need for an efficient, signs-driven, integrated end-to-end translation\nsystem, from sign to gloss to text and vice-versa. There has been a wealth of\nresearch on machine translations and related reviews. However, there are few\nworks on sign language machine translation considering the particularity of the\nlanguage being continuous and dynamic. This paper aims to address this void,\nproviding a retrospective analysis of the temporal evolution of sign language\nmachine translation algorithms and a taxonomy of the Transformers\narchitectures, the most used approach in language translation. We also present\nthe requirements of a real-time Quality-of-Service sign language ma-chine\ntranslation system underpinned by accurate deep learning algorithms. We propose\nfuture research directions for sign language translation systems.\n","authors":["Nada Shahin","Leila Ismail"],"pdf_url":"https://arxiv.org/pdf/2408.14825v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14823v1","updated":"2024-08-27T07:06:49Z","published":"2024-08-27T07:06:49Z","title":"LapisGS: Layered Progressive 3D Gaussian Splatting for Adaptive\n Streaming","summary":" The rise of Extended Reality (XR) requires efficient streaming of 3D online\nworlds, challenging current 3DGS representations to adapt to\nbandwidth-constrained environments. This paper proposes LapisGS, a layered 3DGS\nthat supports adaptive streaming and progressive rendering. Our method\nconstructs a layered structure for cumulative representation, incorporates\ndynamic opacity optimization to maintain visual fidelity, and utilizes\noccupancy maps to efficiently manage Gaussian splats. This proposed model\noffers a progressive representation supporting a continuous rendering quality\nadapted for bandwidth-aware streaming. Extensive experiments validate the\neffectiveness of our approach in balancing visual fidelity with the compactness\nof the model, with up to 50.71% improvement in SSIM, 286.53% improvement in\nLPIPS, and 318.41% reduction in model size, and shows its potential for\nbandwidth-adapted 3D streaming and rendering applications.\n","authors":["Yuang Shi","Simone Gasparini","Géraldine Morin","Wei Tsang Ooi"],"pdf_url":"https://arxiv.org/pdf/2408.14823v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.09193v2","updated":"2024-08-27T07:02:07Z","published":"2024-04-14T09:01:26Z","title":"FaceCat: Enhancing Face Recognition Security with a Unified Diffusion\n Model","summary":" Face anti-spoofing (FAS) and adversarial detection (FAD) have been regarded\nas critical technologies to ensure the safety of face recognition systems.\nHowever, due to limited practicality, complex deployment, and the additional\ncomputational overhead, it is necessary to implement both detection techniques\nwithin a unified framework. This paper aims to achieve this goal by breaking\nthrough two primary obstacles: 1) the suboptimal face feature representation\nand 2) the scarcity of training data. To address the limited performance caused\nby existing feature representations, motivated by the rich structural and\ndetailed features of face diffusion models, we propose FaceCat, the first\napproach leveraging the diffusion model to simultaneously enhance the\nperformance of FAS and FAD. Specifically, FaceCat elaborately designs a\nhierarchical fusion mechanism to capture rich face semantic features of the\ndiffusion model. These features then serve as a robust foundation for a\nlightweight head, designed to execute FAS and FAD simultaneously. Due to the\nlimitations in feature representation that arise from relying solely on\nsingle-modality image data, we further propose a novel text-guided multi-modal\nalignment strategy that utilizes text prompts to enrich feature representation,\nthereby enhancing performance. To combat data scarcity, we build a\ncomprehensive dataset with a wide range of 28 attack types, offering greater\npotential for a unified framework in facial security. Extensive experiments\nvalidate the effectiveness of FaceCat generalizes significantly better and\nobtains excellent robustness against common input transformations.\n","authors":["Jiawei Chen","Xiao Yang","Yinpeng Dong","Hang Su","Zhaoxia Yin"],"pdf_url":"https://arxiv.org/pdf/2404.09193v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2408.14819v1","updated":"2024-08-27T07:01:56Z","published":"2024-08-27T07:01:56Z","title":"Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image\n Generation","summary":" We propose a diffusion-based approach for Text-to-Image (T2I) generation with\ninteractive 3D layout control. Layout control has been widely studied to\nalleviate the shortcomings of T2I diffusion models in understanding objects'\nplacement and relationships from text descriptions. Nevertheless, existing\napproaches for layout control are limited to 2D layouts, require the user to\nprovide a static layout beforehand, and fail to preserve generated images under\nlayout changes. This makes these approaches unsuitable for applications that\nrequire 3D object-wise control and iterative refinements, e.g., interior design\nand complex scene generation. To this end, we leverage the recent advancements\nin depth-conditioned T2I models and propose a novel approach for interactive 3D\nlayout control. We replace the traditional 2D boxes used in layout control with\n3D boxes. Furthermore, we revamp the T2I task as a multi-stage generation\nprocess, where at each stage, the user can insert, change, and move an object\nin 3D while preserving objects from earlier stages. We achieve this through our\nproposed Dynamic Self-Attention (DSA) module and the consistent 3D object\ntranslation strategy. Experiments show that our approach can generate\ncomplicated scenes based on 3D layouts, boosting the object generation success\nrate over the standard depth-conditioned T2I methods by 2x. Moreover, it\noutperforms other methods in comparison in preserving objects under layout\nchanges. Project Page: \\url{https://abdo-eldesokey.github.io/build-a-scene/}\n","authors":["Abdelrahman Eldesokey","Peter Wonka"],"pdf_url":"https://arxiv.org/pdf/2408.14819v1.pdf","comment":"Project Page: https://abdo-eldesokey.github.io/build-a-scene/"},{"id":"http://arxiv.org/abs/2408.14812v1","updated":"2024-08-27T06:50:28Z","published":"2024-08-27T06:50:28Z","title":"HPT++: Hierarchically Prompting Vision-Language Models with\n Multi-Granularity Knowledge Generation and Improved Structure Modeling","summary":" Prompt learning has become a prevalent strategy for adapting vision-language\nfoundation models (VLMs) such as CLIP to downstream tasks. With the emergence\nof large language models (LLMs), recent studies have explored the potential of\nusing category-related descriptions to enhance prompt effectiveness. However,\nconventional descriptions lack explicit structured information necessary to\nrepresent the interconnections among key elements like entities or attributes\nwith relation to a particular category. Since existing prompt tuning methods\ngive little consideration to managing structured knowledge, this paper\nadvocates leveraging LLMs to construct a graph for each description to\nprioritize such structured knowledge. Consequently, we propose a novel approach\ncalled Hierarchical Prompt Tuning (HPT), enabling simultaneous modeling of both\nstructured and conventional linguistic knowledge. Specifically, we introduce a\nrelationship-guided attention module to capture pair-wise associations among\nentities and attributes for low-level prompt learning. In addition, by\nincorporating high-level and global-level prompts modeling overall semantics,\nthe proposed hierarchical structure forges cross-level interlinks and empowers\nthe model to handle more complex and long-term relationships. Finally, by\nenhancing multi-granularity knowledge generation, redesigning the\nrelationship-driven attention re-weighting module, and incorporating consistent\nconstraints on the hierarchical text encoder, we propose HPT++, which further\nimproves the performance of HPT. Our experiments are conducted across a wide\nrange of evaluation settings, including base-to-new generalization,\ncross-dataset evaluation, and domain generalization. Extensive results and\nablation studies demonstrate the effectiveness of our methods, which\nconsistently outperform existing SOTA methods.\n","authors":["Yubin Wang","Xinyang Jiang","De Cheng","Wenli Sun","Dongsheng Li","Cairong Zhao"],"pdf_url":"https://arxiv.org/pdf/2408.14812v1.pdf","comment":"19 pages, 7 figures, 7 tables. arXiv admin note: substantial text\n overlap with arXiv:2312.06323"},{"id":"http://arxiv.org/abs/2408.14810v1","updated":"2024-08-27T06:49:21Z","published":"2024-08-27T06:49:21Z","title":"Generalist Segmentation Algorithm for Photoreceptors Analysis in\n Adaptive Optics Imaging","summary":" Analyzing the cone photoreceptor pattern in images obtained from the living\nhuman retina using quantitative methods can be crucial for the early detection\nand management of various eye conditions. Confocal adaptive optics scanning\nlight ophthalmoscope (AOSLO) imaging enables visualization of the cones from\nreflections of waveguiding cone photoreceptors. While there have been\nsignificant improvements in automated algorithms for segmenting cones in\nconfocal AOSLO images, the process of labelling data remains labor-intensive\nand manual. This paper introduces a method based on deep learning (DL) for\ndetecting and segmenting cones in AOSLO images. The models were trained on a\nsemi-automatically labelled dataset of 20 AOSLO batches of images of 18\nparticipants for 0$^{\\circ}$, 1$^{\\circ}$, and 2$^{\\circ}$ from the foveal\ncenter. F1 scores were 0.968, 0.958, and 0.954 for 0$^{\\circ}$, 1$^{\\circ}$,\nand 2$^{\\circ}$, respectively, which is better than previously reported DL\napproaches. Our method minimizes the need for labelled data by only\nnecessitating a fraction of labelled cones, which is especially beneficial in\nthe field of ophthalmology, where labelled data can often be limited.\n","authors":["Mikhail Kulyabin","Aline Sindel","Hilde Pedersen","Stuart Gilson","Rigmor Baraas","Andreas Maier"],"pdf_url":"https://arxiv.org/pdf/2408.14810v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.19271v2","updated":"2024-08-27T06:34:00Z","published":"2024-07-27T14:45:34Z","title":"Sewer Image Super-Resolution with Depth Priors and Its Lightweight\n Network","summary":" The Quick-view (QV) technique serves as a primary method for detecting\ndefects within sewerage systems. However, the effectiveness of QV is impeded by\nthe limited visual range of its hardware, resulting in suboptimal image quality\nfor distant portions of the sewer network. Image super-resolution is an\neffective way to improve image quality and has been applied in a variety of\nscenes. However, research on super-resolution for sewer images remains\nconsiderably unexplored. In response, this study leverages the inherent depth\nrelationships present within QV images and introduces a novel Depth-guided,\nReference-based Super-Resolution framework denoted as DSRNet. It comprises two\ncore components: a depth extraction module and a depth information matching\nmodule (DMM). DSRNet utilizes the adjacent frames of the low-resolution image\nas reference images and helps them recover texture information based on the\ncorrelation. By combining these modules, the integration of depth priors\nsignificantly enhances both visual quality and performance benchmarks. Besides,\nin pursuit of computational efficiency and compactness, a super-resolution\nknowledge distillation model based on an attention mechanism is introduced.\nThis mechanism facilitates the acquisition of feature similarity between a more\ncomplex teacher model and a streamlined student model, with the latter being a\nlightweight version of DSRNet. Experimental results demonstrate that DSRNet\nsignificantly improves PSNR and SSIM compared with other methods. This study\nalso conducts experiments on sewer defect semantic segmentation, object\ndetection, and classification on the Pipe dataset and Sewer-ML dataset.\nExperiments show that the method can improve the performance of low-resolution\nsewer images in these tasks.\n","authors":["Gang Pan","Chen Wang","Zhijie Sui","Shuai Guo","Yaozhi Lv","Honglie Li","Di Sun","Zixia Xia"],"pdf_url":"https://arxiv.org/pdf/2407.19271v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14805v1","updated":"2024-08-27T06:24:51Z","published":"2024-08-27T06:24:51Z","title":"Platypus: A Generalized Specialist Model for Reading Text in Various\n Forms","summary":" Reading text from images (either natural scenes or documents) has been a\nlong-standing research topic for decades, due to the high technical challenge\nand wide application range. Previously, individual specialist models are\ndeveloped to tackle the sub-tasks of text reading (e.g., scene text\nrecognition, handwritten text recognition and mathematical expression\nrecognition). However, such specialist models usually cannot effectively\ngeneralize across different sub-tasks. Recently, generalist models (such as\nGPT-4V), trained on tremendous data in a unified way, have shown enormous\npotential in reading text in various scenarios, but with the drawbacks of\nlimited accuracy and low efficiency. In this work, we propose Platypus, a\ngeneralized specialist model for text reading. Specifically, Platypus combines\nthe best of both worlds: being able to recognize text of various forms with a\nsingle unified architecture, while achieving excellent accuracy and high\nefficiency. To better exploit the advantage of Platypus, we also construct a\ntext reading dataset (called Worms), the images of which are curated from\nprevious datasets and partially re-labeled. Experiments on standard benchmarks\ndemonstrate the effectiveness and superiority of the proposed Platypus model.\nModel and data will be made publicly available at\nhttps://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/Platypus.\n","authors":["Peng Wang","Zhaohai Li","Jun Tang","Humen Zhong","Fei Huang","Zhibo Yang","Cong Yao"],"pdf_url":"https://arxiv.org/pdf/2408.14805v1.pdf","comment":"Accepted by ECCV2024"},{"id":"http://arxiv.org/abs/2408.14802v1","updated":"2024-08-27T06:14:54Z","published":"2024-08-27T06:14:54Z","title":"RAW-Adapter: Adapting Pre-trained Visual Model to Camera RAW Images","summary":" sRGB images are now the predominant choice for pre-training visual models in\ncomputer vision research, owing to their ease of acquisition and efficient\nstorage. Meanwhile, the advantage of RAW images lies in their rich physical\ninformation under variable real-world challenging lighting conditions. For\ncomputer vision tasks directly based on camera RAW data, most existing studies\nadopt methods of integrating image signal processor (ISP) with backend\nnetworks, yet often overlook the interaction capabilities between the ISP\nstages and subsequent networks. Drawing inspiration from ongoing adapter\nresearch in NLP and CV areas, we introduce RAW-Adapter, a novel approach aimed\nat adapting sRGB pre-trained models to camera RAW data. RAW-Adapter comprises\ninput-level adapters that employ learnable ISP stages to adjust RAW inputs, as\nwell as model-level adapters to build connections between ISP stages and\nsubsequent high-level networks. Additionally, RAW-Adapter is a general\nframework that could be used in various computer vision frameworks. Abundant\nexperiments under different lighting conditions have shown our algorithm's\nstate-of-the-art (SOTA) performance, demonstrating its effectiveness and\nefficiency across a range of real-world and synthetic datasets.\n","authors":["Ziteng Cui","Tatsuya Harada"],"pdf_url":"https://arxiv.org/pdf/2408.14802v1.pdf","comment":"ECCV 2024, code link: https://github.com/cuiziteng/ECCV_RAW_Adapter"},{"id":"http://arxiv.org/abs/2408.14131v2","updated":"2024-08-27T05:54:42Z","published":"2024-08-26T09:26:08Z","title":"GenFormer -- Generated Images are All You Need to Improve Robustness of\n Transformers on Small Datasets","summary":" Recent studies showcase the competitive accuracy of Vision Transformers\n(ViTs) in relation to Convolutional Neural Networks (CNNs), along with their\nremarkable robustness. However, ViTs demand a large amount of data to achieve\nadequate performance, which makes their application to small datasets\nchallenging, falling behind CNNs. To overcome this, we propose GenFormer, a\ndata augmentation strategy utilizing generated images, thereby improving\ntransformer accuracy and robustness on small-scale image classification tasks.\nIn our comprehensive evaluation we propose Tiny ImageNetV2, -R, and -A as new\ntest set variants of Tiny ImageNet by transferring established ImageNet\ngeneralization and robustness benchmarks to the small-scale data domain.\nSimilarly, we introduce MedMNIST-C and EuroSAT-C as corrupted test set variants\nof established fine-grained datasets in the medical and aerial domain. Through\na series of experiments conducted on small datasets of various domains,\nincluding Tiny ImageNet, CIFAR, EuroSAT and MedMNIST datasets, we demonstrate\nthe synergistic power of our method, in particular when combined with common\ntrain and test time augmentations, knowledge distillation, and architectural\ndesign choices. Additionally, we prove the effectiveness of our approach under\nchallenging conditions with limited training data, demonstrating significant\nimprovements in both accuracy and robustness, bridging the gap between CNNs and\nViTs in the small-scale dataset domain.\n","authors":["Sven Oehri","Nikolas Ebert","Ahmed Abdullah","Didier Stricker","Oliver Wasenmüller"],"pdf_url":"https://arxiv.org/pdf/2408.14131v2.pdf","comment":"This paper has been accepted at International Conference on Pattern\n Recognition (ICPR), 2024"},{"id":"http://arxiv.org/abs/2406.18459v5","updated":"2024-08-27T05:46:06Z","published":"2024-06-26T16:10:31Z","title":"DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis\n through Structure Guidance","summary":" Large-scale generative models, such as text-to-image diffusion models, have\ngarnered widespread attention across diverse domains due to their creative and\nhigh-fidelity image generation. Nonetheless, existing large-scale diffusion\nmodels are confined to generating images of up to 1K resolution, which is far\nfrom meeting the demands of contemporary commercial applications. Directly\nsampling higher-resolution images often yields results marred by artifacts such\nas object repetition and distorted shapes. Addressing the aforementioned issues\ntypically necessitates training or fine-tuning models on higher-resolution\ndatasets. However, this poses a formidable challenge due to the difficulty in\ncollecting large-scale high-resolution images and substantial computational\nresources. While several preceding works have proposed alternatives to bypass\nthe cumbersome training process, they often fail to produce convincing results.\nIn this work, we probe the generative ability of diffusion models at higher\nresolution beyond their original capability and propose a novel progressive\napproach that fully utilizes generated low-resolution images to guide the\ngeneration of higher-resolution images. Our method obviates the need for\nadditional training or fine-tuning which significantly lowers the burden of\ncomputational costs. Extensive experiments and results validate the efficiency\nand efficacy of our method. Project page:\nhttps://yhyun225.github.io/DiffuseHigh/\n","authors":["Younghyun Kim","Geunmin Hwang","Junyu Zhang","Eunbyung Park"],"pdf_url":"https://arxiv.org/pdf/2406.18459v5.pdf","comment":"Project page: https://yhyun225.github.io/DiffuseHigh/"},{"id":"http://arxiv.org/abs/2408.14789v1","updated":"2024-08-27T05:31:30Z","published":"2024-08-27T05:31:30Z","title":"Revisiting Surgical Instrument Segmentation Without Human Intervention:\n A Graph Partitioning View","summary":" Surgical instrument segmentation (SIS) on endoscopic images stands as a\nlong-standing and essential task in the context of computer-assisted\ninterventions for boosting minimally invasive surgery. Given the recent surge\nof deep learning methodologies and their data-hungry nature, training a neural\npredictive model based on massive expert-curated annotations has been\ndominating and served as an off-the-shelf approach in the field, which could,\nhowever, impose prohibitive burden to clinicians for preparing fine-grained\npixel-wise labels corresponding to the collected surgical video frames. In this\nwork, we propose an unsupervised method by reframing the video frame\nsegmentation as a graph partitioning problem and regarding image pixels as\ngraph nodes, which is significantly different from the previous efforts. A\nself-supervised pre-trained model is firstly leveraged as a feature extractor\nto capture high-level semantic features. Then, Laplacian matrixs are computed\nfrom the features and are eigendecomposed for graph partitioning. On the \"deep\"\neigenvectors, a surgical video frame is meaningfully segmented into different\nmodules such as tools and tissues, providing distinguishable semantic\ninformation like locations, classes, and relations. The segmentation problem\ncan then be naturally tackled by applying clustering or threshold on the\neigenvectors. Extensive experiments are conducted on various datasets (e.g.,\nEndoVis2017, EndoVis2018, UCL, etc.) for different clinical endpoints. Across\nall the challenging scenarios, our method demonstrates outstanding performance\nand robustness higher than unsupervised state-of-the-art (SOTA) methods. The\ncode is released at https://github.com/MingyuShengSMY/GraphClusteringSIS.git.\n","authors":["Mingyu Sheng","Jianan Fan","Dongnan Liu","Ron Kikinis","Weidong Cai"],"pdf_url":"https://arxiv.org/pdf/2408.14789v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08345v2","updated":"2024-08-27T05:08:00Z","published":"2024-08-15T17:58:10Z","title":"5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual\n Recognition Tasks","summary":" Pre-training & fine-tuning can enhance the transferring efficiency and\nperformance in visual tasks. Recent delta-tuning methods provide more options\nfor visual classification tasks. Despite their success, existing visual\ndelta-tuning art fails to exceed the upper limit of full fine-tuning on\nchallenging tasks like object detection and segmentation. To find a competitive\nalternative to full fine-tuning, we propose the Multi-cognitive Visual Adapter\n(Mona) tuning, a novel adapter-based tuning method. First, we introduce\nmultiple vision-friendly filters into the adapter to enhance its ability to\nprocess visual signals, while previous methods mainly rely on language-friendly\nlinear filters. Second, we add the scaled normalization layer in the adapter to\nregulate the distribution of input features for visual filters. To fully\ndemonstrate the practicality and generality of Mona, we conduct experiments on\nmultiple representative visual tasks, including instance segmentation on COCO,\nsemantic segmentation on ADE20K, object detection on Pascal VOC, oriented\nobject detection on DOTA/STAR, and image classification on three common\ndatasets. Exciting results illustrate that Mona surpasses full fine-tuning on\nall these tasks, and is the only delta-tuning method outperforming full\nfine-tuning on the above various tasks. For example, Mona achieves 1%\nperformance gain on the COCO dataset compared to full fine-tuning.\nComprehensive results suggest that Mona-tuning is more suitable for retaining\nand utilizing the capabilities of pre-trained models than full fine-tuning. The\ncode will be released at https://github.com/Leiyi-Hu/mona.\n","authors":["Dongshuo Yin","Leiyi Hu","Bin Li","Youqun Zhang","Xue Yang"],"pdf_url":"https://arxiv.org/pdf/2408.08345v2.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2311.15010"},{"id":"http://arxiv.org/abs/2408.14176v2","updated":"2024-08-27T04:59:58Z","published":"2024-08-26T10:42:53Z","title":"SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its\n Teacher","summary":" In this paper, we aim to enhance the performance of SwiftBrush, a prominent\none-step text-to-image diffusion model, to be competitive with its multi-step\nStable Diffusion counterpart. Initially, we explore the quality-diversity\ntrade-off between SwiftBrush and SD Turbo: the former excels in image\ndiversity, while the latter excels in image quality. This observation motivates\nour proposed modifications in the training methodology, including better weight\ninitialization and efficient LoRA training. Moreover, our introduction of a\nnovel clamped CLIP loss enhances image-text alignment and results in improved\nimage quality. Remarkably, by combining the weights of models trained with\nefficient LoRA and full training, we achieve a new state-of-the-art one-step\ndiffusion model, achieving an FID of 8.14 and surpassing all GAN-based and\nmulti-step Stable Diffusion models. The project page is available at\nhttps://swiftbrushv2.github.io.\n","authors":["Trung Dao","Thuan Hoang Nguyen","Thanh Le","Duc Vu","Khoi Nguyen","Cuong Pham","Anh Tran"],"pdf_url":"https://arxiv.org/pdf/2408.14176v2.pdf","comment":"Accepted to ECCV'24"},{"id":"http://arxiv.org/abs/2408.14776v1","updated":"2024-08-27T04:45:53Z","published":"2024-08-27T04:45:53Z","title":"MROVSeg: Breaking the Resolution Curse of Vision-Language Models in\n Open-Vocabulary Semantic Segmentation","summary":" Open-vocabulary semantic segmentation aims to segment and recognize\nsemantically meaningful regions based on text-based descriptions during\ninference. A typical solution to address this task is to leverage powerful\nvision-language models (VLMs), such as CLIP, to bridge the gap between open-\nand close-vocabulary recognition. As VLMs are usually pretrained with\nlow-resolution images (e.g. $224\\times224$), most previous methods operate only\non downscaled images. We question this design as low resolution features often\nfail to preserve fine details. Although employing additional image backbones\nfor high-resolution inputs can mitigate this issue, it may also introduce\nsignificant computation overhead. Therefore, we propose MROVSeg, a\nmulti-resolution training framework for open-vocabulary semantic segmentation\nwith a single pretrained CLIP backbone, that uses sliding windows to slice the\nhigh-resolution input into uniform patches, each matching the input size of the\nwell-trained image encoder. Its key components include a Multi-Res Adapter,\nwhich restores the spatial geometry and grasps local-global correspondences\nacross patches by learnable convolutional and scale attention layers. To\nachieve accurate segmentation, we introduce Multi-grained Masked Attention\nscheme to aggregate multi-grained semantics by performing cross-attention\nbetween object queries and multi-resolution CLIP features within the region of\ninterests. Through comprehensive experiments, we demonstrate the superiority of\nMROVSeg on well-established open-vocabulary semantic segmentation benchmarks,\nparticularly for high-resolution inputs, establishing new standards for\nopen-vocabulary semantic segmentation.\n","authors":["Yuanbing Zhu","Bingke Zhu","Zhen Chen","Huan Xu","Ming Tang","Jinqiao Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14776v1.pdf","comment":"Technical report"},{"id":"http://arxiv.org/abs/2407.15773v2","updated":"2024-08-27T04:41:40Z","published":"2024-07-22T16:25:41Z","title":"STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay","summary":" Test-time adaptation (TTA) aims to address the distribution shift between the\ntraining and test data with only unlabeled data at test time. Existing TTA\nmethods often focus on improving recognition performance specifically for test\ndata associated with classes in the training set. However, during the\nopen-world inference process, there are inevitably test data instances from\nunknown classes, commonly referred to as outliers. This paper pays attention to\nthe problem that conducts both sample recognition and outlier rejection during\ninference while outliers exist. To address this problem, we propose a new\napproach called STAble Memory rePlay (STAMP), which performs optimization over\na stable memory bank instead of the risky mini-batch. In particular, the memory\nbank is dynamically updated by selecting low-entropy and label-consistent\nsamples in a class-balanced manner. In addition, we develop a self-weighted\nentropy minimization strategy that assigns higher weight to low-entropy\nsamples. Extensive results demonstrate that STAMP outperforms existing TTA\nmethods in terms of both recognition and outlier detection performance. The\ncode is released at https://github.com/yuyongcan/STAMP.\n","authors":["Yongcan Yu","Lijun Sheng","Ran He","Jian Liang"],"pdf_url":"https://arxiv.org/pdf/2407.15773v2.pdf","comment":"Accepted by ECCV 2024; Fixed a bug in calculating OOD score of STAMP\n and updated the results"},{"id":"http://arxiv.org/abs/2408.14770v1","updated":"2024-08-27T04:18:18Z","published":"2024-08-27T04:18:18Z","title":"Text-guided Foundation Model Adaptation for Long-Tailed Medical Image\n Classification","summary":" In medical contexts, the imbalanced data distribution in long-tailed\ndatasets, due to scarce labels for rare diseases, greatly impairs the\ndiagnostic accuracy of deep learning models. Recent multimodal text-image\nsupervised foundation models offer new solutions to data scarcity through\neffective representation learning. However, their limited medical-specific\npretraining hinders their performance in medical image classification relative\nto natural images. To address this issue, we propose a novel Text-guided\nFoundation model Adaptation for Long-Tailed medical image classification\n(TFA-LT). We adopt a two-stage training strategy, integrating representations\nfrom the foundation model using just two linear adapters and a single ensembler\nfor balanced outcomes. Experimental results on two long-tailed medical image\ndatasets validate the simplicity, lightweight and efficiency of our approach:\nrequiring only 6.1% GPU memory usage of the current best-performing algorithm,\nour method achieves an accuracy improvement of up to 27.1%, highlighting the\nsubstantial potential of foundation model adaptation in this area.\n","authors":["Sirui Li","Li Lin","Yijin Huang","Pujin Cheng","Xiaoying Tang"],"pdf_url":"https://arxiv.org/pdf/2408.14770v1.pdf","comment":"Accepted by IEEE ISBI 2024"},{"id":"http://arxiv.org/abs/2408.14080v2","updated":"2024-08-27T04:14:14Z","published":"2024-08-26T08:02:57Z","title":"SONICS: Synthetic Or Not -- Identifying Counterfeit Songs","summary":" The recent surge in AI-generated songs presents exciting possibilities and\nchallenges. While these tools democratize music creation, they also necessitate\nthe ability to distinguish between human-composed and AI-generated songs for\nsafeguarding artistic integrity and content curation. Existing research and\ndatasets in fake song detection only focus on singing voice deepfake detection\n(SVDD), where the vocals are AI-generated but the instrumental music is sourced\nfrom real songs. However, this approach is inadequate for contemporary\nend-to-end AI-generated songs where all components (vocals, lyrics, music, and\nstyle) could be AI-generated. Additionally, existing datasets lack lyrics-music\ndiversity, long-duration songs, and open fake songs. To address these gaps, we\nintroduce SONICS, a novel dataset for end-to-end Synthetic Song Detection\n(SSD), comprising over 97k songs with over 49k synthetic songs from popular\nplatforms like Suno and Udio. Furthermore, we highlight the importance of\nmodeling long-range temporal dependencies in songs for effective authenticity\ndetection, an aspect overlooked in existing methods. To capture these patterns,\nwe propose a novel model, SpecTTTra, that is up to 3 times faster and 6 times\nmore memory efficient compared to popular CNN and Transformer-based models\nwhile maintaining competitive performance. Finally, we offer both AI-based and\nHuman evaluation benchmarks, addressing another deficiency in current research.\n","authors":["Md Awsafur Rahman","Zaber Ibn Abdul Hakim","Najibul Haque Sarker","Bishmoy Paul","Shaikh Anowarul Fattah"],"pdf_url":"https://arxiv.org/pdf/2408.14080v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17137v4","updated":"2024-08-27T04:02:58Z","published":"2024-05-27T12:54:09Z","title":"Jump-teaching: Ultra Efficient and Robust Learning with Noisy Label","summary":" Sample selection is the most straightforward technique to combat label noise,\naiming to distinguish mislabeled samples during training and avoid the\ndegradation of the robustness of the model. In the workflow, $\\textit{selecting\npossibly clean data}$ and $\\textit{model update}$ are iterative. However, their\ninterplay and intrinsic characteristics hinder the robustness and efficiency of\nlearning with noisy labels: 1) The model chooses clean data with selection\nbias, leading to the accumulated error in the model update. 2) Most selection\nstrategies leverage partner networks or supplementary information to mitigate\nlabel corruption, albeit with increased computation resources and lower\nthroughput speed. Therefore, we employ only one network with the jump manner\nupdate to decouple the interplay and mine more semantic information from the\nloss for a more precise selection. Specifically, the selection of clean data\nfor each model update is based on one of the prior models, excluding the last\niteration. The strategy of model update exhibits a jump behavior in the form.\nMoreover, we map the outputs of the network and labels into the same semantic\nfeature space, respectively. In this space, a detailed and simple loss\ndistribution is generated to distinguish clean samples more effectively. Our\nproposed approach achieves almost up to $2.53\\times$ speedup, $0.46\\times$ peak\nmemory footprint, and superior robustness over state-of-the-art works with\nvarious noise settings.\n","authors":["Kangye Ji","Fei Cheng","Zeqing Wang","Bohu Huang"],"pdf_url":"https://arxiv.org/pdf/2405.17137v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19730v4","updated":"2024-08-27T03:45:18Z","published":"2024-05-30T06:21:34Z","title":"Research on the Spatial Data Intelligent Foundation Model","summary":" This report focuses on spatial data intelligent large models, delving into\nthe principles, methods, and cutting-edge applications of these models. It\nprovides an in-depth discussion on the definition, development history, current\nstatus, and trends of spatial data intelligent large models, as well as the\nchallenges they face. The report systematically elucidates the key technologies\nof spatial data intelligent large models and their applications in urban\nenvironments, aerospace remote sensing, geography, transportation, and other\nscenarios. Additionally, it summarizes the latest application cases of spatial\ndata intelligent large models in themes such as urban development, multimodal\nsystems, remote sensing, smart transportation, and resource environments.\nFinally, the report concludes with an overview and outlook on the development\nprospects of spatial data intelligent large models.\n","authors":["Shaohua Wang","Xing Xie","Yong Li","Danhuai Guo","Zhi Cai","Yu Liu","Yang Yue","Xiao Pan","Feng Lu","Huayi Wu","Zhipeng Gui","Zhiming Ding","Bolong Zheng","Fuzheng Zhang","Jingyuan Wang","Zhengchao Chen","Hao Lu","Jiayi Li","Peng Yue","Wenhao Yu","Yao Yao","Leilei Sun","Yong Zhang","Longbiao Chen","Xiaoping Du","Xiang Li","Xueying Zhang","Kun Qin","Zhaoya Gong","Weihua Dong","Xiaofeng Meng"],"pdf_url":"https://arxiv.org/pdf/2405.19730v4.pdf","comment":"V1 and V2 are in Chinese language, other versions are in English"},{"id":"http://arxiv.org/abs/2408.14765v1","updated":"2024-08-27T03:41:44Z","published":"2024-08-27T03:41:44Z","title":"CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View\n Synthesis","summary":" Satellite-to-street view synthesis aims at generating a realistic street-view\nimage from its corresponding satellite-view image. Although stable diffusion\nmodels have exhibit remarkable performance in a variety of image generation\napplications, their reliance on similar-view inputs to control the generated\nstructure or texture restricts their application to the challenging cross-view\nsynthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion\nmodel for satellite-to-street view synthesis. To address the challenges posed\nby the large discrepancy across views, we design the satellite scene structure\nestimation and cross-view texture mapping modules to construct the structural\nand textural controls for street-view image synthesis. We further design a\ncross-view control guided denoising process that incorporates the above\ncontrols via an enhanced cross-view attention module. To achieve a more\ncomprehensive evaluation of the synthesis results, we additionally design a\nGPT-based scoring method as a supplement to standard evaluation metrics. We\nalso explore the effect of different data sources (e.g., text, maps, building\nheights, and multi-temporal satellite imagery) on this task. Results on three\npublic cross-view datasets show that CrossViewDiff outperforms current\nstate-of-the-art on both standard and GPT-based evaluation metrics, generating\nhigh-quality street-view panoramas with more realistic structures and textures\nacross rural, suburban, and urban scenes. The code and models of this work will\nbe released at https://opendatalab.github.io/CrossViewDiff/.\n","authors":["Weijia Li","Jun He","Junyan Ye","Huaping Zhong","Zhimeng Zheng","Zilong Huang","Dahua Lin","Conghui He"],"pdf_url":"https://arxiv.org/pdf/2408.14765v1.pdf","comment":"21 pages, 11 figures"},{"id":"http://arxiv.org/abs/2312.04822v2","updated":"2024-08-27T03:33:51Z","published":"2023-12-08T04:12:26Z","title":"SiCP: Simultaneous Individual and Cooperative Perception for 3D Object\n Detection in Connected and Automated Vehicles","summary":" Cooperative perception for connected and automated vehicles is traditionally\nachieved through the fusion of feature maps from two or more vehicles. However,\nthe absence of feature maps shared from other vehicles can lead to a\nsignificant decline in 3D object detection performance for cooperative\nperception models compared to standalone 3D detection models. This drawback\nimpedes the adoption of cooperative perception as vehicle resources are often\ninsufficient to concurrently employ two perception models. To tackle this\nissue, we present Simultaneous Individual and Cooperative Perception (SiCP), a\ngeneric framework that supports a wide range of the state-of-the-art standalone\nperception backbones and enhances them with a novel Dual-Perception Network\n(DP-Net) designed to facilitate both individual and cooperative perception. In\naddition to its lightweight nature with only 0.13M parameters, DP-Net is robust\nand retains crucial gradient information during feature map fusion. As\ndemonstrated in a comprehensive evaluation on the V2V4Real and OPV2V datasets,\nthanks to DP-Net, SiCP surpasses state-of-the-art cooperative perception\nsolutions while preserving the performance of standalone perception solutions.\n","authors":["Deyuan Qu","Qi Chen","Tianyu Bai","Hongsheng Lu","Heng Fan","Hao Zhang","Song Fu","Qing Yang"],"pdf_url":"https://arxiv.org/pdf/2312.04822v2.pdf","comment":"Accepted by IROS 2024"},{"id":"http://arxiv.org/abs/2408.14764v1","updated":"2024-08-27T03:31:24Z","published":"2024-08-27T03:31:24Z","title":"SynthDoc: Bilingual Documents Synthesis for Visual Document\n Understanding","summary":" This paper introduces SynthDoc, a novel synthetic document generation\npipeline designed to enhance Visual Document Understanding (VDU) by generating\nhigh-quality, diverse datasets that include text, images, tables, and charts.\nAddressing the challenges of data acquisition and the limitations of existing\ndatasets, SynthDoc leverages publicly available corpora and advanced rendering\ntools to create a comprehensive and versatile dataset. Our experiments,\nconducted using the Donut model, demonstrate that models trained with\nSynthDoc's data achieve superior performance in pre-training read tasks and\nmaintain robustness in downstream tasks, despite language inconsistencies. The\nrelease of a benchmark dataset comprising 5,000 image-text pairs not only\nshowcases the pipeline's capabilities but also provides a valuable resource for\nthe VDU community to advance research and development in document image\nrecognition. This work significantly contributes to the field by offering a\nscalable solution to data scarcity and by validating the efficacy of end-to-end\nmodels in parsing complex, real-world documents.\n","authors":["Chuanghao Ding","Xuejing Liu","Wei Tang","Juan Li","Xiaoliang Wang","Rui Zhao","Cam-Tu Nguyen","Fei Tan"],"pdf_url":"https://arxiv.org/pdf/2408.14764v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14757v1","updated":"2024-08-27T03:17:52Z","published":"2024-08-27T03:17:52Z","title":"Learning effective pruning at initialization from iterative pruning","summary":" Pruning at initialization (PaI) reduces training costs by removing weights\nbefore training, which becomes increasingly crucial with the growing network\nsize. However, current PaI methods still have a large accuracy gap with\niterative pruning, especially at high sparsity levels. This raises an\nintriguing question: can we get inspiration from iterative pruning to improve\nthe PaI performance? In the lottery ticket hypothesis, the iterative rewind\npruning (IRP) finds subnetworks retroactively by rewinding the parameter to the\noriginal initialization in every pruning iteration, which means all the\nsubnetworks are based on the initial state. Here, we hypothesise the surviving\nsubnetworks are more important and bridge the initial feature and their\nsurviving score as the PaI criterion. We employ an end-to-end neural network\n(\\textbf{AutoS}parse) to learn this correlation, input the model's initial\nfeatures, output their score and then prune the lowest score parameters before\ntraining. To validate the accuracy and generalization of our method, we\nperformed PaI across various models. Results show that our approach outperforms\nexisting methods in high-sparsity settings. Notably, as the underlying logic of\nmodel pruning is consistent in different models, only one-time IRP on one model\nis needed (e.g., once IRP on ResNet-18/CIFAR-10, AutoS can be generalized to\nVGG-16/CIFAR-10, ResNet-18/TinyImageNet, et al.). As the first neural\nnetwork-based PaI method, we conduct extensive experiments to validate the\nfactors influencing this approach. These results reveal the learning tendencies\nof neural networks and provide new insights into our understanding and research\nof PaI from a practical perspective. Our code is available at:\nhttps://github.com/ChengYaofeng/AutoSparse.git.\n","authors":["Shengkai Liu","Yaofeng Cheng","Fusheng Zha","Wei Guo","Lining Sun","Zhenshan Bing","Chenguang Yang"],"pdf_url":"https://arxiv.org/pdf/2408.14757v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14754v1","updated":"2024-08-27T03:09:39Z","published":"2024-08-27T03:09:39Z","title":"Sequential-Scanning Dual-Energy CT Imaging Using High Temporal\n Resolution Image Reconstruction and Error-Compensated Material Basis Image\n Generation","summary":" Dual-energy computed tomography (DECT) has been widely used to obtain\nquantitative elemental composition of imaged subjects for personalized and\nprecise medical diagnosis. Compared with DECT leveraging advanced X-ray source\nand/or detector technologies, the use of the sequential-scanning data\nacquisition scheme to implement DECT may make a broader impact on clinical\npractice because this scheme requires no specialized hardware designs and can\nbe directly implemented into conventional CT systems. However, since the\nconcentration of iodinated contrast agent in the imaged subject varies over\ntime, sequentially scanned data sets acquired at two tube potentials are\ntemporally inconsistent. As existing material basis image reconstruction\napproaches assume that the data sets acquired at two tube potentials are\ntemporally consistent, the violation of this assumption results in inaccurate\nquantification of material concentration. In this work, we developed\nsequential-scanning DECT imaging using high temporal resolution image\nreconstruction and error-compensated material basis image generation,\nACCELERATION in short, to address the technical challenge induced by temporal\ninconsistency of sequentially scanned data sets and improve quantification\naccuracy of material concentration in sequential-scanning DECT. ACCELERATION\nhas been validated and evaluated using numerical simulation data sets generated\nfrom clinical human subject exams and experimental human subject studies.\nResults demonstrated the improvement of quantification accuracy and image\nquality using ACCELERATION.\n","authors":["Qiaoxin Li","Ruifeng Chen","Peng Wang","Guotao Quan","Yanfeng Du","Dong Liang","Yinsheng Li"],"pdf_url":"https://arxiv.org/pdf/2408.14754v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12340v2","updated":"2024-08-27T02:53:37Z","published":"2024-08-22T12:36:10Z","title":"VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided by Hand\n Priors Embedding","summary":" Although diffusion-based image virtual try-on has made considerable progress,\nemerging approaches still struggle to effectively address the issue of hand\nocclusion (i.e., clothing regions occluded by the hand part), leading to a\nnotable degradation of the try-on performance. To tackle this issue widely\nexisting in real-world scenarios, we propose VTON-HandFit, leveraging the power\nof hand priors to reconstruct the appearance and structure for hand occlusion\ncases. Firstly, we tailor a Handpose Aggregation Net using the ControlNet-based\nstructure explicitly and adaptively encoding the global hand and pose priors.\nBesides, to fully exploit the hand-related structure and appearance\ninformation, we propose Hand-feature Disentanglement Embedding module to\ndisentangle the hand priors into the hand structure-parametric and\nvisual-appearance features, and customize a masked cross attention for further\ndecoupled feature embedding. Lastly, we customize a hand-canny constraint loss\nto better learn the structure edge knowledge from the hand template of model\nimage. VTON-HandFit outperforms the baselines in qualitative and quantitative\nevaluations on the public dataset and our self-collected hand-occlusion\nHandfit-3K dataset particularly for the arbitrary hand pose occlusion cases in\nreal-world scenarios. The Code and dataset will be available at\n\\url{https://github.com/VTON-HandFit/VTON-HandFit}.\n","authors":["Yujie Liang","Xiaobin Hu","Boyuan Jiang","Donghao Luo","Kai WU","Wenhui Han","Taisong Jin","Chengjie Wang"],"pdf_url":"https://arxiv.org/pdf/2408.12340v2.pdf","comment":"The project page is \\url{https://vton-handfit.github.io}"},{"id":"http://arxiv.org/abs/2305.10662v2","updated":"2024-08-27T02:46:38Z","published":"2023-05-18T02:51:17Z","title":"Private Gradient Estimation is Useful for Generative Modeling","summary":" While generative models have proved successful in many domains, they may pose\na privacy leakage risk in practical deployment. To address this issue,\ndifferentially private generative model learning has emerged as a solution to\ntrain private generative models for different downstream tasks. However,\nexisting private generative modeling approaches face significant challenges in\ngenerating high-dimensional data due to the inherent complexity involved in\nmodeling such data. In this work, we present a new private generative modeling\napproach where samples are generated via Hamiltonian dynamics with gradients of\nthe private dataset estimated by a well-trained network. In the approach, we\nachieve differential privacy by perturbing the projection vectors in the\nestimation of gradients with sliced score matching. In addition, we enhance the\nreconstruction ability of the model by incorporating a residual enhancement\nmodule during the score matching. For sampling, we perform Hamiltonian dynamics\nwith gradients estimated by the well-trained network, allowing the sampled data\nclose to the private dataset's manifold step by step. In this way, our model is\nable to generate data with a resolution of 256x256. Extensive experiments and\nanalysis clearly demonstrate the effectiveness and rationality of the proposed\napproach.\n","authors":["Bochao Liu","Pengju Wang","Weijia Guo","Yong Li","Liansheng Zhuang","Weiping Wang","Shiming Ge"],"pdf_url":"https://arxiv.org/pdf/2305.10662v2.pdf","comment":"accepted by ACM MM 2024 Oral"},{"id":"http://arxiv.org/abs/2408.14744v1","updated":"2024-08-27T02:45:26Z","published":"2024-08-27T02:45:26Z","title":"RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with\n Rich Linguistic Semantics from Openly Available Data and Large Language\n Models","summary":" Abundant, well-annotated multimodal data in remote sensing are pivotal for\naligning complex visual remote sensing (RS) scenes with human language,\nenabling the development of specialized vision language models across diverse\nRS interpretation tasks. However, annotating RS images with rich linguistic\nsemantics at scale demands expertise in RS and substantial human labor, making\nit costly and often impractical. In this study, we propose a workflow that\nleverages large language models (LLMs) to generate multimodal datasets with\nsemantically rich captions at scale from plain OpenStreetMap (OSM) data for\nimages sourced from the Google Earth Engine (GEE) platform. This approach\nfacilitates the generation of paired remote sensing data and can be readily\nscaled up using openly available data. Within this framework, we present\nRSTeller, a multimodal dataset comprising over 1 million RS images, each\naccompanied by multiple descriptive captions. Extensive experiments demonstrate\nthat RSTeller enhances the performance of multiple existing vision language\nmodels for RS scene understanding through continual pre-training. Our\nmethodology significantly reduces the manual effort and expertise needed for\nannotating remote sensing imagery while democratizing access to high-quality\nannotated data. This advancement fosters progress in visual language modeling\nand encourages broader participation in remote sensing research and\napplications. The RSTeller dataset is available at\nhttps://github.com/SlytherinGe/RSTeller.\n","authors":["Junyao Ge","Yang Zheng","Kaitai Guo","Jimin Liang"],"pdf_url":"https://arxiv.org/pdf/2408.14744v1.pdf","comment":"Submitted to ISPRS"},{"id":"http://arxiv.org/abs/2408.14743v1","updated":"2024-08-27T02:43:40Z","published":"2024-08-27T02:43:40Z","title":"Personalized Video Summarization using Text-Based Queries and\n Conditional Modeling","summary":" The proliferation of video content on platforms like YouTube and Vimeo\npresents significant challenges in efficiently locating relevant information.\nAutomatic video summarization aims to address this by extracting and presenting\nkey content in a condensed form. This thesis explores enhancing video\nsummarization by integrating text-based queries and conditional modeling to\ntailor summaries to user needs. Traditional methods often produce fixed\nsummaries that may not align with individual requirements. To overcome this, we\npropose a multi-modal deep learning approach that incorporates both textual\nqueries and visual information, fusing them at different levels of the model\narchitecture. Evaluation metrics such as accuracy and F1-score assess the\nquality of the generated summaries. The thesis also investigates improving\ntext-based query representations using contextualized word embeddings and\nspecialized attention networks. This enhances the semantic understanding of\nqueries, leading to better video summaries. To emulate human-like\nsummarization, which accounts for both visual coherence and abstract factors\nlike storyline consistency, we introduce a conditional modeling approach. This\nmethod uses multiple random variables and joint distributions to capture key\nsummarization components, resulting in more human-like and explainable\nsummaries. Addressing data scarcity in fully supervised learning, the thesis\nproposes a segment-level pseudo-labeling approach. This self-supervised method\ngenerates additional data, improving model performance even with limited\nhuman-labeled datasets. In summary, this research aims to enhance automatic\nvideo summarization by incorporating text-based queries, improving query\nrepresentations, introducing conditional modeling, and addressing data\nscarcity, thereby creating more effective and personalized video summaries.\n","authors":["Jia-Hong Huang"],"pdf_url":"https://arxiv.org/pdf/2408.14743v1.pdf","comment":"Ph.D. thesis, 137 pages"},{"id":"http://arxiv.org/abs/2408.12569v3","updated":"2024-08-27T02:31:42Z","published":"2024-08-22T17:37:27Z","title":"Sapiens: Foundation for Human Vision Models","summary":" We present Sapiens, a family of models for four fundamental human-centric\nvision tasks -- 2D pose estimation, body-part segmentation, depth estimation,\nand surface normal prediction. Our models natively support 1K high-resolution\ninference and are extremely easy to adapt for individual tasks by simply\nfine-tuning models pretrained on over 300 million in-the-wild human images. We\nobserve that, given the same computational budget, self-supervised pretraining\non a curated dataset of human images significantly boosts the performance for a\ndiverse set of human-centric tasks. The resulting models exhibit remarkable\ngeneralization to in-the-wild data, even when labeled data is scarce or\nentirely synthetic. Our simple model design also brings scalability -- model\nperformance across tasks improves as we scale the number of parameters from 0.3\nto 2 billion. Sapiens consistently surpasses existing baselines across various\nhuman-centric benchmarks. We achieve significant improvements over the prior\nstate-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1\nmIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5%\nrelative angular error. Project page:\nhttps://about.meta.com/realitylabs/codecavatars/sapiens.\n","authors":["Rawal Khirodkar","Timur Bagautdinov","Julieta Martinez","Su Zhaoen","Austin James","Peter Selednik","Stuart Anderson","Shunsuke Saito"],"pdf_url":"https://arxiv.org/pdf/2408.12569v3.pdf","comment":"ECCV 2024 (Oral)"},{"id":"http://arxiv.org/abs/2408.13800v2","updated":"2024-08-27T02:30:47Z","published":"2024-08-25T10:42:07Z","title":"BCDNet: A Convolutional Neural Network For Breast Cancer Detection","summary":" Previous research has established that breast cancer is a prevalent cancer\ntype, with Invasive Ductal Carcinoma (IDC) being the most common subtype. The\nincidence of this dangerous cancer continues to rise, making accurate and rapid\ndiagnosis, particularly in the early stages, critically important. While modern\nComputer-Aided Diagnosis (CAD) systems can address most cases, medical\nprofessionals still face challenges in using them in the field without powerful\ncomputing resources. In this paper, we propose a novel CNN model called BCDNet,\nwhich effectively detects IDC in histopathological images with an accuracy of\nup to 89.5% and reduces training time effectively.\n","authors":["Yujia Lin","Aiwei Lian","Mingyu Liao","Yipeng Liu"],"pdf_url":"https://arxiv.org/pdf/2408.13800v2.pdf","comment":"5 pages, 5 figures"},{"id":"http://arxiv.org/abs/2408.14738v1","updated":"2024-08-27T02:29:29Z","published":"2024-08-27T02:29:29Z","title":"Learning Differentially Private Diffusion Models via Stochastic\n Adversarial Distillation","summary":" While the success of deep learning relies on large amounts of training\ndatasets, data is often limited in privacy-sensitive domains. To address this\nchallenge, generative model learning with differential privacy has emerged as a\nsolution to train private generative models for desensitized data generation.\nHowever, the quality of the images generated by existing methods is limited due\nto the complexity of modeling data distribution. We build on the success of\ndiffusion models and introduce DP-SAD, which trains a private diffusion model\nby a stochastic adversarial distillation method. Specifically, we first train a\ndiffusion model as a teacher and then train a student by distillation, in which\nwe achieve differential privacy by adding noise to the gradients from other\nmodels to the student. For better generation quality, we introduce a\ndiscriminator to distinguish whether an image is from the teacher or the\nstudent, which forms the adversarial training. Extensive experiments and\nanalysis clearly demonstrate the effectiveness of our proposed method.\n","authors":["Bochao Liu","Pengju Wang","Shiming Ge"],"pdf_url":"https://arxiv.org/pdf/2408.14738v1.pdf","comment":"accepted by ECCV 2024"},{"id":"http://arxiv.org/abs/2408.13623v2","updated":"2024-08-27T01:59:59Z","published":"2024-08-24T16:33:26Z","title":"Prompt-Softbox-Prompt: A free-text Embedding Control for Image Editing","summary":" Text-driven diffusion models have achieved remarkable success in image\nediting, but a crucial component in these models-text embeddings-has not been\nfully explored. The entanglement and opacity of text embeddings present\nsignificant challenges to achieving precise image editing. In this paper, we\nprovide a comprehensive and in-depth analysis of text embeddings in Stable\nDiffusion XL, offering three key insights. First, while the 'aug_embedding'\ncaptures the full semantic content of the text, its contribution to the final\nimage generation is relatively minor. Second, 'BOS' and 'Padding_embedding' do\nnot contain any semantic information. Lastly, the 'EOS' holds the semantic\ninformation of all words and contains the most style features. Each word\nembedding plays a unique role without interfering with one another. Based on\nthese insights, we propose a novel approach for controllable image editing\nusing a free-text embedding control method called PSP (Prompt-Softbox-Prompt).\nPSP enables precise image editing by inserting or adding text embeddings within\nthe cross-attention layers and using Softbox to define and control the specific\narea for semantic injection. This technique allows for obejct additions and\nreplacements while preserving other areas of the image. Additionally, PSP can\nachieve style transfer by simply replacing text embeddings. Extensive\nexperimental results show that PSP achieves significant results in tasks such\nas object replacement, object addition, and style transfer.\n","authors":["Yitong Yang","Yinglin Wang","Jing Wang","Tian Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.13623v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14732v1","updated":"2024-08-27T01:55:40Z","published":"2024-08-27T01:55:40Z","title":"OctFusion: Octree-based Diffusion Models for 3D Shape Generation","summary":" Diffusion models have emerged as a popular method for 3D generation. However,\nit is still challenging for diffusion models to efficiently generate diverse\nand high-quality 3D shapes. In this paper, we introduce OctFusion, which can\ngenerate 3D shapes with arbitrary resolutions in 2.5 seconds on a single Nvidia\n4090 GPU, and the extracted meshes are guaranteed to be continuous and\nmanifold. The key components of OctFusion are the octree-based latent\nrepresentation and the accompanying diffusion models. The representation\ncombines the benefits of both implicit neural representations and explicit\nspatial octrees and is learned with an octree-based variational autoencoder.\nThe proposed diffusion model is a unified multi-scale U-Net that enables\nweights and computation sharing across different octree levels and avoids the\ncomplexity of widely used cascaded diffusion schemes. We verify the\neffectiveness of OctFusion on the ShapeNet and Objaverse datasets and achieve\nstate-of-the-art performances on shape generation tasks. We demonstrate that\nOctFusion is extendable and flexible by generating high-quality color fields\nfor textured mesh generation and high-quality 3D shapes conditioned on text\nprompts, sketches, or category labels. Our code and pre-trained models are\navailable at \\url{https://github.com/octree-nn/octfusion}.\n","authors":["Bojun Xiong","Si-Tong Wei","Xin-Yang Zheng","Yan-Pei Cao","Zhouhui Lian","Peng-Shuai Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14732v1.pdf","comment":"Technical Report"},{"id":"http://arxiv.org/abs/2408.14724v1","updated":"2024-08-27T01:28:15Z","published":"2024-08-27T01:28:15Z","title":"GeoTransfer : Generalizable Few-Shot Multi-View Reconstruction via\n Transfer Learning","summary":" This paper presents a novel approach for sparse 3D reconstruction by\nleveraging the expressive power of Neural Radiance Fields (NeRFs) and fast\ntransfer of their features to learn accurate occupancy fields. Existing 3D\nreconstruction methods from sparse inputs still struggle with capturing\nintricate geometric details and can suffer from limitations in handling\noccluded regions. On the other hand, NeRFs excel in modeling complex scenes but\ndo not offer means to extract meaningful geometry. Our proposed method offers\nthe best of both worlds by transferring the information encoded in NeRF\nfeatures to derive an accurate occupancy field representation. We utilize a\npre-trained, generalizable state-of-the-art NeRF network to capture detailed\nscene radiance information, and rapidly transfer this knowledge to train a\ngeneralizable implicit occupancy network. This process helps in leveraging the\nknowledge of the scene geometry encoded in the generalizable NeRF prior and\nrefining it to learn occupancy fields, facilitating a more precise\ngeneralizable representation of 3D space. The transfer learning approach leads\nto a dramatic reduction in training time, by orders of magnitude (i.e. from\nseveral days to 3.5 hrs), obviating the need to train generalizable sparse\nsurface reconstruction methods from scratch. Additionally, we introduce a novel\nloss on volumetric rendering weights that helps in the learning of accurate\noccupancy fields, along with a normal loss that helps in global smoothing of\nthe occupancy fields. We evaluate our approach on the DTU dataset and\ndemonstrate state-of-the-art performance in terms of reconstruction accuracy,\nespecially in challenging scenarios with sparse input data and occluded\nregions. We furthermore demonstrate the generalization capabilities of our\nmethod by showing qualitative results on the Blended MVS dataset without any\nretraining.\n","authors":["Shubhendu Jena","Franck Multon","Adnane Boukhayma"],"pdf_url":"https://arxiv.org/pdf/2408.14724v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.13621v5","updated":"2024-08-27T01:23:50Z","published":"2024-04-21T11:21:27Z","title":"Attack on Scene Flow using Point Clouds","summary":" Deep neural networks have made significant advancements in accurately\nestimating scene flow using point clouds, which is vital for many applications\nlike video analysis, action recognition, and navigation. The robustness of\nthese techniques, however, remains a concern, particularly in the face of\nadversarial attacks that have been proven to deceive state-of-the-art deep\nneural networks in many domains. Surprisingly, the robustness of scene flow\nnetworks against such attacks has not been thoroughly investigated. To address\nthis problem, the proposed approach aims to bridge this gap by introducing\nadversarial white-box attacks specifically tailored for scene flow networks.\nExperimental results show that the generated adversarial examples obtain up to\n33.7 relative degradation in average end-point error on the KITTI and\nFlyingThings3D datasets. The study also reveals the significant impact that\nattacks targeting point clouds in only one dimension or color channel have on\naverage end-point error. Analyzing the success and failure of these attacks on\nthe scene flow networks and their 2D optical flow network variants shows a\nhigher vulnerability for the optical flow networks. Code is available at\nhttps://github.com/aheldis/Attack-on-Scene-Flow-using-Point-Clouds.git.\n","authors":["Haniyeh Ehsani Oskouie","Mohammad-Shahram Moin","Shohreh Kasaei"],"pdf_url":"https://arxiv.org/pdf/2404.13621v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14723v1","updated":"2024-08-27T01:23:49Z","published":"2024-08-27T01:23:49Z","title":"Snap and Diagnose: An Advanced Multimodal Retrieval System for\n Identifying Plant Diseases in the Wild","summary":" Plant disease recognition is a critical task that ensures crop health and\nmitigates the damage caused by diseases. A handy tool that enables farmers to\nreceive a diagnosis based on query pictures or the text description of\nsuspicious plants is in high demand for initiating treatment before potential\ndiseases spread further. In this paper, we develop a multimodal plant disease\nimage retrieval system to support disease search based on either image or text\nprompts. Specifically, we utilize the largest in-the-wild plant disease dataset\nPlantWild, which includes over 18,000 images across 89 categories, to provide a\ncomprehensive view of potential diseases relating to the query. Furthermore,\ncross-modal retrieval is achieved in the developed system, facilitated by a\nnovel CLIP-based vision-language model that encodes both disease descriptions\nand disease images into the same latent space. Built on top of the retriever,\nour retrieval system allows users to upload either plant disease images or\ndisease descriptions to retrieve the corresponding images with similar\ncharacteristics from the disease dataset to suggest candidate diseases for end\nusers' consideration.\n","authors":["Tianqi Wei","Zhi Chen","Xin Yu"],"pdf_url":"https://arxiv.org/pdf/2408.14723v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.06889v3","updated":"2024-08-27T00:03:31Z","published":"2024-07-09T14:18:35Z","title":"A Neurosymbolic Approach to Adaptive Feature Extraction in SLAM","summary":" Autonomous robots, autonomous vehicles, and humans wearing mixed-reality\nheadsets require accurate and reliable tracking services for safety-critical\napplications in dynamically changing real-world environments. However, the\nexisting tracking approaches, such as Simultaneous Localization and Mapping\n(SLAM), do not adapt well to environmental changes and boundary conditions\ndespite extensive manual tuning. On the other hand, while deep learning-based\napproaches can better adapt to environmental changes, they typically demand\nsubstantial data for training and often lack flexibility in adapting to new\ndomains. To solve this problem, we propose leveraging the neurosymbolic program\nsynthesis approach to construct adaptable SLAM pipelines that integrate the\ndomain knowledge from traditional SLAM approaches while leveraging data to\nlearn complex relationships. While the approach can synthesize end-to-end SLAM\npipelines, we focus on synthesizing the feature extraction module. We first\ndevise a domain-specific language (DSL) that can encapsulate domain knowledge\non the important attributes for feature extraction and the real-world\nperformance of various feature extractors. Our neurosymbolic architecture then\nundertakes adaptive feature extraction, optimizing parameters via learning\nwhile employing symbolic reasoning to select the most suitable feature\nextractor. Our evaluations demonstrate that our approach, neurosymbolic Feature\nEXtraction (nFEX), yields higher-quality features. It also reduces the pose\nerror observed for the state-of-the-art baseline feature extractors ORB and\nSIFT by up to 90% and up to 66%, respectively, thereby enhancing the system's\nefficiency and adaptability to novel environments.\n","authors":["Yasra Chandio","Momin A. Khan","Khotso Selialia","Luis Garcia","Joseph DeGol","Fatima M. Anwar"],"pdf_url":"https://arxiv.org/pdf/2407.06889v3.pdf","comment":"8 pages, 6 figures, and 5 tables. Published at the 2024 IEEE/RSJ\n International Conference on Intelligent Robots and Systems (IROS).\n Corresponding author: Yasra Chandio (ychandio@umass.edu)"},{"id":"http://arxiv.org/abs/2408.15447v1","updated":"2024-08-27T23:53:52Z","published":"2024-08-27T23:53:52Z","title":"Fine-grained length controllable video captioning with ordinal\n embeddings","summary":" This paper proposes a method for video captioning that controls the length of\ngenerated captions. Previous work on length control often had few levels for\nexpressing length. In this study, we propose two methods of length embedding\nfor fine-grained length control. A traditional embedding method is linear,\nusing a one-hot vector and an embedding matrix. In this study, we propose\nmethods that represent length in multi-hot vectors. One is bit embedding that\nexpresses length in bit representation, and the other is ordinal embedding that\nuses the binary representation often used in ordinal regression. These length\nrepresentations of multi-hot vectors are converted into length embedding by a\nnonlinear MLP. This method allows for not only the length control of caption\nsentences but also the control of the time when reading the caption.\nExperiments using ActivityNet Captions and Spoken Moments in Time show that the\nproposed method effectively controls the length of the generated captions.\nAnalysis of the embedding vectors with ICA shows that length and semantics were\nlearned separately, demonstrating the effectiveness of the proposed embedding\nmethods.\n","authors":["Tomoya Nitta","Takumi Fukuzawa","Toru Tamaki"],"pdf_url":"https://arxiv.org/pdf/2408.15447v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.15891v4","updated":"2024-08-27T22:09:19Z","published":"2024-04-24T14:29:26Z","title":"OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian\n Segmentation","summary":" Recent advancements in 3D reconstruction technologies have paved the way for\nhigh-quality and real-time rendering of complex 3D scenes. Despite these\nachievements, a notable challenge persists: it is difficult to precisely\nreconstruct specific objects from large scenes. Current scene reconstruction\ntechniques frequently result in the loss of object detail textures and are\nunable to reconstruct object portions that are occluded or unseen in views. To\naddress this challenge, we delve into the meticulous 3D reconstruction of\nspecific objects within large scenes and propose a framework termed OMEGAS:\nObject Mesh Extraction from Large Scenes Guided by Gaussian Segmentation.\nSpecifically, we proposed a novel 3D target segmentation technique based on 2D\nGaussian Splatting, which segments 3D consistent target masks in multi-view\nscene images and generates a preliminary target model. Moreover, to reconstruct\nthe unseen portions of the target, we propose a novel target replenishment\ntechnique driven by large-scale generative diffusion priors. We demonstrate\nthat our method can accurately reconstruct specific targets from large scenes,\nboth quantitatively and qualitatively. Our experiments show that OMEGAS\nsignificantly outperforms existing reconstruction methods across various\nscenarios. Our project page is at: https://github.com/CrystalWlz/OMEGAS\n","authors":["Lizhi Wang","Feng Zhou","Bo yu","Pu Cao","Jianqin Yin"],"pdf_url":"https://arxiv.org/pdf/2404.15891v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15428v1","updated":"2024-08-27T22:05:44Z","published":"2024-08-27T22:05:44Z","title":"HEAD: A Bandwidth-Efficient Cooperative Perception Approach for\n Heterogeneous Connected and Autonomous Vehicles","summary":" In cooperative perception studies, there is often a trade-off between\ncommunication bandwidth and perception performance. While current feature\nfusion solutions are known for their excellent object detection performance,\ntransmitting the entire sets of intermediate feature maps requires substantial\nbandwidth. Furthermore, these fusion approaches are typically limited to\nvehicles that use identical detection models. Our goal is to develop a solution\nthat supports cooperative perception across vehicles equipped with different\nmodalities of sensors. This method aims to deliver improved perception\nperformance compared to late fusion techniques, while achieving precision\nsimilar to the state-of-art intermediate fusion, but requires an order of\nmagnitude less bandwidth. We propose HEAD, a method that fuses features from\nthe classification and regression heads in 3D object detection networks. Our\nmethod is compatible with heterogeneous detection networks such as LiDAR\nPointPillars, SECOND, VoxelNet, and camera Bird's-eye View (BEV) Encoder. Given\nthe naturally smaller feature size in the detection heads, we design a\nself-attention mechanism to fuse the classification head and a complementary\nfeature fusion layer to fuse the regression head. Our experiments,\ncomprehensively evaluated on the V2V4Real and OPV2V datasets, demonstrate that\nHEAD is a fusion method that effectively balances communication bandwidth and\nperception performance.\n","authors":["Deyuan Qu","Qi Chen","Yongqi Zhu","Yihao Zhu","Sergei S. Avedisov","Song Fu","Qing Yang"],"pdf_url":"https://arxiv.org/pdf/2408.15428v1.pdf","comment":"Accepted by ECCV 2024 Workshop"},{"id":"http://arxiv.org/abs/2307.11986v2","updated":"2024-08-27T21:25:39Z","published":"2023-07-22T05:34:18Z","title":"Expert Knowledge-Aware Image Difference Graph Representation Learning\n for Difference-Aware Medical Visual Question Answering","summary":" To contribute to automating the medical vision-language model, we propose a\nnovel Chest-Xray Difference Visual Question Answering (VQA) task. Given a pair\nof main and reference images, this task attempts to answer several questions on\nboth diseases and, more importantly, the differences between them. This is\nconsistent with the radiologist's diagnosis practice that compares the current\nimage with the reference before concluding the report. We collect a new\ndataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs\nof main and reference images. Compared to existing medical VQA datasets, our\nquestions are tailored to the Assessment-Diagnosis-Intervention-Evaluation\ntreatment procedure used by clinical professionals. Meanwhile, we also propose\na novel expert knowledge-aware graph representation learning model to address\nthis task. The proposed baseline model leverages expert knowledge such as\nanatomical structure prior, semantic, and spatial knowledge to construct a\nmulti-relationship graph, representing the image differences between two images\nfor the image difference VQA task. The dataset and code can be found at\nhttps://github.com/Holipori/MIMIC-Diff-VQA. We believe this work would further\npush forward the medical vision language model.\n","authors":["Xinyue Hu","Lin Gu","Qiyuan An","Mengliang Zhang","Liangchen Liu","Kazuma Kobayashi","Tatsuya Harada","Ronald M. Summers","Yingying Zhu"],"pdf_url":"https://arxiv.org/pdf/2307.11986v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.01949v2","updated":"2024-08-27T21:05:09Z","published":"2023-09-05T04:55:10Z","title":"Variational Bayesian Imaging with an Efficient Surrogate Score-based\n Prior","summary":" We propose a surrogate function for efficient yet principled use of\nscore-based priors in Bayesian imaging. We consider ill-posed inverse imaging\nproblems in which one aims for a clean image posterior given incomplete or\nnoisy measurements. Since the measurements do not uniquely determine a true\nimage, a prior is needed to constrain the solution space. Recent work turned\nscore-based diffusion models into principled priors for solving ill-posed\nimaging problems by appealing to an ODE-based log-probability function.\nHowever, evaluating the ODE is computationally inefficient and inhibits\nposterior estimation of high-dimensional images. Our proposed surrogate prior\nis based on the evidence lower bound of a score-based diffusion model. We\ndemonstrate the surrogate prior on variational inference for efficient\napproximate posterior sampling of large images. Compared to the exact prior in\nprevious work, our surrogate accelerates optimization of the variational image\ndistribution by at least two orders of magnitude. We also find that our\nprincipled approach gives more accurate posterior estimation than\nnon-variational diffusion-based approaches that involve hyperparameter-tuning\nat inference. Our work establishes a practical path forward for using\nscore-based diffusion models as general-purpose image priors.\n","authors":["Berthy T. Feng","Katherine L. Bouman"],"pdf_url":"https://arxiv.org/pdf/2309.01949v2.pdf","comment":"Published in Transactions on Machine Learning Research (TMLR) August\n 2024"},{"id":"http://arxiv.org/abs/2302.11552v5","updated":"2024-08-27T20:56:53Z","published":"2023-02-22T18:48:46Z","title":"Reduce, Reuse, Recycle: Compositional Generation with Energy-Based\n Diffusion Models and MCMC","summary":" Since their introduction, diffusion models have quickly become the prevailing\napproach to generative modeling in many domains. They can be interpreted as\nlearning the gradients of a time-varying sequence of log-probability density\nfunctions. This interpretation has motivated classifier-based and\nclassifier-free guidance as methods for post-hoc control of diffusion models.\nIn this work, we build upon these ideas using the score-based interpretation of\ndiffusion models, and explore alternative ways to condition, modify, and reuse\ndiffusion models for tasks involving compositional generation and guidance. In\nparticular, we investigate why certain types of composition fail using current\ntechniques and present a number of solutions. We conclude that the sampler (not\nthe model) is responsible for this failure and propose new samplers, inspired\nby MCMC, which enable successful compositional generation. Further, we propose\nan energy-based parameterization of diffusion models which enables the use of\nnew compositional operators and more sophisticated, Metropolis-corrected\nsamplers. Intriguingly we find these samplers lead to notable improvements in\ncompositional generation across a wide set of problems such as\nclassifier-guided ImageNet modeling and compositional text-to-image generation.\n","authors":["Yilun Du","Conor Durkan","Robin Strudel","Joshua B. Tenenbaum","Sander Dieleman","Rob Fergus","Jascha Sohl-Dickstein","Arnaud Doucet","Will Grathwohl"],"pdf_url":"https://arxiv.org/pdf/2302.11552v5.pdf","comment":"ICML 2023, Project Webpage:\n https://energy-based-model.github.io/reduce-reuse-recycle/"},{"id":"http://arxiv.org/abs/2408.15398v1","updated":"2024-08-27T20:49:11Z","published":"2024-08-27T20:49:11Z","title":"Evaluating Pre-Training Bias on Severe Acute Respiratory Syndrome\n Dataset","summary":" Machine learning (ML) is a growing field of computer science that has found\nmany practical applications in several domains, including Health. However, as\ndata grows in size and availability, and the number of models that aim to aid\nor replace human decisions, it raises the concern that these models can be\nsusceptible to bias, which can lead to harm to specific individuals by basing\nits decisions on protected attributes such as gender, religion, sexual\norientation, ethnicity, and others. Visualization techniques might generate\ninsights and help summarize large datasets, enabling data scientists to\nunderstand the data better before training a model by evaluating pre-training\nmetrics applied to the datasets before training, which might contribute to\nidentifying potential harm before any effort is put into training and deploying\nthe models. This work uses the severe acute respiratory syndrome dataset from\nOpenDataSUS to visualize three pre-training bias metrics and their distribution\nacross different regions in Brazil. A random forest model is trained in each\nregion and applied to the others. The aim is to compare the bias for the\ndifferent regions, focusing on their protected attributes and comparing the\nmodel's performance with the metric values.\n","authors":["Diego Dimer Rodrigues"],"pdf_url":"https://arxiv.org/pdf/2408.15398v1.pdf","comment":"short paper for eurovis, 5 pages"},{"id":"http://arxiv.org/abs/2408.15388v1","updated":"2024-08-27T20:14:42Z","published":"2024-08-27T20:14:42Z","title":"Panoptic Perception for Autonomous Driving: A Survey","summary":" Panoptic perception represents a forefront advancement in autonomous driving\ntechnology, unifying multiple perception tasks into a singular, cohesive\nframework to facilitate a thorough understanding of the vehicle's surroundings.\nThis survey reviews typical panoptic perception models for their unique inputs\nand architectures and compares them to performance, responsiveness, and\nresource utilization. It also delves into the prevailing challenges faced in\npanoptic perception and explores potential trajectories for future research.\nOur goal is to furnish researchers in autonomous driving with a detailed\nsynopsis of panoptic perception, positioning this survey as a pivotal reference\nin the ever-evolving landscape of autonomous driving technologies.\n","authors":["Yunge Li","Lanyu Xu"],"pdf_url":"https://arxiv.org/pdf/2408.15388v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15386v1","updated":"2024-08-27T20:08:33Z","published":"2024-08-27T20:08:33Z","title":"Multi-Feature Aggregation in Diffusion Models for Enhanced Face\n Super-Resolution","summary":" Super-resolution algorithms often struggle with images from surveillance\nenvironments due to adverse conditions such as unknown degradation, variations\nin pose, irregular illumination, and occlusions. However, acquiring multiple\nimages, even of low quality, is possible with surveillance cameras. In this\nwork, we develop an algorithm based on diffusion models that utilize a\nlow-resolution image combined with features extracted from multiple low-quality\nimages to generate a super-resolved image while minimizing distortions in the\nindividual's identity. Unlike other algorithms, our approach recovers facial\nfeatures without explicitly providing attribute information or without the need\nto calculate a gradient of a function during the reconstruction process. To the\nbest of our knowledge, this is the first time multi-features combined with\nlow-resolution images are used as conditioners to generate more reliable\nsuper-resolution images using stochastic differential equations. The FFHQ\ndataset was employed for training, resulting in state-of-the-art performance in\nfacial recognition and verification metrics when evaluated on the CelebA and\nQuis-Campi datasets. Our code is publicly available at\nhttps://github.com/marcelowds/fasr\n","authors":["Marcelo dos Santos","Rayson Laroca","Rafael O. Ribeiro","João C. Neves","David Menotti"],"pdf_url":"https://arxiv.org/pdf/2408.15386v1.pdf","comment":"Accepted for presentation at the Conference on Graphics, Patterns and\n Images (SIBGRAPI) 2024"},{"id":"http://arxiv.org/abs/2408.01959v2","updated":"2024-08-27T19:57:45Z","published":"2024-08-04T08:26:58Z","title":"Dataset Scale and Societal Consistency Mediate Facial Impression Bias in\n Vision-Language AI","summary":" Multimodal AI models capable of associating images and text hold promise for\nnumerous domains, ranging from automated image captioning to accessibility\napplications for blind and low-vision users. However, uncertainty about bias\nhas in some cases limited their adoption and availability. In the present work,\nwe study 43 CLIP vision-language models to determine whether they learn\nhuman-like facial impression biases, and we find evidence that such biases are\nreflected across three distinct CLIP model families. We show for the first time\nthat the the degree to which a bias is shared across a society predicts the\ndegree to which it is reflected in a CLIP model. Human-like impressions of\nvisually unobservable attributes, like trustworthiness and sexuality, emerge\nonly in models trained on the largest dataset, indicating that a better fit to\nuncurated cultural data results in the reproduction of increasingly subtle\nsocial biases. Moreover, we use a hierarchical clustering approach to show that\ndataset size predicts the extent to which the underlying structure of facial\nimpression bias resembles that of facial impression bias in humans. Finally, we\nshow that Stable Diffusion models employing CLIP as a text encoder learn facial\nimpression biases, and that these biases intersect with racial biases in Stable\nDiffusion XL-Turbo. While pretrained CLIP models may prove useful for\nscientific studies of bias, they will also require significant dataset curation\nwhen intended for use as general-purpose models in a zero-shot setting.\n","authors":["Robert Wolfe","Aayushi Dangol","Alexis Hiniker","Bill Howe"],"pdf_url":"https://arxiv.org/pdf/2408.01959v2.pdf","comment":"Accepted at Artificial Intelligence, Ethics, and Society 2024"},{"id":"http://arxiv.org/abs/2312.06731v6","updated":"2024-08-27T19:51:13Z","published":"2023-12-11T09:44:41Z","title":"Genixer: Empowering Multimodal Large Language Models as a Powerful Data\n Generator","summary":" Multimodal Large Language Models (MLLMs) demonstrate exceptional\nproblem-solving capabilities, but few research studies aim to gauge the ability\nto generate visual instruction tuning data. This paper proposes to explore the\npotential of empowering MLLMs to generate data independently without relying on\nGPT-4. We introduce Genixer, a comprehensive data generation pipeline\nconsisting of four key steps: (i) instruction data collection, (ii) instruction\ntemplate design, (iii) empowering MLLMs, and (iv) data generation and\nfiltering. Additionally, we outline two modes of data generation: task-agnostic\nand task-specific, enabling controllable output. We demonstrate that a\nsynthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out\nof 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when\ntrained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC\ndatasets. Through experiments and synthetic data analysis, our findings are:\n(1) current MLLMs can serve as robust data generators without assistance from\nGPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in\ngenerating complex instruction tuning data; (3) synthetic datasets enhance\nperformance across various multimodal benchmarks and help mitigate model\nhallucinations. The data, code, and models can be found at\nhttps://github.com/zhaohengyuan1/Genixer.\n","authors":["Henry Hengyuan Zhao","Pan Zhou","Mike Zheng Shou"],"pdf_url":"https://arxiv.org/pdf/2312.06731v6.pdf","comment":"Accepted by ECCV 2024"},{"id":"http://arxiv.org/abs/2408.15374v1","updated":"2024-08-27T19:22:06Z","published":"2024-08-27T19:22:06Z","title":"CycleGAN with Better Cycles","summary":" CycleGAN provides a framework to train image-to-image translation with\nunpaired datasets using cycle consistency loss [4]. While results are great in\nmany applications, the pixel level cycle consistency can potentially be\nproblematic and causes unrealistic images in certain cases. In this project, we\npropose three simple modifications to cycle consistency, and show that such an\napproach achieves better results with fewer artifacts.\n","authors":["Tongzhou Wang","Yihan Lin"],"pdf_url":"https://arxiv.org/pdf/2408.15374v1.pdf","comment":"Technical Report 2018"},{"id":"http://arxiv.org/abs/2408.15373v1","updated":"2024-08-27T19:13:15Z","published":"2024-08-27T19:13:15Z","title":"Handling Geometric Domain Shifts in Semantic Segmentation of Surgical\n RGB and Hyperspectral Images","summary":" Robust semantic segmentation of intraoperative image data holds promise for\nenabling automatic surgical scene understanding and autonomous robotic surgery.\nWhile model development and validation are primarily conducted on idealistic\nscenes, geometric domain shifts, such as occlusions of the situs, are common in\nreal-world open surgeries. To close this gap, we (1) present the first analysis\nof state-of-the-art (SOA) semantic segmentation models when faced with\ngeometric out-of-distribution (OOD) data, and (2) propose an augmentation\ntechnique called \"Organ Transplantation\", to enhance generalizability. Our\ncomprehensive validation on six different OOD datasets, comprising 600 RGB and\nhyperspectral imaging (HSI) cubes from 33 pigs, each annotated with 19 classes,\nreveals a large performance drop in SOA organ segmentation models on geometric\nOOD data. This performance decline is observed not only in conventional RGB\ndata (with a dice similarity coefficient (DSC) drop of 46 %) but also in HSI\ndata (with a DSC drop of 45 %), despite the richer spectral information\ncontent. The performance decline increases with the spatial granularity of the\ninput data. Our augmentation technique improves SOA model performance by up to\n67 % for RGB data and 90 % for HSI data, achieving performance at the level of\nin-distribution performance on real OOD test data. Given the simplicity and\neffectiveness of our augmentation method, it is a valuable tool for addressing\ngeometric domain shifts in surgical scene segmentation, regardless of the\nunderlying model. Our code and pre-trained models are publicly available at\nhttps://github.com/IMSY-DKFZ/htc.\n","authors":["Silvia Seidlitz","Jan Sellner","Alexander Studier-Fischer","Alessandro Motta","Berkin Özdemir","Beat P. Müller-Stich","Felix Nickel","Lena Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2408.15373v1.pdf","comment":"Silvia Seidlitz and Jan Sellner contributed equally"},{"id":"http://arxiv.org/abs/2408.13912v2","updated":"2024-08-27T19:06:57Z","published":"2024-08-25T18:27:20Z","title":"Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs","summary":" In this paper, we introduce Splatt3R, a pose-free, feed-forward method for\nin-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given\nuncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without\nrequiring any camera parameters or depth information. For generalizability, we\nbuild Splatt3R upon a ``foundation'' 3D geometry reconstruction method, MASt3R,\nby extending it to deal with both 3D structure and appearance. Specifically,\nunlike the original MASt3R which reconstructs only 3D point clouds, we predict\nthe additional Gaussian attributes required to construct a Gaussian primitive\nfor each point. Hence, unlike other novel view synthesis methods, Splatt3R is\nfirst trained by optimizing the 3D point cloud's geometry loss, and then a\nnovel view synthesis objective. By doing this, we avoid the local minima\npresent in training 3D Gaussian Splats from stereo views. We also propose a\nnovel loss masking strategy that we empirically find is critical for strong\nperformance on extrapolated viewpoints. We train Splatt3R on the ScanNet++\ndataset and demonstrate excellent generalisation to uncalibrated, in-the-wild\nimages. Splatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and\nthe resultant splats can be rendered in real-time.\n","authors":["Brandon Smart","Chuanxia Zheng","Iro Laina","Victor Adrian Prisacariu"],"pdf_url":"https://arxiv.org/pdf/2408.13912v2.pdf","comment":"Our project page can be found at: https://splatt3r.active.vision/"},{"id":"http://arxiv.org/abs/2301.06267v5","updated":"2024-08-27T19:00:47Z","published":"2023-01-16T05:40:42Z","title":"Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with\n Multimodal Models","summary":" The ability to quickly learn a new task with minimal instruction - known as\nfew-shot learning - is a central aspect of intelligent agents. Classical\nfew-shot benchmarks make use of few-shot samples from a single modality, but\nsuch samples may not be sufficient to characterize an entire concept class. In\ncontrast, humans use cross-modal information to learn new concepts efficiently.\nIn this work, we demonstrate that one can indeed build a better ${\\bf visual}$\ndog classifier by ${\\bf read}$ing about dogs and ${\\bf listen}$ing to them\nbark. To do so, we exploit the fact that recent multimodal foundation models\nsuch as CLIP learn cross-modal encoders that map different modalities to the\nsame representation space. Specifically, we propose a simple strategy for ${\\bf\ncross-modal}$ ${\\bf adaptation}$: we treat examples from different modalities\nas additional few-shot examples. For example, by simply repurposing class names\nas an additional training sample, we trivially turn any n-shot learning problem\ninto a (n+1)-shot problem. This allows us to produce SOTA results with\nembarrassingly simple linear classifiers. We show that our approach can be\ncombined with existing methods such as prefix tuning, adapters, and classifier\nensembling. Finally, to explore other modalities beyond vision and language, we\nconstruct the first (to our knowledge) audiovisual few-shot benchmark and use\ncross-modal training to improve the performance of both image and audio\nclassification.\n","authors":["Zhiqiu Lin","Samuel Yu","Zhiyi Kuang","Deepak Pathak","Deva Ramanan"],"pdf_url":"https://arxiv.org/pdf/2301.06267v5.pdf","comment":"Published at CVPR 2023. Project site:\n https://linzhiqiu.github.io/papers/cross_modal/"},{"id":"http://arxiv.org/abs/2406.14568v2","updated":"2024-08-27T18:42:09Z","published":"2024-04-29T23:53:42Z","title":"Policy Gradient-Driven Noise Mask","summary":" Deep learning classifiers face significant challenges when dealing with\nheterogeneous multi-modal and multi-organ biomedical datasets. The low-level\nfeature distinguishability limited to imaging-modality hinders the classifiers'\nability to learn high-level semantic relationships, resulting in sub-optimal\nperformance. To address this issue, image augmentation strategies are employed\nas regularization techniques. While additive noise input during network\ntraining is a well-established augmentation as regularization method, modern\npipelines often favor more robust techniques such as dropout and weight decay.\nThis preference stems from the observation that combining these established\ntechniques with noise input can adversely affect model performance.\n In this study, we propose a novel pretraining pipeline that learns to\ngenerate conditional noise mask specifically tailored to improve performance on\nmulti-modal and multi-organ datasets. As a reinforcement learning algorithm,\nour approach employs a dual-component system comprising a very light-weight\npolicy network that learns to sample conditional noise using a differentiable\nbeta distribution as well as a classifier network. The policy network is\ntrained using the reinforce algorithm to generate image-specific noise masks\nthat regularize the classifier during pretraining. A key aspect is that the\npolicy network's role is limited to obtaining an intermediate (or heated) model\nbefore fine-tuning. During inference, the policy network is omitted, allowing\ndirect comparison between the baseline and noise-regularized models.\n We conducted experiments and related analyses on RadImageNet datasets.\nResults demonstrate that fine-tuning the intermediate models consistently\noutperforms conventional training algorithms on both classification and\ngeneralization to unseen concept tasks.\n","authors":["Mehmet Can Yavuz","Yang Yang"],"pdf_url":"https://arxiv.org/pdf/2406.14568v2.pdf","comment":"13 pages; 8 figures; 5 tables"},{"id":"http://arxiv.org/abs/2403.10170v2","updated":"2024-08-27T18:36:12Z","published":"2024-03-15T10:26:52Z","title":"Computer User Interface Understanding. A New Dataset and a Learning\n Framework","summary":" User Interface (UI) understanding has been an increasingly popular topic over\nthe last few years. So far, there has been a vast focus solely on web and\nmobile applications. In this paper, we introduce the harder task of computer UI\nunderstanding. With the goal of enabling research in this field, we have\ngenerated a dataset with a set of videos where a user is performing a sequence\nof actions and each image shows the desktop contents at that time point. We\nalso present a framework that is composed of a synthetic sample generation\npipeline to augment the dataset with relevant characteristics, and a\ncontrastive learning method to classify images in the videos. We take advantage\nof the natural conditional, tree-like, relationship of the images'\ncharacteristics to regularize the learning of the representations by dealing\nwith multiple partial tasks simultaneously. Experimental results show that the\nproposed framework outperforms previously proposed hierarchical multi-label\ncontrastive losses in fine-grain UI classification.\n","authors":["Andrés Muñoz","Daniel Borrajo"],"pdf_url":"https://arxiv.org/pdf/2403.10170v2.pdf","comment":"14 pages main paper, 6 pages appendix"},{"id":"http://arxiv.org/abs/2408.15355v1","updated":"2024-08-27T18:27:47Z","published":"2024-08-27T18:27:47Z","title":"Optimizing Lung Cancer Detection in CT Imaging: A Wavelet Multi-Layer\n Perceptron (WMLP) Approach Enhanced by Dragonfly Algorithm (DA)","summary":" Lung cancer stands as the preeminent cause of cancer-related mortality\nglobally. Prompt and precise diagnosis, coupled with effective treatment, is\nimperative to reduce the fatality rates associated with this formidable\ndisease. This study introduces a cutting-edge deep learning framework for the\nclassification of lung cancer from CT scan imagery. The research encompasses a\nsuite of image pre-processing strategies, notably Canny edge detection, and\nwavelet transformations, which precede the extraction of salient features and\nsubsequent classification via a Multi-Layer Perceptron (MLP). The optimization\nprocess is further refined using the Dragonfly Algorithm (DA). The methodology\nput forth has attained an impressive training and testing accuracy of 99.82\\%,\nunderscoring its efficacy and reliability in the accurate diagnosis of lung\ncancer.\n","authors":["Bitasadat Jamshidi","Nastaran Ghorbani","Mohsen Rostamy-Malkhalifeh"],"pdf_url":"https://arxiv.org/pdf/2408.15355v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2408.15232v1","updated":"2024-08-27T17:50:03Z","published":"2024-08-27T17:50:03Z","title":"Into the Unknown Unknowns: Engaged Human Learning through Participation\n in Language Model Agent Conversations","summary":" While language model (LM)-powered chatbots and generative search engines\nexcel at answering concrete queries, discovering information in the terrain of\nunknown unknowns remains challenging for users. To emulate the common\neducational scenario where children/students learn by listening to and\nparticipating in conversations of their parents/teachers, we create\nCollaborative STORM (Co-STORM). Unlike QA systems that require users to ask all\nthe questions, Co-STORM lets users observe and occasionally steer the discourse\namong several LM agents. The agents ask questions on the user's behalf,\nallowing the user to discover unknown unknowns serendipitously. To facilitate\nuser interaction, Co-STORM assists users in tracking the discourse by\norganizing the uncovered information into a dynamic mind map, ultimately\ngenerating a comprehensive report as takeaways. For automatic evaluation, we\nconstruct the WildSeek dataset by collecting real information-seeking records\nwith user goals. Co-STORM outperforms baseline methods on both discourse trace\nand report quality. In a further human evaluation, 70% of participants prefer\nCo-STORM over a search engine, and 78% favor it over a RAG chatbot.\n","authors":["Yucheng Jiang","Yijia Shao","Dekun Ma","Sina J. Semnani","Monica S. Lam"],"pdf_url":"https://arxiv.org/pdf/2408.15232v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15172v1","updated":"2024-08-27T16:10:21Z","published":"2024-08-27T16:10:21Z","title":"X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation","summary":" Large Language Models (LLMs) and Large Multimodal Models (LMMs) have been\nshown to enhance the effectiveness of enriching item descriptions, thereby\nimproving the accuracy of recommendation systems. However, most existing\napproaches either rely on text-only prompting or employ basic multimodal\nstrategies that do not fully exploit the complementary information available\nfrom both textual and visual modalities. This paper introduces a novel\nframework, Cross-Reflection Prompting, termed X-Reflect, designed to address\nthese limitations by prompting LMMs to explicitly identify and reconcile\nsupportive and conflicting information between text and images. By capturing\nnuanced insights from both modalities, this approach generates more\ncomprehensive and contextually richer item representations. Extensive\nexperiments conducted on two widely used benchmarks demonstrate that our method\noutperforms existing prompting baselines in downstream recommendation accuracy.\nAdditionally, we evaluate the generalizability of our framework across\ndifferent LMM backbones and the robustness of the prompting strategies,\noffering insights for optimization. This work underscores the importance of\nintegrating multimodal information and presents a novel solution for improving\nitem understanding in multimodal recommendation systems.\n","authors":["Hanjia Lyu","Ryan Rossi","Xiang Chen","Md Mehrab Tanjim","Stefano Petrangeli","Somdeb Sarkhel","Jiebo Luo"],"pdf_url":"https://arxiv.org/pdf/2408.15172v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.16828v2","updated":"2024-08-27T15:07:28Z","published":"2024-07-23T20:38:23Z","title":"Pareto Front Approximation for Multi-Objective Session-Based Recommender\n Systems","summary":" This work introduces MultiTRON, an approach that adapts Pareto front\napproximation techniques to multi-objective session-based recommender systems\nusing a transformer neural network. Our approach optimizes trade-offs between\nkey metrics such as click-through and conversion rates by training on sampled\npreference vectors. A significant advantage is that after training, a single\nmodel can access the entire Pareto front, allowing it to be tailored to meet\nthe specific requirements of different stakeholders by adjusting an additional\ninput vector that weights the objectives. We validate the model's performance\nthrough extensive offline and online evaluation. For broader application and\nresearch, the source code is made available at\nhttps://github.com/otto-de/MultiTRON. The results confirm the model's ability\nto manage multiple recommendation objectives effectively, offering a flexible\ntool for diverse business needs.\n","authors":["Timo Wilm","Philipp Normann","Felix Stepprath"],"pdf_url":"https://arxiv.org/pdf/2407.16828v2.pdf","comment":"Accepted at the Eighteenth ACM Conference on Recommender Systems\n (RecSys '24)"},{"id":"http://arxiv.org/abs/2402.09766v2","updated":"2024-08-27T13:01:56Z","published":"2024-02-15T07:35:52Z","title":"From Variability to Stability: Advancing RecSys Benchmarking Practices","summary":" In the rapidly evolving domain of Recommender Systems (RecSys), new\nalgorithms frequently claim state-of-the-art performance based on evaluations\nover a limited set of arbitrarily selected datasets. However, this approach may\nfail to holistically reflect their effectiveness due to the significant impact\nof dataset characteristics on algorithm performance. Addressing this\ndeficiency, this paper introduces a novel benchmarking methodology to\nfacilitate a fair and robust comparison of RecSys algorithms, thereby advancing\nevaluation practices. By utilizing a diverse set of $30$ open datasets,\nincluding two introduced in this work, and evaluating $11$ collaborative\nfiltering algorithms across $9$ metrics, we critically examine the influence of\ndataset characteristics on algorithm performance. We further investigate the\nfeasibility of aggregating outcomes from multiple datasets into a unified\nranking. Through rigorous experimental analysis, we validate the reliability of\nour methodology under the variability of datasets, offering a benchmarking\nstrategy that balances quality and computational demands. This methodology\nenables a fair yet effective means of evaluating RecSys algorithms, providing\nvaluable guidance for future research endeavors.\n","authors":["Valeriy Shevchenko","Nikita Belousov","Alexey Vasilev","Vladimir Zholobov","Artyom Sosedka","Natalia Semenova","Anna Volodkevich","Andrey Savchenko","Alexey Zaytsev"],"pdf_url":"https://arxiv.org/pdf/2402.09766v2.pdf","comment":"8 pages with 11 figures"},{"id":"http://arxiv.org/abs/2408.15004v1","updated":"2024-08-27T12:41:37Z","published":"2024-08-27T12:41:37Z","title":"Measuring publication relatedness using controlled vocabularies","summary":" Measuring the relatedness between scientific publications has important\napplications in many areas of bibliometrics and science policy. Controlled\nvocabularies provide a promising basis for measuring relatedness because they\naddress issues that arise when using citation or textual similarity to measure\nrelatedness. While several controlled-vocabulary-based relatedness measures\nhave been developed, there exists no comprehensive and direct test of their\naccuracy and suitability for different types of research questions. This paper\nreviews existing measures, develops a new measure, and benchmarks the measures\nusing TREC Genomics data as a ground truth of topics. The benchmark test show\nthat the new measure and the measure proposed by Ahlgren et al. (2020) have\ndiffering strengths and weaknesses. These results inform a discussion of which\nmethod to choose when studying interdisciplinarity, information retrieval,\nclustering of science, and researcher topic switching.\n","authors":["Emil Dolmer Alnor"],"pdf_url":"https://arxiv.org/pdf/2408.15004v1.pdf","comment":"Accepted for presentation at the 28th International Conference on\n Science, Technology and Innovation Indicators, 2024"},{"id":"http://arxiv.org/abs/2408.15002v1","updated":"2024-08-27T12:34:41Z","published":"2024-08-27T12:34:41Z","title":"Knowledge Discovery in Optical Music Recognition: Enhancing Information\n Retrieval with Instance Segmentation","summary":" Optical Music Recognition (OMR) automates the transcription of musical\nnotation from images into machine-readable formats like MusicXML, MEI, or MIDI,\nsignificantly reducing the costs and time of manual transcription. This study\nexplores knowledge discovery in OMR by applying instance segmentation using\nMask R-CNN to enhance the detection and delineation of musical symbols in sheet\nmusic. Unlike Optical Character Recognition (OCR), OMR must handle the\nintricate semantics of Common Western Music Notation (CWMN), where symbol\nmeanings depend on shape, position, and context. Our approach leverages\ninstance segmentation to manage the density and overlap of musical symbols,\nfacilitating more precise information retrieval from music scores. Evaluations\non the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with\nour method achieving a mean Average Precision (mAP) of up to 59.70\\% in dense\nsymbol environments, achieving comparable results to object detection.\nFurthermore, using traditional computer vision techniques, we add a parallel\nstep for staff detection to infer the pitch for the recognised symbols. This\nstudy emphasises the role of pixel-wise segmentation in advancing accurate\nmusic symbol recognition, contributing to knowledge discovery in OMR. Our\nfindings indicate that instance segmentation provides more precise\nrepresentations of musical symbols, particularly in densely populated scores,\nadvancing OMR technology. We make our implementation, pre-processing scripts,\ntrained models, and evaluation results publicly available to support further\nresearch and development.\n","authors":["Elona Shatri","George Fazekas"],"pdf_url":"https://arxiv.org/pdf/2408.15002v1.pdf","comment":"8 pages content and one references, accepted version at the\n International Conference on Knowledge Discovery and Information Retrieval\n 2024, Porto, Portugal"},{"id":"http://arxiv.org/abs/2408.14968v1","updated":"2024-08-27T11:21:19Z","published":"2024-08-27T11:21:19Z","title":"MRSE: An Efficient Multi-modality Retrieval System for Large Scale\n E-commerce","summary":" Providing high-quality item recall for text queries is crucial in large-scale\ne-commerce search systems. Current Embedding-based Retrieval Systems (ERS)\nembed queries and items into a shared low-dimensional space, but uni-modality\nERS rely too heavily on textual features, making them unreliable in complex\ncontexts. While multi-modality ERS incorporate various data sources, they often\noverlook individual preferences for different modalities, leading to suboptimal\nresults. To address these issues, we propose MRSE, a Multi-modality Retrieval\nSystem that integrates text, item images, and user preferences through\nlightweight mixture-of-expert (LMoE) modules to better align features across\nand within modalities. MRSE also builds user profiles at a multi-modality level\nand introduces a novel hybrid loss function that enhances consistency and\nrobustness using hard negative sampling. Experiments on a large-scale dataset\nfrom Shopee and online A/B testing show that MRSE achieves an 18.9% improvement\nin offline relevance and a 3.7% gain in online core metrics compared to\nShopee's state-of-the-art uni-modality system.\n","authors":["Hao Jiang","Haoxiang Zhang","Qingshan Hou","Chaofeng Chen","Weisi Lin","Jingchang Zhang","Annan Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14968v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14908v1","updated":"2024-08-27T09:35:13Z","published":"2024-08-27T09:35:13Z","title":"Triplètoile: Extraction of Knowledge from Microblogging Text","summary":" Numerous methods and pipelines have recently emerged for the automatic\nextraction of knowledge graphs from documents such as scientific publications\nand patents. However, adapting these methods to incorporate alternative text\nsources like micro-blogging posts and news has proven challenging as they\nstruggle to model open-domain entities and relations, typically found in these\nsources. In this paper, we propose an enhanced information extraction pipeline\ntailored to the extraction of a knowledge graph comprising open-domain entities\nfrom micro-blogging posts on social media platforms. Our pipeline leverages\ndependency parsing and classifies entity relations in an unsupervised manner\nthrough hierarchical clustering over word embeddings. We provide a use case on\nextracting semantic triples from a corpus of 100 thousand tweets about digital\ntransformation and publicly release the generated knowledge graph. On the same\ndataset, we conduct two experimental evaluations, showing that the system\nproduces triples with precision over 95% and outperforms similar pipelines of\naround 5% in terms of precision, while generating a comparatively higher number\nof triples.\n","authors":["Vanni Zavarella","Sergio Consoli","Diego Reforgiato Recupero","Gianni Fenu","Simone Angioni","Davide Buscaldi","Danilo Dessì","Francesco Osborne"],"pdf_url":"https://arxiv.org/pdf/2408.14908v1.pdf","comment":"42 pages, 6 figures"},{"id":"http://arxiv.org/abs/2408.14906v1","updated":"2024-08-27T09:34:38Z","published":"2024-08-27T09:34:38Z","title":"Writing in the Margins: Better Inference Pattern for Long Context\n Retrieval","summary":" In this paper, we introduce Writing in the Margins (WiM), a new inference\npattern for Large Language Models designed to optimize the handling of long\ninput sequences in retrieval-oriented tasks. This approach leverages the\nchunked prefill of the key-value cache to perform segment-wise inference, which\nenables efficient processing of extensive contexts along with the generation\nand classification of intermediate information (\"margins\") that guide the model\ntowards specific tasks. This method increases computational overhead marginally\nwhile significantly enhancing the performance of off-the-shelf models without\nthe need for fine-tuning. Specifically, we observe that WiM provides an average\nenhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG)\nand more than a 30.0% increase in the F1-score for aggregation tasks (CWE).\nAdditionally, we show how the proposed pattern fits into an interactive\nretrieval design that provides end-users with ongoing updates about the\nprogress of context processing, and pinpoints the integration of relevant\ninformation into the final response. We release our implementation of WiM using\nHugging Face Transformers library at\nhttps://github.com/writer/writing-in-the-margins.\n","authors":["Melisa Russak","Umar Jamil","Christopher Bryant","Kiran Kamble","Axel Magnuson","Mateusz Russak","Waseem AlShikh"],"pdf_url":"https://arxiv.org/pdf/2408.14906v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14851v1","updated":"2024-08-27T08:08:05Z","published":"2024-08-27T08:08:05Z","title":"Graph and Sequential Neural Networks in Session-based Recommendation: A\n Survey","summary":" Recent years have witnessed the remarkable success of recommendation systems\n(RSs) in alleviating the information overload problem. As a new paradigm of\nRSs, session-based recommendation (SR) specializes in users' short-term\npreference capture and aims to provide a more dynamic and timely recommendation\nbased on the ongoing interacted actions. In this survey, we will give a\ncomprehensive overview of the recent works on SR. First, we clarify the\ndefinitions of various SR tasks and introduce the characteristics of\nsession-based recommendation against other recommendation tasks. Then, we\nsummarize the existing methods in two categories: sequential neural network\nbased methods and graph neural network (GNN) based methods. The standard\nframeworks and technical are also introduced. Finally, we discuss the\nchallenges of SR and new research directions in this area.\n","authors":["Zihao Li","Chao Yang","Yakun Chen","Xianzhi Wang","Hongxu Chen","Guandong Xu","Lina Yao","Quan Z. Sheng"],"pdf_url":"https://arxiv.org/pdf/2408.14851v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.14043v2","updated":"2024-08-27T06:18:05Z","published":"2024-06-20T07:06:58Z","title":"Taxonomy-Guided Zero-Shot Recommendations with LLMs","summary":" With the emergence of large language models (LLMs) and their ability to\nperform a variety of tasks, their application in recommender systems (RecSys)\nhas shown promise. However, we are facing significant challenges when deploying\nLLMs into RecSys, such as limited prompt length, unstructured item information,\nand un-constrained generation of recommendations, leading to sub-optimal\nperformance. To address these issues, we propose a novel method using a\ntaxonomy dictionary. This method provides a systematic framework for\ncategorizing and organizing items, improving the clarity and structure of item\ninformation. By incorporating the taxonomy dictionary into LLM prompts, we\nachieve efficient token utilization and controlled feature generation, leading\nto more accurate and contextually relevant recommendations. Our Taxonomy-guided\nRecommendation (TaxRec) approach features a two-step process: one-time taxonomy\ncategorization and LLM-based recommendation, enabling zero-shot recommendations\nwithout the need for domain-specific fine-tuning. Experimental results\ndemonstrate TaxRec significantly enhances recommendation quality compared to\ntraditional zero-shot approaches, showcasing its efficacy as personal\nrecommender with LLMs. Code is available at\nhttps://github.com/yueqingliang1/TaxRec.\n","authors":["Yueqing Liang","Liangwei Yang","Chen Wang","Xiongxiao Xu","Philip S. Yu","Kai Shu"],"pdf_url":"https://arxiv.org/pdf/2406.14043v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.01262v3","updated":"2024-08-27T03:13:50Z","published":"2024-08-02T13:35:11Z","title":"RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework","summary":" Retrieval-Augmented Generation (RAG) systems have demonstrated their\nadvantages in alleviating the hallucination of Large Language Models (LLMs).\nExisting RAG benchmarks mainly focus on evaluating whether LLMs can correctly\nanswer the general knowledge. However, they are unable to evaluate the\neffectiveness of the RAG system in dealing with the data from different\nvertical domains. This paper introduces RAGEval, a framework for automatically\ngenerating evaluation datasets to evaluate the knowledge usage ability of\ndifferent LLMs in different scenarios. Specifically, RAGEval summarizes a\nschema from seed documents, applies the configurations to generate diverse\ndocuments, and constructs question-answering pairs according to both articles\nand configurations. We propose three novel metrics, Completeness,\nHallucination, and Irrelevance, to carefully evaluate the responses generated\nby LLMs. By benchmarking RAG models in vertical domains, RAGEval has the\nability to better evaluate the knowledge usage ability of LLMs, which avoids\nthe confusion regarding the source of knowledge in answering question in\nexisting QA datasets--whether it comes from parameterized memory or retrieval.\nThe code and dataset will be released.\n","authors":["Kunlun Zhu","Yifan Luo","Dingling Xu","Ruobing Wang","Shi Yu","Shuo Wang","Yukun Yan","Zhenghao Liu","Xu Han","Zhiyuan Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2408.01262v3.pdf","comment":"add github repo"},{"id":"http://arxiv.org/abs/2408.14743v1","updated":"2024-08-27T02:43:40Z","published":"2024-08-27T02:43:40Z","title":"Personalized Video Summarization using Text-Based Queries and\n Conditional Modeling","summary":" The proliferation of video content on platforms like YouTube and Vimeo\npresents significant challenges in efficiently locating relevant information.\nAutomatic video summarization aims to address this by extracting and presenting\nkey content in a condensed form. This thesis explores enhancing video\nsummarization by integrating text-based queries and conditional modeling to\ntailor summaries to user needs. Traditional methods often produce fixed\nsummaries that may not align with individual requirements. To overcome this, we\npropose a multi-modal deep learning approach that incorporates both textual\nqueries and visual information, fusing them at different levels of the model\narchitecture. Evaluation metrics such as accuracy and F1-score assess the\nquality of the generated summaries. The thesis also investigates improving\ntext-based query representations using contextualized word embeddings and\nspecialized attention networks. This enhances the semantic understanding of\nqueries, leading to better video summaries. To emulate human-like\nsummarization, which accounts for both visual coherence and abstract factors\nlike storyline consistency, we introduce a conditional modeling approach. This\nmethod uses multiple random variables and joint distributions to capture key\nsummarization components, resulting in more human-like and explainable\nsummaries. Addressing data scarcity in fully supervised learning, the thesis\nproposes a segment-level pseudo-labeling approach. This self-supervised method\ngenerates additional data, improving model performance even with limited\nhuman-labeled datasets. In summary, this research aims to enhance automatic\nvideo summarization by incorporating text-based queries, improving query\nrepresentations, introducing conditional modeling, and addressing data\nscarcity, thereby creating more effective and personalized video summaries.\n","authors":["Jia-Hong Huang"],"pdf_url":"https://arxiv.org/pdf/2408.14743v1.pdf","comment":"Ph.D. thesis, 137 pages"},{"id":"http://arxiv.org/abs/2408.14723v1","updated":"2024-08-27T01:23:49Z","published":"2024-08-27T01:23:49Z","title":"Snap and Diagnose: An Advanced Multimodal Retrieval System for\n Identifying Plant Diseases in the Wild","summary":" Plant disease recognition is a critical task that ensures crop health and\nmitigates the damage caused by diseases. A handy tool that enables farmers to\nreceive a diagnosis based on query pictures or the text description of\nsuspicious plants is in high demand for initiating treatment before potential\ndiseases spread further. In this paper, we develop a multimodal plant disease\nimage retrieval system to support disease search based on either image or text\nprompts. Specifically, we utilize the largest in-the-wild plant disease dataset\nPlantWild, which includes over 18,000 images across 89 categories, to provide a\ncomprehensive view of potential diseases relating to the query. Furthermore,\ncross-modal retrieval is achieved in the developed system, facilitated by a\nnovel CLIP-based vision-language model that encodes both disease descriptions\nand disease images into the same latent space. Built on top of the retriever,\nour retrieval system allows users to upload either plant disease images or\ndisease descriptions to retrieve the corresponding images with similar\ncharacteristics from the disease dataset to suggest candidate diseases for end\nusers' consideration.\n","authors":["Tianqi Wei","Zhi Chen","Xin Yu"],"pdf_url":"https://arxiv.org/pdf/2408.14723v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15371v1","updated":"2024-08-27T19:10:21Z","published":"2024-08-27T19:10:21Z","title":"Temporal Graph Neural Network-Powered Paper Recommendation on Dynamic\n Citation Networks","summary":" Due to the rapid growth of scientific publications, identifying all related\nreference articles in the literature has become increasingly challenging yet\nhighly demanding. Existing methods primarily assess candidate publications from\na static perspective, focusing on the content of articles and their structural\ninformation, such as citation relationships. There is a lack of research\nregarding how to account for the evolving impact among papers on their\nembeddings. Toward this goal, this paper introduces a temporal dimension to\npaper recommendation strategies. The core idea is to continuously update a\npaper's embedding when new citation relationships appear, enhancing its\nrelevance for future recommendations. Whenever a citation relationship is added\nto the literature upon the publication of a paper, the embeddings of the two\nrelated papers are updated through a Temporal Graph Neural Network (TGN). A\nlearnable memory update module based on a Recurrent Neural Network (RNN) is\nutilized to study the evolution of the embedding of a paper in order to predict\nits reference impact in a future timestamp. Such a TGN-based model learns a\npattern of how people's views of the paper may evolve, aiming to guide paper\nrecommendations more precisely. Extensive experiments on an open citation\nnetwork dataset, including 313,278 articles from\nhttps://paperswithcode.com/about PaperWithCode, have demonstrated the\neffectiveness of the proposed approach.\n","authors":["Junhao Shen","Mohammad Ausaf Ali Haqqani","Beichen Hu","Cheng Huang","Xihao Xie","Tsengdar Lee","Jia Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.15371v1.pdf","comment":"10 pages, 4 figures, accepted by SDU@AAAI-2024. The AAAI Workshop on\n Scientific Document Understanding (2024)"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2408.15240v1","updated":"2024-08-27T17:57:45Z","published":"2024-08-27T17:57:45Z","title":"Generative Verifiers: Reward Modeling as Next-Token Prediction","summary":" Verifiers or reward models are often used to enhance the reasoning\nperformance of large language models (LLMs). A common approach is the Best-of-N\nmethod, where N candidate solutions generated by the LLM are ranked by a\nverifier, and the best one is selected. While LLM-based verifiers are typically\ntrained as discriminative classifiers to score solutions, they do not utilize\nthe text generation capabilities of pretrained LLMs. To overcome this\nlimitation, we instead propose training verifiers using the ubiquitous\nnext-token prediction objective, jointly on verification and solution\ngeneration. Compared to standard verifiers, such generative verifiers (GenRM)\ncan benefit from several advantages of LLMs: they integrate seamlessly with\ninstruction tuning, enable chain-of-thought reasoning, and can utilize\nadditional inference-time compute via majority voting for better verification.\nWe demonstrate that when using Gemma-based verifiers on algorithmic and\ngrade-school math reasoning tasks, GenRM outperforms discriminative verifiers\nand LLM-as-a-Judge, showing a 16-64% improvement in the percentage of problems\nsolved with Best-of-N. Furthermore, we show that GenRM scales favorably across\ndataset size, model capacity, and inference-time compute.\n","authors":["Lunjun Zhang","Arian Hosseini","Hritik Bansal","Mehran Kazemi","Aviral Kumar","Rishabh Agarwal"],"pdf_url":"https://arxiv.org/pdf/2408.15240v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15237v1","updated":"2024-08-27T17:56:11Z","published":"2024-08-27T17:56:11Z","title":"The Mamba in the Llama: Distilling and Accelerating Hybrid Models","summary":" Linear RNN architectures, like Mamba, can be competitive with Transformer\nmodels in language modeling while having advantageous deployment\ncharacteristics. Given the focus on training large-scale Transformer models, we\nconsider the challenge of converting these pretrained models for deployment. We\ndemonstrate that it is feasible to distill large Transformers into linear RNNs\nby reusing the linear projection weights from attention layers with academic\nGPU resources. The resulting hybrid model, which incorporates a quarter of the\nattention layers, achieves performance comparable to the original Transformer\nin chat benchmarks and outperforms open-source hybrid Mamba models trained from\nscratch with trillions of tokens in both chat benchmarks and general\nbenchmarks. Moreover, we introduce a hardware-aware speculative decoding\nalgorithm that accelerates the inference speed of Mamba and hybrid models.\nOverall we show how, with limited computation resources, we can remove many of\nthe original attention layers and generate from the resulting model more\nefficiently. Our top-performing model, distilled from Llama3-8B-Instruct,\nachieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and\n7.35 on MT-Bench, surpassing the best instruction-tuned linear RNN model.\n","authors":["Junxiong Wang","Daniele Paliotta","Avner May","Alexander M. Rush","Tri Dao"],"pdf_url":"https://arxiv.org/pdf/2408.15237v1.pdf","comment":"Code is open-sourced at https://github.com/jxiw/MambaInLlama"},{"id":"http://arxiv.org/abs/2408.15231v1","updated":"2024-08-27T17:48:29Z","published":"2024-08-27T17:48:29Z","title":"DCT-CryptoNets: Scaling Private Inference in the Frequency Domain","summary":" The convergence of fully homomorphic encryption (FHE) and machine learning\noffers unprecedented opportunities for private inference of sensitive data. FHE\nenables computation directly on encrypted data, safeguarding the entire machine\nlearning pipeline, including data and model confidentiality. However, existing\nFHE-based implementations for deep neural networks face significant challenges\nin computational cost, latency, and scalability, limiting their practical\ndeployment. This paper introduces DCT-CryptoNets, a novel approach that\nleverages frequency-domain learning to tackle these issues. Our method operates\ndirectly in the frequency domain, utilizing the discrete cosine transform (DCT)\ncommonly employed in JPEG compression. This approach is inherently compatible\nwith remote computing services, where images are usually transmitted and stored\nin compressed formats. DCT-CryptoNets reduces the computational burden of\nhomomorphic operations by focusing on perceptually relevant low-frequency\ncomponents. This is demonstrated by substantial latency reduction of up to\n5.3$\\times$ compared to prior work on image classification tasks, including a\nnovel demonstration of ImageNet inference within 2.5 hours, down from 12.5\nhours compared to prior work on equivalent compute resources. Moreover,\nDCT-CryptoNets improves the reliability of encrypted accuracy by reducing\nvariability (e.g., from $\\pm$2.5\\% to $\\pm$1.0\\% on ImageNet). This study\ndemonstrates a promising avenue for achieving efficient and practical\nprivacy-preserving deep learning on high resolution images seen in real-world\napplications.\n","authors":["Arjun Roy","Kaushik Roy"],"pdf_url":"https://arxiv.org/pdf/2408.15231v1.pdf","comment":"Under Review; 10 pages content, 3 pages appendix, 4 figures, 8\n tables; Code TBD"},{"id":"http://arxiv.org/abs/2408.15221v1","updated":"2024-08-27T17:33:30Z","published":"2024-08-27T17:33:30Z","title":"LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet","summary":" Recent large language model (LLM) defenses have greatly improved models'\nability to refuse harmful queries, even when adversarially attacked. However,\nLLM defenses are primarily evaluated against automated adversarial attacks in a\nsingle turn of conversation, an insufficient threat model for real-world\nmalicious use. We demonstrate that multi-turn human jailbreaks uncover\nsignificant vulnerabilities, exceeding 70% attack success rate (ASR) on\nHarmBench against defenses that report single-digit ASRs with automated\nsingle-turn attacks. Human jailbreaks also reveal vulnerabilities in machine\nunlearning defenses, successfully recovering dual-use biosecurity knowledge\nfrom unlearned models. We compile these results into Multi-Turn Human\nJailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks.\nWe publicly release MHJ alongside a compendium of jailbreak tactics developed\nacross dozens of commercial red teaming engagements, supporting research\ntowards stronger LLM defenses.\n","authors":["Nathaniel Li","Ziwen Han","Ian Steneker","Willow Primack","Riley Goodside","Hugh Zhang","Zifan Wang","Cristina Menghini","Summer Yue"],"pdf_url":"https://arxiv.org/pdf/2408.15221v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.13643v2","updated":"2024-08-27T17:32:39Z","published":"2022-09-27T19:16:26Z","title":"MPC-Pipe: an Efficient Pipeline Scheme for Secure Multi-party Machine\n Learning Inference","summary":" Multi-party computing (MPC) has been gaining popularity as a secure computing\nmodel over the past few years. However, prior works have demonstrated that MPC\nprotocols still pay substantial performance penalties compared to plaintext,\nparticularly when applied to ML algorithms. The overhead is due to added\ncomputation and communication costs. Prior studies, as well as our own\nanalysis, found that most MPC protocols today sequentially perform\ncommunication and computation. The participating parties must compute on their\nshares first and then perform data communication to allow the distribution of\nnew secret shares before proceeding to the next computation step. In this work,\nwe show that serialization is unnecessary, particularly in the context of ML\ncomputations (both in Convolutional neural networks and in Transformer-based\nmodels). We demonstrate that it is possible to carefully orchestrate the\ncomputation and communication steps to overlap.\n We propose MPC-Pipe, an efficient MPC system for both training and inference\nof ML workloads, which pipelines computations and communications in an MPC\nprotocol during the online phase. MPC-Pipe proposes three pipeline schemes to\noptimize the online phase of ML in the semi-honest majority adversary setting.\nWe implement MPC-Pipe by augmenting a modified version of CrypTen, which\nseparates online and offline phases. We evaluate the end-to-end system\nperformance benefits of the online phase of MPC using deep neural networks\n(VGG16, ResNet50) and Transformers using different network settings. We show\nthat MPC-Pipe can improve the throughput and latency of ML workloads.\n","authors":["Yongqin Wang","Rachit Rajat","Murali Annavaram"],"pdf_url":"https://arxiv.org/pdf/2209.13643v2.pdf","comment":"To be appeared in ASPLOS'25"},{"id":"http://arxiv.org/abs/2209.04042v3","updated":"2024-08-27T17:24:51Z","published":"2022-09-08T21:46:12Z","title":"Assessing Lower Limb Strength using Internet-of-Things Enabled Chair","summary":" This project describes the application of the technologies of Machine\nLearning and Internet-of-Things to assess the lower limb strength of\nindividuals undergoing rehabilitation or therapy. Specifically, it seeks to\nmeasure and assess the progress of individuals by sensors attached to chairs\nand processing the data through Google GPU Tensorflow CoLab. Pressure sensors\nare attached to various locations on a chair, including but not limited to the\nseating area, backrest, hand rests, and legs. Sensor data from the individual\nperforming both sit-to-stand transition and stand-to-sit transition provides a\ntime series dataset regarding the pressure distribution and vibratory motion on\nthe chair. The dataset and timing information can then be fed into a machine\nlearning model to estimate the relative strength and weakness during various\nphases of the movement.\n","authors":["Chelsea Yeh","Hanna Kaitlin Dy","Phillip Schodinger","Hudson Kaleb Dy"],"pdf_url":"https://arxiv.org/pdf/2209.04042v3.pdf","comment":"12 Pages"},{"id":"http://arxiv.org/abs/2406.14507v2","updated":"2024-08-27T17:19:20Z","published":"2024-06-20T17:12:20Z","title":"On Newton's Method to Unlearn Neural Networks","summary":" With the widespread applications of neural networks (NNs) trained on personal\ndata, machine unlearning has become increasingly important for enabling\nindividuals to exercise their personal data ownership, particularly the \"right\nto be forgotten\" from trained NNs. Since retraining is computationally\nexpensive, we seek approximate unlearning algorithms for NNs that return\nidentical models to the retrained oracle. While Newton's method has been\nsuccessfully used to approximately unlearn linear models, we observe that\nadapting it for NN is challenging due to degenerate Hessians that make\ncomputing Newton's update impossible. Additionally, we show that when coupled\nwith popular techniques to resolve the degeneracy, Newton's method often incurs\noffensively large norm updates and empirically degrades model performance\npost-unlearning. To address these challenges, we propose CureNewton's method, a\nprinciple approach that leverages cubic regularization to handle the Hessian\ndegeneracy effectively. The added regularizer eliminates the need for manual\nfinetuning and affords a natural interpretation within the unlearning context.\nExperiments across different models and datasets show that our method can\nachieve competitive unlearning performance to the state-of-the-art algorithm in\npractical unlearning settings, while being theoretically justified and\nefficient in running time.\n","authors":["Nhung Bui","Xinyang Lu","Rachael Hwee Ling Sim","See-Kiong Ng","Bryan Kian Hsiang Low"],"pdf_url":"https://arxiv.org/pdf/2406.14507v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.05921v2","updated":"2024-08-27T17:14:16Z","published":"2024-07-08T13:28:47Z","title":"TAPVid-3D: A Benchmark for Tracking Any Point in 3D","summary":" We introduce a new benchmark, TAPVid-3D, for evaluating the task of\nlong-range Tracking Any Point in 3D (TAP-3D). While point tracking in two\ndimensions (TAP) has many benchmarks measuring performance on real-world\nvideos, such as TAPVid-DAVIS, three-dimensional point tracking has none. To\nthis end, leveraging existing footage, we build a new benchmark for 3D point\ntracking featuring 4,000+ real-world videos, composed of three different data\nsources spanning a variety of object types, motion patterns, and indoor and\noutdoor environments. To measure performance on the TAP-3D task, we formulate a\ncollection of metrics that extend the Jaccard-based metric used in TAP to\nhandle the complexities of ambiguous depth scales across models, occlusions,\nand multi-track spatio-temporal smoothness. We manually verify a large sample\nof trajectories to ensure correct video annotations, and assess the current\nstate of the TAP-3D task by constructing competitive baselines using existing\ntracking models. We anticipate this benchmark will serve as a guidepost to\nimprove our ability to understand precise 3D motion and surface deformation\nfrom monocular video. Code for dataset download, generation, and model\nevaluation is available at https://tapvid3d.github.io\n","authors":["Skanda Koppula","Ignacio Rocco","Yi Yang","Joe Heyward","João Carreira","Andrew Zisserman","Gabriel Brostow","Carl Doersch"],"pdf_url":"https://arxiv.org/pdf/2407.05921v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.14053v5","updated":"2024-08-27T17:03:12Z","published":"2023-09-25T11:35:10Z","title":"Revisiting LARS for Large Batch Training Generalization of Neural\n Networks","summary":" This paper explores Large Batch Training techniques using layer-wise adaptive\nscaling ratio (LARS) across diverse settings, uncovering insights. LARS\nalgorithms with warm-up tend to be trapped in sharp minimizers early on due to\nredundant ratio scaling. Additionally, a fixed steep decline in the latter\nphase restricts deep neural networks from effectively navigating early-phase\nsharp minimizers. Building on these findings, we propose Time Varying LARS\n(TVLARS), a novel algorithm that replaces warm-up with a configurable\nsigmoid-like function for robust training in the initial phase. TVLARS promotes\ngradient exploration early on, surpassing sharp optimizers and gradually\ntransitioning to LARS for robustness in later phases. Extensive experiments\ndemonstrate that TVLARS consistently outperforms LARS and LAMB in most cases,\nwith up to 2\\% improvement in classification scenarios. Notably, in all\nself-supervised learning cases, TVLARS dominates LARS and LAMB with performance\nimprovements of up to 10\\%.\n","authors":["Khoi Do","Duong Nguyen","Hoa Nguyen","Long Tran-Thanh","Nguyen-Hoang Tran","Quoc-Viet Pham"],"pdf_url":"https://arxiv.org/pdf/2309.14053v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15198v1","updated":"2024-08-27T16:58:23Z","published":"2024-08-27T16:58:23Z","title":"Automatic 8-tissue Segmentation for 6-month Infant Brains","summary":" Numerous studies have highlighted that atypical brain development,\nparticularly during infancy and toddlerhood, is linked to an increased\nlikelihood of being diagnosed with a neurodevelopmental condition, such as\nautism. Accurate brain tissue segmentations for morphological analysis are\nessential in numerous infant studies. However, due to ongoing white matter (WM)\nmyelination changing tissue contrast in T1- and T2-weighted images, automatic\ntissue segmentation in 6-month infants is particularly difficult. On the other\nhand, manual labelling by experts is time-consuming and labor-intensive. In\nthis study, we propose the first 8-tissue segmentation pipeline for\nsix-month-old infant brains. This pipeline utilizes domain adaptation (DA)\ntechniques to leverage our longitudinal data, including neonatal images\nsegmented with the neonatal Developing Human Connectome Project structural\npipeline. Our pipeline takes raw 6-month images as inputs and generates the\n8-tissue segmentation as outputs, forming an end-to-end segmentation pipeline.\nThe segmented tissues include WM, gray matter (GM), cerebrospinal fluid (CSF),\nventricles, cerebellum, basal ganglia, brainstem, and hippocampus/amygdala.\nCycle-Consistent Generative Adversarial Network (CycleGAN) and Attention U-Net\nwere employed to achieve the image contrast transformation between neonatal and\n6-month images and perform tissue segmentation on the synthesized 6-month\nimages (neonatal images with 6-month intensity contrast), respectively.\nMoreover, we incorporated the segmentation outputs from Infant Brain Extraction\nand Analysis Toolbox (iBEAT) and another Attention U-Net to further enhance the\nperformance and construct the end-to-end segmentation pipeline. Our evaluation\nwith real 6-month images achieved a DICE score of 0.92, an HD95 of 1.6, and an\nASSD of 0.42.\n","authors":["Yilan Dong","Vanessa Kyriakopoulou","Irina Grigorescu","Grainne McAlonan","Dafnis Batalle","Maria Deprez"],"pdf_url":"https://arxiv.org/pdf/2408.15198v1.pdf","comment":"11 pages, 4 figures, to be published in MICCAI PIPPI workshop"},{"id":"http://arxiv.org/abs/2408.14461v2","updated":"2024-08-27T16:43:52Z","published":"2024-08-26T17:50:47Z","title":"A domain decomposition-based autoregressive deep learning model for\n unsteady and nonlinear partial differential equations","summary":" In this paper, we propose a domain-decomposition-based deep learning (DL)\nframework, named transient-CoMLSim, for accurately modeling unsteady and\nnonlinear partial differential equations (PDEs). The framework consists of two\nkey components: (a) a convolutional neural network (CNN)-based autoencoder\narchitecture and (b) an autoregressive model composed of fully connected\nlayers. Unlike existing state-of-the-art methods that operate on the entire\ncomputational domain, our CNN-based autoencoder computes a lower-dimensional\nbasis for solution and condition fields represented on subdomains. Timestepping\nis performed entirely in the latent space, generating embeddings of the\nsolution variables from the time history of embeddings of solution and\ncondition variables. This approach not only reduces computational complexity\nbut also enhances scalability, making it well-suited for large-scale\nsimulations. Furthermore, to improve the stability of our rollouts, we employ a\ncurriculum learning (CL) approach during the training of the autoregressive\nmodel. The domain-decomposition strategy enables scaling to out-of-distribution\ndomain sizes while maintaining the accuracy of predictions -- a feature not\neasily integrated into popular DL-based approaches for physics simulations. We\nbenchmark our model against two widely-used DL architectures, Fourier Neural\nOperator (FNO) and U-Net, and demonstrate that our framework outperforms them\nin terms of accuracy, extrapolation to unseen timesteps, and stability for a\nwide range of use cases.\n","authors":["Sheel Nidhan","Haoliang Jiang","Lalit Ghule","Clancy Umphrey","Rishikesh Ranade","Jay Pathak"],"pdf_url":"https://arxiv.org/pdf/2408.14461v2.pdf","comment":"26 pages"},{"id":"http://arxiv.org/abs/2401.13054v3","updated":"2024-08-27T16:42:26Z","published":"2024-01-23T19:26:24Z","title":"Frustrated Random Walks: A Fast Method to Compute Node Distances on\n Hypergraphs","summary":" A hypergraph is a generalization of a graph that arises naturally when\nattribute-sharing among entities is considered. Compared to graphs, hypergraphs\nhave the distinct advantage that they contain explicit communities and are more\nconvenient to manipulate. An open problem in hypergraph research is how to\naccurately and efficiently calculate node distances on hypergraphs. Estimating\nnode distances enables us to find a node's nearest neighbors, which has\nimportant applications in such areas as recommender system, targeted\nadvertising, etc. In this paper, we propose using expected hitting times of\nrandom walks to compute hypergraph node distances. We note that simple random\nwalks (SRW) cannot accurately compute node distances on highly complex\nreal-world hypergraphs, which motivates us to introduce frustrated random walks\n(FRW) for this task. We further benchmark our method against DeepWalk, and show\nthat while the latter can achieve comparable results, FRW has a distinct\ncomputational advantage in cases where the number of targets is fairly small.\nFor such cases, we show that FRW runs in significantly shorter time than\nDeepWalk. Finally, we analyze the time complexity of our method, and show that\nfor large and sparse hypergraphs, the complexity is approximately linear,\nrendering it superior to the DeepWalk alternative.\n","authors":["Enzhi Li","Scott Nickleach","Bilal Fadlallah"],"pdf_url":"https://arxiv.org/pdf/2401.13054v3.pdf","comment":"15 pages, 6 figures"},{"id":"http://arxiv.org/abs/2408.15183v1","updated":"2024-08-27T16:35:06Z","published":"2024-08-27T16:35:06Z","title":"On latent dynamics learning in nonlinear reduced order modeling","summary":" In this work, we present the novel mathematical framework of latent dynamics\nmodels (LDMs) for reduced order modeling of parameterized nonlinear\ntime-dependent PDEs. Our framework casts this latter task as a nonlinear\ndimensionality reduction problem, while constraining the latent state to evolve\naccordingly to an (unknown) dynamical system. A time-continuous setting is\nemployed to derive error and stability estimates for the LDM approximation of\nthe full order model (FOM) solution. We analyze the impact of using an explicit\nRunge-Kutta scheme in the time-discrete setting, resulting in the\n$\\Delta\\text{LDM}$ formulation, and further explore the learnable setting,\n$\\Delta\\text{LDM}_\\theta$, where deep neural networks approximate the discrete\nLDM components, while providing a bounded approximation error with respect to\nthe FOM. Moreover, we extend the concept of parameterized Neural ODE - recently\nproposed as a possible way to build data-driven dynamical systems with varying\ninput parameters - to be a convolutional architecture, where the input\nparameters information is injected by means of an affine modulation mechanism,\nwhile designing a convolutional autoencoder neural network able to retain\nspatial-coherence, thus enhancing interpretability at the latent level.\nNumerical experiments, including the Burgers' and the\nadvection-reaction-diffusion equations, demonstrate the framework's ability to\nobtain, in a multi-query context, a time-continuous approximation of the FOM\nsolution, thus being able to query the LDM approximation at any given time\ninstance while retaining a prescribed level of accuracy. Our findings highlight\nthe remarkable potential of the proposed LDMs, representing a mathematically\nrigorous framework to enhance the accuracy and approximation capabilities of\nreduced order modeling for time-dependent parameterized PDEs.\n","authors":["Nicola Farenga","Stefania Fresca","Simone Brivio","Andrea Manzoni"],"pdf_url":"https://arxiv.org/pdf/2408.15183v1.pdf","comment":"43 pages"},{"id":"http://arxiv.org/abs/2408.15173v1","updated":"2024-08-27T16:11:20Z","published":"2024-08-27T16:11:20Z","title":"Exploiting Approximate Symmetry for Efficient Multi-Agent Reinforcement\n Learning","summary":" Mean-field games (MFG) have become significant tools for solving large-scale\nmulti-agent reinforcement learning problems under symmetry. However, the\nassumption of exact symmetry limits the applicability of MFGs, as real-world\nscenarios often feature inherent heterogeneity. Furthermore, most works on MFG\nassume access to a known MFG model, which might not be readily available for\nreal-world finite-agent games. In this work, we broaden the applicability of\nMFGs by providing a methodology to extend any finite-player, possibly\nasymmetric, game to an \"induced MFG\". First, we prove that $N$-player dynamic\ngames can be symmetrized and smoothly extended to the infinite-player continuum\nvia explicit Kirszbraun extensions. Next, we propose the notion of\n$\\alpha,\\beta$-symmetric games, a new class of dynamic population games that\nincorporate approximate permutation invariance. For $\\alpha,\\beta$-symmetric\ngames, we establish explicit approximation bounds, demonstrating that a Nash\npolicy of the induced MFG is an approximate Nash of the $N$-player dynamic\ngame. We show that TD learning converges up to a small bias using trajectories\nof the $N$-player game with finite-sample guarantees, permitting symmetrized\nlearning without building an explicit MFG model. Finally, for certain games\nsatisfying monotonicity, we prove a sample complexity of\n$\\widetilde{\\mathcal{O}}(\\varepsilon^{-6})$ for the $N$-agent game to learn an\n$\\varepsilon$-Nash up to symmetrization bias. Our theory is supported by\nevaluations on MARL benchmarks with thousands of agents.\n","authors":["Batuhan Yardim","Niao He"],"pdf_url":"https://arxiv.org/pdf/2408.15173v1.pdf","comment":"5 figures"},{"id":"http://arxiv.org/abs/2408.15165v1","updated":"2024-08-27T16:03:18Z","published":"2024-08-27T16:03:18Z","title":"Latent Ewald summation for machine learning of long-range interactions","summary":" Machine learning interatomic potentials (MLIPs) often neglect long-range\ninteractions, such as electrostatic and dispersion forces. In this work, we\nintroduce a straightforward and efficient method to account for long-range\ninteractions by learning a latent variable from local atomic descriptors and\napplying an Ewald summation to this variable. We demonstrate that in systems\nincluding charged, polar, or apolar molecular dimers, bulk water, and\nwater-vapor interface, standard short-ranged MLIPs can lead to unphysical\npredictions even when employing message passing. The long-range models\neffectively eliminate these artifacts, with only about twice the computational\ncost of short-range MLIPs.\n","authors":["Bingqing Cheng"],"pdf_url":"https://arxiv.org/pdf/2408.15165v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15158v1","updated":"2024-08-27T15:52:52Z","published":"2024-08-27T15:52:52Z","title":"Delay as Payoff in MAB","summary":" In this paper, we investigate a variant of the classical stochastic\nMulti-armed Bandit (MAB) problem, where the payoff received by an agent (either\ncost or reward) is both delayed, and directly corresponds to the magnitude of\nthe delay. This setting models faithfully many real world scenarios such as the\ntime it takes for a data packet to traverse a network given a choice of route\n(where delay serves as the agent's cost); or a user's time spent on a web page\ngiven a choice of content (where delay serves as the agent's reward).\n Our main contributions are tight upper and lower bounds for both the cost and\nreward settings. For the case that delays serve as costs, which we are the\nfirst to consider, we prove optimal regret that scales as $\\sum_{i:\\Delta_i >\n0}\\frac{\\log T}{\\Delta_i} + d^*$, where $T$ is the maximal number of steps,\n$\\Delta_i$ are the sub-optimality gaps and $d^*$ is the minimal expected delay\namongst arms. For the case that delays serves as rewards, we show optimal\nregret of $\\sum_{i:\\Delta_i > 0}\\frac{\\log T}{\\Delta_i} + \\bar{d}$, where $\\bar\nd$ is the second maximal expected delay. These improve over the regret in the\ngeneral delay-dependent payoff setting, which scales as $\\sum_{i:\\Delta_i >\n0}\\frac{\\log T}{\\Delta_i} + D$, where $D$ is the maximum possible delay. Our\nregret bounds highlight the difference between the cost and reward scenarios,\nshowing that the improvement in the cost scenario is more significant than for\nthe reward. Finally, we accompany our theoretical results with an empirical\nevaluation.\n","authors":["Ofir Schlisselberg","Ido Cohen","Tal Lancewicki","Yishay Mansour"],"pdf_url":"https://arxiv.org/pdf/2408.15158v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.11075v4","updated":"2024-08-27T15:39:53Z","published":"2024-07-13T04:29:36Z","title":"A Comprehensive Survey on Kolmogorov Arnold Networks (KAN)","summary":" Through this comprehensive survey of Kolmogorov-Arnold Networks(KAN), we have\ngained a thorough understanding of its theoretical foundation, architectural\ndesign, application scenarios, and current research progress. KAN, with its\nunique architecture and flexible activation functions, excels in handling\ncomplex data patterns and nonlinear relationships, demonstrating wide-ranging\napplication potential. While challenges remain, KAN is poised to pave the way\nfor innovative solutions in various fields, potentially revolutionizing how we\napproach complex computational problems.\n","authors":["Yuntian Hou","Di Zhang"],"pdf_url":"https://arxiv.org/pdf/2407.11075v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.08669v2","updated":"2024-08-27T15:36:59Z","published":"2024-01-08T21:13:07Z","title":"Deep Reinforcement Learning for Multi-Truck Vehicle Routing Problems\n with Multi-Leg Demand Routes","summary":" Deep reinforcement learning (RL) has been shown to be effective in producing\napproximate solutions to some vehicle routing problems (VRPs), especially when\nusing policies generated by encoder-decoder attention mechanisms. While these\ntechniques have been quite successful for relatively simple problem instances,\nthere are still under-researched and highly complex VRP variants for which no\neffective RL method has been demonstrated. In this work we focus on one such\nVRP variant, which contains multiple trucks and multi-leg routing requirements.\nIn these problems, demand is required to move along sequences of nodes, instead\nof just from a start node to an end node. With the goal of making deep RL a\nviable strategy for real-world industrial-scale supply chain logistics, we\ndevelop new extensions to existing encoder-decoder attention models which allow\nthem to handle multiple trucks and multi-leg routing requirements. Our models\nhave the advantage that they can be trained for a small number of trucks and\nnodes, and then embedded into a large supply chain to yield solutions for\nlarger numbers of trucks and nodes. We test our approach on a real supply chain\nenvironment arising in the operations of Japanese automotive parts manufacturer\nAisin Corporation, and find that our algorithm outperforms Aisin's previous\nbest solution.\n","authors":["Joshua Levin","Randall Correll","Takanori Ide","Takafumi Suzuki","Takaho Saito","Alan Arai"],"pdf_url":"https://arxiv.org/pdf/2401.08669v2.pdf","comment":"13 pages, 4 figures"},{"id":"http://arxiv.org/abs/2408.10280v2","updated":"2024-08-27T15:34:49Z","published":"2024-08-18T12:18:56Z","title":"NoRA: Nested Low-Rank Adaptation for Efficient Fine-Tuning Large Models","summary":" In this paper, we introduce Nested Low-Rank Adaptation (NoRA), a novel\napproach to parameter-efficient fine-tuning that extends the capabilities of\nLow-Rank Adaptation (LoRA) techniques. Vanilla LoRA overlooks pre-trained\nweight inheritance and still requires fine-tuning numerous parameters. To\naddresses these issues, our NoRA adopts a dual-layer nested structure with\nSingular Value Decomposition (SVD), effectively leveraging original matrix\nknowledge while reducing tunable parameters. Specifically, NoRA freezes the\nouter LoRA weights and utilizes an inner LoRA design, providing enhanced\ncontrol over model optimization. This approach allows the model to more\nprecisely adapt to specific tasks while maintaining a compact parameter space.\nBy freezing outer LoRA weights and using an inner LoRA design, NoRA enables\nprecise task adaptation with a compact parameter space. Evaluations on tasks\nincluding commonsense reasoning with large language models, fine-tuning\nvision-language models, and subject-driven generation demonstrate NoRA's\nsuperiority over LoRA and its variants. Code will be released upon acceptance.\n","authors":["Cheng Lin","Lujun Li","Dezhi Li","Jie Zou","Wei Xue","Yike Guo"],"pdf_url":"https://arxiv.org/pdf/2408.10280v2.pdf","comment":"Work in progress, revisions ongoing"},{"id":"http://arxiv.org/abs/2311.07596v2","updated":"2024-08-27T15:34:43Z","published":"2023-11-10T11:40:24Z","title":"Graph GOSPA metric: a metric to measure the discrepancy between graphs\n of different sizes","summary":" This paper proposes a metric to measure the dissimilarity between graphs that\nmay have a different number of nodes. The proposed metric extends the\ngeneralised optimal subpattern assignment (GOSPA) metric, which is a metric for\nsets, to graphs. The proposed graph GOSPA metric includes costs associated with\nnode attribute errors for properly assigned nodes, missed and false nodes and\nedge mismatches between graphs. The computation of this metric is based on\nfinding the optimal assignments between nodes in the two graphs, with the\npossibility of leaving some of the nodes unassigned. We also propose a lower\nbound for the metric, which is also a metric for graphs and is computable in\npolynomial time using linear programming. The metric is first derived for\nundirected unweighted graphs and it is then extended to directed and weighted\ngraphs. The properties of the metric are demonstrated via simulated and\nempirical datasets.\n","authors":["Jinhao Gu","Ángel F. García-Fernández","Robert E. Firth","Lennart Svensson"],"pdf_url":"https://arxiv.org/pdf/2311.07596v2.pdf","comment":"Accepted in IEEE Transactions on Signal Processing. The code is\n available at https://github.com/JinhaoGu/The-graph-GOSPA-metric"},{"id":"http://arxiv.org/abs/2405.14848v2","updated":"2024-08-27T15:28:33Z","published":"2024-05-23T17:56:38Z","title":"Local Causal Discovery for Structural Evidence of Direct Discrimination","summary":" Identifying the causal pathways of unfairness is a critical objective in\nimproving policy design and algorithmic decision-making. Prior work in causal\nfairness analysis often requires knowledge of the causal graph, hindering\npractical applications in complex or low-knowledge domains. Moreover, global\ndiscovery methods that learn causal structure from data can result in unstable\nperformance with finite samples, potentially leading to contradictory fairness\nconclusions. To mitigate these issues, we introduce local discovery for direct\ndiscrimination (LD3): a method that uncovers structural evidence of direct\ndiscrimination by identifying the causal parents of an outcome variable. LD3\nperforms a linear number of conditional independence tests relative to variable\nset size, and allows for latent confounding under the sufficient condition that\nno parent of the outcome is latent. We show that LD3 returns a valid adjustment\nset (VAS) under a new graphical criterion for the weighted controlled direct\neffect, a qualitative indicator of direct discrimination. LD3 limits\nunnecessary adjustment, providing interpretable VAS for assessing unfairness.\nWe use LD3 to analyze causal fairness in two complex decision systems: criminal\nrecidivism prediction and liver transplant allocation. LD3 was more\ntime-efficient and returned more plausible results on real-world data than\nbaselines, which took 46x to 5870x longer to execute.\n","authors":["Jacqueline Maasch","Kyra Gan","Violet Chen","Agni Orfanoudaki","Nil-Jana Akpinar","Fei Wang"],"pdf_url":"https://arxiv.org/pdf/2405.14848v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15138v1","updated":"2024-08-27T15:23:09Z","published":"2024-08-27T15:23:09Z","title":"How transformers learn structured data: insights from hierarchical\n filtering","summary":" We introduce a hierarchical filtering procedure for generative models of\nsequences on trees, enabling control over the range of positional correlations\nin the data. Leveraging this controlled setting, we provide evidence that\nvanilla encoder-only transformer architectures can implement the optimal Belief\nPropagation algorithm on both root classification and masked language modeling\ntasks. Correlations at larger distances corresponding to increasing layers of\nthe hierarchy are sequentially included as the network is trained. We analyze\nhow the transformer layers succeed by focusing on attention maps from models\ntrained with varying degrees of filtering. These attention maps show clear\nevidence for iterative hierarchical reconstruction of correlations, and we can\nrelate these observations to a plausible implementation of the exact inference\nalgorithm for the network sizes considered.\n","authors":["Jerome Garnier-Brun","Marc Mézard","Emanuele Moscato","Luca Saglietti"],"pdf_url":"https://arxiv.org/pdf/2408.15138v1.pdf","comment":"18 pages, 9 figures"},{"id":"http://arxiv.org/abs/2408.15136v1","updated":"2024-08-27T15:19:07Z","published":"2024-08-27T15:19:07Z","title":"Low-Budget Simulation-Based Inference with Bayesian Neural Networks","summary":" Simulation-based inference methods have been shown to be inaccurate in the\ndata-poor regime, when training simulations are limited or expensive. Under\nthese circumstances, the inference network is particularly prone to\noverfitting, and using it without accounting for the computational uncertainty\narising from the lack of identifiability of the network weights can lead to\nunreliable results. To address this issue, we propose using Bayesian neural\nnetworks in low-budget simulation-based inference, thereby explicitly\naccounting for the computational uncertainty of the posterior approximation. We\ndesign a family of Bayesian neural network priors that are tailored for\ninference and show that they lead to well-calibrated posteriors on tested\nbenchmarks, even when as few as $O(10)$ simulations are available. This opens\nup the possibility of performing reliable simulation-based inference using very\nexpensive simulators, as we demonstrate on a problem from the field of\ncosmology where single simulations are computationally expensive. We show that\nBayesian neural networks produce informative and well-calibrated posterior\nestimates with only a few hundred simulations.\n","authors":["Arnaud Delaunoy","Maxence de la Brassinne Bonardeaux","Siddharth Mishra-Sharma","Gilles Louppe"],"pdf_url":"https://arxiv.org/pdf/2408.15136v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.07531v2","updated":"2024-08-27T15:16:06Z","published":"2024-08-14T13:03:41Z","title":"Development of a Large Language Model-based Multi-Agent Clinical\n Decision Support System for Korean Triage and Acuity Scale (KTAS)-Based\n Triage and Treatment Planning in Emergency Departments","summary":" Emergency department (ED) overcrowding and the complexity of rapid\ndecision-making in critical care settings pose significant challenges to\nhealthcare systems worldwide. While clinical decision support systems (CDSS)\nhave shown promise, the integration of large language models (LLMs) offers new\npossibilities for enhancing triage accuracy and clinical decision-making. This\nstudy presents an LLM-driven CDSS designed to assist ED physicians and nurses\nin patient triage, treatment planning, and overall emergency care management.\n We developed a multi-agent CDSS utilizing Llama-3-70b as the base LLM,\norchestrated by CrewAI and Langchain. The system comprises four AI agents\nemulating key ED roles: Triage Nurse, Emergency Physician, Pharmacist, and ED\nCoordinator. It incorporates the Korean Triage and Acuity Scale (KTAS) for\ntriage assessment and integrates with the RxNorm API for medication management.\n The model was evaluated using the Asclepius dataset, with performance\nassessed by a clinical emergency medicine specialist. The CDSS demonstrated\nhigh accuracy in triage decision-making compared to the baseline of a\nsingle-agent system. Furthermore, the system exhibited strong performance in\ncritical areas, including primary diagnosis, critical findings identification,\ndisposition decision-making, treatment planning, and resource allocation.\n Our multi-agent CDSS demonstrates significant potential for supporting\ncomprehensive emergency care management. By leveraging state-of-the-art AI\ntechnologies, this system offers a scalable and adaptable tool that could\nenhance emergency medical care delivery, potentially alleviating ED\novercrowding and improving patient outcomes. This work contributes to the\ngrowing field of AI applications in emergency medicine and offers a promising\ndirection for future research and clinical implementation.\n","authors":["Seungjun Han","Wongyung Choi"],"pdf_url":"https://arxiv.org/pdf/2408.07531v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15133v1","updated":"2024-08-27T15:13:06Z","published":"2024-08-27T15:13:06Z","title":"Using LLMs for Explaining Sets of Counterfactual Examples to Final Users","summary":" Causality is vital for understanding true cause-and-effect relationships\nbetween variables within predictive models, rather than relying on mere\ncorrelations, making it highly relevant in the field of Explainable AI. In an\nautomated decision-making scenario, causal inference methods can analyze the\nunderlying data-generation process, enabling explanations of a model's decision\nby manipulating features and creating counterfactual examples. These\ncounterfactuals explore hypothetical scenarios where a minimal number of\nfactors are altered, providing end-users with valuable information on how to\nchange their situation. However, interpreting a set of multiple counterfactuals\ncan be challenging for end-users who are not used to analyzing raw data\nrecords. In our work, we propose a novel multi-step pipeline that uses\ncounterfactuals to generate natural language explanations of actions that will\nlead to a change in outcome in classifiers of tabular data using LLMs. This\npipeline is designed to guide the LLM through smaller tasks that mimic human\nreasoning when explaining a decision based on counterfactual cases. We\nconducted various experiments using a public dataset and proposed a method of\nclosed-loop evaluation to assess the coherence of the final explanation with\nthe counterfactuals, as well as the quality of the content. Results are\npromising, although further experiments with other datasets and human\nevaluations should be carried out.\n","authors":["Arturo Fredes","Jordi Vitria"],"pdf_url":"https://arxiv.org/pdf/2408.15133v1.pdf","comment":"Presented as a poster in the 2nd Workshop on Causal Inference and\n Machine Learning in Practice at KDD 2024"},{"id":"http://arxiv.org/abs/2408.15128v1","updated":"2024-08-27T15:08:06Z","published":"2024-08-27T15:08:06Z","title":"Evaluating the Energy Consumption of Machine Learning: Systematic\n Literature Review and Experiments","summary":" Monitoring, understanding, and optimizing the energy consumption of Machine\nLearning (ML) are various reasons why it is necessary to evaluate the energy\nusage of ML. However, there exists no universal tool that can answer this\nquestion for all use cases, and there may even be disagreement on how to\nevaluate energy consumption for a specific use case. Tools and methods are\nbased on different approaches, each with their own advantages and drawbacks,\nand they need to be mapped out and explained in order to select the most\nsuitable one for a given situation. We address this challenge through two\napproaches. First, we conduct a systematic literature review of all tools and\nmethods that permit to evaluate the energy consumption of ML (both at training\nand at inference), irrespective of whether they were originally designed for\nmachine learning or general software. Second, we develop and use an\nexperimental protocol to compare a selection of these tools and methods. The\ncomparison is both qualitative and quantitative on a range of ML tasks of\ndifferent nature (vision, language) and computational complexity. The\nsystematic literature review serves as a comprehensive guide for understanding\nthe array of tools and methods used in evaluating energy consumption of ML, for\nvarious use cases going from basic energy monitoring to consumption\noptimization. Two open-source repositories are provided for further\nexploration. The first one contains tools that can be used to replicate this\nwork or extend the current review. The second repository houses the\nexperimental protocol, allowing users to augment the protocol with new ML\ncomputing tasks and additional energy evaluation tools.\n","authors":["Charlotte Rodriguez","Laura Degioanni","Laetitia Kameni","Richard Vidal","Giovanni Neglia"],"pdf_url":"https://arxiv.org/pdf/2408.15128v1.pdf","comment":"52 pages,"},{"id":"http://arxiv.org/abs/2407.16828v2","updated":"2024-08-27T15:07:28Z","published":"2024-07-23T20:38:23Z","title":"Pareto Front Approximation for Multi-Objective Session-Based Recommender\n Systems","summary":" This work introduces MultiTRON, an approach that adapts Pareto front\napproximation techniques to multi-objective session-based recommender systems\nusing a transformer neural network. Our approach optimizes trade-offs between\nkey metrics such as click-through and conversion rates by training on sampled\npreference vectors. A significant advantage is that after training, a single\nmodel can access the entire Pareto front, allowing it to be tailored to meet\nthe specific requirements of different stakeholders by adjusting an additional\ninput vector that weights the objectives. We validate the model's performance\nthrough extensive offline and online evaluation. For broader application and\nresearch, the source code is made available at\nhttps://github.com/otto-de/MultiTRON. The results confirm the model's ability\nto manage multiple recommendation objectives effectively, offering a flexible\ntool for diverse business needs.\n","authors":["Timo Wilm","Philipp Normann","Felix Stepprath"],"pdf_url":"https://arxiv.org/pdf/2407.16828v2.pdf","comment":"Accepted at the Eighteenth ACM Conference on Recommender Systems\n (RecSys '24)"},{"id":"http://arxiv.org/abs/2408.15126v1","updated":"2024-08-27T15:07:27Z","published":"2024-08-27T15:07:27Z","title":"Force-Guided Bridge Matching for Full-Atom Time-Coarsened Dynamics of\n Peptides","summary":" Molecular Dynamics (MD) simulations are irreplaceable and ubiquitous in\nfields of materials science, chemistry, pharmacology just to name a few.\nConventional MD simulations are plagued by numerical stability as well as long\nequilibration time issues, which limits broader applications of MD simulations.\nRecently, a surge of deep learning approaches have been devised for\ntime-coarsened dynamics, which learns the state transition mechanism over much\nlarger time scales to overcome these limitations. However, only a few methods\ntarget the underlying Boltzmann distribution by resampling techniques, where\nproposals are rarely accepted as new states with low efficiency. In this work,\nwe propose a force-guided bridge matching model, FBM, a novel framework that\nfirst incorporates physical priors into bridge matching for full-atom\ntime-coarsened dynamics. With the guidance of our well-designed intermediate\nforce field, FBM is feasible to target the Boltzmann-like distribution by\ndirect inference without extra steps. Experiments on small peptides verify our\nsuperiority in terms of comprehensive metrics and demonstrate transferability\nto unseen peptide systems.\n","authors":["Ziyang Yu","Wenbing Huang","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2408.15126v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13960v2","updated":"2024-08-27T15:06:17Z","published":"2024-08-25T23:48:11Z","title":"Time Series Analysis for Education: Methods, Applications, and Future\n Directions","summary":" Recent advancements in the collection and analysis of sequential educational\ndata have brought time series analysis to a pivotal position in educational\nresearch, highlighting its essential role in facilitating data-driven\ndecision-making. However, there is a lack of comprehensive summaries that\nconsolidate these advancements. To the best of our knowledge, this paper is the\nfirst to provide a comprehensive review of time series analysis techniques\nspecifically within the educational context. We begin by exploring the\nlandscape of educational data analytics, categorizing various data sources and\ntypes relevant to education. We then review four prominent time series\nmethods-forecasting, classification, clustering, and anomaly\ndetection-illustrating their specific application points in educational\nsettings. Subsequently, we present a range of educational scenarios and\napplications, focusing on how these methods are employed to address diverse\neducational tasks, which highlights the practical integration of multiple time\nseries methods to solve complex educational problems. Finally, we conclude with\na discussion on future directions, including personalized learning analytics,\nmultimodal data fusion, and the role of large language models (LLMs) in\neducational time series. The contributions of this paper include a detailed\ntaxonomy of educational data, a synthesis of time series techniques with\nspecific educational applications, and a forward-looking perspective on\nemerging trends and future research opportunities in educational analysis. The\nrelated papers and resources are available and regularly updated at the project\npage.\n","authors":["Shengzhong Mao","Chaoli Zhang","Yichi Song","Jindong Wang","Xiao-Jun Zeng","Zenglin Xu","Qingsong Wen"],"pdf_url":"https://arxiv.org/pdf/2408.13960v2.pdf","comment":"24 pages, 3 figures, 6 tables, project page: see\n https://github.com/ai-for-edu/time-series-analysis-for-education"},{"id":"http://arxiv.org/abs/2408.05892v3","updated":"2024-08-27T15:00:53Z","published":"2024-08-12T02:10:18Z","title":"Polyp SAM 2: Advancing Zero shot Polyp Segmentation in Colorectal Cancer\n Detection","summary":" Polyp segmentation plays a crucial role in the early detection and diagnosis\nof colorectal cancer. However, obtaining accurate segmentations often requires\nlabor-intensive annotations and specialized models. Recently, Meta AI Research\nreleased a general Segment Anything Model 2 (SAM 2), which has demonstrated\npromising performance in several segmentation tasks. In this manuscript, we\nevaluate the performance of SAM 2 in segmenting polyps under various prompted\nsettings. We hope this report will provide insights to advance the field of\npolyp segmentation and promote more interesting work in the future. This\nproject is publicly available at https://github.com/ sajjad-sh33/Polyp-SAM-2.\n","authors":["Mobina Mansoori","Sajjad Shahabodini","Jamshid Abouei","Konstantinos N. Plataniotis","Arash Mohammadi"],"pdf_url":"https://arxiv.org/pdf/2408.05892v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12446v2","updated":"2024-08-27T14:59:25Z","published":"2024-08-22T14:41:49Z","title":"EX-DRL: Hedging Against Heavy Losses with EXtreme Distributional\n Reinforcement Learning","summary":" Recent advancements in Distributional Reinforcement Learning (DRL) for\nmodeling loss distributions have shown promise in developing hedging strategies\nin derivatives markets. A common approach in DRL involves learning the\nquantiles of loss distributions at specified levels using Quantile Regression\n(QR). This method is particularly effective in option hedging due to its direct\nquantile-based risk assessment, such as Value at Risk (VaR) and Conditional\nValue at Risk (CVaR). However, these risk measures depend on the accurate\nestimation of extreme quantiles in the loss distribution's tail, which can be\nimprecise in QR-based DRL due to the rarity and extremity of tail data, as\nhighlighted in the literature. To address this issue, we propose EXtreme DRL\n(EX-DRL), which enhances extreme quantile prediction by modeling the tail of\nthe loss distribution with a Generalized Pareto Distribution (GPD). This method\nintroduces supplementary data to mitigate the scarcity of extreme quantile\nobservations, thereby improving estimation accuracy through QR. Comprehensive\nexperiments on gamma hedging options demonstrate that EX-DRL improves existing\nQR-based models by providing more precise estimates of extreme quantiles,\nthereby improving the computation and reliability of risk metrics for complex\nfinancial risk management.\n","authors":["Parvin Malekzadeh","Zissis Poulos","Jacky Chen","Zeyu Wang","Konstantinos N. Plataniotis"],"pdf_url":"https://arxiv.org/pdf/2408.12446v2.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2408.15114v1","updated":"2024-08-27T14:54:33Z","published":"2024-08-27T14:54:33Z","title":"Few-Shot Unsupervised Implicit Neural Shape Representation Learning with\n Spatial Adversaries","summary":" Implicit Neural Representations have gained prominence as a powerful\nframework for capturing complex data modalities, encompassing a wide range from\n3D shapes to images and audio. Within the realm of 3D shape representation,\nNeural Signed Distance Functions (SDF) have demonstrated remarkable potential\nin faithfully encoding intricate shape geometry. However, learning SDFs from\nsparse 3D point clouds in the absence of ground truth supervision remains a\nvery challenging task. While recent methods rely on smoothness priors to\nregularize the learning, our method introduces a regularization term that\nleverages adversarial samples around the shape to improve the learned SDFs.\nThrough extensive experiments and evaluations, we illustrate the efficacy of\nour proposed method, highlighting its capacity to improve SDF learning with\nrespect to baselines and the state-of-the-art using synthetic and real data.\n","authors":["Amine Ouasfi","Adnane Boukhayma"],"pdf_url":"https://arxiv.org/pdf/2408.15114v1.pdf","comment":"ICML 2024"},{"id":"http://arxiv.org/abs/2403.20324v3","updated":"2024-08-27T14:53:41Z","published":"2024-03-29T17:51:50Z","title":"Localising the Seizure Onset Zone from Single-Pulse Electrical\n Stimulation Responses with a CNN Transformer","summary":" Epilepsy is one of the most common neurological disorders, often requiring\nsurgical intervention when medication fails to control seizures. For effective\nsurgical outcomes, precise localisation of the epileptogenic focus - often\napproximated through the Seizure Onset Zone (SOZ) - is critical yet remains a\nchallenge. Active probing through electrical stimulation is already standard\nclinical practice for identifying epileptogenic areas. Our study advances the\napplication of deep learning for SOZ localisation using Single-Pulse Electrical\nStimulation (SPES) responses, with two key contributions. Firstly, we implement\nan existing deep learning model to compare two SPES analysis paradigms:\ndivergent and convergent. These paradigms evaluate outward and inward effective\nconnections, respectively. We assess the generalisability of these models to\nunseen patients and electrode placements using held-out test sets. Our findings\nreveal a notable improvement in moving from a divergent (AUROC: 0.574) to a\nconvergent approach (AUROC: 0.666), marking the first application of the latter\nin this context. Secondly, we demonstrate the efficacy of CNN Transformers with\ncross-channel attention in handling heterogeneous electrode placements,\nincreasing the AUROC to 0.730. These findings represent a significant step in\nmodelling patient-specific intracranial EEG electrode placements in SPES.\nFuture work will explore integrating these models into clinical decision-making\nprocesses to bridge the gap between deep learning research and practical\nhealthcare applications.\n","authors":["Jamie Norris","Aswin Chari","Dorien van Blooijs","Gerald Cooray","Karl Friston","Martin Tisdall","Richard Rosch"],"pdf_url":"https://arxiv.org/pdf/2403.20324v3.pdf","comment":"21 pages, 6 figures, accepted at Machine Learning for Healthcare 2024"},{"id":"http://arxiv.org/abs/2311.07537v2","updated":"2024-08-27T14:34:26Z","published":"2023-11-13T18:23:46Z","title":"Estimating optical vegetation indices and biophysical variables for\n temperate forests with Sentinel-1 SAR data using machine learning techniques:\n A case study for Czechia","summary":" Current optical vegetation indices (VIs) for monitoring forest ecosystems are\nwell established and widely used in various applications, but can be limited by\natmospheric effects such as clouds. In contrast, synthetic aperture radar (SAR)\ndata can offer insightful and systematic forest monitoring with complete time\nseries (TS) due to signal penetration through clouds and day and night image\nacquisitions. This study aims to address the limitations of optical satellite\ndata by using SAR data as an alternative for estimating optical VIs for forests\nthrough machine learning (ML). While this approach is less direct and likely\nonly feasible through the power of ML, it raises the scientific question of\nwhether enough relevant information is contained in the SAR signal to\naccurately estimate VIs. This work covers the estimation of TS of four VIs\n(LAI, FAPAR, EVI and NDVI) using multitemporal Sentinel-1 SAR and ancillary\ndata. The study focused on both healthy and disturbed temperate forest areas in\nCzechia for the year 2021, while ground truth labels generated from Sentinel-2\nmultispectral data. This was enabled by creating a paired multi-modal TS\ndataset in Google Earth Engine (GEE), including temporally and spatially\naligned Sentinel-1, Sentinel-2, DEM, weather and land cover datasets. The\ninclusion of DEM-derived auxiliary features and additional meteorological\ninformation, further improved the results. In the comparison of ML models, the\ntraditional ML algorithms, RFR and XGBoost slightly outperformed the AutoML\napproach, auto-sklearn, for all VIs, achieving high accuracies ($R^2$ between\n70-86%) and low errors (0.055-0.29 of MAE). In general, up to 240 measurements\nper year and a spatial resolution of 20 m can be achieved using estimated\nSAR-based VIs with high accuracy. A great advantage of the SAR-based VI is the\nability to detect abrupt forest changes with sub-weekly temporal accuracy.\n","authors":["Daniel Paluba","Bertrand Le Saux","Přemysl Stych"],"pdf_url":"https://arxiv.org/pdf/2311.07537v2.pdf","comment":"Revised version of the preprint, based on comments from the\n reviewers. Full research article. 23 pages, 10 figures, 7 tables"},{"id":"http://arxiv.org/abs/2408.15099v1","updated":"2024-08-27T14:31:54Z","published":"2024-08-27T14:31:54Z","title":"No Regrets: Investigating and Improving Regret Approximations for\n Curriculum Discovery","summary":" What data or environments to use for training to improve downstream\nperformance is a longstanding and very topical question in reinforcement\nlearning. In particular, Unsupervised Environment Design (UED) methods have\ngained recent attention as their adaptive curricula enable agents to be robust\nto in- and out-of-distribution tasks. We ask to what extent these methods are\nthemselves robust when applied to a novel setting, closely inspired by a\nreal-world robotics problem. Surprisingly, we find that the state-of-the-art\nUED methods either do not improve upon the na\\\"{i}ve baseline of Domain\nRandomisation (DR), or require substantial hyperparameter tuning to do so. Our\nanalysis shows that this is due to their underlying scoring functions failing\nto predict intuitive measures of ``learnability'', i.e., in finding the\nsettings that the agent sometimes solves, but not always. Based on this, we\ninstead directly train on levels with high learnability and find that this\nsimple and intuitive approach outperforms UED methods and DR in several\nbinary-outcome environments, including on our domain and the standard UED\ndomain of Minigrid. We further introduce a new adversarial evaluation procedure\nfor directly measuring robustness, closely mirroring the conditional value at\nrisk (CVaR). We open-source all our code and present visualisations of final\npolicies here: https://github.com/amacrutherford/sampling-for-learnability.\n","authors":["Alexander Rutherford","Michael Beukman","Timon Willi","Bruno Lacerda","Nick Hawes","Jakob Foerster"],"pdf_url":"https://arxiv.org/pdf/2408.15099v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15097v1","updated":"2024-08-27T14:30:06Z","published":"2024-08-27T14:30:06Z","title":"Data-Driven Nonlinear Deformation Design of 3D-Printable Shells","summary":" Designing and fabricating structures with specific mechanical properties\nrequires understanding the intricate relationship between design parameters and\nperformance. Understanding the design-performance relationship becomes\nincreasingly complicated for nonlinear deformations. Though successful at\nmodeling elastic deformations, simulation-based techniques struggle to model\nlarge elastoplastic deformations exhibiting plasticity and densification. We\npropose a neural network trained on experimental data to learn the\ndesign-performance relationship between 3D-printable shells and their\ncompressive force-displacement behavior. Trained on thousands of physical\nexperiments, our network aids in both forward and inverse design to generate\nshells exhibiting desired elastoplastic and hyperelastic deformations. We\nvalidate a subset of generated designs through fabrication and testing.\nFurthermore, we demonstrate the network's inverse design efficacy in generating\ncustom shells for several applications.\n","authors":["Samuel Silverman","Kelsey L. Snapp","Keith A. Brown","Emily Whiting"],"pdf_url":"https://arxiv.org/pdf/2408.15097v1.pdf","comment":"Submitted to 3D Printing and Additive Manufacturing"},{"id":"http://arxiv.org/abs/2408.15096v1","updated":"2024-08-27T14:26:56Z","published":"2024-08-27T14:26:56Z","title":"Post-processing fairness with minimal changes","summary":" In this paper, we introduce a novel post-processing algorithm that is both\nmodel-agnostic and does not require the sensitive attribute at test time. In\naddition, our algorithm is explicitly designed to enforce minimal changes\nbetween biased and debiased predictions; a property that, while highly\ndesirable, is rarely prioritized as an explicit objective in fairness\nliterature. Our approach leverages a multiplicative factor applied to the logit\nvalue of probability scores produced by a black-box classifier. We demonstrate\nthe efficacy of our method through empirical evaluations, comparing its\nperformance against other four debiasing algorithms on two widely used datasets\nin fairness research.\n","authors":["Federico Di Gennaro","Thibault Laugel","Vincent Grari","Xavier Renard","Marcin Detyniecki"],"pdf_url":"https://arxiv.org/pdf/2408.15096v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15094v1","updated":"2024-08-27T14:25:42Z","published":"2024-08-27T14:25:42Z","title":"Constrained Diffusion Models via Dual Training","summary":" Diffusion models have attained prominence for their ability to synthesize a\nprobability distribution for a given dataset via a diffusion process, enabling\nthe generation of new data points with high fidelity. However, diffusion\nprocesses are prone to generating biased data based on the training dataset. To\naddress this issue, we develop constrained diffusion models by imposing\ndiffusion constraints based on desired distributions that are informed by\nrequirements. Specifically, we cast the training of diffusion models under\nrequirements as a constrained distribution optimization problem that aims to\nreduce the distribution difference between original and generated data while\nobeying constraints on the distribution of generated data. We show that our\nconstrained diffusion models generate new data from a mixture data distribution\nthat achieves the optimal trade-off among objective and constraints. To train\nconstrained diffusion models, we develop a dual training algorithm and\ncharacterize the optimality of the trained constrained diffusion model. We\nempirically demonstrate the effectiveness of our constrained models in two\nconstrained generation tasks: (i) we consider a dataset with one or more\nunderrepresented classes where we train the model with constraints to ensure\nfairly sampling from all classes during inference; (ii) we fine-tune a\npre-trained diffusion model to sample from a new dataset while avoiding\noverfitting.\n","authors":["Shervin Khalafi","Dongsheng Ding","Alejandro Ribeiro"],"pdf_url":"https://arxiv.org/pdf/2408.15094v1.pdf","comment":"41 pages, 4 figures, 2 tables"},{"id":"http://arxiv.org/abs/2408.13843v2","updated":"2024-08-27T14:24:52Z","published":"2024-08-25T14:17:43Z","title":"Consistent machine learning for topology optimization with\n microstructure-dependent neural network material models","summary":" Additive manufacturing methods together with topology optimization have\nenabled the creation of multiscale structures with controlled spatially-varying\nmaterial microstructure. However, topology optimization or inverse design of\nsuch structures in the presence of nonlinearities remains a challenge due to\nthe expense of computational homogenization methods and the complexity of\ndifferentiably parameterizing the microstructural response. A solution to this\nchallenge lies in machine learning techniques that offer efficient,\ndifferentiable mappings between the material response and its microstructural\ndescriptors. This work presents a framework for designing multiscale\nheterogeneous structures with spatially varying microstructures by merging a\nhomogenization-based topology optimization strategy with a consistent machine\nlearning approach grounded in hyperelasticity theory. We leverage neural\narchitectures that adhere to critical physical principles such as\npolyconvexity, objectivity, material symmetry, and thermodynamic consistency to\nsupply the framework with a reliable constitutive model that is dependent on\nmaterial microstructural descriptors. Our findings highlight the potential of\nintegrating consistent machine learning models with density-based topology\noptimization for enhancing design optimization of heterogeneous hyperelastic\nstructures under finite deformations.\n","authors":["Harikrishnan Vijayakumaran","Jonathan B. Russ","Glaucio H. Paulino","Miguel A. Bessa"],"pdf_url":"https://arxiv.org/pdf/2408.13843v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10895v4","updated":"2024-08-27T14:23:51Z","published":"2023-07-20T14:18:44Z","title":"Variational Autoencoding of Dental Point Clouds","summary":" Digital dentistry has made significant advancements, yet numerous challenges\nremain. This paper introduces the FDI 16 dataset, an extensive collection of\ntooth meshes and point clouds. Additionally, we present a novel approach:\nVariational FoldingNet (VF-Net), a fully probabilistic variational autoencoder\nfor point clouds. Notably, prior latent variable models for point clouds lack a\none-to-one correspondence between input and output points. Instead, they rely\non optimizing Chamfer distances, a metric that lacks a normalized\ndistributional counterpart, rendering it unsuitable for probabilistic modeling.\nWe replace the explicit minimization of Chamfer distances with a suitable\nencoder, increasing computational efficiency while simplifying the\nprobabilistic extension. This allows for straightforward application in various\ntasks, including mesh generation, shape completion, and representation\nlearning. Empirically, we provide evidence of lower reconstruction error in\ndental reconstruction and interpolation, showcasing state-of-the-art\nperformance in dental sample generation while identifying valuable latent\nrepresentations\n","authors":["Johan Ziruo Ye","Thomas Ørkild","Peter Lempel Søndergaard","Søren Hauberg"],"pdf_url":"https://arxiv.org/pdf/2307.10895v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15089v1","updated":"2024-08-27T14:20:21Z","published":"2024-08-27T14:20:21Z","title":"SiHGNN: Leveraging Properties of Semantic Graphs for Efficient HGNN\n Acceleration","summary":" Heterogeneous Graph Neural Networks (HGNNs) have expanded graph\nrepresentation learning to heterogeneous graph fields. Recent studies have\ndemonstrated their superior performance across various applications, including\nmedical analysis and recommendation systems, often surpassing existing methods.\nHowever, GPUs often experience inefficiencies when executing HGNNs due to their\nunique and complex execution patterns. Compared to traditional Graph Neural\nNetworks, these patterns further exacerbate irregularities in memory access. To\ntackle these challenges, recent studies have focused on developing\ndomain-specific accelerators for HGNNs. Nonetheless, most of these efforts have\nconcentrated on optimizing the datapath or scheduling data accesses, while\nlargely overlooking the potential benefits that could be gained from leveraging\nthe inherent properties of the semantic graph, such as its topology, layout,\nand generation.\n In this work, we focus on leveraging the properties of semantic graphs to\nenhance HGNN performance. First, we analyze the Semantic Graph Build (SGB)\nstage and identify significant opportunities for data reuse during semantic\ngraph generation. Next, we uncover the phenomenon of buffer thrashing during\nthe Graph Feature Processing (GFP) stage, revealing potential optimization\nopportunities in semantic graph layout. Furthermore, we propose a lightweight\nhardware accelerator frontend for HGNNs, called SiHGNN. This accelerator\nfrontend incorporates a tree-based Semantic Graph Builder for efficient\nsemantic graph generation and features a novel Graph Restructurer for\noptimizing semantic graph layouts. Experimental results show that SiHGNN\nenables the state-of-the-art HGNN accelerator to achieve an average performance\nimprovement of 2.95$\\times$.\n","authors":["Runzhen Xue","Mingyu Yan","Dengke Han","Zhimin Tang","Xiaochun Ye","Dongrui Fan"],"pdf_url":"https://arxiv.org/pdf/2408.15089v1.pdf","comment":"12 pages, 18 figures. arXiv admin note: text overlap with\n arXiv:2404.04792"},{"id":"http://arxiv.org/abs/2408.14340v2","updated":"2024-08-27T14:09:44Z","published":"2024-08-26T15:13:14Z","title":"Foundation Models for Music: A Survey","summary":" In recent years, foundation models (FMs) such as large language models (LLMs)\nand latent diffusion models (LDMs) have profoundly impacted diverse sectors,\nincluding music. This comprehensive review examines state-of-the-art (SOTA)\npre-trained models and foundation models in music, spanning from representation\nlearning, generative learning and multimodal learning. We first contextualise\nthe significance of music in various industries and trace the evolution of AI\nin music. By delineating the modalities targeted by foundation models, we\ndiscover many of the music representations are underexplored in FM development.\nThen, emphasis is placed on the lack of versatility of previous methods on\ndiverse music applications, along with the potential of FMs in music\nunderstanding, generation and medical application. By comprehensively exploring\nthe details of the model pre-training paradigm, architectural choices,\ntokenisation, finetuning methodologies and controllability, we emphasise the\nimportant topics that should have been well explored, like instruction tuning\nand in-context learning, scaling law and emergent ability, as well as\nlong-sequence modelling etc. A dedicated section presents insights into music\nagents, accompanied by a thorough analysis of datasets and evaluations\nessential for pre-training and downstream tasks. Finally, by underscoring the\nvital importance of ethical considerations, we advocate that following research\non FM for music should focus more on such issues as interpretability,\ntransparency, human responsibility, and copyright issues. The paper offers\ninsights into future challenges and trends on FMs for music, aiming to shape\nthe trajectory of human-AI collaboration in the music realm.\n","authors":["Yinghao Ma","Anders Øland","Anton Ragni","Bleiz MacSen Del Sette","Charalampos Saitis","Chris Donahue","Chenghua Lin","Christos Plachouras","Emmanouil Benetos","Elio Quinton","Elona Shatri","Fabio Morreale","Ge Zhang","György Fazekas","Gus Xia","Huan Zhang","Ilaria Manco","Jiawen Huang","Julien Guinot","Liwei Lin","Luca Marinelli","Max W. Y. Lam","Megha Sharma","Qiuqiang Kong","Roger B. Dannenberg","Ruibin Yuan","Shangda Wu","Shih-Lun Wu","Shuqi Dai","Shun Lei","Shiyin Kang","Simon Dixon","Wenhu Chen","Wenhao Huang","Xingjian Du","Xingwei Qu","Xu Tan","Yizhi Li","Zeyue Tian","Zhiyong Wu","Zhizheng Wu","Ziyang Ma","Ziyu Wang"],"pdf_url":"https://arxiv.org/pdf/2408.14340v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15077v1","updated":"2024-08-27T14:05:48Z","published":"2024-08-27T14:05:48Z","title":"MMASD+: A Novel Dataset for Privacy-Preserving Behavior Analysis of\n Children with Autism Spectrum Disorder","summary":" Autism spectrum disorder (ASD) is characterized by significant challenges in\nsocial interaction and comprehending communication signals. Recently,\ntherapeutic interventions for ASD have increasingly utilized Deep learning\npowered-computer vision techniques to monitor individual progress over time.\nThese models are trained on private, non-public datasets from the autism\ncommunity, creating challenges in comparing results across different models due\nto privacy-preserving data-sharing issues. This work introduces MMASD+. MMASD+\nconsists of diverse data modalities, including 3D-Skeleton, 3D Body Mesh, and\nOptical Flow data. It integrates the capabilities of Yolov8 and Deep SORT\nalgorithms to distinguish between the therapist and children, addressing a\nsignificant barrier in the original dataset. Additionally, a Multimodal\nTransformer framework is proposed to predict 11 action types and the presence\nof ASD. This framework achieves an accuracy of 95.03% for predicting action\ntypes and 96.42% for predicting ASD presence, demonstrating over a 10%\nimprovement compared to models trained on single data modalities. These\nfindings highlight the advantages of integrating multiple data modalities\nwithin the Multimodal Transformer framework.\n","authors":["Pavan Uttej Ravva","Behdokht Kiafar","Pinar Kullu","Jicheng Li","Anjana Bhat","Roghayeh Leila Barmaki"],"pdf_url":"https://arxiv.org/pdf/2408.15077v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.04580v2","updated":"2024-08-27T14:05:38Z","published":"2024-02-07T04:43:41Z","title":"A Comprehensive Survey of Cross-Domain Policy Transfer for Embodied\n Agents","summary":" The burgeoning fields of robot learning and embodied AI have triggered an\nincreasing demand for large quantities of data. However, collecting sufficient\nunbiased data from the target domain remains a challenge due to costly data\ncollection processes and stringent safety requirements. Consequently,\nresearchers often resort to data from easily accessible source domains, such as\nsimulation and laboratory environments, for cost-effective data acquisition and\nrapid model iteration. Nevertheless, the environments and embodiments of these\nsource domains can be quite different from their target domain counterparts,\nunderscoring the need for effective cross-domain policy transfer approaches. In\nthis paper, we conduct a systematic review of existing cross-domain policy\ntransfer methods. Through a nuanced categorization of domain gaps, we\nencapsulate the overarching insights and design considerations of each problem\nsetting. We also provide a high-level discussion about the key methodologies\nused in cross-domain policy transfer problems. Lastly, we summarize the open\nchallenges that lie beyond the capabilities of current paradigms and discuss\npotential future directions in this field.\n","authors":["Haoyi Niu","Jianming Hu","Guyue Zhou","Xianyuan Zhan"],"pdf_url":"https://arxiv.org/pdf/2402.04580v2.pdf","comment":"IJCAI 2024"},{"id":"http://arxiv.org/abs/2408.15076v1","updated":"2024-08-27T14:04:04Z","published":"2024-08-27T14:04:04Z","title":"MiWaves Reinforcement Learning Algorithm","summary":" The escalating prevalence of cannabis use poses a significant public health\nchallenge globally. In the U.S., cannabis use is more prevalent among emerging\nadults (EAs) (ages 18-25) than any other age group, with legalization in the\nmultiple states contributing to a public perception that cannabis is less risky\nthan in prior decades. To address this growing concern, we developed MiWaves, a\nreinforcement learning (RL) algorithm designed to optimize the delivery of\npersonalized intervention prompts to reduce cannabis use among EAs. MiWaves\nleverages domain expertise and prior data to tailor the likelihood of delivery\nof intervention messages. This paper presents a comprehensive overview of the\nalgorithm's design, including key decisions and experimental outcomes. The\nfinalized MiWaves RL algorithm was deployed in a clinical trial from March to\nMay 2024.\n","authors":["Susobhan Ghosh","Yongyi Guo","Pei-Yao Hung","Lara Coughlin","Erin Bonar","Inbal Nahum-Shani","Maureen Walton","Susan Murphy"],"pdf_url":"https://arxiv.org/pdf/2408.15076v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2402.17739"},{"id":"http://arxiv.org/abs/2408.06425v5","updated":"2024-08-27T14:03:15Z","published":"2024-08-12T18:04:59Z","title":"Bayesian Learning in a Nonlinear Multiscale State-Space Model","summary":" The ubiquity of multiscale interactions in complex systems is\nwell-recognized, with development and heredity serving as a prime example of\nhow processes at different temporal scales influence one another. This work\nintroduces a novel multiscale state-space model to explore the dynamic\ninterplay between systems interacting across different time scales, with\nfeedback between each scale. We propose a Bayesian learning framework to\nestimate unknown states by learning the unknown process noise covariances\nwithin this multiscale model. We develop a Particle Gibbs with Ancestor\nSampling (PGAS) algorithm for inference and demonstrate through simulations the\nefficacy of our approach.\n","authors":["Nayely Vélez-Cruz","Manfred D. Laubichler"],"pdf_url":"https://arxiv.org/pdf/2408.06425v5.pdf","comment":"Corrected a typo"},{"id":"http://arxiv.org/abs/2408.15073v1","updated":"2024-08-27T14:02:21Z","published":"2024-08-27T14:02:21Z","title":"Interactive dense pixel visualizations for time series and model\n attribution explanations","summary":" The field of Explainable Artificial Intelligence (XAI) for Deep Neural\nNetwork models has developed significantly, offering numerous techniques to\nextract explanations from models. However, evaluating explanations is often not\ntrivial, and differences in applied metrics can be subtle, especially with\nnon-intelligible data. Thus, there is a need for visualizations tailored to\nexplore explanations for domains with such data, e.g., time series. We propose\nDAVOTS, an interactive visual analytics approach to explore raw time series\ndata, activations of neural networks, and attributions in a dense-pixel\nvisualization to gain insights into the data, models' decisions, and\nexplanations. To further support users in exploring large datasets, we apply\nclustering approaches to the visualized data domains to highlight groups and\npresent ordering strategies for individual and combined data exploration to\nfacilitate finding patterns. We visualize a CNN trained on the FordA dataset to\ndemonstrate the approach.\n","authors":["Udo Schlegel","Daniel A. Keim"],"pdf_url":"https://arxiv.org/pdf/2408.15073v1.pdf","comment":"5 pages, 2 figures, accepted at MLVIS 2023"},{"id":"http://arxiv.org/abs/2408.15065v1","updated":"2024-08-27T13:48:15Z","published":"2024-08-27T13:48:15Z","title":"The Benefits of Balance: From Information Projections to Variance\n Reduction","summary":" Data balancing across multiple modalities/sources appears in various forms in\nseveral foundation models (e.g., CLIP and DINO) achieving universal\nrepresentation learning. We show that this iterative algorithm, usually used to\navoid representation collapse, enjoys an unsuspected benefit: reducing the\nvariance of estimators that are functionals of the empirical distribution over\nthese sources. We provide non-asymptotic bounds quantifying this variance\nreduction effect and relate them to the eigendecays of appropriately defined\nMarkov operators. We explain how various forms of data balancing in contrastive\nmultimodal learning and self-supervised clustering can be interpreted as\ninstances of this variance reduction scheme.\n","authors":["Lang Liu","Ronak Mehta","Soumik Pal","Zaid Harchaoui"],"pdf_url":"https://arxiv.org/pdf/2408.15065v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15057v1","updated":"2024-08-27T13:40:15Z","published":"2024-08-27T13:40:15Z","title":"Subgroup Analysis via Model-based Rule Forest","summary":" Machine learning models are often criticized for their black-box nature,\nraising concerns about their applicability in critical decision-making\nscenarios. Consequently, there is a growing demand for interpretable models in\nsuch contexts. In this study, we introduce Model-based Deep Rule Forests\n(mobDRF), an interpretable representation learning algorithm designed to\nextract transparent models from data. By leveraging IF-THEN rules with\nmulti-level logic expressions, mobDRF enhances the interpretability of existing\nmodels without compromising accuracy. We apply mobDRF to identify key risk\nfactors for cognitive decline in an elderly population, demonstrating its\neffectiveness in subgroup analysis and local model optimization. Our method\noffers a promising solution for developing trustworthy and interpretable\nmachine learning models, particularly valuable in fields like healthcare, where\nunderstanding differential effects across patient subgroups can lead to more\npersonalized and effective treatments.\n","authors":["I-Ling Cheng","Chan Hsu","Chantung Ku","Pei-Ju Lee","Yihuang Kang"],"pdf_url":"https://arxiv.org/pdf/2408.15057v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.09212v4","updated":"2024-08-27T13:40:13Z","published":"2024-07-12T12:20:39Z","title":"Generating $SROI^-$ Ontologies via Knowledge Graph Query Embedding\n Learning","summary":" Query embedding approaches answer complex logical queries over incomplete\nknowledge graphs (KGs) by computing and operating on low-dimensional vector\nrepresentations of entities, relations, and queries. However, current query\nembedding models heavily rely on excessively parameterized neural networks and\ncannot explain the knowledge learned from the graph. We propose a novel query\nembedding method, AConE, which explains the knowledge learned from the graph in\nthe form of $SROI^-$ description logic axioms while being more\nparameter-efficient than most existing approaches. AConE associates queries to\na $SROI^-$ description logic concept. Every $SROI^-$ concept is embedded as a\ncone in complex vector space, and each $SROI^-$ relation is embedded as a\ntransformation that rotates and scales cones. We show theoretically that AConE\ncan learn $SROI^-$ axioms, and defines an algebra whose operations correspond\none to one to $SROI^-$ description logic concept constructs. Our empirical\nstudy on multiple query datasets shows that AConE achieves superior results\nover previous baselines with fewer parameters. Notably on the WN18RR dataset,\nAConE achieves significant improvement over baseline models. We provide\ncomprehensive analyses showing that the capability to represent axioms\npositively impacts the results of query answering.\n","authors":["Yunjie He","Daniel Hernandez","Mojtaba Nayyeri","Bo Xiong","Yuqicheng Zhu","Evgeny Kharlamov","Steffen Staab"],"pdf_url":"https://arxiv.org/pdf/2407.09212v4.pdf","comment":"Accepted by ECAI 2024"},{"id":"http://arxiv.org/abs/2408.15055v1","updated":"2024-08-27T13:32:31Z","published":"2024-08-27T13:32:31Z","title":"Causal Rule Forest: Toward Interpretable and Precise Treatment Effect\n Estimation","summary":" Understanding and inferencing Heterogeneous Treatment Effects (HTE) and\nConditional Average Treatment Effects (CATE) are vital for developing\npersonalized treatment recommendations. Many state-of-the-art approaches\nachieve inspiring performance in estimating HTE on benchmark datasets or\nsimulation studies. However, the indirect predicting manner and complex model\narchitecture reduce the interpretability of these approaches. To mitigate the\ngap between predictive performance and heterogeneity interpretability, we\nintroduce the Causal Rule Forest (CRF), a novel approach to learning hidden\npatterns from data and transforming the patterns into interpretable multi-level\nBoolean rules. By training the other interpretable causal inference models with\ndata representation learned by CRF, we can reduce the predictive errors of\nthese models in estimating HTE and CATE, while keeping their interpretability\nfor identifying subgroups that a treatment is more effective. Our experiments\nunderscore the potential of CRF to advance personalized interventions and\npolicies, paving the way for future research to enhance its scalability and\napplication across complex causal inference challenges.\n","authors":["Chan Hsu","Jun-Ting Wu","Yihuang Kang"],"pdf_url":"https://arxiv.org/pdf/2408.15055v1.pdf","comment":"The 25th IEEE International Conference on Information Reuse and\n Integration for Data Science (IRI 2024)"},{"id":"http://arxiv.org/abs/2301.01188v4","updated":"2024-08-27T13:28:52Z","published":"2022-12-29T01:07:19Z","title":"Deep R Programming","summary":" Deep R Programming is a comprehensive and in-depth introductory course on one\nof the most popular languages for data science. It equips ambitious students,\nprofessionals, and researchers with the knowledge and skills to become\nindependent users of this potent environment so that they can tackle any\nproblem related to data wrangling and analytics, numerical computing,\nstatistics, and machine learning. This textbook is a non-profit project. Its\nonline and PDF versions are freely available at\n.\n","authors":["Marek Gagolewski"],"pdf_url":"https://arxiv.org/pdf/2301.01188v4.pdf","comment":"v1.0.1 (2024-08-27)"},{"id":"http://arxiv.org/abs/2403.00381v2","updated":"2024-08-27T13:13:54Z","published":"2024-03-01T09:09:37Z","title":"Structured Deep Neural Networks-Based Backstepping Trajectory Tracking\n Control for Lagrangian Systems","summary":" Deep neural networks (DNN) are increasingly being used to learn controllers\ndue to their excellent approximation capabilities. However, their black-box\nnature poses significant challenges to closed-loop stability guarantees and\nperformance analysis. In this paper, we introduce a structured DNN-based\ncontroller for the trajectory tracking control of Lagrangian systems using\nbacking techniques. By properly designing neural network structures, the\nproposed controller can ensure closed-loop stability for any compatible neural\nnetwork parameters. In addition, improved control performance can be achieved\nby further optimizing neural network parameters. Besides, we provide explicit\nupper bounds on tracking errors in terms of controller parameters, which allows\nus to achieve the desired tracking performance by properly selecting the\ncontroller parameters. Furthermore, when system models are unknown, we propose\nan improved Lagrangian neural network (LNN) structure to learn the system\ndynamics and design the controller. We show that in the presence of model\napproximation errors and external disturbances, the closed-loop stability and\ntracking control performance can still be guaranteed. The effectiveness of the\nproposed approach is demonstrated through simulations.\n","authors":["Jiajun Qian","Liang Xu","Xiaoqiang Ren","Xiaofan Wang"],"pdf_url":"https://arxiv.org/pdf/2403.00381v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15041v1","updated":"2024-08-27T13:10:26Z","published":"2024-08-27T13:10:26Z","title":"Earth Observation Satellite Scheduling with Graph Neural Networks","summary":" The Earth Observation Satellite Planning (EOSP) is a difficult optimization\nproblem with considerable practical interest. A set of requested observations\nmust be scheduled on an agile Earth observation satellite while respecting\nconstraints on their visibility window, as well as maneuver constraints that\nimpose varying delays between successive observations. In addition, the problem\nis largely oversubscribed: there are much more candidate observations than what\ncan possibly be achieved. Therefore, one must select the set of observations\nthat will be performed while maximizing their weighted cumulative benefit, and\npropose a feasible schedule for these observations. As previous work mostly\nfocused on heuristic and iterative search algorithms, this paper presents a new\ntechnique for selecting and scheduling observations based on Graph Neural\nNetworks (GNNs) and Deep Reinforcement Learning (DRL). GNNs are used to extract\nrelevant information from the graphs representing instances of the EOSP, and\nDRL drives the search for optimal schedules. Our simulations show that it is\nable to learn on small problem instances and generalize to larger real-world\ninstances, with very competitive performance compared to traditional\napproaches.\n","authors":["Antoine Jacquet","Guillaume Infantes","Nicolas Meuleau","Emmanuel Benazera","Stéphanie Roussel","Vincent Baudoui","Jonathan Guerra"],"pdf_url":"https://arxiv.org/pdf/2408.15041v1.pdf","comment":"Accepted at 17th European Workshop on Reinforcement Learning (EWRL\n 2024)"},{"id":"http://arxiv.org/abs/2405.17035v3","updated":"2024-08-27T13:05:33Z","published":"2024-05-27T10:42:13Z","title":"Glauber Generative Model: Discrete Diffusion Models via Binary\n Classification","summary":" We introduce the Glauber Generative Model (GGM), a new class of discrete\ndiffusion models, to obtain new samples from a distribution given samples from\na discrete space. GGM deploys a discrete Markov chain called the heat bath\ndynamics (or the Glauber dynamics) to denoise a sequence of noisy tokens to a\nsample from a joint distribution of discrete tokens. Our novel conceptual\nframework provides an exact reduction of the task of learning the denoising\nMarkov chain to solving a class of binary classification tasks. More\nspecifically, the model learns to classify a given token in a noisy sequence as\nsignal or noise. In contrast, prior works on discrete diffusion models either\nsolve regression problems to learn importance ratios, or minimize loss\nfunctions given by variational approximations. We apply GGM to language\nmodeling and image generation, where images are discretized using image\ntokenizers like VQGANs. We show that it outperforms existing discrete diffusion\nmodels in language generation, and demonstrates strong performance for image\ngeneration without using dataset-specific image tokenizers. We also show that\nour model is capable of performing well in zero-shot control settings like text\nand image infilling.\n","authors":["Harshit Varma","Dheeraj Nagaraj","Karthikeyan Shanmugam"],"pdf_url":"https://arxiv.org/pdf/2405.17035v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.09766v2","updated":"2024-08-27T13:01:56Z","published":"2024-02-15T07:35:52Z","title":"From Variability to Stability: Advancing RecSys Benchmarking Practices","summary":" In the rapidly evolving domain of Recommender Systems (RecSys), new\nalgorithms frequently claim state-of-the-art performance based on evaluations\nover a limited set of arbitrarily selected datasets. However, this approach may\nfail to holistically reflect their effectiveness due to the significant impact\nof dataset characteristics on algorithm performance. Addressing this\ndeficiency, this paper introduces a novel benchmarking methodology to\nfacilitate a fair and robust comparison of RecSys algorithms, thereby advancing\nevaluation practices. By utilizing a diverse set of $30$ open datasets,\nincluding two introduced in this work, and evaluating $11$ collaborative\nfiltering algorithms across $9$ metrics, we critically examine the influence of\ndataset characteristics on algorithm performance. We further investigate the\nfeasibility of aggregating outcomes from multiple datasets into a unified\nranking. Through rigorous experimental analysis, we validate the reliability of\nour methodology under the variability of datasets, offering a benchmarking\nstrategy that balances quality and computational demands. This methodology\nenables a fair yet effective means of evaluating RecSys algorithms, providing\nvaluable guidance for future research endeavors.\n","authors":["Valeriy Shevchenko","Nikita Belousov","Alexey Vasilev","Vladimir Zholobov","Artyom Sosedka","Natalia Semenova","Anna Volodkevich","Andrey Savchenko","Alexey Zaytsev"],"pdf_url":"https://arxiv.org/pdf/2402.09766v2.pdf","comment":"8 pages with 11 figures"},{"id":"http://arxiv.org/abs/2408.13628v2","updated":"2024-08-27T12:53:22Z","published":"2024-08-24T17:10:59Z","title":"Enhancing Uplift Modeling in Multi-Treatment Marketing Campaigns:\n Leveraging Score Ranking and Calibration Techniques","summary":" Uplift modeling is essential for optimizing marketing strategies by selecting\nindividuals likely to respond positively to specific marketing campaigns. This\nimportance escalates in multi-treatment marketing campaigns, where diverse\ntreatment is available and we may want to assign the customers to treatment\nthat can make the most impact. While there are existing approaches with\nconvenient frameworks like Causalml, there are potential spaces to enhance the\neffect of uplift modeling in multi treatment cases. This paper introduces a\nnovel approach to uplift modeling in multi-treatment campaigns, leveraging\nscore ranking and calibration techniques to improve overall performance of the\nmarketing campaign. We review existing uplift models, including Meta Learner\nframeworks (S, T, X), and their application in real-world scenarios.\nAdditionally, we delve into insights from multi-treatment studies to highlight\nthe complexities and potential advancements in the field. Our methodology\nincorporates Meta-Learner calibration and a scoring rank-based offer selection\nstrategy. Extensive experiment results with real-world datasets demonstrate the\npractical benefits and superior performance of our approach. The findings\nunderscore the critical role of integrating score ranking and calibration\ntechniques in refining the performance and reliability of uplift predictions,\nthereby advancing predictive modeling in marketing analytics and providing\nactionable insights for practitioners seeking to optimize their campaign\nstrategies.\n","authors":["Yoon Tae Park","Ting Xu","Mohamed Anany"],"pdf_url":"https://arxiv.org/pdf/2408.13628v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.10635v5","updated":"2024-08-27T12:12:42Z","published":"2024-03-26T15:36:47Z","title":"Compressed Federated Reinforcement Learning with a Generative Model","summary":" Reinforcement learning has recently gained unprecedented popularity, yet it\nstill grapples with sample inefficiency. Addressing this challenge, federated\nreinforcement learning (FedRL) has emerged, wherein agents collaboratively\nlearn a single policy by aggregating local estimations. However, this\naggregation step incurs significant communication costs. In this paper, we\npropose CompFedRL, a communication-efficient FedRL approach incorporating both\n\\textit{periodic aggregation} and (direct/error-feedback) compression\nmechanisms. Specifically, we consider compressed federated $Q$-learning with a\ngenerative model setup, where a central server learns an optimal $Q$-function\nby periodically aggregating compressed $Q$-estimates from local agents. For the\nfirst time, we characterize the impact of these two mechanisms (which have\nremained elusive) by providing a finite-time analysis of our algorithm,\ndemonstrating strong convergence behaviors when utilizing either direct or\nerror-feedback compression. Our bounds indicate improved solution accuracy\nconcerning the number of agents and other federated hyperparameters while\nsimultaneously reducing communication costs. To corroborate our theory, we also\nconduct in-depth numerical experiments to verify our findings, considering\nTop-$K$ and Sparsified-$K$ sparsification operators.\n","authors":["Ali Beikmohammadi","Sarit Khirirat","Sindri Magnússon"],"pdf_url":"https://arxiv.org/pdf/2404.10635v5.pdf","comment":"European Conference on Machine Learning and Principles and Practice\n of Knowledge Discovery in Databases (ECML-PKDD 2024)"},{"id":"http://arxiv.org/abs/2111.10847v3","updated":"2024-08-27T12:09:32Z","published":"2021-11-21T15:58:01Z","title":"Diffusion Tensor Estimation with Uncertainty Calibration","summary":" It is highly desirable to know how uncertain a model's predictions are,\nespecially for models that are complex and hard to understand as in deep\nlearning. Although there has been a growing interest in using deep learning\nmethods in diffusion-weighted MRI, prior works have not addressed the issue of\nmodel uncertainty. Here, we propose a deep learning method to estimate the\ndiffusion tensor and compute the estimation uncertainty. Data-dependent\nuncertainty is computed directly by the network and learned via loss\nattenuation. Model uncertainty is computed using Monte Carlo dropout. We also\npropose a new method for evaluating the quality of predicted uncertainties. We\ncompare the new method with the standard least-squares tensor estimation and\nbootstrap-based uncertainty computation techniques. Our experiments show that\nwhen the number of measurements is small the deep learning method is more\naccurate and its uncertainty predictions are better calibrated than the\nstandard methods. We show that the estimation uncertainties computed by the new\nmethod can highlight the model's biases, detect domain shift, and reflect the\nstrength of noise in the measurements. Our study shows the importance and\npractical value of modeling prediction uncertainties in deep learning-based\ndiffusion MRI analysis.\n","authors":["Davood Karimi","Simon K. Warfield","Ali Gholipour"],"pdf_url":"https://arxiv.org/pdf/2111.10847v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.09096v2","updated":"2024-08-27T11:50:50Z","published":"2024-07-12T08:48:16Z","title":"STD-PLM: Understanding Both Spatial and Temporal Properties of\n Spatial-Temporal Data with PLM","summary":" Spatial-temporal forecasting and imputation are important for real-world\nintelligent systems. Most existing methods are tailored for individual\nforecasting or imputation tasks but are not designed for both. Additionally,\nthey are less effective for zero-shot and few-shot learning. While pre-trained\nlanguage model (PLM) have exhibited strong pattern recognition and reasoning\nabilities across various tasks, including few-shot and zero-shot learning,\ntheir applications in spatial-temporal data understanding has been constrained\nby insufficient modeling of complex correlations such as the temporal\ncorrelations, spatial connectivity, non-pairwise and high-order\nspatial-temporal correlations within data. In this paper, we propose STD-PLM\nfor understanding both spatial and temporal properties of\n\\underline{S}patial-\\underline{T}emporal \\underline{D}ata with \\underline{PLM},\nwhich is capable of implementing both spatial-temporal forecasting and\nimputation tasks. STD-PLM understands spatial-temporal correlations via\nexplicitly designed spatial and temporal tokenizers. Topology-aware node\nembeddings are designed for PLM to comprehend and exploit the topology\nstructure of data in inductive manner. Furthermore, to mitigate the efficiency\nissues introduced by the PLM, we design a sandglass attention module (SGA)\ncombined with a specific constrained loss function, which significantly\nimproves the model's efficiency while ensuring performance. Extensive\nexperiments demonstrate that STD-PLM exhibits competitive performance and\ngeneralization capabilities across the forecasting and imputation tasks on\nvarious datasets. Moreover, STD-PLM achieves promising results on both few-shot\nand zero-shot tasks.\n","authors":["YiHeng Huang","Xiaowei Mao","Shengnan Guo","Yubin Chen","Junfeng Shen","Tiankuo Li","Youfang Lin","Huaiyu Wan"],"pdf_url":"https://arxiv.org/pdf/2407.09096v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.16263v4","updated":"2024-08-27T11:48:49Z","published":"2022-03-30T12:48:22Z","title":"Does Audio Deepfake Detection Generalize?","summary":" Current text-to-speech algorithms produce realistic fakes of human voices,\nmaking deepfake detection a much-needed area of research. While researchers\nhave presented various techniques for detecting audio spoofs, it is often\nunclear exactly why these architectures are successful: Preprocessing steps,\nhyperparameter settings, and the degree of fine-tuning are not consistent\nacross related work. Which factors contribute to success, and which are\naccidental? In this work, we address this problem: We systematize audio\nspoofing detection by re-implementing and uniformly evaluating architectures\nfrom related work. We identify overarching features for successful audio\ndeepfake detection, such as using cqtspec or logspec features instead of\nmelspec features, which improves performance by 37% EER on average, all other\nfactors constant. Additionally, we evaluate generalization capabilities: We\ncollect and publish a new dataset consisting of 37.9 hours of found audio\nrecordings of celebrities and politicians, of which 17.2 hours are deepfakes.\nWe find that related work performs poorly on such real-world data (performance\ndegradation of up to one thousand percent). This may suggest that the community\nhas tailored its solutions too closely to the prevailing ASVSpoof benchmark and\nthat deepfakes are much harder to detect outside the lab than previously\nthought.\n","authors":["Nicolas M. Müller","Pavel Czempin","Franziska Dieckmann","Adam Froghyar","Konstantin Böttinger"],"pdf_url":"https://arxiv.org/pdf/2203.16263v4.pdf","comment":"Interspeech 2022"},{"id":"http://arxiv.org/abs/2408.14976v1","updated":"2024-08-27T11:38:01Z","published":"2024-08-27T11:38:01Z","title":"Prior-free Balanced Replay: Uncertainty-guided Reservoir Sampling for\n Long-Tailed Continual Learning","summary":" Even in the era of large models, one of the well-known issues in continual\nlearning (CL) is catastrophic forgetting, which is significantly challenging\nwhen the continual data stream exhibits a long-tailed distribution, termed as\nLong-Tailed Continual Learning (LTCL). Existing LTCL solutions generally\nrequire the label distribution of the data stream to achieve re-balance\ntraining. However, obtaining such prior information is often infeasible in real\nscenarios since the model should learn without pre-identifying the majority and\nminority classes. To this end, we propose a novel Prior-free Balanced Replay\n(PBR) framework to learn from long-tailed data stream with less forgetting.\nConcretely, motivated by our experimental finding that the minority classes are\nmore likely to be forgotten due to the higher uncertainty, we newly design an\nuncertainty-guided reservoir sampling strategy to prioritize rehearsing\nminority data without using any prior information, which is based on the mutual\ndependence between the model and samples. Additionally, we incorporate two\nprior-free components to further reduce the forgetting issue: (1) Boundary\nconstraint is to preserve uncertain boundary supporting samples for continually\nre-estimating task boundaries. (2) Prototype constraint is to maintain the\nconsistency of learned class prototypes along with training. Our approach is\nevaluated on three standard long-tailed benchmarks, demonstrating superior\nperformance to existing CL methods and previous SOTA LTCL approach in both\ntask- and class-incremental learning settings, as well as ordered- and\nshuffled-LTCL settings.\n","authors":["Lei Liu","Li Liu","Yawen Cui"],"pdf_url":"https://arxiv.org/pdf/2408.14976v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.10672v2","updated":"2024-08-27T11:13:43Z","published":"2024-03-15T20:48:41Z","title":"Riemannian Flow Matching Policy for Robot Motion Learning","summary":" We introduce Riemannian Flow Matching Policies (RFMP), a novel model for\nlearning and synthesizing robot visuomotor policies. RFMP leverages the\nefficient training and inference capabilities of flow matching methods. By\ndesign, RFMP inherits the strengths of flow matching: the ability to encode\nhigh-dimensional multimodal distributions, commonly encountered in robotic\ntasks, and a very simple and fast inference process. We demonstrate the\napplicability of RFMP to both state-based and vision-conditioned robot motion\npolicies. Notably, as the robot state resides on a Riemannian manifold, RFMP\ninherently incorporates geometric awareness, which is crucial for realistic\nrobotic tasks. To evaluate RFMP, we conduct two proof-of-concept experiments,\ncomparing its performance against Diffusion Policies. Although both approaches\nsuccessfully learn the considered tasks, our results show that RFMP provides\nsmoother action trajectories with significantly lower inference times.\n","authors":["Max Braun","Noémie Jaquier","Leonel Rozo","Tamim Asfour"],"pdf_url":"https://arxiv.org/pdf/2403.10672v2.pdf","comment":"Accepted for publication at IROS'24. 8 pages, 5 figures, 4 tables"},{"id":"http://arxiv.org/abs/2408.14964v1","updated":"2024-08-27T11:10:39Z","published":"2024-08-27T11:10:39Z","title":"Cross-Modal Learning for Chemistry Property Prediction: Large Language\n Models Meet Graph Machine Learning","summary":" In the field of chemistry, the objective is to create novel molecules with\ndesired properties, facilitating accurate property predictions for applications\nsuch as material design and drug screening. However, existing graph deep\nlearning methods face limitations that curb their expressive power. To address\nthis, we explore the integration of vast molecular domain knowledge from Large\nLanguage Models (LLMs) with the complementary strengths of Graph Neural\nNetworks (GNNs) to enhance performance in property prediction tasks. We\nintroduce a Multi-Modal Fusion (MMF) framework that synergistically harnesses\nthe analytical prowess of GNNs and the linguistic generative and predictive\nabilities of LLMs, thereby improving accuracy and robustness in predicting\nmolecular properties. Our framework combines the effectiveness of GNNs in\nmodeling graph-structured data with the zero-shot and few-shot learning\ncapabilities of LLMs, enabling improved predictions while reducing the risk of\noverfitting. Furthermore, our approach effectively addresses distributional\nshifts, a common challenge in real-world applications, and showcases the\nefficacy of learning cross-modal representations, surpassing state-of-the-art\nbaselines on benchmark datasets for property prediction tasks.\n","authors":["Sakhinana Sagar Srinivas","Venkataramana Runkana"],"pdf_url":"https://arxiv.org/pdf/2408.14964v1.pdf","comment":"Paper Accepted at Workshop on Robustness of Few-shot and Zero-shot\n Learning in Foundation Models at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2406.17640v2","updated":"2024-08-27T11:00:47Z","published":"2024-06-25T15:24:06Z","title":"BayTTA: Uncertainty-aware medical image classification with optimized\n test-time augmentation using Bayesian model averaging","summary":" Test-time augmentation (TTA) is a well-known technique employed during the\ntesting phase of computer vision tasks. It involves aggregating multiple\naugmented versions of input data. Combining predictions using a simple average\nformulation is a common and straightforward approach after performing TTA. This\npaper introduces a novel framework for optimizing TTA, called BayTTA\n(Bayesian-based TTA), which is based on Bayesian Model Averaging (BMA). First,\nwe generate a prediction list associated with different variations of the input\ndata created through TTA. Then, we use BMA to combine predictions weighted by\nthe respective posterior probabilities. Such an approach allows one to take\ninto account model uncertainty, and thus to enhance the predictive performance\nof the related machine learning or deep learning model. We evaluate the\nperformance of BayTTA on various public data, including three medical image\ndatasets comprising skin cancer, breast cancer, and chest X-ray images and two\nwell-known gene editing datasets, CRISPOR and GUIDE-seq. Our experimental\nresults indicate that BayTTA can be effectively integrated into\nstate-of-the-art deep learning models used in medical image analysis as well as\ninto some popular pre-trained CNN models such as VGG-16, MobileNetV2,\nDenseNet201, ResNet152V2, and InceptionRes-NetV2, leading to the enhancement in\ntheir accuracy and robustness performance. The source code of the proposed\nBayTTA method is freely available at: \\underline\n{https://github.com/Z-Sherkat/BayTTA}.\n","authors":["Zeinab Sherkatghanad","Moloud Abdar","Mohammadreza Bakhtyari","Pawel Plawiak","Vladimir Makarenkov"],"pdf_url":"https://arxiv.org/pdf/2406.17640v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.03729v2","updated":"2024-08-27T10:57:01Z","published":"2024-06-06T04:05:12Z","title":"Enhancing Sign Language Detection through Mediapipe and Convolutional\n Neural Networks (CNN)","summary":" This research combines MediaPipe and CNNs for the efficient and accurate\ninterpretation of ASL dataset for the real-time detection of sign language. The\nsystem presented here captures and processes hands' gestures in real time. the\nintended purpose was to create a very easy, accurate, and fast way of entering\ncommands without the necessity of touching something.MediaPipe supports one of\nthe powerful frameworks in real-time hand tracking capabilities for the ability\nto capture and preprocess hand movements, which increases the accuracy of the\ngesture recognition system. Actually, the integration of CNN with the MediaPipe\nresults in higher efficiency in using the model of real-time processing.The\naccuracy achieved by the model on ASL datasets is 99.12\\%.The model was tested\nusing American Sign Language (ASL) datasets. The results were then compared to\nthose of existing methods to evaluate how well it performed, using established\nevaluation techniques. The system will have applications in the communication,\neducation, and accessibility domains. Making systems such as described in this\npaper even better will assist people with hearing impairment and make things\naccessible to them. We tested the recognition and translation performance on an\nASL dataset and achieved better accuracy over previous models.It is meant to\nthe research is to identify the characters that American signs recognize using\nhand images taken from a web camera by based on mediapipe and CNNs\n","authors":["Aditya Raj Verma","Gagandeep Singh","Karnim Meghwal","Banawath Ramji","Praveen Kumar Dadheech"],"pdf_url":"https://arxiv.org/pdf/2406.03729v2.pdf","comment":"We have decided to withdraw our paper due to significant revisions\n and improvements that need to be made based on new findings. After further\n analysis, we believe these changes are necessary to ensure the accuracy and\n completeness of our work. We plan to resubmit the revised version in the\n future once the updates are complete"},{"id":"http://arxiv.org/abs/2408.14951v1","updated":"2024-08-27T10:54:51Z","published":"2024-08-27T10:54:51Z","title":"Domain-decoupled Physics-informed Neural Networks with Closed-form\n Gradients for Fast Model Learning of Dynamical Systems","summary":" Physics-informed neural networks (PINNs) are trained using physical equations\nand can also incorporate unmodeled effects by learning from data. PINNs for\ncontrol (PINCs) of dynamical systems are gaining interest due to their\nprediction speed compared to classical numerical integration methods for\nnonlinear state-space models, making them suitable for real-time control\napplications. We introduce the domain-decoupled physics-informed neural network\n(DD-PINN) to address current limitations of PINC in handling large and complex\nnonlinear dynamic systems. The time domain is decoupled from the feed-forward\nneural network to construct an Ansatz function, allowing for calculation of\ngradients in closed form. This approach significantly reduces training times,\nespecially for large dynamical systems, compared to PINC, which relies on\ngraph-based automatic differentiation. Additionally, the DD-PINN inherently\nfulfills the initial condition and supports higher-order excitation inputs,\nsimplifying the training process and enabling improved prediction accuracy.\nValidation on three systems - a nonlinear mass-spring-damper, a\nfive-mass-chain, and a two-link robot - demonstrates that the DD-PINN achieves\nsignificantly shorter training times. In cases where the PINC's prediction\ndiverges, the DD-PINN's prediction remains stable and accurate due to higher\nphysics loss reduction or use of a higher-order excitation input. The DD-PINN\nallows for fast and accurate learning of large dynamical systems previously out\nof reach for the PINC.\n","authors":["Henrik Krauss","Tim-Lukas Habich","Max Bartholdt","Thomas Seel","Moritz Schappler"],"pdf_url":"https://arxiv.org/pdf/2408.14951v1.pdf","comment":"Accepted to International Conference on Informatics in Control,\n Automation and Robotics (ICINCO) 2024"},{"id":"http://arxiv.org/abs/2408.14935v1","updated":"2024-08-27T10:17:22Z","published":"2024-08-27T10:17:22Z","title":"Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian\n Network Structures","summary":" We introduce an information theoretic criterion for Bayesian network\nstructure learning which we call quotient normalized maximum likelihood (qNML).\nIn contrast to the closely related factorized normalized maximum likelihood\ncriterion, qNML satisfies the property of score equivalence. It is also\ndecomposable and completely free of adjustable hyperparameters. For practical\ncomputations, we identify a remarkably accurate approximation proposed earlier\nby Szpankowski and Weinberger. Experiments on both simulated and real data\ndemonstrate that the new criterion leads to parsimonious models with good\npredictive accuracy.\n","authors":["Tomi Silander","Janne Leppä-aho","Elias Jääsaari","Teemu Roos"],"pdf_url":"https://arxiv.org/pdf/2408.14935v1.pdf","comment":"Accepted to AISTATS 2018"},{"id":"http://arxiv.org/abs/2406.15504v2","updated":"2024-08-27T10:07:27Z","published":"2024-06-19T16:43:56Z","title":"Dr.E Bridges Graphs with Large Language Models through Words","summary":" Significant efforts have been dedicated to integrating the powerful Large\nLanguage Models (LLMs) with diverse modalities, particularly focusing on the\nfusion of language, vision and audio data. However, the graph-structured data,\nwhich is inherently rich in structural and domain-specific knowledge, has not\nyet been gracefully adapted to LLMs. Existing methods either describe the graph\nwith raw text, suffering the loss of graph structural information, or feed\nGraph Neural Network (GNN) embeddings into LLMs at the cost of losing\nexplainable prompt semantics. To bridge this gap, we introduce an end-to-end\nmodality-aligning framework for LLM-graph alignment: Dual-Residual Vector\nQuantized-Variational AutoEncoder, namely Dr.E. Our approach is purposefully\ndesigned to facilitate token-level alignment with LLMs, enabling an effective\ntranslation of the intrinsic `language' of graphs into comprehensible natural\nlanguage. We also manage to enhance LLMs' more robust structural understanding\nof graphs by incorporating multiple views of the central nodes based on their\nsurrounding nodes at various distances. Our experimental evaluations on\nstandard graph tasks demonstrate competitive performance against other\nstate-of-the-art (SOTA) approaches. Additionally, our framework ensures certain\nvisual interpretability, efficiency, and robustness, marking the promising\nsuccessful endeavor to achieve token-level alignment between LLMs and GNNs. Our\ncode is available at: https://anonymous.4open.science/r/dre-817.\n","authors":["Zipeng Liu","Likang Wu","Ming He","Zhong Guan","Hongke Zhao","Nan Feng"],"pdf_url":"https://arxiv.org/pdf/2406.15504v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14928v1","updated":"2024-08-27T10:05:37Z","published":"2024-08-27T10:05:37Z","title":"Targetin the partition function of chemically disordered materials with\n a generative approach based on inverse variational autoencoders","summary":" Computing atomic-scale properties of chemically disordered materials requires\nan efficient exploration of their vast configuration space. Traditional\napproaches such as Monte Carlo or Special Quasirandom Structures either entail\nsampling an excessive amount of configurations or do not ensure that the\nconfiguration space has been properly covered. In this work, we propose a novel\napproach where generative machine learning is used to yield a representative\nset of configurations for accurate property evaluation and provide accurate\nestimations of atomic-scale properties with minimal computational cost. Our\nmethod employs a specific type of variational autoencoder with inverse roles\nfor the encoder and decoder, enabling the application of an unsupervised active\nlearning scheme that does not require any initial training database. The model\niteratively generates configuration batches, whose properties are computed with\nconventional atomic-scale methods. These results are then fed back into the\nmodel to estimate the partition function, repeating the process until\nconvergence. We illustrate our approach by computing point-defect formation\nenergies and concentrations in (U, Pu)O2 mixed-oxide fuels. In addition, the ML\nmodel provides valuable insights into the physical factors influencing the\ntarget property. Our method is generally applicable to explore other\nproperties, such as atomic-scale diffusion coefficients, in ideally or\nnon-ideally disordered materials like high-entropy alloys.\n","authors":["Maciej J. Karcz","Luca Messina","Eiji Kawasaki","Emeric Bourasseau"],"pdf_url":"https://arxiv.org/pdf/2408.14928v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14915v1","updated":"2024-08-27T09:44:01Z","published":"2024-08-27T09:44:01Z","title":"Can Transformers Do Enumerative Geometry?","summary":" How can Transformers model and learn enumerative geometry? What is a robust\nprocedure for using Transformers in abductive knowledge discovery within a\nmathematician-machine collaboration? In this work, we introduce a new paradigm\nin computational enumerative geometry in analyzing the $\\psi$-class\nintersection numbers on the moduli space of curves. By formulating the\nenumerative problem as a continuous optimization task, we develop a\nTransformer-based model for computing $\\psi$-class intersection numbers based\non the underlying quantum Airy structure. For a finite range of genera, our\nmodel is capable of regressing intersection numbers that span an extremely wide\nrange of values, from $10^{-45}$ to $10^{45}$. To provide a proper inductive\nbias for capturing the recursive behavior of intersection numbers, we propose a\nnew activation function, Dynamic Range Activator (DRA). Moreover, given the\nsevere heteroscedasticity of $\\psi$-class intersections and the required\nprecision, we quantify the uncertainty of the predictions using Conformal\nPrediction with a dynamic sliding window that is aware of the number of marked\npoints. Next, we go beyond merely computing intersection numbers and explore\nthe enumerative \"world-model\" of the Transformers. Through a series of causal\ninference and correlational interpretability analyses, we demonstrate that\nTransformers are actually modeling Virasoro constraints in a purely data-driven\nmanner. Additionally, we provide evidence for the comprehension of several\nvalues appearing in the large genus asymptotic of $\\psi$-class intersection\nnumbers through abductive hypothesis testing.\n","authors":["Baran Hashemi","Roderic G. Corominas","Alessandro Giacchetto"],"pdf_url":"https://arxiv.org/pdf/2408.14915v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14909v1","updated":"2024-08-27T09:35:49Z","published":"2024-08-27T09:35:49Z","title":"SpikingSSMs: Learning Long Sequences with Sparse and Parallel Spiking\n State Space Models","summary":" Known as low energy consumption networks, spiking neural networks (SNNs) have\ngained a lot of attention within the past decades. While SNNs are increasing\ncompetitive with artificial neural networks (ANNs) for vision tasks, they are\nrarely used for long sequence tasks, despite their intrinsic temporal dynamics.\nIn this work, we develop spiking state space models (SpikingSSMs) for long\nsequence learning by leveraging on the sequence learning abilities of state\nspace models (SSMs). Inspired by dendritic neuron structure, we hierarchically\nintegrate neuronal dynamics with the original SSM block, meanwhile realizing\nsparse synaptic computation. Furthermore, to solve the conflict of event-driven\nneuronal dynamics with parallel computing, we propose a light-weight surrogate\ndynamic network which accurately predicts the after-reset membrane potential\nand compatible to learnable thresholds, enabling orders of acceleration in\ntraining speed compared with conventional iterative methods. On the long range\narena benchmark task, SpikingSSM achieves competitive performance to\nstate-of-the-art SSMs meanwhile realizing on average 90\\% of network sparsity.\nOn language modeling, our network significantly surpasses existing spiking\nlarge language models (spikingLLMs) on the WikiText-103 dataset with only a\nthird of the model size, demonstrating its potential as backbone architecture\nfor low computation cost LLMs.\n","authors":["Shuaijie Shen","Chao Wang","Renzhuo Huang","Yan Zhong","Qinghai Guo","Zhichao Lu","Jianguo Zhang","Luziwei Leng"],"pdf_url":"https://arxiv.org/pdf/2408.14909v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.10779v2","updated":"2024-08-27T09:28:35Z","published":"2024-05-17T13:40:59Z","title":"Baseline Results for Selected Nonlinear System Identification Benchmarks","summary":" Nonlinear system identification remains an important open challenge across\nresearch and academia. Large numbers of novel approaches are seen published\neach year, each presenting improvements or extensions to existing methods. It\nis natural, therefore, to consider how one might choose between these competing\nmodels. Benchmark datasets provide one clear way to approach this question.\nHowever, to make meaningful inference based on benchmark performance it is\nimportant to understand how well a new method performs comparatively to results\navailable with well-established methods. This paper presents a set of ten\nbaseline techniques and their relative performances on five popular benchmarks.\nThe aim of this contribution is to stimulate thought and discussion regarding\nobjective comparison of identification methodologies.\n","authors":["Max D. Champneys","Gerben I. Beintema","Roland Tóth","Maarten Schoukens","Timothy J. Rogers"],"pdf_url":"https://arxiv.org/pdf/2405.10779v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05655v2","updated":"2024-08-27T09:24:41Z","published":"2023-10-09T12:10:51Z","title":"Causal structure learning with momentum: Sampling distributions over\n Markov Equivalence Classes of DAGs","summary":" In the context of inferring a Bayesian network structure (directed acyclic\ngraph, DAG for short), we devise a non-reversible continuous time Markov chain,\nthe ``Causal Zig-Zag sampler'', that targets a probability distribution over\nclasses of observationally equivalent (Markov equivalent) DAGs. The classes are\nrepresented as completed partially directed acyclic graphs (CPDAGs). The\nnon-reversible Markov chain relies on the operators used in Chickering's Greedy\nEquivalence Search (GES) and is endowed with a momentum variable, which\nimproves mixing significantly as we show empirically. The possible target\ndistributions include posterior distributions based on a prior over DAGs and a\nMarkov equivalent likelihood. We offer an efficient implementation wherein we\ndevelop new algorithms for listing, counting, uniformly sampling, and applying\npossible moves of the GES operators, all of which significantly improve upon\nthe state-of-the-art run-time.\n","authors":["Moritz Schauer","Marcel Wienöbst"],"pdf_url":"https://arxiv.org/pdf/2310.05655v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2109.06458v3","updated":"2024-08-27T09:09:05Z","published":"2021-09-14T05:54:29Z","title":"A Note on Knowledge Distillation Loss Function for Object Classification","summary":" This research note provides a quick introduction to the knowledge\ndistillation loss function used in object classification. In particular, we\ndiscuss its connection to a previously proposed logits matching loss function.\nWe further treat knowledge distillation as a specific form of output\nregularization and demonstrate its connection to label smoothing and\nentropy-based regularization.\n","authors":["Defang Chen"],"pdf_url":"https://arxiv.org/pdf/2109.06458v3.pdf","comment":"Research Note, 4 pages"},{"id":"http://arxiv.org/abs/2408.14890v1","updated":"2024-08-27T09:06:29Z","published":"2024-08-27T09:06:29Z","title":"Development of Large Annotated Music Datasets using HMM-based Forced\n Viterbi Alignment","summary":" Datasets are essential for any machine learning task. Automatic Music\nTranscription (AMT) is one such task, where considerable amount of data is\nrequired depending on the way the solution is achieved. Considering the fact\nthat a music dataset, complete with audio and its time-aligned transcriptions\nwould require the effort of people with musical experience, it could be stated\nthat the task becomes even more challenging. Musical experience is required in\nplaying the musical instrument(s), and in annotating and verifying the\ntranscriptions. We propose a method that would help in streamlining this\nprocess, making the task of obtaining a dataset from a particular instrument\neasy and efficient. We use predefined guitar exercises and hidden Markov\nmodel(HMM) based forced viterbi alignment to accomplish this. The guitar\nexercises are designed to be simple. Since the note sequence are already\ndefined, HMM based forced viterbi alignment provides time-aligned\ntranscriptions of these audio files. The onsets of the transcriptions are\nmanually verified and the labels are accurate up to 10ms, averaging at 5ms. The\ncontributions of the proposed work is two fold, i) a well streamlined and\nefficient method for generating datasets for any instrument, especially\nmonophonic and, ii) an acoustic plectrum guitar dataset containing wave files\nand transcriptions in the form of label files. This method will aid as a\npreliminary step towards building concrete datasets for building AMT systems\nfor different instruments.\n","authors":["S. Johanan Joysingh","P. Vijayalakshmi","T. Nagarajan"],"pdf_url":"https://arxiv.org/pdf/2408.14890v1.pdf","comment":"submitted to TENCON 2019"},{"id":"http://arxiv.org/abs/2408.08448v3","updated":"2024-08-27T09:04:35Z","published":"2024-08-15T22:57:39Z","title":"Exploring Cross-model Neuronal Correlations in the Context of Predicting\n Model Performance and Generalizability","summary":" As Artificial Intelligence (AI) models are increasingly integrated into\ncritical systems, the need for a robust framework to establish the\ntrustworthiness of AI is increasingly paramount. While collaborative efforts\nhave established conceptual foundations for such a framework, there remains a\nsignificant gap in developing concrete, technically robust methods for\nassessing AI model quality and performance. A critical drawback in the\ntraditional methods for assessing the validity and generalizability of models\nis their dependence on internal developer datasets, rendering it challenging to\nindependently assess and verify their performance claims. This paper introduces\na novel approach for assessing a newly trained model's performance based on\nanother known model by calculating correlation between neural networks. The\nproposed method evaluates correlations by determining if, for each neuron in\none network, there exists a neuron in the other network that produces similar\noutput. This approach has implications for memory efficiency, allowing for the\nuse of smaller networks when high correlation exists between networks of\ndifferent sizes. Additionally, the method provides insights into robustness,\nsuggesting that if two highly correlated networks are compared and one\ndemonstrates robustness when operating in production environments, the other is\nlikely to exhibit similar robustness. This contribution advances the technical\ntoolkit for responsible AI, supporting more comprehensive and nuanced\nevaluations of AI models to ensure their safe and effective deployment. Code is\navailable at https://github.com/aheldis/Cross-model-correlation.git.\n","authors":["Haniyeh Ehsani Oskouie","Lionel Levine","Majid Sarrafzadeh"],"pdf_url":"https://arxiv.org/pdf/2408.08448v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14889v1","updated":"2024-08-27T09:04:08Z","published":"2024-08-27T09:04:08Z","title":"Towards turbine-location-aware multi-decadal wind power predictions with\n CMIP6","summary":" With the increasing amount of renewable energy in the grid, long-term wind\npower forecasting for multiple decades becomes more critical. In these\nlong-term forecasts, climate data is essential as it allows us to account for\nclimate change. Yet the resolution of climate models is often very coarse. In\nthis paper, we show that by including turbine locations when downscaling with\nGaussian Processes, we can generate valuable aggregate wind power predictions\ndespite the low resolution of the CMIP6 climate models. This work is a first\nstep towards multi-decadal turbine-location-aware wind power forecasting using\nglobal climate model output.\n","authors":["Nina Effenberger","Nicole Ludwig"],"pdf_url":"https://arxiv.org/pdf/2408.14889v1.pdf","comment":"4 pages, pre-print"},{"id":"http://arxiv.org/abs/2408.14887v1","updated":"2024-08-27T09:00:27Z","published":"2024-08-27T09:00:27Z","title":"Literary and Colloquial Dialect Identification for Tamil using Acoustic\n Features","summary":" The evolution and diversity of a language is evident from it's various\ndialects. If the various dialects are not addressed in technological\nadvancements like automatic speech recognition and speech synthesis, there is a\nchance that these dialects may disappear. Speech technology plays a role in\npreserving various dialects of a language from going extinct. In order to build\na full fledged automatic speech recognition system that addresses various\ndialects, an Automatic Dialect Identification (ADI) system acting as the front\nend is required. This is similar to how language identification systems act as\nfront ends to automatic speech recognition systems that handle multiple\nlanguages. The current work proposes a way to identify two popular and broadly\nclassified Tamil dialects, namely literary and colloquial Tamil. Acoustical\ncharacteristics rather than phonetics and phonotactics are used, alleviating\nthe requirement of language-dependant linguistic tools. Hence one major\nadvantage of the proposed method is that it does not require an annotated\ncorpus, hence it can be easily adapted to other languages. Gaussian Mixture\nModels (GMM) using Mel Frequency Cepstral Coefficient (MFCC) features are used\nto perform the classification task. The experiments yielded an error rate of\n12%. Vowel nasalization, as being the reason for this good performance, is\ndiscussed. The number of mixture models for the GMM is varied and the\nperformance is analysed.\n","authors":["M. Nanmalar","P. Vijayalakshmi","T. Nagarajan"],"pdf_url":"https://arxiv.org/pdf/2408.14887v1.pdf","comment":"submitted to TENCON 2019"},{"id":"http://arxiv.org/abs/2408.14875v1","updated":"2024-08-27T08:44:31Z","published":"2024-08-27T08:44:31Z","title":"Adversarial Attacks and Defenses in Multivariate Time-Series Forecasting\n for Smart and Connected Infrastructures","summary":" The emergence of deep learning models has revolutionized various industries\nover the last decade, leading to a surge in connected devices and\ninfrastructures. However, these models can be tricked into making incorrect\npredictions with high confidence, leading to disastrous failures and security\nconcerns. To this end, we explore the impact of adversarial attacks on\nmultivariate time-series forecasting and investigate methods to counter them.\nSpecifically, we employ untargeted white-box attacks, namely the Fast Gradient\nSign Method (FGSM) and the Basic Iterative Method (BIM), to poison the inputs\nto the training process, effectively misleading the model. We also illustrate\nthe subtle modifications to the inputs after the attack, which makes detecting\nthe attack using the naked eye quite difficult. Having demonstrated the\nfeasibility of these attacks, we develop robust models through adversarial\ntraining and model hardening. We are among the first to showcase the\ntransferability of these attacks and defenses by extrapolating our work from\nthe benchmark electricity data to a larger, 10-year real-world data used for\npredicting the time-to-failure of hard disks. Our experimental results confirm\nthat the attacks and defenses achieve the desired security thresholds, leading\nto a 72.41% and 94.81% decrease in RMSE for the electricity and hard disk\ndatasets respectively after implementing the adversarial defenses.\n","authors":["Pooja Krishan","Rohan Mohapatra","Saptarshi Sengupta"],"pdf_url":"https://arxiv.org/pdf/2408.14875v1.pdf","comment":"17 pages, 32 figures"},{"id":"http://arxiv.org/abs/2405.07488v2","updated":"2024-08-27T08:44:20Z","published":"2024-05-13T06:04:26Z","title":"Predictive Modeling of Flexible EHD Pumps using Kolmogorov-Arnold\n Networks","summary":" We present a novel approach to predicting the pressure and flow rate of\nflexible electrohydrodynamic pumps using the Kolmogorov-Arnold Network.\nInspired by the Kolmogorov-Arnold representation theorem, KAN replaces fixed\nactivation functions with learnable spline-based activation functions, enabling\nit to approximate complex nonlinear functions more effectively than traditional\nmodels like Multi-Layer Perceptron and Random Forest. We evaluated KAN on a\ndataset of flexible EHD pump parameters and compared its performance against\nRF, and MLP models. KAN achieved superior predictive accuracy, with Mean\nSquared Errors of 12.186 and 0.001 for pressure and flow rate predictions,\nrespectively. The symbolic formulas extracted from KAN provided insights into\nthe nonlinear relationships between input parameters and pump performance.\nThese findings demonstrate that KAN offers exceptional accuracy and\ninterpretability, making it a promising alternative for predictive modeling in\nelectrohydrodynamic pumping.\n","authors":["Yanhong Peng","Yuxin Wang","Fangchao Hu","Miao He","Zebing Mao","Xia Huang","Jun Ding"],"pdf_url":"https://arxiv.org/pdf/2405.07488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14871v1","updated":"2024-08-27T08:41:42Z","published":"2024-08-27T08:41:42Z","title":"Learning Robust Reward Machines from Noisy Labels","summary":" This paper presents PROB-IRM, an approach that learns robust reward machines\n(RMs) for reinforcement learning (RL) agents from noisy execution traces. The\nkey aspect of RM-driven RL is the exploitation of a finite-state machine that\ndecomposes the agent's task into different subtasks. PROB-IRM uses a\nstate-of-the-art inductive logic programming framework robust to noisy examples\nto learn RMs from noisy traces using the Bayesian posterior degree of beliefs,\nthus ensuring robustness against inconsistencies. Pivotal for the results is\nthe interleaving between RM learning and policy learning: a new RM is learned\nwhenever the RL agent generates a trace that is believed not to be accepted by\nthe current RM. To speed up the training of the RL agent, PROB-IRM employs a\nprobabilistic formulation of reward shaping that uses the posterior Bayesian\nbeliefs derived from the traces. Our experimental analysis shows that PROB-IRM\ncan learn (potentially imperfect) RMs from noisy traces and exploit them to\ntrain an RL agent to solve its tasks successfully. Despite the complexity of\nlearning the RM from noisy traces, agents trained with PROB-IRM perform\ncomparably to agents provided with handcrafted RMs.\n","authors":["Roko Parac","Lorenzo Nodari","Leo Ardon","Daniel Furelos-Blanco","Federico Cerutti","Alessandra Russo"],"pdf_url":"https://arxiv.org/pdf/2408.14871v1.pdf","comment":"Preprint accepted for publication to the 21st International\n Conference on Principles of Knowledge Representation and Reasoning (KR 2024)"},{"id":"http://arxiv.org/abs/2308.16818v3","updated":"2024-08-27T08:39:38Z","published":"2023-08-31T15:49:21Z","title":"Irregular Traffic Time Series Forecasting Based on Asynchronous\n Spatio-Temporal Graph Convolutional Network","summary":" Accurate traffic forecasting is crucial for the development of Intelligent\nTransportation Systems (ITS), playing a pivotal role in modern urban traffic\nmanagement. Traditional forecasting methods, however, struggle with the\nirregular traffic time series resulting from adaptive traffic signal controls,\npresenting challenges in asynchronous spatial dependency, irregular temporal\ndependency, and predicting variable-length sequences. To this end, we propose\nan Asynchronous Spatio-tEmporal graph convolutional nEtwoRk (ASeer) tailored\nfor irregular traffic time series forecasting. Specifically, we first propose\nan Asynchronous Graph Diffusion Network to capture the spatial dependency\nbetween asynchronously measured traffic states regulated by adaptive traffic\nsignals. After that, to capture the temporal dependency within irregular\ntraffic state sequences, a personalized time encoding is devised to embed the\ncontinuous time signals. Then, we propose a Transformable Time-aware\nConvolution Network, which adapts meta-filters for time-aware convolution on\nthe sequences with inconsistent temporal flow. Additionally, a\nSemi-Autoregressive Prediction Network, comprising a state evolution unit and a\nsemi-autoregressive predictor, is designed to predict variable-length traffic\nsequences effectively and efficiently. Extensive experiments on a newly\nestablished benchmark demonstrate the superiority of ASeer compared with twelve\ncompetitive baselines across six metrics.\n","authors":["Weijia Zhang","Le Zhang","Jindong Han","Hao Liu","Yanjie Fu","Jingbo Zhou","Yu Mei","Hui Xiong"],"pdf_url":"https://arxiv.org/pdf/2308.16818v3.pdf","comment":"This work is published in the research track of KDD 2024"},{"id":"http://arxiv.org/abs/2408.14866v1","updated":"2024-08-27T08:38:48Z","published":"2024-08-27T08:38:48Z","title":"Advancing Adversarial Suffix Transfer Learning on Aligned Large Language\n Models","summary":" Language Language Models (LLMs) face safety concerns due to potential misuse\nby malicious users. Recent red-teaming efforts have identified adversarial\nsuffixes capable of jailbreaking LLMs using the gradient-based search algorithm\nGreedy Coordinate Gradient (GCG). However, GCG struggles with computational\ninefficiency, limiting further investigations regarding suffix transferability\nand scalability across models and data. In this work, we bridge the connection\nbetween search efficiency and suffix transferability. We propose a two-stage\ntransfer learning framework, DeGCG, which decouples the search process into\nbehavior-agnostic pre-searching and behavior-relevant post-searching.\nSpecifically, we employ direct first target token optimization in pre-searching\nto facilitate the search process. We apply our approach to cross-model,\ncross-data, and self-transfer scenarios. Furthermore, we introduce an\ninterleaved variant of our approach, i-DeGCG, which iteratively leverages\nself-transferability to accelerate the search process. Experiments on HarmBench\ndemonstrate the efficiency of our approach across various models and domains.\nNotably, our i-DeGCG outperforms the baseline on Llama2-chat-7b with ASRs of\n$43.9$ ($+22.2$) and $39.0$ ($+19.5$) on valid and test sets, respectively.\nFurther analysis on cross-model transfer indicates the pivotal role of first\ntarget token optimization in leveraging suffix transferability for efficient\nsearching.\n","authors":["Hongfu Liu","Yuxi Xie","Ye Wang","Michael Shieh"],"pdf_url":"https://arxiv.org/pdf/2408.14866v1.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2408.14865v1","updated":"2024-08-27T08:38:45Z","published":"2024-08-27T08:38:45Z","title":"Data downlink prioritization using image classification on-board a 6U\n CubeSat","summary":" Nanosatellites are proliferating as low-cost dedicated sensing systems with\nlean development cycles. Kyushu Institute of Technology and collaborators have\nlaunched a joint venture for a nanosatellite mission, VERTECS. The primary\nmission is to elucidate the formation history of stars by observing the\noptical-wavelength cosmic background radiation. The VERTECS satellite will be\nequipped with a small-aperture telescope and a high-precision attitude control\nsystem to capture the cosmic data for analysis on the ground. However,\nnanosatellites are limited by their onboard memory resources and downlink speed\ncapabilities. Additionally, due to a limited number of ground stations, the\nsatellite mission will face issues meeting the required data budget for mission\nsuccess. To alleviate this issue, we propose an on-orbit system to autonomously\nclassify and then compress desirable image data for data downlink\nprioritization and optimization. The system comprises a prototype Camera\nController Board (CCB) which carries a Raspberry Pi Compute Module 4 which is\nused for classification and compression. The system uses a lightweight\nConvolutional Neural Network (CNN) model to classify and determine the\ndesirability of captured image data. The model is designed to be lean and\nrobust to reduce the computational and memory load on the satellite. The model\nis trained and tested on a novel star field dataset consisting of data captured\nby the Sloan Digital Sky Survey (SDSS). The dataset is meant to simulate the\nexpected data produced by the 6U satellite. The compression step implements\nGZip, RICE or HCOMPRESS compression, which are standards for astronomical data.\nPreliminary testing on the proposed CNN model results in a classification\naccuracy of about 100\\% on the star field dataset, with compression ratios of\n3.99, 5.16 and 5.43 for GZip, RICE and HCOMPRESS that were achieved on tested\nFITS image data.\n","authors":["Keenan A. A. Chatar","Ezra Fielding","Kei Sano","Kentaro Kitamura"],"pdf_url":"https://arxiv.org/pdf/2408.14865v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2408.14864v1","updated":"2024-08-27T08:38:17Z","published":"2024-08-27T08:38:17Z","title":"Dynamic operator management in meta-heuristics using reinforcement\n learning: an application to permutation flowshop scheduling problems","summary":" This study develops a framework based on reinforcement learning to\ndynamically manage a large portfolio of search operators within\nmeta-heuristics. Using the idea of tabu search, the framework allows for\ncontinuous adaptation by temporarily excluding less efficient operators and\nupdating the portfolio composition during the search. A Q-learning-based\nadaptive operator selection mechanism is used to select the most suitable\noperator from the dynamically updated portfolio at each stage. Unlike\ntraditional approaches, the proposed framework requires no input from the\nexperts regarding the search operators, allowing domain-specific non-experts to\neffectively use the framework. The performance of the proposed framework is\nanalyzed through an application to the permutation flowshop scheduling problem.\nThe results demonstrate the superior performance of the proposed framework\nagainst state-of-the-art algorithms in terms of optimality gap and convergence\nspeed.\n","authors":["Maryam Karimi Mamaghan","Mehrdad Mohammadi","Wout Dullaert","Daniele Vigo","Amir Pirayesh"],"pdf_url":"https://arxiv.org/pdf/2408.14864v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.03870v2","updated":"2024-08-27T08:31:04Z","published":"2024-03-06T17:23:28Z","title":"Learning to Decode Collaboratively with Multiple Language Models","summary":" We propose a method to teach multiple large language models (LLM) to\ncollaborate by interleaving their generations at the token level. We model the\ndecision of which LLM generates the next token as a latent variable. By\noptimizing the marginal likelihood of a training set under our latent variable\nmodel, the base LLM automatically learns when to generate itself and when to\ncall on one of the ``assistant'' language models to generate, all without\ndirect supervision. Token-level collaboration during decoding allows for a\nfusion of each model's expertise in a manner tailored to the specific task at\nhand. Our collaborative decoding is especially useful in cross-domain settings\nwhere a generalist base LLM learns to invoke domain expert models. On\ninstruction-following, domain-specific QA, and reasoning tasks, we show that\nthe performance of the joint system exceeds that of the individual models.\nThrough qualitative analysis of the learned latent decisions, we show models\ntrained with our method exhibit several interesting collaboration patterns,\ne.g., template-filling. Our code is available at\nhttps://github.com/clinicalml/co-llm.\n","authors":["Shannon Zejiang Shen","Hunter Lang","Bailin Wang","Yoon Kim","David Sontag"],"pdf_url":"https://arxiv.org/pdf/2403.03870v2.pdf","comment":"16 pages, 4 figures, 11 tables"},{"id":"http://arxiv.org/abs/2408.13766v2","updated":"2024-08-27T08:07:20Z","published":"2024-08-25T08:23:06Z","title":"Enhancing Robustness of Human Detection Algorithms in Maritime SAR\n through Augmented Aerial Images to Simulate Weather Conditions","summary":" 7,651 cases of Search and Rescue Missions (SAR) were reported by the United\nStates Coast Guard in 2024, with over 1322 SAR helicopters deployed in the 6\nfirst months alone. Through the utilizations of YOLO, we were able to run\ndifferent weather conditions and lighting from our augmented dataset for\ntraining. YOLO then utilizes CNNs to apply a series of convolutions and pooling\nlayers to the input image, where the convolution layers are able to extract the\nmain features of the image. Through this, our YOLO model is able to learn to\ndifferentiate different objects which may considerably improve its accuracy,\npossibly enhancing the efficiency of SAR operations through enhanced detection\naccuracy. This paper aims to improve the model's accuracy of human detection in\nmaritime SAR by evaluating a robust datasets containing various elevations and\ngeological locations, as well as through data augmentation which simulates\ndifferent weather and lighting. We observed that models trained on augmented\ndatasets outperformed their non-augmented counterparts in which the human\nrecall scores ranged from 0.891 to 0.911 with an improvement rate of 3.4\\% on\nthe YOLOv5l model. Results showed that these models demonstrate greater\nrobustness to real-world conditions in varying of weather, brightness, tint,\nand contrast.\n","authors":["Miguel Tjia","Artem Kim","Elaine Wynette Wijaya","Hanna Tefara","Kevin Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.13766v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14010v2","updated":"2024-08-27T08:02:49Z","published":"2024-08-26T04:31:55Z","title":"Improving Water Quality Time-Series Prediction in Hong Kong using\n Sentinel-2 MSI Data and Google Earth Engine Cloud Computing","summary":" Effective water quality monitoring in coastal regions is crucial due to the\nprogressive deterioration caused by pollution and human activities. To address\nthis, this study develops time-series models to predict chlorophyll-a (Chl-a),\nsuspended solids (SS), and turbidity using Sentinel-2 satellite data and Google\nEarth Engine (GEE) in the coastal regions of Hong Kong. Leveraging Long\nShort-Term Memory (LSTM) Recurrent Neural Networks, the study incorporates\nextensive temporal datasets to enhance prediction accuracy. The models utilize\nspectral data from Sentinel-2, focusing on optically active components, and\ndemonstrate that selected variables closely align with the spectral\ncharacteristics of Chl-a and SS. The results indicate improved predictive\nperformance over previous methods, highlighting the potential for remote\nsensing technology in continuous and comprehensive water quality assessment.\n","authors":["Rohin Sood","Kevin Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.14010v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14847v1","updated":"2024-08-27T07:58:08Z","published":"2024-08-27T07:58:08Z","title":"Intraoperative Glioma Segmentation with YOLO + SAM for Improved Accuracy\n in Tumor Resection","summary":" Gliomas, a common type of malignant brain tumor, present significant surgical\nchallenges due to their similarity to healthy tissue. Preoperative Magnetic\nResonance Imaging (MRI) images are often ineffective during surgery due to\nfactors such as brain shift, which alters the position of brain structures and\ntumors. This makes real-time intraoperative MRI (ioMRI) crucial, as it provides\nupdated imaging that accounts for these shifts, ensuring more accurate tumor\nlocalization and safer resections. This paper presents a deep learning pipeline\ncombining You Only Look Once Version 8 (YOLOv8) and Segment Anything Model\nVision Transformer-base (SAM ViT-b) to enhance glioma detection and\nsegmentation during ioMRI. Our model was trained using the Brain Tumor\nSegmentation 2021 (BraTS 2021) dataset, which includes standard magnetic\nresonance imaging (MRI) images, and noise-augmented MRI images that simulate\nioMRI images. Noised MRI images are harder for a deep learning pipeline to\nsegment, but they are more representative of surgical conditions. Achieving a\nDice Similarity Coefficient (DICE) score of 0.79, our model performs comparably\nto state-of-the-art segmentation models tested on noiseless data. This\nperformance demonstrates the model's potential to assist surgeons in maximizing\ntumor resection and improving surgical outcomes.\n","authors":["Samir Kassam","Angelo Markham","Katie Vo","Yashas Revanakara","Michael Lam","Kevin Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.14847v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14843v1","updated":"2024-08-27T07:54:15Z","published":"2024-08-27T07:54:15Z","title":"Correntropy-Based Improper Likelihood Model for Robust\n Electrophysiological Source Imaging","summary":" Bayesian learning provides a unified skeleton to solve the\nelectrophysiological source imaging task. From this perspective, existing\nsource imaging algorithms utilize the Gaussian assumption for the observation\nnoise to build the likelihood function for Bayesian inference. However, the\nelectromagnetic measurements of brain activity are usually affected by\nmiscellaneous artifacts, leading to a potentially non-Gaussian distribution for\nthe observation noise. Hence the conventional Gaussian likelihood model is a\nsuboptimal choice for the real-world source imaging task. In this study, we aim\nto solve this problem by proposing a new likelihood model which is robust with\nrespect to non-Gaussian noises. Motivated by the robust maximum correntropy\ncriterion, we propose a new improper distribution model concerning the noise\nassumption. This new noise distribution is leveraged to structure a robust\nlikelihood function and integrated with hierarchical prior distributions to\nestimate source activities by variational inference. In particular, the score\nmatching is adopted to determine the hyperparameters for the improper\nlikelihood model. A comprehensive performance evaluation is performed to\ncompare the proposed noise assumption to the conventional Gaussian model.\nSimulation results show that, the proposed method can realize more precise\nsource reconstruction by designing known ground-truth. The real-world dataset\nalso demonstrates the superiority of our new method with the visual perception\ntask. This study provides a new backbone for Bayesian source imaging, which\nwould facilitate its application using real-world noisy brain signal.\n","authors":["Yuanhao Li","Badong Chen","Zhongxu Hu","Keita Suzuki","Wenjun Bai","Yasuharu Koike","Okito Yamashita"],"pdf_url":"https://arxiv.org/pdf/2408.14843v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14842v1","updated":"2024-08-27T07:54:01Z","published":"2024-08-27T07:54:01Z","title":"From Bias to Balance: Detecting Facial Expression Recognition Biases in\n Large Multimodal Foundation Models","summary":" This study addresses the racial biases in facial expression recognition (FER)\nsystems within Large Multimodal Foundation Models (LMFMs). Despite advances in\ndeep learning and the availability of diverse datasets, FER systems often\nexhibit higher error rates for individuals with darker skin tones. Existing\nresearch predominantly focuses on traditional FER models (CNNs, RNNs, ViTs),\nleaving a gap in understanding racial biases in LMFMs. We benchmark four\nleading LMFMs: GPT-4o, PaliGemma, Gemini, and CLIP to assess their performance\nin facial emotion detection across different racial demographics. A linear\nclassifier trained on CLIP embeddings obtains accuracies of 95.9\\% for RADIATE,\n90.3\\% for Tarr, and 99.5\\% for Chicago Face. Furthermore, we identify that\nAnger is misclassified as Disgust 2.1 times more often in Black Females than\nWhite Females. This study highlights the need for fairer FER systems and\nestablishes a foundation for developing unbiased, accurate FER technologies.\nVisit https://kvjvhub.github.io/FERRacialBias/ for further information\nregarding the biases within facial expression recognition.\n","authors":["Kaylee Chhua","Zhoujinyi Wen","Vedant Hathalia","Kevin Zhu","Sean O'Brien"],"pdf_url":"https://arxiv.org/pdf/2408.14842v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14840v1","updated":"2024-08-27T07:51:26Z","published":"2024-08-27T07:51:26Z","title":"CL4KGE: A Curriculum Learning Method for Knowledge Graph Embedding","summary":" Knowledge graph embedding (KGE) constitutes a foundational task, directed\ntowards learning representations for entities and relations within knowledge\ngraphs (KGs), with the objective of crafting representations comprehensive\nenough to approximate the logical and symbolic interconnections among entities.\nIn this paper, we define a metric Z-counts to measure the difficulty of\ntraining each triple ($<$head entity, relation, tail entity$>$) in KGs with\ntheoretical analysis. Based on this metric, we propose \\textbf{CL4KGE}, an\nefficient \\textbf{C}urriculum \\textbf{L}earning based training strategy for\n\\textbf{KGE}. This method includes a difficulty measurer and a training\nscheduler that aids in the training of KGE models. Our approach possesses the\nflexibility to act as a plugin within a wide range of KGE models, with the\nadded advantage of adaptability to the majority of KGs in existence. The\nproposed method has been evaluated on popular KGE models, and the results\ndemonstrate that it enhances the state-of-the-art methods. The use of Z-counts\nas a metric has enabled the identification of challenging triples in KGs, which\nhelps in devising effective training strategies.\n","authors":["Yang Liu","Chuan Zhou","Peng Zhang","Yanan Cao","Yongchao Liu","Zhao Li","Hongyang Chen"],"pdf_url":"https://arxiv.org/pdf/2408.14840v1.pdf","comment":"16 pages, 3 figures"},{"id":"http://arxiv.org/abs/2408.14837v1","updated":"2024-08-27T07:46:07Z","published":"2024-08-27T07:46:07Z","title":"Diffusion Models Are Real-Time Game Engines","summary":" We present GameNGen, the first game engine powered entirely by a neural model\nthat enables real-time interaction with a complex environment over long\ntrajectories at high quality. GameNGen can interactively simulate the classic\ngame DOOM at over 20 frames per second on a single TPU. Next frame prediction\nachieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are\nonly slightly better than random chance at distinguishing short clips of the\ngame from clips of the simulation. GameNGen is trained in two phases: (1) an\nRL-agent learns to play the game and the training sessions are recorded, and\n(2) a diffusion model is trained to produce the next frame, conditioned on the\nsequence of past frames and actions. Conditioning augmentations enable stable\nauto-regressive generation over long trajectories.\n","authors":["Dani Valevski","Yaniv Leviathan","Moab Arar","Shlomi Fruchter"],"pdf_url":"https://arxiv.org/pdf/2408.14837v1.pdf","comment":"Project page: https://gamengen.github.io/"},{"id":"http://arxiv.org/abs/2405.18723v3","updated":"2024-08-27T07:31:44Z","published":"2024-05-29T03:08:30Z","title":"Conformal Depression Prediction","summary":" While existing depression prediction methods based on deep learning show\npromise, their practical application is hindered by the lack of\ntrustworthiness, as these deep models are often deployed as black box models,\nleaving us uncertain on the confidence of their predictions. For high-risk\nclinical applications like depression prediction, uncertainty quantification is\nessential in decision-making. In this paper, we introduce conformal depression\nprediction (CDP), a depression prediction method with uncertainty\nquantification based on conformal prediction (CP), giving valid confidence\nintervals with theoretical coverage guarantees for the model predictions. CDP\nis a plug-and-play module that requires neither model retraining nor an\nassumption about the depression data distribution. As CDP provides only an\naverage coverage guarantee across all inputs rather than per-input performance\nguarantee, we further propose CDP-ACC, an improved conformal prediction with\napproximate conditional coverage. CDP-ACC firstly estimates the prediction\ndistribution through neighborhood relaxation, and then introduces a conformal\nscore function by constructing nested sequences, so as to provide a tighter\nprediction interval adaptive to specific input. We empirically demonstrate the\napplication of CDP in uncertainty-aware facial depression prediction, as well\nas the effectiveness and superiority of CDP-ACC on the AVEC 2013 and AVEC 2014\ndatasets. Our code is publicly available at https://github.com/PushineLee/CDP.\n","authors":["Yonghong Li","Xiuzhuang Zhou"],"pdf_url":"https://arxiv.org/pdf/2405.18723v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14831v1","updated":"2024-08-27T07:28:05Z","published":"2024-08-27T07:28:05Z","title":"DRL-Based Federated Self-Supervised Learning for Task Offloading and\n Resource Allocation in ISAC-Enabled Vehicle Edge Computing","summary":" Intelligent Transportation Systems (ITS) leverage Integrated Sensing and\nCommunications (ISAC) to enhance data exchange between vehicles and\ninfrastructure in the Internet of Vehicles (IoV). This integration inevitably\nincreases computing demands, risking real-time system stability. Vehicle Edge\nComputing (VEC) addresses this by offloading tasks to Road Side Unit (RSU),\nensuring timely services. Our previous work FLSimCo algorithm, which uses local\nresources for Federated Self-Supervised Learning (SSL), though vehicles often\ncan't complete all iterations task. Our improved algorithm offloads partial\ntask to RSU and optimizes energy consumption by adjusting transmission power,\nCPU frequency, and task assignment ratios, balancing local and RSU-based\ntraining. Meanwhile, setting an offloading threshold further prevents\ninefficiencies. Simulation results show that the enhanced algorithm reduces\nenergy consumption, improves offloading efficiency and the accuracy of\nFederated SSL.\n","authors":["Xueying Gu","Qiong Wu","Pingyi Fan","Nan Cheng","Wen Chen","Khaled B. Letaief"],"pdf_url":"https://arxiv.org/pdf/2408.14831v1.pdf","comment":"This paper has been submitted to Digital Communications and Networks.\n The source code has been released at:\n https://github.com/qiongwu86/Federated-SSL-task-offloading-and-resource-allocation"},{"id":"http://arxiv.org/abs/2408.13751v2","updated":"2024-08-27T07:26:20Z","published":"2024-08-25T07:32:58Z","title":"Improved identification of breakpoints in piecewise regression and its\n applications","summary":" Identifying breakpoints in piecewise regression is critical in enhancing the\nreliability and interpretability of data fitting. In this paper, we propose\nnovel algorithms based on the greedy algorithm to accurately and efficiently\nidentify breakpoints in piecewise polynomial regression. The algorithm updates\nthe breakpoints to minimize the error by exploring the neighborhood of each\nbreakpoint. It has a fast convergence rate and stability to find optimal\nbreakpoints. Moreover, it can determine the optimal number of breakpoints. The\ncomputational results for real and synthetic data show that its accuracy is\nbetter than any existing methods. The real-world datasets demonstrate that\nbreakpoints through the proposed algorithm provide valuable data information.\n","authors":["Taehyeong Kim","Hyungu Lee","Hayoung Choi"],"pdf_url":"https://arxiv.org/pdf/2408.13751v2.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2408.14825v1","updated":"2024-08-27T07:11:45Z","published":"2024-08-27T07:11:45Z","title":"From Rule-Based Models to Deep Learning Transformers Architectures for\n Natural Language Processing and Sign Language Translation Systems: Survey,\n Taxonomy and Performance Evaluation","summary":" With the growing Deaf and Hard of Hearing population worldwide and the\npersistent shortage of certified sign language interpreters, there is a\npressing need for an efficient, signs-driven, integrated end-to-end translation\nsystem, from sign to gloss to text and vice-versa. There has been a wealth of\nresearch on machine translations and related reviews. However, there are few\nworks on sign language machine translation considering the particularity of the\nlanguage being continuous and dynamic. This paper aims to address this void,\nproviding a retrospective analysis of the temporal evolution of sign language\nmachine translation algorithms and a taxonomy of the Transformers\narchitectures, the most used approach in language translation. We also present\nthe requirements of a real-time Quality-of-Service sign language ma-chine\ntranslation system underpinned by accurate deep learning algorithms. We propose\nfuture research directions for sign language translation systems.\n","authors":["Nada Shahin","Leila Ismail"],"pdf_url":"https://arxiv.org/pdf/2408.14825v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14821v1","updated":"2024-08-27T07:03:51Z","published":"2024-08-27T07:03:51Z","title":"Data-driven Effective Modeling of Multiscale Stochastic Dynamical\n Systems","summary":" We present a numerical method for learning the dynamics of slow components of\nunknown multiscale stochastic dynamical systems. While the governing equations\nof the systems are unknown, bursts of observation data of the slow variables\nare available. By utilizing the observation data, our proposed method is\ncapable of constructing a generative stochastic model that can accurately\ncapture the effective dynamics of the slow variables in distribution. We\npresent a comprehensive set of numerical examples to demonstrate the\nperformance of the proposed method.\n","authors":["Yuan Chen","Dongbin Xiu"],"pdf_url":"https://arxiv.org/pdf/2408.14821v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2406.15747"},{"id":"http://arxiv.org/abs/2408.14817v1","updated":"2024-08-27T06:58:52Z","published":"2024-08-27T06:58:52Z","title":"A Comprehensive Benchmark of Machine and Deep Learning Across Diverse\n Tabular Datasets","summary":" The analysis of tabular datasets is highly prevalent both in scientific\nresearch and real-world applications of Machine Learning (ML). Unlike many\nother ML tasks, Deep Learning (DL) models often do not outperform traditional\nmethods in this area. Previous comparative benchmarks have shown that DL\nperformance is frequently equivalent or even inferior to models such as\nGradient Boosting Machines (GBMs). In this study, we introduce a comprehensive\nbenchmark aimed at better characterizing the types of datasets where DL models\nexcel. Although several important benchmarks for tabular datasets already\nexist, our contribution lies in the variety and depth of our comparison: we\nevaluate 111 datasets with 20 different models, including both regression and\nclassification tasks. These datasets vary in scale and include both those with\nand without categorical variables. Importantly, our benchmark contains a\nsufficient number of datasets where DL models perform best, allowing for a\nthorough analysis of the conditions under which DL models excel. Building on\nthe results of this benchmark, we train a model that predicts scenarios where\nDL models outperform alternative methods with 86.1% accuracy (AUC 0.78). We\npresent insights derived from this characterization and compare these findings\nto previous benchmarks.\n","authors":["Assaf Shmuel","Oren Glickman","Teddy Lazebnik"],"pdf_url":"https://arxiv.org/pdf/2408.14817v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14806v1","updated":"2024-08-27T06:28:35Z","published":"2024-08-27T06:28:35Z","title":"Poly2Vec: Polymorphic Encoding of Geospatial Objects for Spatial\n Reasoning with Deep Neural Networks","summary":" Encoding geospatial data is crucial for enabling machine learning (ML) models\nto perform tasks that require spatial reasoning, such as identifying the\ntopological relationships between two different geospatial objects. However,\nexisting encoding methods are limited as they are typically customized to\nhandle only specific types of spatial data, which impedes their applicability\nacross different downstream tasks where multiple data types coexist. To address\nthis, we introduce Poly2Vec, an encoding framework that unifies the modeling of\ndifferent geospatial objects, including 2D points, polylines, and polygons,\nirrespective of the downstream task. We leverage the power of the 2D Fourier\ntransform to encode useful spatial properties, such as shape and location, from\ngeospatial objects into fixed-length vectors. These vectors are then inputted\ninto neural network models for spatial reasoning tasks.This unified approach\neliminates the need to develop and train separate models for each distinct\nspatial type. We evaluate Poly2Vec on both synthetic and real datasets of mixed\ngeometry types and verify its consistent performance across several downstream\nspatial reasoning tasks.\n","authors":["Maria Despoina Siampou","Jialiang Li","John Krumm","Cyrus Shahabi","Hua Lu"],"pdf_url":"https://arxiv.org/pdf/2408.14806v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14797v1","updated":"2024-08-27T06:07:18Z","published":"2024-08-27T06:07:18Z","title":"MaskCycleGAN-based Whisper to Normal Speech Conversion","summary":" Whisper to normal speech conversion is an active area of research. Various\narchitectures based on generative adversarial networks have been proposed in\nthe recent past. Especially, recent study shows that MaskCycleGAN, which is a\nmask guided, and cyclic consistency keeping, generative adversarial network,\nperforms really well for voice conversion from spectrogram representations. In\nthe current work we present a MaskCycleGAN approach for the conversion of\nwhispered speech to normal speech. We find that tuning the mask parameters, and\npre-processing the signal with a voice activity detector provides superior\nperformance when compared to the existing approach. The wTIMIT dataset is used\nfor evaluation. Objective metrics such as PESQ and G-Loss are used to evaluate\nthe converted speech, along with subjective evaluation using mean opinion\nscore. The results show that the proposed approach offers considerable\nbenefits.\n","authors":["K. Rohith Gupta","K. Ramnath","S. Johanan Joysingh","P. Vijayalakshmi","T. Nagarajan"],"pdf_url":"https://arxiv.org/pdf/2408.14797v1.pdf","comment":"submitted to TENCON 2024"},{"id":"http://arxiv.org/abs/2408.14788v1","updated":"2024-08-27T05:28:52Z","published":"2024-08-27T05:28:52Z","title":"Learning from Complementary Features","summary":" While precise data observation is essential for the learning processes of\npredictive models, it can be challenging owing to factors such as insufficient\nobservation accuracy, high collection costs, and privacy constraints. In this\npaper, we examines cases where some qualitative features are unavailable as\nprecise information indicating \"what it is,\" but rather as complementary\ninformation indicating \"what it is not.\" We refer to features defined by\nprecise information as ordinary features (OFs) and those defined by\ncomplementary information as complementary features (CFs). We then formulate a\nnew learning scenario termed Complementary Feature Learning (CFL), where\npredictive models are constructed using instances consisting of OFs and CFs.\nThe simplest formalization of CFL applies conventional supervised learning\ndirectly using the observed values of CFs. However, this approach does not\nresolve the ambiguity associated with CFs, making learning challenging and\ncomplicating the interpretation of the predictive model's specific predictions.\nTherefore, we derive an objective function from an information-theoretic\nperspective to estimate the OF values corresponding to CFs and to predict\noutput labels based on these estimations. Based on this objective function, we\npropose a theoretically guaranteed graph-based estimation method along with its\npractical approximation, for estimating OF values corresponding to CFs. The\nresults of numerical experiments conducted with real-world data demonstrate\nthat our proposed method effectively estimates OF values corresponding to CFs\nand predicts output labels.\n","authors":["Kosuke Sugiyama","Masato Uchida"],"pdf_url":"https://arxiv.org/pdf/2408.14788v1.pdf","comment":"16 pages, 7 figures"},{"id":"http://arxiv.org/abs/2402.02051v2","updated":"2024-08-27T05:26:14Z","published":"2024-02-03T06:01:21Z","title":"Nonlinear subspace clustering by functional link neural networks","summary":" Nonlinear subspace clustering based on a feed-forward neural network has been\ndemonstrated to provide better clustering accuracy than some advanced subspace\nclustering algorithms. While this approach demonstrates impressive outcomes, it\ninvolves a balance between effectiveness and computational cost. In this study,\nwe employ a functional link neural network to transform data samples into a\nnonlinear domain. Subsequently, we acquire a self-representation matrix through\na learning mechanism that builds upon the mapped samples. As the functional\nlink neural network is a single-layer neural network, our proposed method\nachieves high computational efficiency while ensuring desirable clustering\nperformance. By incorporating the local similarity regularization to enhance\nthe grouping effect, our proposed method further improves the quality of the\nclustering results. Additionally, we introduce a convex combination subspace\nclustering scheme, which combining a linear subspace clustering method with the\nfunctional link neural network subspace clustering approach. This combination\napproach allows for a dynamic balance between linear and nonlinear\nrepresentations. Extensive experiments confirm the advancement of our methods.\nThe source code will be released on https://lshi91.github.io/ soon.\n","authors":["Long Shi","Lei Cao","Zhongpu Chen","Badong Chen","Yu Zhao"],"pdf_url":"https://arxiv.org/pdf/2402.02051v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14785v1","updated":"2024-08-27T05:23:45Z","published":"2024-08-27T05:23:45Z","title":"Unsupervised-to-Online Reinforcement Learning","summary":" Offline-to-online reinforcement learning (RL), a framework that trains a\npolicy with offline RL and then further fine-tunes it with online RL, has been\nconsidered a promising recipe for data-driven decision-making. While sensible,\nthis framework has drawbacks: it requires domain-specific offline RL\npre-training for each task, and is often brittle in practice. In this work, we\npropose unsupervised-to-online RL (U2O RL), which replaces domain-specific\nsupervised offline RL with unsupervised offline RL, as a better alternative to\noffline-to-online RL. U2O RL not only enables reusing a single pre-trained\nmodel for multiple downstream tasks, but also learns better representations,\nwhich often result in even better performance and stability than supervised\noffline-to-online RL. To instantiate U2O RL in practice, we propose a general\nrecipe for U2O RL to bridge task-agnostic unsupervised offline skill-based\npolicy pre-training and supervised online fine-tuning. Throughout our\nexperiments in nine state-based and pixel-based environments, we empirically\ndemonstrate that U2O RL achieves strong performance that matches or even\noutperforms previous offline-to-online RL approaches, while being able to reuse\na single pre-trained model for a number of different downstream tasks.\n","authors":["Junsu Kim","Seohong Park","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2408.14785v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14763v2","updated":"2024-08-27T05:09:09Z","published":"2023-12-22T15:28:55Z","title":"Enhanced Latent Multi-view Subspace Clustering","summary":" Latent multi-view subspace clustering has been demonstrated to have desirable\nclustering performance. However, the original latent representation method\nvertically concatenates the data matrices from multiple views into a single\nmatrix along the direction of dimensionality to recover the latent\nrepresentation matrix, which may result in an incomplete information recovery.\nTo fully recover the latent space representation, we in this paper propose an\nEnhanced Latent Multi-view Subspace Clustering (ELMSC) method. The ELMSC method\ninvolves constructing an augmented data matrix that enhances the representation\nof multi-view data. Specifically, we stack the data matrices from various views\ninto the block-diagonal locations of the augmented matrix to exploit the\ncomplementary information. Meanwhile, the non-block-diagonal entries are\ncomposed based on the similarity between different views to capture the\nconsistent information. In addition, we enforce a sparse regularization for the\nnon-diagonal blocks of the augmented self-representation matrix to avoid\nredundant calculations of consistency information. Finally, a novel iterative\nalgorithm based on the framework of Alternating Direction Method of Multipliers\n(ADMM) is developed to solve the optimization problem for ELMSC. Extensive\nexperiments on real-world datasets demonstrate that our proposed ELMSC is able\nto achieve higher clustering performance than some state-of-art multi-view\nclustering methods.\n","authors":["Long Shi","Lei Cao","Jun Wang","Badong Chen"],"pdf_url":"https://arxiv.org/pdf/2312.14763v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14780v1","updated":"2024-08-27T04:57:53Z","published":"2024-08-27T04:57:53Z","title":"GINN-KAN: Interpretability pipelining with applications in Physics\n Informed Neural Networks","summary":" Neural networks are powerful function approximators, yet their ``black-box\"\nnature often renders them opaque and difficult to interpret. While many\npost-hoc explanation methods exist, they typically fail to capture the\nunderlying reasoning processes of the networks. A truly interpretable neural\nnetwork would be trained similarly to conventional models using techniques such\nas backpropagation, but additionally provide insights into the learned\ninput-output relationships. In this work, we introduce the concept of\ninterpretability pipelineing, to incorporate multiple interpretability\ntechniques to outperform each individual technique. To this end, we first\nevaluate several architectures that promise such interpretability, with a\nparticular focus on two recent models selected for their potential to\nincorporate interpretability into standard neural network architectures while\nstill leveraging backpropagation: the Growing Interpretable Neural Network\n(GINN) and Kolmogorov Arnold Networks (KAN). We analyze the limitations and\nstrengths of each and introduce a novel interpretable neural network GINN-KAN\nthat synthesizes the advantages of both models. When tested on the Feynman\nsymbolic regression benchmark datasets, GINN-KAN outperforms both GINN and KAN.\nTo highlight the capabilities and the generalizability of this approach, we\nposition GINN-KAN as an alternative to conventional black-box networks in\nPhysics-Informed Neural Networks (PINNs). We expect this to have far-reaching\nimplications in the application of deep learning pipelines in the natural\nsciences. Our experiments with this interpretable PINN on 15 different partial\ndifferential equations demonstrate that GINN-KAN augmented PINNs outperform\nPINNs with black-box networks in solving differential equations and surpass the\ncapabilities of both GINN and KAN.\n","authors":["Nisal Ranasinghe","Yu Xia","Sachith Seneviratne","Saman Halgamuge"],"pdf_url":"https://arxiv.org/pdf/2408.14780v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14778v1","updated":"2024-08-27T04:56:45Z","published":"2024-08-27T04:56:45Z","title":"GPU-Accelerated Counterfactual Regret Minimization","summary":" Counterfactual regret minimization (CFR) is a family of algorithms of\nno-regret learning dynamics capable of solving large-scale imperfect\ninformation games. There has been a notable lack of work on making CFR more\ncomputationally efficient. We propose implementing this algorithm as a series\nof dense and sparse matrix and vector operations, thereby making it highly\nparallelizable for a graphical processing unit. Our experiments show that our\nimplementation performs up to about 352.5 times faster than OpenSpiel's Python\nimplementation and up to about 22.2 times faster than OpenSpiel's C++\nimplementation and the speedup becomes more pronounced as the size of the game\nbeing solved grows.\n","authors":["Juho Kim"],"pdf_url":"https://arxiv.org/pdf/2408.14778v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14777v1","updated":"2024-08-27T04:56:22Z","published":"2024-08-27T04:56:22Z","title":"Quartered Chirp Spectral Envelope for Whispered vs Normal Speech\n Classification","summary":" Whispered speech as an acceptable form of human-computer interaction is\ngaining traction. Systems that address multiple modes of speech require a\nrobust front-end speech classifier. Performance of whispered vs normal speech\nclassification drops in the presence of additive white Gaussian noise, since\nnormal speech takes on some of the characteristics of whispered speech. In this\nwork, we propose a new feature named the quartered chirp spectral envelope, a\ncombination of the chirp spectrum and the quartered spectral envelope, to\nclassify whispered and normal speech. The chirp spectrum can be fine-tuned to\nobtain customized features for a given task, and the quartered spectral\nenvelope has been proven to work especially well for the current task. The\nfeature is trained on a one dimensional convolutional neural network, that\ncaptures the trends in the spectral envelope. The proposed system performs\nbetter than the state of the art, in the presence of white noise.\n","authors":["S. Johanan Joysingh","P. Vijayalakshmi","T. Nagarajan"],"pdf_url":"https://arxiv.org/pdf/2408.14777v1.pdf","comment":"submitted to TENCON 2024"},{"id":"http://arxiv.org/abs/2408.13609v2","updated":"2024-08-27T04:49:46Z","published":"2024-08-24T15:43:02Z","title":"GNN: Graph Neural Network and Large Language Model for Data Discovery","summary":" Our algorithm GNN: Graph Neural Network and Large Language Model for Data\nDiscovery inherit the benefits of \\cite{hoang2024plod} (PLOD: Predictive\nLearning Optimal Data Discovery), \\cite{Hoang2024BODBO} (BOD: Blindly Optimal\nData Discovery) in terms of overcoming the challenges of having to predefine\nutility function and the human input for attribute ranking, which helps prevent\nthe time-consuming loop process. In addition to these previous works, our\nalgorithm GNN leverages the advantages of graph neural networks and large\nlanguage models to understand text type values that cannot be understood by\nPLOD and MOD, thus making the task of predicting outcomes more reliable. GNN\ncould be seen as an extension of PLOD in terms of understanding the text type\nvalue and the user's preferences, not only numerical values but also text\nvalues, making the promise of data science and analytics purposes.\n","authors":["Thomas Hoang"],"pdf_url":"https://arxiv.org/pdf/2408.13609v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.04814v4","updated":"2024-08-27T04:43:10Z","published":"2024-04-07T05:47:41Z","title":"Inference-Time Rule Eraser: Fair Recognition via Distilling and Removing\n Biased Rules","summary":" Machine learning models often make predictions based on biased features such\nas gender, race, and other social attributes, posing significant fairness\nrisks, especially in societal applications, such as hiring, banking, and\ncriminal justice. Traditional approaches to addressing this issue involve\nretraining or fine-tuning neural networks with fairness-aware optimization\nobjectives. However, these methods can be impractical due to significant\ncomputational resources, complex industrial tests, and the associated CO2\nfootprint. Additionally, regular users often fail to fine-tune models because\nthey lack access to model parameters In this paper, we introduce the\nInference-Time Rule Eraser (Eraser), a novel method designed to address\nfairness concerns by removing biased decision-making rules from deployed models\nduring inference without altering model weights. We begin by establishing a\ntheoretical foundation for modifying model outputs to eliminate biased rules\nthrough Bayesian analysis. Next, we present a specific implementation of Eraser\nthat involves two stages: (1) distilling the biased rules from the deployed\nmodel into an additional patch model, and (2) removing these biased rules from\nthe output of the deployed model during inference. Extensive experiments\nvalidate the effectiveness of our approach, showcasing its superior performance\nin addressing fairness concerns in AI systems.\n","authors":["Yi Zhang","Dongyuan Lu","Jitao Sang"],"pdf_url":"https://arxiv.org/pdf/2404.04814v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.15773v2","updated":"2024-08-27T04:41:40Z","published":"2024-07-22T16:25:41Z","title":"STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay","summary":" Test-time adaptation (TTA) aims to address the distribution shift between the\ntraining and test data with only unlabeled data at test time. Existing TTA\nmethods often focus on improving recognition performance specifically for test\ndata associated with classes in the training set. However, during the\nopen-world inference process, there are inevitably test data instances from\nunknown classes, commonly referred to as outliers. This paper pays attention to\nthe problem that conducts both sample recognition and outlier rejection during\ninference while outliers exist. To address this problem, we propose a new\napproach called STAble Memory rePlay (STAMP), which performs optimization over\na stable memory bank instead of the risky mini-batch. In particular, the memory\nbank is dynamically updated by selecting low-entropy and label-consistent\nsamples in a class-balanced manner. In addition, we develop a self-weighted\nentropy minimization strategy that assigns higher weight to low-entropy\nsamples. Extensive results demonstrate that STAMP outperforms existing TTA\nmethods in terms of both recognition and outlier detection performance. The\ncode is released at https://github.com/yuyongcan/STAMP.\n","authors":["Yongcan Yu","Lijun Sheng","Ran He","Jian Liang"],"pdf_url":"https://arxiv.org/pdf/2407.15773v2.pdf","comment":"Accepted by ECCV 2024; Fixed a bug in calculating OOD score of STAMP\n and updated the results"},{"id":"http://arxiv.org/abs/2408.14025v2","updated":"2024-08-27T04:36:52Z","published":"2024-08-26T05:31:46Z","title":"An Item Response Theory-based R Module for Algorithm Portfolio Analysis","summary":" Experimental evaluation is crucial in AI research, especially for assessing\nalgorithms across diverse tasks. Many studies often evaluate a limited set of\nalgorithms, failing to fully understand their strengths and weaknesses within a\ncomprehensive portfolio. This paper introduces an Item Response Theory (IRT)\nbased analysis tool for algorithm portfolio evaluation called AIRT-Module.\nTraditionally used in educational psychometrics, IRT models test question\ndifficulty and student ability using responses to test questions. Adapting IRT\nto algorithm evaluation, the AIRT-Module contains a Shiny web application and\nthe R package airt. AIRT-Module uses algorithm performance measures to compute\nanomalousness, consistency, and difficulty limits for an algorithm and the\ndifficulty of test instances. The strengths and weaknesses of algorithms are\nvisualised using the difficulty spectrum of the test instances. AIRT-Module\noffers a detailed understanding of algorithm capabilities across varied test\ninstances, thus enhancing comprehensive AI method assessment. It is available\nat https://sevvandi.shinyapps.io/AIRT/ .\n","authors":["Brodie Oldfield","Sevvandi Kandanaarachchi","Ziqi Xu","Mario Andrés Muñoz"],"pdf_url":"https://arxiv.org/pdf/2408.14025v2.pdf","comment":"10 Pages, 6 Figures. Submitted to SoftwareX"},{"id":"http://arxiv.org/abs/2408.14774v1","updated":"2024-08-27T04:31:58Z","published":"2024-08-27T04:31:58Z","title":"Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning","summary":" We introduce Instruct-SkillMix, an automated approach for creating diverse,\nhigh quality SFT data. The Instruct-SkillMix pipeline involves two stages, each\nleveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to\nextract core \"skills\" for instruction-following, either from existing datasets,\nor by directly prompting the model; (2) Data generation: uses the powerful LLM\nto generate (instruction, response) data that exhibit a randomly chosen pair of\nthese skills. Here, the use of random skill combinations promotes diversity and\ndifficulty.\n Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from\nInstruct-SkillMix leads to strong gains on instruction following benchmarks\nsuch as AlpacaEval 2.0, MT-Bench, and WildBench. With just $4$K examples,\nLLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0.\nTo our knowledge, this achieves state-of-the-art performance among all models\nthat have only undergone SFT (no RL methods) and competes with proprietary\nmodels such as Claude 3 Opus and LLaMA-3.1-405B-Instruct.\n Ablation studies also suggest plausible reasons for why creating open\ninstruction-tuning datasets via naive crowd-sourcing has proved difficult.\nIntroducing low quality answers (\"shirkers\") in $20\\%$ of Instruct-SkillMix\nexamples causes performance to plummet, sometimes catastrophically.\n The Instruct-SkillMix pipeline is flexible and is adaptable to other\nsettings.\n","authors":["Simran Kaur","Simon Park","Anirudh Goyal","Sanjeev Arora"],"pdf_url":"https://arxiv.org/pdf/2408.14774v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14080v2","updated":"2024-08-27T04:14:14Z","published":"2024-08-26T08:02:57Z","title":"SONICS: Synthetic Or Not -- Identifying Counterfeit Songs","summary":" The recent surge in AI-generated songs presents exciting possibilities and\nchallenges. While these tools democratize music creation, they also necessitate\nthe ability to distinguish between human-composed and AI-generated songs for\nsafeguarding artistic integrity and content curation. Existing research and\ndatasets in fake song detection only focus on singing voice deepfake detection\n(SVDD), where the vocals are AI-generated but the instrumental music is sourced\nfrom real songs. However, this approach is inadequate for contemporary\nend-to-end AI-generated songs where all components (vocals, lyrics, music, and\nstyle) could be AI-generated. Additionally, existing datasets lack lyrics-music\ndiversity, long-duration songs, and open fake songs. To address these gaps, we\nintroduce SONICS, a novel dataset for end-to-end Synthetic Song Detection\n(SSD), comprising over 97k songs with over 49k synthetic songs from popular\nplatforms like Suno and Udio. Furthermore, we highlight the importance of\nmodeling long-range temporal dependencies in songs for effective authenticity\ndetection, an aspect overlooked in existing methods. To capture these patterns,\nwe propose a novel model, SpecTTTra, that is up to 3 times faster and 6 times\nmore memory efficient compared to popular CNN and Transformer-based models\nwhile maintaining competitive performance. Finally, we offer both AI-based and\nHuman evaluation benchmarks, addressing another deficiency in current research.\n","authors":["Md Awsafur Rahman","Zaber Ibn Abdul Hakim","Najibul Haque Sarker","Bishmoy Paul","Shaikh Anowarul Fattah"],"pdf_url":"https://arxiv.org/pdf/2408.14080v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13820v2","updated":"2024-08-27T03:58:09Z","published":"2023-10-20T21:14:07Z","title":"FERI: A Multitask-based Fairness Achieving Algorithm with Applications\n to Fair Organ Transplantation","summary":" Liver transplantation often faces fairness challenges across subgroups\ndefined by sensitive attributes such as age group, gender, and race/ethnicity.\nMachine learning models for outcome prediction can introduce additional biases.\nTherefore, we introduce Fairness through the Equitable Rate of Improvement in\nMultitask Learning (FERI) algorithm for fair predictions of graft failure risk\nin liver transplant patients. FERI constrains subgroup loss by balancing\nlearning rates and preventing subgroup dominance in the training process. Our\nresults show that FERI maintained high predictive accuracy with AUROC and AUPRC\ncomparable to baseline models. More importantly, FERI demonstrated an ability\nto improve fairness without sacrificing accuracy. Specifically, for the gender,\nFERI reduced the demographic parity disparity by 71.74%, and for the age group,\nit decreased the equalized odds disparity by 40.46%. Therefore, the FERI\nalgorithm advanced fairness-aware predictive modeling in healthcare and\nprovides an invaluable tool for equitable healthcare systems.\n","authors":["Can Li","Dejian Lai","Xiaoqian Jiang","Kai Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.13820v2.pdf","comment":"First Prize Student Award Paper, American Medical Informatics\n Association 2024 Informatics Summit"},{"id":"http://arxiv.org/abs/2405.19730v4","updated":"2024-08-27T03:45:18Z","published":"2024-05-30T06:21:34Z","title":"Research on the Spatial Data Intelligent Foundation Model","summary":" This report focuses on spatial data intelligent large models, delving into\nthe principles, methods, and cutting-edge applications of these models. It\nprovides an in-depth discussion on the definition, development history, current\nstatus, and trends of spatial data intelligent large models, as well as the\nchallenges they face. The report systematically elucidates the key technologies\nof spatial data intelligent large models and their applications in urban\nenvironments, aerospace remote sensing, geography, transportation, and other\nscenarios. Additionally, it summarizes the latest application cases of spatial\ndata intelligent large models in themes such as urban development, multimodal\nsystems, remote sensing, smart transportation, and resource environments.\nFinally, the report concludes with an overview and outlook on the development\nprospects of spatial data intelligent large models.\n","authors":["Shaohua Wang","Xing Xie","Yong Li","Danhuai Guo","Zhi Cai","Yu Liu","Yang Yue","Xiao Pan","Feng Lu","Huayi Wu","Zhipeng Gui","Zhiming Ding","Bolong Zheng","Fuzheng Zhang","Jingyuan Wang","Zhengchao Chen","Hao Lu","Jiayi Li","Peng Yue","Wenhao Yu","Yao Yao","Leilei Sun","Yong Zhang","Longbiao Chen","Xiaoping Du","Xiang Li","Xueying Zhang","Kun Qin","Zhaoya Gong","Weihua Dong","Xiaofeng Meng"],"pdf_url":"https://arxiv.org/pdf/2405.19730v4.pdf","comment":"V1 and V2 are in Chinese language, other versions are in English"},{"id":"http://arxiv.org/abs/2402.10260v2","updated":"2024-08-27T03:32:47Z","published":"2024-02-15T18:58:09Z","title":"A StrongREJECT for Empty Jailbreaks","summary":" Most jailbreak papers claim the jailbreaks they propose are highly effective,\noften boasting near-100% attack success rates. However, it is perhaps more\ncommon than not for jailbreak developers to substantially exaggerate the\neffectiveness of their jailbreaks. We suggest this problem arises because\njailbreak researchers lack a standard, high-quality benchmark for evaluating\njailbreak performance, leaving researchers to create their own. To create a\nbenchmark, researchers must choose a dataset of forbidden prompts to which a\nvictim model will respond, along with an evaluation method that scores the\nharmfulness of the victim model's responses. We show that existing benchmarks\nsuffer from significant shortcomings and introduce the StrongREJECT benchmark\nto address these issues. StrongREJECT's dataset contains prompts that victim\nmodels must answer with specific, harmful information, while its automated\nevaluator measures the extent to which a response gives useful information to\nforbidden prompts. In doing so, the StrongREJECT evaluator achieves\nstate-of-the-art agreement with human judgments of jailbreak effectiveness.\nNotably, we find that existing evaluation methods significantly overstate\njailbreak effectiveness compared to human judgments and the StrongREJECT\nevaluator. We describe a surprising and novel phenomenon that explains this\ndiscrepancy: jailbreaks bypassing a victim model's safety fine-tuning tend to\nreduce its capabilities. Together, our findings underscore the need for\nresearchers to use a high-quality benchmark, such as StrongREJECT, when\ndeveloping new jailbreak attacks. We release the StrongREJECT code and data at\nhttps://strong-reject.readthedocs.io/en/latest/.\n","authors":["Alexandra Souly","Qingyuan Lu","Dillon Bowen","Tu Trinh","Elvis Hsieh","Sana Pandey","Pieter Abbeel","Justin Svegliato","Scott Emmons","Olivia Watkins","Sam Toyer"],"pdf_url":"https://arxiv.org/pdf/2402.10260v2.pdf","comment":"Code and data at https://strong-reject.readthedocs.io/en/latest/"},{"id":"http://arxiv.org/abs/2408.14763v1","updated":"2024-08-27T03:30:18Z","published":"2024-08-27T03:30:18Z","title":"Channel-wise Influence: Estimating Data Influence for Multivariate Time\n Series","summary":" The influence function, a technique from robust statistics, measures the\nimpact on model parameters or related functions when training data is removed\nor modified. This effective and valuable post-hoc method allows for studying\nthe interpretability of machine learning models without requiring costly model\nretraining. It would provide extensions like increasing model performance,\nimproving model generalization, and offering interpretability. Recently,\nMultivariate Time Series (MTS) analysis has become an important yet challenging\ntask, attracting significant attention. However, there is no preceding research\non the influence functions of MTS to shed light on the effects of modifying the\nchannel of training MTS. Given that each channel in an MTS plays a crucial role\nin its analysis, it is essential to characterize the influence of different\nchannels. To fill this gap, we propose a channel-wise influence function, which\nis the first method that can estimate the influence of different channels in\nMTS, utilizing a first-order gradient approximation that leverages the more\ninformative average gradient of the data set. Additionally, we demonstrate how\nthis influence function can be used to estimate the impact of a channel in MTS.\nFinally, we validated the accuracy and effectiveness of our influence\nestimation function in critical MTS analysis tasks, such as MTS anomaly\ndetection and MTS forecasting. According to abundant experiments on real-world\ndataset, the original influence function performs worse than our method and\neven fail for the channel pruning problem, which demonstrate the superiority\nand necessity of channel-wise influence function in MTS analysis tasks.\n","authors":["Muyao Wang","Zeke Xie","Bo Chen"],"pdf_url":"https://arxiv.org/pdf/2408.14763v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14762v1","updated":"2024-08-27T03:30:01Z","published":"2024-08-27T03:30:01Z","title":"Explainable Hierarchical Urban Representation Learning for Commuting\n Flow Prediction","summary":" Commuting flow prediction is an essential task for municipal operations in\nthe real world. Previous studies have revealed that it is feasible to estimate\nthe commuting origin-destination (OD) demand within a city using multiple\nauxiliary data. However, most existing methods are not suitable to deal with a\nsimilar task at a large scale, namely within a prefecture or the whole nation,\nowing to the increased number of geographical units that need to be maintained.\nIn addition, region representation learning is a universal approach for gaining\nurban knowledge for diverse metropolitan downstream tasks. Although many\nresearchers have developed comprehensive frameworks to describe urban units\nfrom multi-source data, they have not clarified the relationship between the\nselected geographical elements. Furthermore, metropolitan areas naturally\npreserve ranked structures, like cities and their inclusive districts, which\nmakes elucidating relations between cross-level urban units necessary.\nTherefore, we develop a heterogeneous graph-based model to generate meaningful\nregion embeddings at multiple spatial resolutions for predicting different\ntypes of inter-level OD flows. To demonstrate the effectiveness of the proposed\nmethod, extensive experiments were conducted using real-world aggregated mobile\nphone datasets collected from Shizuoka Prefecture, Japan. The results indicate\nthat our proposed model outperforms existing models in terms of a uniform urban\nstructure. We extend the understanding of predicted results using reasonable\nexplanations to enhance the credibility of the model.\n","authors":["Mingfei Cai","Yanbo Pang","Yoshihide Sekimoto"],"pdf_url":"https://arxiv.org/pdf/2408.14762v1.pdf","comment":"11 pages, 6 figures"},{"id":"http://arxiv.org/abs/2408.13448v2","updated":"2024-08-27T03:28:50Z","published":"2024-08-24T03:12:21Z","title":"ALIAS: DAG Learning with Efficient Unconstrained Policies","summary":" Recently, reinforcement learning (RL) has proved a promising alternative for\nconventional local heuristics in score-based approaches to learning directed\nacyclic causal graphs (DAGs) from observational data. However, the intricate\nacyclicity constraint still challenges the efficient exploration of the vast\nspace of DAGs in existing methods. In this study, we introduce ALIAS\n(reinforced dAg Learning wIthout Acyclicity conStraints), a novel approach to\ncausal discovery powered by the RL machinery. Our method features an efficient\npolicy for generating DAGs in just a single step with an optimal quadratic\ncomplexity, fueled by a novel parametrization of DAGs that directly translates\na continuous space to the space of all DAGs, bypassing the need for explicitly\nenforcing acyclicity constraints. This approach enables us to navigate the\nsearch space more effectively by utilizing policy gradient methods and\nestablished scoring functions. In addition, we provide compelling empirical\nevidence for the strong performance of ALIAS in comparison with\nstate-of-the-arts in causal discovery over increasingly difficult experiment\nconditions on both synthetic and real datasets.\n","authors":["Bao Duong","Hung Le","Thin Nguyen"],"pdf_url":"https://arxiv.org/pdf/2408.13448v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.03865v3","updated":"2024-08-27T03:25:58Z","published":"2023-11-07T10:28:17Z","title":"When Fairness Meets Privacy: Exploring Privacy Threats in Fair Binary\n Classifiers via Membership Inference Attacks","summary":" Previous studies have developed fairness methods for biased models that\nexhibit discriminatory behaviors towards specific subgroups. While these models\nhave shown promise in achieving fair predictions, recent research has\nidentified their potential vulnerability to score-based membership inference\nattacks (MIAs). In these attacks, adversaries can infer whether a particular\ndata sample was used during training by analyzing the model's prediction\nscores. However, our investigations reveal that these score-based MIAs are\nineffective when targeting fairness-enhanced models in binary classifications.\nThe attack models trained to launch the MIAs degrade into simplistic threshold\nmodels, resulting in lower attack performance. Meanwhile, we observe that\nfairness methods often lead to prediction performance degradation for the\nmajority subgroups of the training data. This raises the barrier to successful\nattacks and widens the prediction gaps between member and non-member data.\nBuilding upon these insights, we propose an efficient MIA method against\nfairness-enhanced models based on fairness discrepancy results (FD-MIA). It\nleverages the difference in the predictions from both the original and\nfairness-enhanced models and exploits the observed prediction gaps as attack\nclues. We also explore potential strategies for mitigating privacy leakages.\nExtensive experiments validate our findings and demonstrate the efficacy of the\nproposed method.\n","authors":["Huan Tian","Guangsheng Zhang","Bo Liu","Tianqing Zhu","Ming Ding","Wanlei Zhou"],"pdf_url":"https://arxiv.org/pdf/2311.03865v3.pdf","comment":"Accepted by IJCAI 2024"},{"id":"http://arxiv.org/abs/2408.14757v1","updated":"2024-08-27T03:17:52Z","published":"2024-08-27T03:17:52Z","title":"Learning effective pruning at initialization from iterative pruning","summary":" Pruning at initialization (PaI) reduces training costs by removing weights\nbefore training, which becomes increasingly crucial with the growing network\nsize. However, current PaI methods still have a large accuracy gap with\niterative pruning, especially at high sparsity levels. This raises an\nintriguing question: can we get inspiration from iterative pruning to improve\nthe PaI performance? In the lottery ticket hypothesis, the iterative rewind\npruning (IRP) finds subnetworks retroactively by rewinding the parameter to the\noriginal initialization in every pruning iteration, which means all the\nsubnetworks are based on the initial state. Here, we hypothesise the surviving\nsubnetworks are more important and bridge the initial feature and their\nsurviving score as the PaI criterion. We employ an end-to-end neural network\n(\\textbf{AutoS}parse) to learn this correlation, input the model's initial\nfeatures, output their score and then prune the lowest score parameters before\ntraining. To validate the accuracy and generalization of our method, we\nperformed PaI across various models. Results show that our approach outperforms\nexisting methods in high-sparsity settings. Notably, as the underlying logic of\nmodel pruning is consistent in different models, only one-time IRP on one model\nis needed (e.g., once IRP on ResNet-18/CIFAR-10, AutoS can be generalized to\nVGG-16/CIFAR-10, ResNet-18/TinyImageNet, et al.). As the first neural\nnetwork-based PaI method, we conduct extensive experiments to validate the\nfactors influencing this approach. These results reveal the learning tendencies\nof neural networks and provide new insights into our understanding and research\nof PaI from a practical perspective. Our code is available at:\nhttps://github.com/ChengYaofeng/AutoSparse.git.\n","authors":["Shengkai Liu","Yaofeng Cheng","Fusheng Zha","Wei Guo","Lining Sun","Zhenshan Bing","Chenguang Yang"],"pdf_url":"https://arxiv.org/pdf/2408.14757v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14756v1","updated":"2024-08-27T03:12:08Z","published":"2024-08-27T03:12:08Z","title":"Training-Free Time-Series Anomaly Detection: Leveraging Image Foundation\n Models","summary":" Recent advancements in time-series anomaly detection have relied on deep\nlearning models to handle the diverse behaviors of time-series data. However,\nthese models often suffer from unstable training and require extensive\nhyperparameter tuning, leading to practical limitations. Although foundation\nmodels present a potential solution, their use in time series is limited. To\novercome these issues, we propose an innovative image-based, training-free\ntime-series anomaly detection (ITF-TAD) approach. ITF-TAD converts time-series\ndata into images using wavelet transform and compresses them into a single\nrepresentation, leveraging image foundation models for anomaly detection. This\napproach achieves high-performance anomaly detection without unstable neural\nnetwork training or hyperparameter tuning. Furthermore, ITF-TAD identifies\nanomalies across different frequencies, providing users with a detailed\nvisualization of anomalies and their corresponding frequencies. Comprehensive\nexperiments on five benchmark datasets, including univariate and multivariate\ntime series, demonstrate that ITF-TAD offers a practical and effective solution\nwith performance exceeding or comparable to that of deep models.\n","authors":["Nobuo Namura","Yuma Ichikawa"],"pdf_url":"https://arxiv.org/pdf/2408.14756v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14747v1","updated":"2024-08-27T02:52:15Z","published":"2024-08-27T02:52:15Z","title":"Benchmarking Reinforcement Learning Methods for Dexterous Robotic\n Manipulation with a Three-Fingered Gripper","summary":" Reinforcement Learning (RL) training is predominantly conducted in\ncost-effective and controlled simulation environments. However, the transfer of\nthese trained models to real-world tasks often presents unavoidable challenges.\nThis research explores the direct training of RL algorithms in controlled yet\nrealistic real-world settings for the execution of dexterous manipulation. The\nbenchmarking results of three RL algorithms trained on intricate in-hand\nmanipulation tasks within practical real-world contexts are presented. Our\nstudy not only demonstrates the practicality of RL training in authentic\nreal-world scenarios, facilitating direct real-world applications, but also\nprovides insights into the associated challenges and considerations.\nAdditionally, our experiences with the employed experimental methods are\nshared, with the aim of empowering and engaging fellow researchers and\npractitioners in this dynamic field of robotics.\n","authors":["Elizabeth Cutler","Yuning Xing","Tony Cui","Brendan Zhou","Koen van Rijnsoever","Ben Hart","David Valencia","Lee Violet C. Ong","Trevor Gee","Minas Liarokapis","Henry Williams"],"pdf_url":"https://arxiv.org/pdf/2408.14747v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13713v2","updated":"2024-08-27T02:39:56Z","published":"2024-08-25T03:26:00Z","title":"Verifiable cloud-based variational quantum algorithms","summary":" Variational quantum algorithms (VQAs) have shown potential for quantum\nadvantage with noisy intermediate-scale quantum (NISQ) devices for quantum\nmachine learning (QML). However, given the high cost and limited availability\nof quantum resources, delegating VQAs via cloud networks is a more practical\nsolution for clients with limited quantum capabilities. Recently, Shingu et\nal.[Physical Review A, 105, 022603 (2022)] proposed a variational secure cloud\nquantum computing protocol, utilizing ancilla-driven quantum computation (ADQC)\nfor cloud-based VQAs with minimal quantum resource consumption. However, their\nprotocol lacks verifiability, which exposes it to potential malicious behaviors\nby the server. Additionally, channel loss requires frequent re-delegation as\nthe size of the delegated variational circuit grows, complicating verification\ndue to increased circuit complexity. This paper introduces a new protocol to\naddress these challenges and enhance both verifiability and tolerance to\nchannel loss in cloud-based VQAs.\n","authors":["Junhong Yang","Banghai Wang","Junyu Quan","Qin Li"],"pdf_url":"https://arxiv.org/pdf/2408.13713v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.14156v2","updated":"2024-08-27T02:31:50Z","published":"2024-06-20T09:53:56Z","title":"Tractable Equilibrium Computation in Markov Games through Risk Aversion","summary":" A significant roadblock to the development of principled multi-agent\nreinforcement learning is the fact that desired solution concepts like Nash\nequilibria may be intractable to compute. To overcome this obstacle, we take\ninspiration from behavioral economics and show that -- by imbuing agents with\nimportant features of human decision-making like risk aversion and bounded\nrationality -- a class of risk-averse quantal response equilibria (RQE) become\ntractable to compute in all $n$-player matrix and finite-horizon Markov games.\nIn particular, we show that they emerge as the endpoint of no-regret learning\nin suitably adjusted versions of the games. Crucially, the class of\ncomputationally tractable RQE is independent of the underlying game structure\nand only depends on agents' degree of risk-aversion and bounded rationality. To\nvalidate the richness of this class of solution concepts we show that it\ncaptures peoples' patterns of play in a number of 2-player matrix games\npreviously studied in experimental economics. Furthermore, we give a first\nanalysis of the sample complexity of computing these equilibria in\nfinite-horizon Markov games when one has access to a generative model and\nvalidate our findings on a simple multi-agent reinforcement learning benchmark.\n","authors":["Eric Mazumdar","Kishan Panaganti","Laixi Shi"],"pdf_url":"https://arxiv.org/pdf/2406.14156v2.pdf","comment":"preprint of multi-agent RL with risk-averse equilibria"},{"id":"http://arxiv.org/abs/2408.14738v1","updated":"2024-08-27T02:29:29Z","published":"2024-08-27T02:29:29Z","title":"Learning Differentially Private Diffusion Models via Stochastic\n Adversarial Distillation","summary":" While the success of deep learning relies on large amounts of training\ndatasets, data is often limited in privacy-sensitive domains. To address this\nchallenge, generative model learning with differential privacy has emerged as a\nsolution to train private generative models for desensitized data generation.\nHowever, the quality of the images generated by existing methods is limited due\nto the complexity of modeling data distribution. We build on the success of\ndiffusion models and introduce DP-SAD, which trains a private diffusion model\nby a stochastic adversarial distillation method. Specifically, we first train a\ndiffusion model as a teacher and then train a student by distillation, in which\nwe achieve differential privacy by adding noise to the gradients from other\nmodels to the student. For better generation quality, we introduce a\ndiscriminator to distinguish whether an image is from the teacher or the\nstudent, which forms the adversarial training. Extensive experiments and\nanalysis clearly demonstrate the effectiveness of our proposed method.\n","authors":["Bochao Liu","Pengju Wang","Shiming Ge"],"pdf_url":"https://arxiv.org/pdf/2408.14738v1.pdf","comment":"accepted by ECCV 2024"},{"id":"http://arxiv.org/abs/2408.14736v1","updated":"2024-08-27T02:28:27Z","published":"2024-08-27T02:28:27Z","title":"Bandwidth-Aware and Overlap-Weighted Compression for\n Communication-Efficient Federated Learning","summary":" Current data compression methods, such as sparsification in Federated\nAveraging (FedAvg), effectively enhance the communication efficiency of\nFederated Learning (FL). However, these methods encounter challenges such as\nthe straggler problem and diminished model performance due to heterogeneous\nbandwidth and non-IID (Independently and Identically Distributed) data. To\naddress these issues, we introduce a bandwidth-aware compression framework for\nFL, aimed at improving communication efficiency while mitigating the problems\nassociated with non-IID data. First, our strategy dynamically adjusts\ncompression ratios according to bandwidth, enabling clients to upload their\nmodels at a close pace, thus exploiting the otherwise wasted time to transmit\nmore data. Second, we identify the non-overlapped pattern of retained\nparameters after compression, which results in diminished client update signals\ndue to uniformly averaged weights. Based on this finding, we propose a\nparameter mask to adjust the client-averaging coefficients at the parameter\nlevel, thereby more closely approximating the original updates, and improving\nthe training convergence under heterogeneous environments. Our evaluations\nreveal that our method significantly boosts model accuracy, with a maximum\nimprovement of 13% over the uncompressed FedAvg. Moreover, it achieves a\n$3.37\\times$ speedup in reaching the target accuracy compared to FedAvg with a\nTop-K compressor, demonstrating its effectiveness in accelerating convergence\nwith compression. The integration of common compression techniques into our\nframework further establishes its potential as a versatile foundation for\nfuture cross-device, communication-efficient FL research, addressing critical\nchallenges in FL and advancing the field of distributed machine learning.\n","authors":["Zichen Tang","Junlin Huang","Rudan Yan","Yuxin Wang","Zhenheng Tang","Shaohuai Shi","Amelie Chi Zhou","Xiaowen Chu"],"pdf_url":"https://arxiv.org/pdf/2408.14736v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.05579v2","updated":"2024-08-27T02:23:23Z","published":"2023-12-09T13:53:35Z","title":"Conditional Stochastic Interpolation for Generative Learning","summary":" We propose a conditional stochastic interpolation (CSI) method for learning\nconditional distributions. CSI is based on estimating probability flow\nequations or stochastic differential equations that transport a reference\ndistribution to the target conditional distribution. This is achieved by first\nlearning the conditional drift and score functions based on CSI, which are then\nused to construct a deterministic process governed by an ordinary differential\nequation or a diffusion process for conditional sampling. In our proposed\napproach, we incorporate an adaptive diffusion term to address the instability\nissues arising in the diffusion process. We derive explicit expressions of the\nconditional drift and score functions in terms of conditional expectations,\nwhich naturally lead to an nonparametric regression approach to estimating\nthese functions. Furthermore, we establish nonasymptotic error bounds for\nlearning the target conditional distribution. We illustrate the application of\nCSI on image generation using a benchmark image dataset.\n","authors":["Ding Huang","Jian Huang","Ting Li","Guohao Shen"],"pdf_url":"https://arxiv.org/pdf/2312.05579v2.pdf","comment":"57 pages, 5 figures"},{"id":"http://arxiv.org/abs/2408.13452v2","updated":"2024-08-27T02:19:31Z","published":"2024-08-24T03:43:35Z","title":"Data Augmentation for Continual RL via Adversarial Gradient Episodic\n Memory","summary":" Data efficiency of learning, which plays a key role in the Reinforcement\nLearning (RL) training process, becomes even more important in continual RL\nwith sequential environments. In continual RL, the learner interacts with\nnon-stationary, sequential tasks and is required to learn new tasks without\nforgetting previous knowledge. However, there is little work on implementing\ndata augmentation for continual RL. In this paper, we investigate the efficacy\nof data augmentation for continual RL. Specifically, we provide benchmarking\ndata augmentations for continual RL, by (1) summarising existing data\naugmentation methods and (2) including a new augmentation method for continual\nRL: Adversarial Augmentation with Gradient Episodic Memory (Adv-GEM). Extensive\nexperiments show that data augmentations, such as random amplitude scaling,\nstate-switch, mixup, adversarial augmentation, and Adv-GEM, can improve\nexisting continual RL algorithms in terms of their average performance,\ncatastrophic forgetting, and forward transfer, on robot control tasks. All data\naugmentation methods are implemented as plug-in modules for trivial integration\ninto continual RL methods.\n","authors":["Sihao Wu","Xingyu Zhao","Xiaowei Huang"],"pdf_url":"https://arxiv.org/pdf/2408.13452v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14734v1","updated":"2024-08-27T02:03:22Z","published":"2024-08-27T02:03:22Z","title":"General-Kindred Physics-Informed Neural Network to the Solutions of\n Singularly Perturbed Differential Equations","summary":" Physics-Informed Neural Networks (PINNs) have become a promising research\ndirection in the field of solving Partial Differential Equations (PDEs).\nDealing with singular perturbation problems continues to be a difficult\nchallenge in the field of PINN. The solution of singular perturbation problems\noften exhibits sharp boundary layers and steep gradients, and traditional PINN\ncannot achieve approximation of boundary layers. In this manuscript, we propose\nthe General-Kindred Physics-Informed Neural Network (GKPINN) for solving\nSingular Perturbation Differential Equations (SPDEs). This approach utilizes\nasymptotic analysis to acquire prior knowledge of the boundary layer from the\nequation and establishes a novel network to assist PINN in approximating the\nboundary layer. It is compared with traditional PINN by solving examples of\none-dimensional, two-dimensional, and time-varying SPDE equations. The research\nfindings underscore the exceptional performance of our novel approach, GKPINN,\nwhich delivers a remarkable enhancement in reducing the $L_2$ error by two to\nfour orders of magnitude compared to the established PINN methodology. This\nsignificant improvement is accompanied by a substantial acceleration in\nconvergence rates, without compromising the high precision that is critical for\nour applications. Furthermore, GKPINN still performs well in extreme cases with\nperturbation parameters of ${1\\times10}^{-38}$, demonstrating its excellent\ngeneralization ability.\n","authors":["Sen Wang","Peizhi Zhao","Qinglong Ma","Tao Song"],"pdf_url":"https://arxiv.org/pdf/2408.14734v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14728v1","updated":"2024-08-27T01:41:21Z","published":"2024-08-27T01:41:21Z","title":"TART: Boosting Clean Accuracy Through Tangent Direction Guided\n Adversarial Training","summary":" Adversarial training has been shown to be successful in enhancing the\nrobustness of deep neural networks against adversarial attacks. However, this\nrobustness is accompanied by a significant decline in accuracy on clean data.\nIn this paper, we propose a novel method, called Tangent Direction Guided\nAdversarial Training (TART), that leverages the tangent space of the data\nmanifold to ameliorate the existing adversarial defense algorithms. We argue\nthat training with adversarial examples having large normal components\nsignificantly alters the decision boundary and hurts accuracy. TART mitigates\nthis issue by estimating the tangent direction of adversarial examples and\nallocating an adaptive perturbation limit according to the norm of their\ntangential component. To the best of our knowledge, our paper is the first work\nto consider the concept of tangent space and direction in the context of\nadversarial defense. We validate the effectiveness of TART through extensive\nexperiments on both simulated and benchmark datasets. The results demonstrate\nthat TART consistently boosts clean accuracy while retaining a high level of\nrobustness against adversarial attacks. Our findings suggest that incorporating\nthe geometric properties of data can lead to more effective and efficient\nadversarial training methods.\n","authors":["Bongsoo Yi","Rongjie Lai","Yao Li"],"pdf_url":"https://arxiv.org/pdf/2408.14728v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.06635v6","updated":"2024-08-27T01:27:29Z","published":"2023-12-11T18:51:59Z","title":"Gated Linear Attention Transformers with Hardware-Efficient Training","summary":" Transformers with linear attention allow for efficient parallel training but\ncan simultaneously be formulated as an RNN with 2D (matrix-valued) hidden\nstates, thus enjoying linear-time inference complexity. However, linear\nattention generally underperforms ordinary softmax attention. Moreover, current\nimplementations of linear attention lack I/O-awareness and are thus slower than\nhighly optimized implementations of softmax attention. This work describes a\nhardware-efficient algorithm for linear attention that trades off memory\nmovement against parallelizability. The resulting implementation, dubbed\nFLASHLINEARATTENTION, is faster than FLASHATTENTION-2 (Dao, 2023) as a\nstandalone layer even on short sequence lengths (e.g., 1K). We then generalize\nthis algorithm to a more expressive variant of linear attention with\ndata-dependent gates. When used as a replacement for the standard attention\nlayer in Transformers, the resulting gated linear attention (GLA) Transformer\nis found to perform competitively against the LLaMA-architecture Transformer\n(Touvron et al., 2023) as well recent linear-time-inference baselines such as\nRetNet (Sun et al., 2023a) and Mamba (Gu & Dao, 2023) on moderate-scale\nlanguage modeling experiments. GLA Transformer is especially effective at\nlength generalization, enabling a model trained on 2K to generalize to\nsequences longer than 20K without significant perplexity degradations. For\ntraining speed, the GLA Transformer has higher throughput than a\nsimilarly-sized Mamba model.\n","authors":["Songlin Yang","Bailin Wang","Yikang Shen","Rameswar Panda","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2312.06635v6.pdf","comment":"minor update"},{"id":"http://arxiv.org/abs/2404.13621v5","updated":"2024-08-27T01:23:50Z","published":"2024-04-21T11:21:27Z","title":"Attack on Scene Flow using Point Clouds","summary":" Deep neural networks have made significant advancements in accurately\nestimating scene flow using point clouds, which is vital for many applications\nlike video analysis, action recognition, and navigation. The robustness of\nthese techniques, however, remains a concern, particularly in the face of\nadversarial attacks that have been proven to deceive state-of-the-art deep\nneural networks in many domains. Surprisingly, the robustness of scene flow\nnetworks against such attacks has not been thoroughly investigated. To address\nthis problem, the proposed approach aims to bridge this gap by introducing\nadversarial white-box attacks specifically tailored for scene flow networks.\nExperimental results show that the generated adversarial examples obtain up to\n33.7 relative degradation in average end-point error on the KITTI and\nFlyingThings3D datasets. The study also reveals the significant impact that\nattacks targeting point clouds in only one dimension or color channel have on\naverage end-point error. Analyzing the success and failure of these attacks on\nthe scene flow networks and their 2D optical flow network variants shows a\nhigher vulnerability for the optical flow networks. Code is available at\nhttps://github.com/aheldis/Attack-on-Scene-Flow-using-Point-Clouds.git.\n","authors":["Haniyeh Ehsani Oskouie","Mohammad-Shahram Moin","Shohreh Kasaei"],"pdf_url":"https://arxiv.org/pdf/2404.13621v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.16338v2","updated":"2024-08-27T01:17:34Z","published":"2023-09-28T10:51:12Z","title":"Anti-Matthew FL: Bridging the Performance Gap in Federated Learning to\n Counteract the Matthew Effect","summary":" Federated learning (FL) stands as a paradigmatic approach that facilitates\nmodel training across heterogeneous and diverse datasets originating from\nvarious data providers. However, conventional FLs fall short of achieving\nconsistent performance, potentially leading to performance degradation for\nclients who are disadvantaged in data resources. Influenced by the Matthew\neffect, deploying a performance-imbalanced global model in applications further\nimpedes the generation of high-quality data from disadvantaged clients,\nexacerbating the disparities in data resources among clients. In this work, we\npropose anti-Matthew fairness for the global model at the client level,\nrequiring equal accuracy and equal decision bias across clients. To balance the\ntrade-off between achieving anti-Matthew fairness and performance optimality,\nwe formalize the anti-Matthew effect federated learning (anti-Matthew FL) as a\nmulti-constrained multi-objectives optimization (MCMOO) problem and propose a\nthree-stage multi-gradient descent algorithm to obtain the Pareto optimality.\nWe theoretically analyze the convergence and time complexity of our proposed\nalgorithms. Additionally, through extensive experimentation, we demonstrate\nthat our proposed anti-Matthew FL outperforms other state-of-the-art FL\nalgorithms in achieving a high-performance global model while effectively\nbridging performance gaps among clients. We hope this work provides valuable\ninsights into the manifestation of the Matthew effect in FL and other\ndecentralized learning scenarios and can contribute to designing fairer\nlearning mechanisms, ultimately fostering societal welfare.\n","authors":["Jiashi Gao","Xin Yao","Xuetao Wei"],"pdf_url":"https://arxiv.org/pdf/2309.16338v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14721v1","updated":"2024-08-27T01:04:14Z","published":"2024-08-27T01:04:14Z","title":"PAT: Pruning-Aware Tuning for Large Language Models","summary":" Large language models (LLMs) excel in language tasks, especially with\nsupervised fine-tuning after pre-training. However, their substantial memory\nand computational requirements hinder practical applications. Structural\npruning, which reduces less significant weight dimensions, is one solution.\nYet, traditional post-hoc pruning often leads to significant performance loss,\nwith limited recovery from further fine-tuning due to reduced capacity. Since\nthe model fine-tuning refines the general and chaotic knowledge in pre-trained\nmodels, we aim to incorporate structural pruning with the fine-tuning, and\npropose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy\nwhile preserving the model performance to the maximum extend. Specifically, we\ninsert the innovative Hybrid Sparsification Modules (HSMs) between the\nAttention and FFN components to accordingly sparsify the upstream and\ndownstream linear modules. The HSM comprises a lightweight operator and a\nglobally shared trainable mask. The lightweight operator maintains a training\noverhead comparable to that of LoRA, while the trainable mask unifies the\nchannels to be sparsified, ensuring structural pruning. Additionally, we\npropose the Identity Loss which decouples the transformation and scaling\nproperties of the HSMs to enhance training robustness. Extensive experiments\ndemonstrate that PAT excels in both performance and efficiency. For example,\nour Llama2-7b model with a 25\\% pruning ratio achieves 1.33$\\times$ speedup\nwhile outperforming the LoRA-finetuned model by up to 1.26\\% in accuracy with a\nsimilar training cost. Code:\nhttps://github.com/kriskrisliu/PAT_Pruning-Aware-Tuning\n","authors":["Yijiang Liu","Huanrui Yang","Youxin Chen","Rongyu Zhang","Miao Wang","Yuan Du","Li Du"],"pdf_url":"https://arxiv.org/pdf/2408.14721v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.12023v4","updated":"2024-08-27T00:48:35Z","published":"2023-11-20T18:57:41Z","title":"LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient\n Language Model Finetuning","summary":" We propose a simple approach for memory-efficient adaptation of pretrained\nlanguage models. Our approach uses an iterative algorithm to decompose each\npretrained matrix into a high-precision low-rank component and a\nmemory-efficient quantized component. During finetuning, the quantized\ncomponent remains fixed and only the low-rank component is updated. We present\nan integer linear programming formulation of the quantization component which\nenables dynamic configuration of quantization parameters (e.g., bit-width,\nblock size) for each matrix given an overall target memory budget. We further\nexplore a data-aware version of the algorithm which uses an approximation of\nthe Fisher information matrix to weight the reconstruction objective during\nmatrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and\n70B) demonstrate that our low-rank plus quantized matrix decomposition approach\n(LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and enables\naggressive quantization to sub-3 bits with only minor performance degradations.\nWhen finetuned on a language modeling calibration dataset, LQ-LoRA can also be\nused for model compression; in this setting our 2.75-bit LLaMA-2-70B model\n(which has 2.85 bits on average when including the low-rank components and\nrequires 27GB of GPU memory) performs respectably compared to the 16-bit\nbaseline.\n","authors":["Han Guo","Philip Greengard","Eric P. Xing","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2311.12023v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.10960v2","updated":"2024-08-27T00:27:12Z","published":"2024-07-15T17:55:42Z","title":"Fast Matrix Multiplications for Lookup Table-Quantized LLMs","summary":" The deployment of large language models (LLMs) is often constrained by memory\nbandwidth, where the primary bottleneck is the cost of transferring model\nparameters from the GPU's global memory to its registers. When coupled with\ncustom kernels that fuse the dequantization and matmul operations, weight-only\nquantization can thus enable faster inference by reducing the amount of memory\nmovement. However, developing high-performance kernels for weight-quantized\nLLMs presents substantial challenges, especially when the weights are\ncompressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform,\nlookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup\ntable engine for LUT-quantized LLMs, which uses offline restructuring of the\nquantized weight matrix to minimize bit manipulations associated with\nunpacking, and vectorization and duplication of the lookup table to mitigate\nshared memory bandwidth constraints. At batch sizes < 32 and quantization group\nsize of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster\nthan existing GEMM kernels. As an application of FLUTE, we explore a simple\nextension to lookup table-based NormalFloat quantization and apply it to\nquantize LLaMA3 to various configurations, obtaining competitive quantization\nperformance against strong baselines while obtaining an end-to-end throughput\nincrease of 1.5 to 2 times.\n","authors":["Han Guo","William Brandon","Radostin Cholakov","Jonathan Ragan-Kelley","Eric P. Xing","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2407.10960v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15449v1","updated":"2024-08-27T23:58:51Z","published":"2024-08-27T23:58:51Z","title":"Graph Attention Inference of Network Topology in Multi-Agent Systems","summary":" Accurately identifying the underlying graph structures of multi-agent systems\nremains a difficult challenge. Our work introduces a novel machine\nlearning-based solution that leverages the attention mechanism to predict\nfuture states of multi-agent systems by learning node representations. The\ngraph structure is then inferred from the strength of the attention values.\nThis approach is applied to both linear consensus dynamics and the non-linear\ndynamics of Kuramoto oscillators, resulting in implicit learning the graph by\nlearning good agent representations. Our results demonstrate that the presented\ndata-driven graph attention machine learning model can identify the network\ntopology in multi-agent systems, even when the underlying dynamic model is not\nknown, as evidenced by the F1 scores achieved in the link prediction.\n","authors":["Akshay Kolli","Reza Azadeh","Kshitj Jerath"],"pdf_url":"https://arxiv.org/pdf/2408.15449v1.pdf","comment":"Accepted for publication at Modeling and Estimation Control\n Conference 2024; 6 pages, 5 figures"},{"id":"http://arxiv.org/abs/2408.07877v2","updated":"2024-08-27T22:55:03Z","published":"2024-08-15T01:33:06Z","title":"IReCa: Intrinsic Reward-enhanced Context-aware Reinforcement Learning\n for Human-AI Coordination","summary":" In human-AI coordination scenarios, human agents usually exhibit asymmetric\nbehaviors that are significantly sparse and unpredictable compared to those of\nAI agents. These characteristics introduce two primary challenges to human-AI\ncoordination: the effectiveness of obtaining sparse rewards and the efficiency\nof training the AI agents. To tackle these challenges, we propose an Intrinsic\nReward-enhanced Context-aware (IReCa) reinforcement learning (RL) algorithm,\nwhich leverages intrinsic rewards to facilitate the acquisition of sparse\nrewards and utilizes environmental context to enhance training efficiency. Our\nIReCa RL algorithm introduces three unique features: (i) it encourages the\nexploration of sparse rewards by incorporating intrinsic rewards that\nsupplement traditional extrinsic rewards from the environment; (ii) it improves\nthe acquisition of sparse rewards by prioritizing the corresponding sparse\nstate-action pairs; and (iii) it enhances the training efficiency by optimizing\nthe exploration and exploitation through innovative context-aware weights of\nextrinsic and intrinsic rewards. Extensive simulations executed in the\nOvercooked layouts demonstrate that our IReCa RL algorithm can increase the\naccumulated rewards by approximately 20% and reduce the epochs required for\nconvergence by approximately 67% compared to state-of-the-art baselines.\n","authors":["Xin Hao","Bahareh Nakisa","Mohmmad Naim Rastgoo","Richard Dazeley"],"pdf_url":"https://arxiv.org/pdf/2408.07877v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.15826v2","updated":"2024-08-27T22:18:07Z","published":"2024-03-23T12:53:51Z","title":"Scaling Learning based Policy Optimization for Temporal Logic Tasks by\n Controller Network Dropout","summary":" This paper introduces a model-based approach for training feedback\ncontrollers for an autonomous agent operating in a highly nonlinear (albeit\ndeterministic) environment. We desire the trained policy to ensure that the\nagent satisfies specific task objectives and safety constraints, both expressed\nin Discrete-Time Signal Temporal Logic (DT-STL). One advantage for\nreformulation of a task via formal frameworks, like DT-STL, is that it permits\nquantitative satisfaction semantics. In other words, given a trajectory and a\nDT-STL formula, we can compute the {\\em robustness}, which can be interpreted\nas an approximate signed distance between the trajectory and the set of\ntrajectories satisfying the formula. We utilize feedback control, and we assume\na feed forward neural network for learning the feedback controller. We show how\nthis learning problem is similar to training recurrent neural networks (RNNs),\nwhere the number of recurrent units is proportional to the temporal horizon of\nthe agent's task objectives. This poses a challenge: RNNs are susceptible to\nvanishing and exploding gradients, and na\\\"{i}ve gradient descent-based\nstrategies to solve long-horizon task objectives thus suffer from the same\nproblems. To tackle this challenge, we introduce a novel gradient approximation\nalgorithm based on the idea of dropout or gradient sampling. One of the main\ncontributions is the notion of {\\em controller network dropout}, where we\napproximate the NN controller in several time-steps in the task horizon by the\ncontrol input obtained using the controller in a previous training step. We\nshow that our control synthesis methodology, can be quite helpful for\nstochastic gradient descent to converge with less numerical issues, enabling\nscalable backpropagation over long time horizons and trajectories over high\ndimensional state spaces.\n","authors":["Navid Hashemi","Bardh Hoxha","Danil Prokhorov","Georgios Fainekos","Jyotirmoy Deshmukh"],"pdf_url":"https://arxiv.org/pdf/2403.15826v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15421v1","updated":"2024-08-27T21:54:26Z","published":"2024-08-27T21:54:26Z","title":"Simultaneous Training of First- and Second-Order Optimizers in\n Population-Based Reinforcement Learning","summary":" The tuning of hyperparameters in reinforcement learning (RL) is critical, as\nthese parameters significantly impact an agent's performance and learning\nefficiency. Dynamic adjustment of hyperparameters during the training process\ncan significantly enhance both the performance and stability of learning.\nPopulation-based training (PBT) provides a method to achieve this by\ncontinuously tuning hyperparameters throughout the training. This ongoing\nadjustment enables models to adapt to different learning stages, resulting in\nfaster convergence and overall improved performance. In this paper, we propose\nan enhancement to PBT by simultaneously utilizing both first- and second-order\noptimizers within a single population. We conducted a series of experiments\nusing the TD3 algorithm across various MuJoCo environments. Our results, for\nthe first time, empirically demonstrate the potential of incorporating\nsecond-order optimizers within PBT-based RL. Specifically, the combination of\nthe K-FAC optimizer with Adam led to up to a 10% improvement in overall\nperformance compared to PBT using only Adam. Additionally, in environments\nwhere Adam occasionally fails, such as the Swimmer environment, the mixed\npopulation with K-FAC exhibited more reliable learning outcomes, offering a\nsignificant advantage in training stability without a substantial increase in\ncomputational time.\n","authors":["Felix Pfeiffer","Shahram Eivazi"],"pdf_url":"https://arxiv.org/pdf/2408.15421v1.pdf","comment":"8 pages, 5 figures"},{"id":"http://arxiv.org/abs/2408.15418v1","updated":"2024-08-27T21:47:09Z","published":"2024-08-27T21:47:09Z","title":"Understanding GNNs for Boolean Satisfiability through Approximation\n Algorithms","summary":" The paper deals with the interpretability of Graph Neural Networks in the\ncontext of Boolean Satisfiability. The goal is to demystify the internal\nworkings of these models and provide insightful perspectives into their\ndecision-making processes. This is done by uncovering connections to two\napproximation algorithms studied in the domain of Boolean Satisfiability:\nBelief Propagation and Semidefinite Programming Relaxations. Revealing these\nconnections has empowered us to introduce a suite of impactful enhancements.\nThe first significant enhancement is a curriculum training procedure, which\nincrementally increases the problem complexity in the training set, together\nwith increasing the number of message passing iterations of the Graph Neural\nNetwork. We show that the curriculum, together with several other\noptimizations, reduces the training time by more than an order of magnitude\ncompared to the baseline without the curriculum. Furthermore, we apply\ndecimation and sampling of initial embeddings, which significantly increase the\npercentage of solved problems.\n","authors":["Jan Hůla","David Mojžíšek","Mikoláš Janota"],"pdf_url":"https://arxiv.org/pdf/2408.15418v1.pdf","comment":"CIKM 2024"},{"id":"http://arxiv.org/abs/2408.15417v1","updated":"2024-08-27T21:46:47Z","published":"2024-08-27T21:46:47Z","title":"Implicit Geometry of Next-token Prediction: From Language Sparsity\n Patterns to Model Representations","summary":" Next-token prediction (NTP) over large text corpora has become the go-to\nparadigm to train large language models. Yet, it remains unclear how NTP\ninfluences the mapping of linguistic patterns to geometric properties of the\nresulting model representations. We frame training of large language models as\nsoft-label classification over sparse probabilistic label vectors, coupled with\nan analytical approximation that allows unrestricted generation of context\nembeddings. This approach links NTP training to rank-constrained, nuclear-norm\nregularized optimization in the logit domain, offering a framework for\nanalyzing the geometry of word and context embeddings. In large embedding\nspaces, we find that NTP implicitly favors learning logits with a sparse plus\nlow-rank structure. While the sparse component captures the co-occurrence\nfrequency of context-word pairs, the orthogonal low-rank component, which\nbecomes dominant as training progresses, depends solely on the sparsity pattern\nof the co-occurrence matrix. Consequently, when projected onto an appropriate\nsubspace, representations of contexts that are followed by the same set of\nnext-tokens collapse, a phenomenon we term subspace-collapse. We validate our\nfindings on synthetic and small-scale real language datasets. Finally, we\noutline potential research directions aimed at deepening the understanding of\nNTP's influence on the learning of linguistic patterns and regularities.\n","authors":["Yize Zhao","Tina Behnia","Vala Vakilian","Christos Thrampoulidis"],"pdf_url":"https://arxiv.org/pdf/2408.15417v1.pdf","comment":"Accepted at COLM 2024"},{"id":"http://arxiv.org/abs/2310.07819v3","updated":"2024-08-27T21:37:57Z","published":"2023-10-11T19:00:40Z","title":"Faithfulness Measurable Masked Language Models","summary":" A common approach to explaining NLP models is to use importance measures that\nexpress which tokens are important for a prediction. Unfortunately, such\nexplanations are often wrong despite being persuasive. Therefore, it is\nessential to measure their faithfulness. One such metric is if tokens are truly\nimportant, then masking them should result in worse model performance. However,\ntoken masking introduces out-of-distribution issues, and existing solutions\nthat address this are computationally expensive and employ proxy models.\nFurthermore, other metrics are very limited in scope. This work proposes an\ninherently faithfulness measurable model that addresses these challenges. This\nis achieved using a novel fine-tuning method that incorporates masking, such\nthat masking tokens become in-distribution by design. This differs from\nexisting approaches, which are completely model-agnostic but are inapplicable\nin practice. We demonstrate the generality of our approach by applying it to 16\ndifferent datasets and validate it using statistical in-distribution tests. The\nfaithfulness is then measured with 9 different importance measures. Because\nmasking is in-distribution, importance measures that themselves use masking\nbecome consistently more faithful. Additionally, because the model makes\nfaithfulness cheap to measure, we can optimize explanations towards maximal\nfaithfulness; thus, our model becomes indirectly inherently explainable.\n","authors":["Andreas Madsen","Siva Reddy","Sarath Chandar"],"pdf_url":"https://arxiv.org/pdf/2310.07819v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11986v2","updated":"2024-08-27T21:25:39Z","published":"2023-07-22T05:34:18Z","title":"Expert Knowledge-Aware Image Difference Graph Representation Learning\n for Difference-Aware Medical Visual Question Answering","summary":" To contribute to automating the medical vision-language model, we propose a\nnovel Chest-Xray Difference Visual Question Answering (VQA) task. Given a pair\nof main and reference images, this task attempts to answer several questions on\nboth diseases and, more importantly, the differences between them. This is\nconsistent with the radiologist's diagnosis practice that compares the current\nimage with the reference before concluding the report. We collect a new\ndataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs\nof main and reference images. Compared to existing medical VQA datasets, our\nquestions are tailored to the Assessment-Diagnosis-Intervention-Evaluation\ntreatment procedure used by clinical professionals. Meanwhile, we also propose\na novel expert knowledge-aware graph representation learning model to address\nthis task. The proposed baseline model leverages expert knowledge such as\nanatomical structure prior, semantic, and spatial knowledge to construct a\nmulti-relationship graph, representing the image differences between two images\nfor the image difference VQA task. The dataset and code can be found at\nhttps://github.com/Holipori/MIMIC-Diff-VQA. We believe this work would further\npush forward the medical vision language model.\n","authors":["Xinyue Hu","Lin Gu","Qiyuan An","Mengliang Zhang","Liangchen Liu","Kazuma Kobayashi","Tatsuya Harada","Ronald M. Summers","Yingying Zhu"],"pdf_url":"https://arxiv.org/pdf/2307.11986v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15408v1","updated":"2024-08-27T21:18:41Z","published":"2024-08-27T21:18:41Z","title":"Divergence-free neural operators for stress field modeling in\n polycrystalline materials","summary":" The purpose of the current work is the development and comparison of Fourier\nneural operators (FNOs) for surrogate modeling of the quasi-static mechanical\nresponse of polycrystalline materials. Three types of such FNOs are considered\nhere: a physics-guided FNO (PgFNO), a physics-informed FNO (PiFNO), and a\nphysics-encoded FNO (PeFNO). These are trained and compared with the help of\nstress field data from a reference model for heterogeneous elastic materials\nwith a periodic grain microstructure. Whereas PgFNO training is based solely on\nthese data, that of the PiFNO and PeFNO is in addition constrained by the\nrequirement that stress fields satisfy mechanical equilibrium, i.e., be\ndivergence-free. The difference between the PiFNO and PeFNO lies in how this\nconstraint is taken into account; in the PiFNO, it is included in the loss\nfunction, whereas in the PeFNO, it is \"encoded\" in the operator architecture.\nIn the current work, this encoding is based on a stress potential and Fourier\ntransforms. As a result, only the training of the PiFNO is constrained by\nmechanical equilibrium; in contrast, mechanical equilibrium constrains both the\ntraining and output of the PeFNO. Due in particular to this, stress fields\ncalculated by the trained PeFNO are significantly more accurate than those\ncalculated by the trained PiFNO in the example cases considered.\n","authors":["Mohammad S. Khorrami","Pawan Goyal","Jaber R. Mianroodi","Bob Svendsen","Peter Benner","Dierk Raabe"],"pdf_url":"https://arxiv.org/pdf/2408.15408v1.pdf","comment":"17 pages, 11 figures"},{"id":"http://arxiv.org/abs/2408.15404v1","updated":"2024-08-27T20:57:26Z","published":"2024-08-27T20:57:26Z","title":"Evaluating Credit VIX (CDS IV) Prediction Methods with Incremental Batch\n Learning","summary":" This paper presents the experimental process and results of SVM, Gradient\nBoosting, and an Attention-GRU Hybrid model in predicting the Implied\nVolatility of rolled-over five-year spread contracts of credit default swaps\n(CDS) on European corporate debt during the quarter following mid-May '24, as\nrepresented by the iTraxx/Cboe Europe Main 1-Month Volatility Index (BP\nVolatility). The analysis employs a feature matrix inspired by Merton's\ndeterminants of default probability. Our comparative assessment aims to\nidentify strengths in SOTA and classical machine learning methods for financial\nrisk prediction\n","authors":["Robert Taylor"],"pdf_url":"https://arxiv.org/pdf/2408.15404v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.11552v5","updated":"2024-08-27T20:56:53Z","published":"2023-02-22T18:48:46Z","title":"Reduce, Reuse, Recycle: Compositional Generation with Energy-Based\n Diffusion Models and MCMC","summary":" Since their introduction, diffusion models have quickly become the prevailing\napproach to generative modeling in many domains. They can be interpreted as\nlearning the gradients of a time-varying sequence of log-probability density\nfunctions. This interpretation has motivated classifier-based and\nclassifier-free guidance as methods for post-hoc control of diffusion models.\nIn this work, we build upon these ideas using the score-based interpretation of\ndiffusion models, and explore alternative ways to condition, modify, and reuse\ndiffusion models for tasks involving compositional generation and guidance. In\nparticular, we investigate why certain types of composition fail using current\ntechniques and present a number of solutions. We conclude that the sampler (not\nthe model) is responsible for this failure and propose new samplers, inspired\nby MCMC, which enable successful compositional generation. Further, we propose\nan energy-based parameterization of diffusion models which enables the use of\nnew compositional operators and more sophisticated, Metropolis-corrected\nsamplers. Intriguingly we find these samplers lead to notable improvements in\ncompositional generation across a wide set of problems such as\nclassifier-guided ImageNet modeling and compositional text-to-image generation.\n","authors":["Yilun Du","Conor Durkan","Robin Strudel","Joshua B. Tenenbaum","Sander Dieleman","Rob Fergus","Jascha Sohl-Dickstein","Arnaud Doucet","Will Grathwohl"],"pdf_url":"https://arxiv.org/pdf/2302.11552v5.pdf","comment":"ICML 2023, Project Webpage:\n https://energy-based-model.github.io/reduce-reuse-recycle/"},{"id":"http://arxiv.org/abs/2408.15400v1","updated":"2024-08-27T20:51:48Z","published":"2024-08-27T20:51:48Z","title":"Exploring the origins of switching dynamics in a multifunctional\n reservoir computer","summary":" The concept of multifunctionality has enabled reservoir computers (RCs), a\ntype of dynamical system that is typically realised as an artificial neural\nnetwork, to reconstruct multiple attractors simultaneously using the same set\nof trained weights. However there are many additional phenomena that arise when\ntraining a RC to reconstruct more than one attractor. Previous studies have\nfound that, in certain cases, if the RC fails to reconstruct a coexistence of\nattractors then it exhibits a form of metastability whereby, without any\nexternal input, the state of the RC switches between different modes of\nbehaviour that resemble properties of the attractors it failed to reconstruct.\nIn this paper we explore the origins of these switching dynamics in a\nparadigmatic setting via the `seeing double' problem.\n","authors":["Andrew Flynn","Andreas Amann"],"pdf_url":"https://arxiv.org/pdf/2408.15400v1.pdf","comment":"Preprint submitted to Frontiers in Network Physiology"},{"id":"http://arxiv.org/abs/2408.15399v1","updated":"2024-08-27T20:51:06Z","published":"2024-08-27T20:51:06Z","title":"A Statistical Framework for Data-dependent Retrieval-Augmented Models","summary":" Modern ML systems increasingly augment input instances with additional\nrelevant information to enhance final prediction. Despite growing interest in\nsuch retrieval-augmented models, their fundamental properties and training are\nnot well understood. We propose a statistical framework to study such models\nwith two components: 1) a {\\em retriever} to identify the relevant information\nout of a large corpus via a data-dependent metric; and 2) a {\\em predictor}\nthat consumes the input instances along with the retrieved information to make\nthe final predictions. We present a principled method for end-to-end training\nof both components and draw connections with various training approaches in the\nliterature. Furthermore, we establish excess risk bounds for\nretrieval-augmented models while delineating the contributions of both\nretriever and predictor towards the model performance. We validate the utility\nof our proposed training methods along with the key takeaways from our\nstatistical analysis on open domain question answering task where retrieval\naugmentation is important.\n","authors":["Soumya Basu","Ankit Singh Rawat","Manzil Zaheer"],"pdf_url":"https://arxiv.org/pdf/2408.15399v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15398v1","updated":"2024-08-27T20:49:11Z","published":"2024-08-27T20:49:11Z","title":"Evaluating Pre-Training Bias on Severe Acute Respiratory Syndrome\n Dataset","summary":" Machine learning (ML) is a growing field of computer science that has found\nmany practical applications in several domains, including Health. However, as\ndata grows in size and availability, and the number of models that aim to aid\nor replace human decisions, it raises the concern that these models can be\nsusceptible to bias, which can lead to harm to specific individuals by basing\nits decisions on protected attributes such as gender, religion, sexual\norientation, ethnicity, and others. Visualization techniques might generate\ninsights and help summarize large datasets, enabling data scientists to\nunderstand the data better before training a model by evaluating pre-training\nmetrics applied to the datasets before training, which might contribute to\nidentifying potential harm before any effort is put into training and deploying\nthe models. This work uses the severe acute respiratory syndrome dataset from\nOpenDataSUS to visualize three pre-training bias metrics and their distribution\nacross different regions in Brazil. A random forest model is trained in each\nregion and applied to the others. The aim is to compare the bias for the\ndifferent regions, focusing on their protected attributes and comparing the\nmodel's performance with the metric values.\n","authors":["Diego Dimer Rodrigues"],"pdf_url":"https://arxiv.org/pdf/2408.15398v1.pdf","comment":"short paper for eurovis, 5 pages"},{"id":"http://arxiv.org/abs/2408.15395v1","updated":"2024-08-27T20:39:09Z","published":"2024-08-27T20:39:09Z","title":"SCAN-Edge: Finding MobileNet-speed Hybrid Networks for Diverse Edge\n Devices via Hardware-Aware Evolutionary Search","summary":" Designing low-latency and high-efficiency hybrid networks for a variety of\nlow-cost commodity edge devices is both costly and tedious, leading to the\nadoption of hardware-aware neural architecture search (NAS) for finding optimal\narchitectures. However, unifying NAS for a wide range of edge devices presents\nchallenges due to the variety of hardware designs, supported operations, and\ncompilation optimizations. Existing methods often fix the search space of\narchitecture choices (e.g., activation, convolution, or self-attention) and\nestimate latency using hardware-agnostic proxies (e.g., FLOPs), which fail to\nachieve proclaimed latency across various edge devices. To address this issue,\nwe propose SCAN-Edge, a unified NAS framework that jointly searches for\nself-attention, convolution, and activation to accommodate the wide variety of\nedge devices, including CPU-, GPU-, and hardware accelerator-based systems. To\nhandle the large search space, SCAN-Edge relies on with a hardware-aware\nevolutionary algorithm that improves the quality of the search space to\naccelerate the sampling process. Experiments on large-scale datasets\ndemonstrate that our hybrid networks match the actual MobileNetV2 latency for\n224x224 input resolution on various commodity edge devices.\n","authors":["Hung-Yueh Chiang","Diana Marculescu"],"pdf_url":"https://arxiv.org/pdf/2408.15395v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15393v1","updated":"2024-08-27T20:33:16Z","published":"2024-08-27T20:33:16Z","title":"Stability Analysis of Physics-Informed Neural Networks for Stiff Linear\n Differential Equations","summary":" We present a stability analysis of Physics-Informed Neural Networks (PINNs)\ncoupled with random projections, for the numerical solution of (stiff) linear\ndifferential equations. For our analysis, we consider systems of linear ODEs,\nand linear parabolic PDEs. We prove that properly designed PINNs offer\nconsistent and asymptotically stable numerical schemes, thus convergent\nschemes. In particular, we prove that multi-collocation random projection PINNs\nguarantee asymptotic stability for very high stiffness and that\nsingle-collocation PINNs are $A$-stable. To assess the performance of the PINNs\nin terms of both numerical approximation accuracy and computational cost, we\ncompare it with other implicit schemes and in particular backward Euler, the\nmidpoint, trapezoidal (Crank-Nikolson), the 2-stage Gauss scheme and the 2 and\n3 stages Radau schemes. We show that the proposed PINNs outperform the above\ntraditional schemes, in both numerical approximation accuracy and importantly\ncomputational cost, for a wide range of step sizes.\n","authors":["Gianluca Fabiani","Erik Bollt","Constantinos Siettos","Athanasios N. Yannacopoulos"],"pdf_url":"https://arxiv.org/pdf/2408.15393v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15388v1","updated":"2024-08-27T20:14:42Z","published":"2024-08-27T20:14:42Z","title":"Panoptic Perception for Autonomous Driving: A Survey","summary":" Panoptic perception represents a forefront advancement in autonomous driving\ntechnology, unifying multiple perception tasks into a singular, cohesive\nframework to facilitate a thorough understanding of the vehicle's surroundings.\nThis survey reviews typical panoptic perception models for their unique inputs\nand architectures and compares them to performance, responsiveness, and\nresource utilization. It also delves into the prevailing challenges faced in\npanoptic perception and explores potential trajectories for future research.\nOur goal is to furnish researchers in autonomous driving with a detailed\nsynopsis of panoptic perception, positioning this survey as a pivotal reference\nin the ever-evolving landscape of autonomous driving technologies.\n","authors":["Yunge Li","Lanyu Xu"],"pdf_url":"https://arxiv.org/pdf/2408.15388v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.01959v2","updated":"2024-08-27T19:57:45Z","published":"2024-08-04T08:26:58Z","title":"Dataset Scale and Societal Consistency Mediate Facial Impression Bias in\n Vision-Language AI","summary":" Multimodal AI models capable of associating images and text hold promise for\nnumerous domains, ranging from automated image captioning to accessibility\napplications for blind and low-vision users. However, uncertainty about bias\nhas in some cases limited their adoption and availability. In the present work,\nwe study 43 CLIP vision-language models to determine whether they learn\nhuman-like facial impression biases, and we find evidence that such biases are\nreflected across three distinct CLIP model families. We show for the first time\nthat the the degree to which a bias is shared across a society predicts the\ndegree to which it is reflected in a CLIP model. Human-like impressions of\nvisually unobservable attributes, like trustworthiness and sexuality, emerge\nonly in models trained on the largest dataset, indicating that a better fit to\nuncurated cultural data results in the reproduction of increasingly subtle\nsocial biases. Moreover, we use a hierarchical clustering approach to show that\ndataset size predicts the extent to which the underlying structure of facial\nimpression bias resembles that of facial impression bias in humans. Finally, we\nshow that Stable Diffusion models employing CLIP as a text encoder learn facial\nimpression biases, and that these biases intersect with racial biases in Stable\nDiffusion XL-Turbo. While pretrained CLIP models may prove useful for\nscientific studies of bias, they will also require significant dataset curation\nwhen intended for use as general-purpose models in a zero-shot setting.\n","authors":["Robert Wolfe","Aayushi Dangol","Alexis Hiniker","Bill Howe"],"pdf_url":"https://arxiv.org/pdf/2408.01959v2.pdf","comment":"Accepted at Artificial Intelligence, Ethics, and Society 2024"},{"id":"http://arxiv.org/abs/2401.03717v3","updated":"2024-08-27T19:45:07Z","published":"2024-01-08T08:00:04Z","title":"Universal Time-Series Representation Learning: A Survey","summary":" Time-series data exists in every corner of real-world systems and services,\nranging from satellites in the sky to wearable devices on human bodies.\nLearning representations by extracting and inferring valuable information from\nthese time series is crucial for understanding the complex dynamics of\nparticular phenomena and enabling informed decisions. With the learned\nrepresentations, we can perform numerous downstream analyses more effectively.\nAmong several approaches, deep learning has demonstrated remarkable performance\nin extracting hidden patterns and features from time-series data without manual\nfeature engineering. This survey first presents a novel taxonomy based on three\nfundamental elements in designing state-of-the-art universal representation\nlearning methods for time series. According to the proposed taxonomy, we\ncomprehensively review existing studies and discuss their intuitions and\ninsights into how these methods enhance the quality of learned representations.\nFinally, as a guideline for future studies, we summarize commonly used\nexperimental setups and datasets and discuss several promising research\ndirections. An up-to-date corresponding resource is available at\nhttps://github.com/itouchz/awesome-deep-time-series-representations.\n","authors":["Patara Trirat","Yooju Shin","Junhyeok Kang","Youngeun Nam","Jihye Na","Minyoung Bae","Joeun Kim","Byunghyun Kim","Jae-Gil Lee"],"pdf_url":"https://arxiv.org/pdf/2401.03717v3.pdf","comment":"41 pages, 7 figures"},{"id":"http://arxiv.org/abs/2408.13683v2","updated":"2024-08-27T19:27:07Z","published":"2024-08-24T22:40:31Z","title":"Submodular Maximization Approaches for Equitable Client Selection in\n Federated Learning","summary":" In a conventional Federated Learning framework, client selection for training\ntypically involves the random sampling of a subset of clients in each\niteration. However, this random selection often leads to disparate performance\namong clients, raising concerns regarding fairness, particularly in\napplications where equitable outcomes are crucial, such as in medical or\nfinancial machine learning tasks. This disparity typically becomes more\npronounced with the advent of performance-centric client sampling techniques.\nThis paper introduces two novel methods, namely SUBTRUNC and UNIONFL, designed\nto address the limitations of random client selection. Both approaches utilize\nsubmodular function maximization to achieve more balanced models. By modifying\nthe facility location problem, they aim to mitigate the fairness concerns\nassociated with random selection. SUBTRUNC leverages client loss information to\ndiversify solutions, while UNIONFL relies on historical client selection data\nto ensure a more equitable performance of the final model. Moreover, these\nalgorithms are accompanied by robust theoretical guarantees regarding\nconvergence under reasonable assumptions. The efficacy of these methods is\ndemonstrated through extensive evaluations across heterogeneous scenarios,\nrevealing significant improvements in fairness as measured by a client\ndissimilarity metric.\n","authors":["Andrés Catalino Castillo Jiménez","Ege C. Kaya","Lintao Ye","Abolfazl Hashemi"],"pdf_url":"https://arxiv.org/pdf/2408.13683v2.pdf","comment":"13 pages"},{"id":"http://arxiv.org/abs/2408.15374v1","updated":"2024-08-27T19:22:06Z","published":"2024-08-27T19:22:06Z","title":"CycleGAN with Better Cycles","summary":" CycleGAN provides a framework to train image-to-image translation with\nunpaired datasets using cycle consistency loss [4]. While results are great in\nmany applications, the pixel level cycle consistency can potentially be\nproblematic and causes unrealistic images in certain cases. In this project, we\npropose three simple modifications to cycle consistency, and show that such an\napproach achieves better results with fewer artifacts.\n","authors":["Tongzhou Wang","Yihan Lin"],"pdf_url":"https://arxiv.org/pdf/2408.15374v1.pdf","comment":"Technical Report 2018"},{"id":"http://arxiv.org/abs/2408.15373v1","updated":"2024-08-27T19:13:15Z","published":"2024-08-27T19:13:15Z","title":"Handling Geometric Domain Shifts in Semantic Segmentation of Surgical\n RGB and Hyperspectral Images","summary":" Robust semantic segmentation of intraoperative image data holds promise for\nenabling automatic surgical scene understanding and autonomous robotic surgery.\nWhile model development and validation are primarily conducted on idealistic\nscenes, geometric domain shifts, such as occlusions of the situs, are common in\nreal-world open surgeries. To close this gap, we (1) present the first analysis\nof state-of-the-art (SOA) semantic segmentation models when faced with\ngeometric out-of-distribution (OOD) data, and (2) propose an augmentation\ntechnique called \"Organ Transplantation\", to enhance generalizability. Our\ncomprehensive validation on six different OOD datasets, comprising 600 RGB and\nhyperspectral imaging (HSI) cubes from 33 pigs, each annotated with 19 classes,\nreveals a large performance drop in SOA organ segmentation models on geometric\nOOD data. This performance decline is observed not only in conventional RGB\ndata (with a dice similarity coefficient (DSC) drop of 46 %) but also in HSI\ndata (with a DSC drop of 45 %), despite the richer spectral information\ncontent. The performance decline increases with the spatial granularity of the\ninput data. Our augmentation technique improves SOA model performance by up to\n67 % for RGB data and 90 % for HSI data, achieving performance at the level of\nin-distribution performance on real OOD test data. Given the simplicity and\neffectiveness of our augmentation method, it is a valuable tool for addressing\ngeometric domain shifts in surgical scene segmentation, regardless of the\nunderlying model. Our code and pre-trained models are publicly available at\nhttps://github.com/IMSY-DKFZ/htc.\n","authors":["Silvia Seidlitz","Jan Sellner","Alexander Studier-Fischer","Alessandro Motta","Berkin Özdemir","Beat P. Müller-Stich","Felix Nickel","Lena Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2408.15373v1.pdf","comment":"Silvia Seidlitz and Jan Sellner contributed equally"},{"id":"http://arxiv.org/abs/2408.15371v1","updated":"2024-08-27T19:10:21Z","published":"2024-08-27T19:10:21Z","title":"Temporal Graph Neural Network-Powered Paper Recommendation on Dynamic\n Citation Networks","summary":" Due to the rapid growth of scientific publications, identifying all related\nreference articles in the literature has become increasingly challenging yet\nhighly demanding. Existing methods primarily assess candidate publications from\na static perspective, focusing on the content of articles and their structural\ninformation, such as citation relationships. There is a lack of research\nregarding how to account for the evolving impact among papers on their\nembeddings. Toward this goal, this paper introduces a temporal dimension to\npaper recommendation strategies. The core idea is to continuously update a\npaper's embedding when new citation relationships appear, enhancing its\nrelevance for future recommendations. Whenever a citation relationship is added\nto the literature upon the publication of a paper, the embeddings of the two\nrelated papers are updated through a Temporal Graph Neural Network (TGN). A\nlearnable memory update module based on a Recurrent Neural Network (RNN) is\nutilized to study the evolution of the embedding of a paper in order to predict\nits reference impact in a future timestamp. Such a TGN-based model learns a\npattern of how people's views of the paper may evolve, aiming to guide paper\nrecommendations more precisely. Extensive experiments on an open citation\nnetwork dataset, including 313,278 articles from\nhttps://paperswithcode.com/about PaperWithCode, have demonstrated the\neffectiveness of the proposed approach.\n","authors":["Junhao Shen","Mohammad Ausaf Ali Haqqani","Beichen Hu","Cheng Huang","Xihao Xie","Tsengdar Lee","Jia Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.15371v1.pdf","comment":"10 pages, 4 figures, accepted by SDU@AAAI-2024. The AAAI Workshop on\n Scientific Document Understanding (2024)"},{"id":"http://arxiv.org/abs/2408.13912v2","updated":"2024-08-27T19:06:57Z","published":"2024-08-25T18:27:20Z","title":"Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs","summary":" In this paper, we introduce Splatt3R, a pose-free, feed-forward method for\nin-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given\nuncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without\nrequiring any camera parameters or depth information. For generalizability, we\nbuild Splatt3R upon a ``foundation'' 3D geometry reconstruction method, MASt3R,\nby extending it to deal with both 3D structure and appearance. Specifically,\nunlike the original MASt3R which reconstructs only 3D point clouds, we predict\nthe additional Gaussian attributes required to construct a Gaussian primitive\nfor each point. Hence, unlike other novel view synthesis methods, Splatt3R is\nfirst trained by optimizing the 3D point cloud's geometry loss, and then a\nnovel view synthesis objective. By doing this, we avoid the local minima\npresent in training 3D Gaussian Splats from stereo views. We also propose a\nnovel loss masking strategy that we empirically find is critical for strong\nperformance on extrapolated viewpoints. We train Splatt3R on the ScanNet++\ndataset and demonstrate excellent generalisation to uncalibrated, in-the-wild\nimages. Splatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and\nthe resultant splats can be rendered in real-time.\n","authors":["Brandon Smart","Chuanxia Zheng","Iro Laina","Victor Adrian Prisacariu"],"pdf_url":"https://arxiv.org/pdf/2408.13912v2.pdf","comment":"Our project page can be found at: https://splatt3r.active.vision/"},{"id":"http://arxiv.org/abs/2408.15368v1","updated":"2024-08-27T19:04:32Z","published":"2024-08-27T19:04:32Z","title":"Optimization Solution Functions as Deterministic Policies for Offline\n Reinforcement Learning","summary":" Offline reinforcement learning (RL) is a promising approach for many control\napplications but faces challenges such as limited data coverage and value\nfunction overestimation. In this paper, we propose an implicit actor-critic\n(iAC) framework that employs optimization solution functions as a deterministic\npolicy (actor) and a monotone function over the optimal value of optimization\nas a critic. By encoding optimality in the actor policy, we show that the\nlearned policies are robust to the suboptimality of the learned actor\nparameters via the exponentially decaying sensitivity (EDS) property. We obtain\nperformance guarantees for the proposed iAC framework and show its benefits\nover general function approximation schemes. Finally, we validate the proposed\nframework on two real-world applications and show a significant improvement\nover state-of-the-art (SOTA) offline RL methods.\n","authors":["Vanshaj Khattar","Ming Jin"],"pdf_url":"https://arxiv.org/pdf/2408.15368v1.pdf","comment":"American Control Conference 2024"},{"id":"http://arxiv.org/abs/2301.06267v5","updated":"2024-08-27T19:00:47Z","published":"2023-01-16T05:40:42Z","title":"Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with\n Multimodal Models","summary":" The ability to quickly learn a new task with minimal instruction - known as\nfew-shot learning - is a central aspect of intelligent agents. Classical\nfew-shot benchmarks make use of few-shot samples from a single modality, but\nsuch samples may not be sufficient to characterize an entire concept class. In\ncontrast, humans use cross-modal information to learn new concepts efficiently.\nIn this work, we demonstrate that one can indeed build a better ${\\bf visual}$\ndog classifier by ${\\bf read}$ing about dogs and ${\\bf listen}$ing to them\nbark. To do so, we exploit the fact that recent multimodal foundation models\nsuch as CLIP learn cross-modal encoders that map different modalities to the\nsame representation space. Specifically, we propose a simple strategy for ${\\bf\ncross-modal}$ ${\\bf adaptation}$: we treat examples from different modalities\nas additional few-shot examples. For example, by simply repurposing class names\nas an additional training sample, we trivially turn any n-shot learning problem\ninto a (n+1)-shot problem. This allows us to produce SOTA results with\nembarrassingly simple linear classifiers. We show that our approach can be\ncombined with existing methods such as prefix tuning, adapters, and classifier\nensembling. Finally, to explore other modalities beyond vision and language, we\nconstruct the first (to our knowledge) audiovisual few-shot benchmark and use\ncross-modal training to improve the performance of both image and audio\nclassification.\n","authors":["Zhiqiu Lin","Samuel Yu","Zhiyi Kuang","Deepak Pathak","Deva Ramanan"],"pdf_url":"https://arxiv.org/pdf/2301.06267v5.pdf","comment":"Published at CVPR 2023. Project site:\n https://linzhiqiu.github.io/papers/cross_modal/"},{"id":"http://arxiv.org/abs/2403.13724v2","updated":"2024-08-27T18:42:55Z","published":"2024-03-20T16:33:06Z","title":"Probabilistic Forecasting with Stochastic Interpolants and Föllmer\n Processes","summary":" We propose a framework for probabilistic forecasting of dynamical systems\nbased on generative modeling. Given observations of the system state over time,\nwe formulate the forecasting problem as sampling from the conditional\ndistribution of the future system state given its current state. To this end,\nwe leverage the framework of stochastic interpolants, which facilitates the\nconstruction of a generative model between an arbitrary base distribution and\nthe target. We design a fictitious, non-physical stochastic dynamics that takes\nas initial condition the current system state and produces as output a sample\nfrom the target conditional distribution in finite time and without bias. This\nprocess therefore maps a point mass centered at the current state onto a\nprobabilistic ensemble of forecasts. We prove that the drift coefficient\nentering the stochastic differential equation (SDE) achieving this task is\nnon-singular, and that it can be learned efficiently by square loss regression\nover the time-series data. We show that the drift and the diffusion\ncoefficients of this SDE can be adjusted after training, and that a specific\nchoice that minimizes the impact of the estimation error gives a F\\\"ollmer\nprocess. We highlight the utility of our approach on several complex,\nhigh-dimensional forecasting problems, including stochastically forced\nNavier-Stokes and video prediction on the KTH and CLEVRER datasets.\n","authors":["Yifan Chen","Mark Goldstein","Mengjian Hua","Michael S. Albergo","Nicholas M. Boffi","Eric Vanden-Eijnden"],"pdf_url":"https://arxiv.org/pdf/2403.13724v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12091v2","updated":"2024-08-27T18:34:24Z","published":"2024-08-22T03:00:21Z","title":"Unsupervised discovery of the shared and private geometry in multi-view\n data","summary":" Modern applications often leverage multiple views of a subject of study.\nWithin neuroscience, there is growing interest in large-scale simultaneous\nrecordings across multiple brain regions. Understanding the relationship\nbetween views (e.g., the neural activity in each region recorded) can reveal\nfundamental principles about the characteristics of each representation and\nabout the system. However, existing methods to characterize such relationships\neither lack the expressivity required to capture complex nonlinearities,\ndescribe only sources of variance that are shared between views, or discard\ngeometric information that is crucial to interpreting the data. Here, we\ndevelop a nonlinear neural network-based method that, given paired samples of\nhigh-dimensional views, disentangles low-dimensional shared and private latent\nvariables underlying these views while preserving intrinsic data geometry.\nAcross multiple simulated and real datasets, we demonstrate that our method\noutperforms competing methods. Using simulated populations of lateral\ngeniculate nucleus (LGN) and V1 neurons we demonstrate our model's ability to\ndiscover interpretable shared and private structure across different noise\nconditions. On a dataset of unrotated and corresponding but randomly rotated\nMNIST digits, we recover private latents for the rotated view that encode\nrotation angle regardless of digit class, and places the angle representation\non a 1-d manifold, while shared latents encode digit class but not rotation\nangle. Applying our method to simultaneous Neuropixels recordings of\nhippocampus and prefrontal cortex while mice run on a linear track, we discover\na low-dimensional shared latent space that encodes the animal's position. We\npropose our approach as a general-purpose method for finding succinct and\ninterpretable descriptions of paired data sets in terms of disentangled shared\nand private latent variables.\n","authors":["Sai Koukuntla","Joshua B. Julian","Jesse C. Kaminsky","Manuel Schottdorf","David W. Tank","Carlos D. Brody","Adam S. Charles"],"pdf_url":"https://arxiv.org/pdf/2408.12091v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15357v1","updated":"2024-08-27T18:29:47Z","published":"2024-08-27T18:29:47Z","title":"On the effectiveness of smartphone IMU sensors and Deep Learning in the\n detection of cardiorespiratory conditions","summary":" This research introduces an innovative method for the early screening of\ncardiorespiratory diseases based on an acquisition protocol, which leverages\ncommodity smartphone's Inertial Measurement Units (IMUs) and deep learning\ntechniques. We collected, in a clinical setting, a dataset featuring recordings\nof breathing kinematics obtained by accelerometer and gyroscope readings from\nfive distinct body regions. We propose an end-to-end deep learning pipeline for\nearly cardiorespiratory disease screening, incorporating a preprocessing step\nsegmenting the data into individual breathing cycles, and a recurrent\nbidirectional module capturing features from diverse body regions. We employed\nLeave-one-out-cross-validation with Bayesian optimization for hyperparameter\ntuning and model selection. The experimental results consistently demonstrated\nthe superior performance of a bidirectional Long-Short Term Memory (Bi-LSTM) as\na feature encoder architecture, yielding an average sensitivity of $0.81 \\pm\n0.02$, specificity of $0.82 \\pm 0.05$, F1 score of $0.81 \\pm 0.02$, and\naccuracy of $80.2\\% \\pm 3.9$ across diverse seed variations. We also assessed\ngeneralization capabilities on a skewed distribution, comprising exclusively\nhealthy patients not used in training, revealing a true negative rate of $74.8\n\\% \\pm 4.5$. The sustained accuracy of predictions over time during breathing\ncycles within a single patient underscores the efficacy of the preprocessing\nstrategy, highlighting the model's ability to discern significant patterns\nthroughout distinct phases of the respiratory cycle. This investigation\nunderscores the potential usefulness of widely available smartphones as devices\nfor timely cardiorespiratory disease screening in the general population, in\nat-home settings, offering crucial assistance to public health efforts\n(especially during a pandemic outbreaks, such as the recent COVID-19).\n","authors":["Lorenzo Simone","Luca Miglior","Vincenzo Gervasi","Luca Moroni","Emanuele Vignali","Emanuele Gasparotti","Simona Celi"],"pdf_url":"https://arxiv.org/pdf/2408.15357v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.15356v1","updated":"2024-08-27T18:28:31Z","published":"2024-08-27T18:28:31Z","title":"Optimal level set estimation for non-parametric tournament and\n crowdsourcing problems","summary":" Motivated by crowdsourcing, we consider a problem where we partially observe\nthe correctness of the answers of $n$ experts on $d$ questions. In this paper,\nwe assume that both the experts and the questions can be ordered, namely that\nthe matrix $M$ containing the probability that expert $i$ answers correctly to\nquestion $j$ is bi-isotonic up to a permutation of it rows and columns. When\n$n=d$, this also encompasses the strongly stochastic transitive (SST) model\nfrom the tournament literature. Here, we focus on the relevant problem of\ndeciphering small entries of $M$ from large entries of $M$, which is key in\ncrowdsourcing for efficient allocation of workers to questions. More precisely,\nwe aim at recovering a (or several) level set $p$ of the matrix up to a\nprecision $h$, namely recovering resp. the sets of positions $(i,j)$ in $M$\nsuch that $M_{ij}>p+h$ and $M_{i,j}