Skip to content

Raven-1027/DailyArXiv

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

544 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Daily Papers

The project automatically fetches the latest papers from arXiv based on keywords.

The subheadings in the README file represent the search keywords.

Only the most recent articles for each keyword are retained, up to a maximum of 100 papers.

You can click the 'Watch' button to receive daily email notifications.

Last update: 2026-02-14

strong correlation

Title Date Abstract Comment
AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild 2026-02-12
Show

Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user's true intent via active clarification and interaction during execution. However, existing benchmarks predominantly operate under the idealized assumption that user-issued instructions are complete and unequivocal. This paradigm focuses exclusively on assessing single-turn execution while overlooking the alignment capability of the agent. To address this limitation, we introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction following to bidirectional intent alignment. Grounded in Cognitive Gap theory, we propose a taxonomy of four clarity levels: Detailed, Standard, Incomplete, and Ambiguous. We construct a rigorous dataset of 240 ecologically valid tasks across 25 applications, subject to strict review protocols. Furthermore, targeting evaluation in dynamic environments, we develop MUSE (Mobile User Satisfaction Evaluator), an automated framework utilizing an MLLM-as-a-judge multi-agent architecture. MUSE performs fine-grained auditing across three dimensions: Outcome Effectiveness, Execution Quality, and Interaction Quality. Empirical results on AmbiBench reveal the performance boundaries of SoTA agents across different clarity levels, quantify the gains derived from active interaction, and validate the strong correlation between MUSE and human judgment. This work redefines evaluation standards, laying the foundation for next-generation agents capable of truly understanding user intent.

21 pages, 7 figures
GameDevBench: Evaluating Agentic Capabilities Through Game Development 2026-02-11
Show

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.

Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets 2026-02-10
Show

Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0%$ of samples, while South American and African countries are severely under-represented with only $1.8%$ and $3.8%$ of images, respectively. We observe a strong correlation between a country's GDP and its representation in the data ($ρ= 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.

41 pages, 20 figures
Toward Ultra-Long-Horizon Sequential Model Editing 2026-02-10
Show

Model editing has emerged as a practical approach for mitigating factual errors and outdated knowledge in large language models (LLMs). Among existing methods, the Locate-and-Edit (L&E) paradigm is the dominant framework: it locates MLP parameters implicated in expressing a target fact, and then performs a localized update to rewrite that fact. However, long sequences of edits often trigger abrupt model collapse in L&E beyond a critical point. We empirically identify a strong correlation between collapse and explosive growth of edited MLP weight norms, and formally prove that commonly used L&E update rules can induce exponential norm growth across sequential edits in the absence of explicit norm control. To address this issue, we propose Norm-Anchor Scaling NAS, a plug-and-play norm-constrained strategy. Across extensive experiments, NAS delays the collapse point of representative L&E algorithms by more than 4 times and yields a 72.2% average relative gain in editing performance, requiring only a single additional line of code and incurring negligible computational overhead.

Scalable Delphi: Large Language Models for Structured Risk Estimation 2026-02-09
Show

Quantitative risk assessment in high-stakes domains relies on structured expert elicitation to estimate unobservable properties. The gold standard - the Delphi method - produces calibrated, auditable judgments but requires months of coordination and specialist time, placing rigorous risk assessment out of reach for most applications. We investigate whether Large Language Models (LLMs) can serve as scalable proxies for structured expert elicitation. We propose Scalable Delphi, adapting the classical protocol for LLMs with diverse expert personas, iterative refinement, and rationale sharing. Because target quantities are typically unobservable, we develop an evaluation framework based on necessary conditions: calibration against verifiable proxies, sensitivity to evidence, and alignment with human expert judgment. We evaluate in the domain of AI-augmented cybersecurity risk, using three capability benchmarks and independent human elicitation studies. LLM panels achieve strong correlations with benchmark ground truth (Pearson r=0.87-0.95), improve systematically as evidence is added, and align with human expert panels - in one comparison, closer to a human panel than the two human panels are to each other. This demonstrates that LLM-based elicitation can extend structured expert judgment to settings where traditional methods are infeasible, reducing elicitation time from months to minutes.

Linearization Explains Fine-Tuning in Large Language Models 2026-02-09
Show

Parameter-Efficient Fine-Tuning (PEFT) is a popular class of techniques that strive to adapt large models in a scalable and resource-efficient manner. Yet, the mechanisms underlying their training performance and generalization remain underexplored. In this paper, we provide several insights into such fine-tuning through the lens of linearization. Fine-tuned models are often implicitly encouraged to remain close to the pretrained model. By making this explicit, using an Euclidean distance inductive bias in parameter space, we show that fine-tuning dynamics become equivalent to learning with the positive-definite neural tangent kernel (NTK). We specifically analyze how close the fully linear and the linearized fine-tuning optimizations are, based on the strength of the regularization. This allows us to be pragmatic about how good a model linearization is when fine-tuning large language models (LLMs). When linearization is a good model, our findings reveal a strong correlation between the eigenvalue spectrum of the NTK and the performance of model adaptation. Motivated by this, we give spectral perturbation bounds on the NTK induced by the choice of layers selected for fine-tuning. We empirically validate our theory on Low Rank Adaptation (LoRA) on LLMs. These insights not only characterize fine-tuning but also have the potential to enhance PEFT techniques, paving the way to better informed and more nimble adaptation in LLMs.

SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents 2026-02-08
Show

Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.

Content-Driven Frame-Level Bit Prediction for Rate Control in Versatile Video Coding 2026-02-05
Show

Rate control allocates bits efficiently across frames to meet a target bitrate while maintaining quality. Conventional two-pass rate control (2pRC) in Versatile Video Coding (VVC) relies on analytical rate-QP models, which often fail to capture nonlinear spatial-temporal variations, causing quality instability and high complexity due to multiple trial encodes. This paper proposes a content-adaptive framework that predicts frame-level bit consumption using lightweight features from the Video Complexity Analyzer (VCA) and quantization parameters within a Random Forest regression. On ultra-high-definition sequences encoded with VVenC, the model achieves strong correlation with ground truth, yielding R2 values of 0.93, 0.88, and 0.77 for I-, P-, and B-frames, respectively. Integrated into a rate-control loop, it achieves comparable coding efficiency to 2pRC while reducing total encoding time by 33.3%. The results show that VCA-driven bit prediction provides a computationally efficient and accurate alternative to conventional rate-QP models.

2026 ...

2026 IEEE International Symposium on Circuits and Systems (ISCAS)

Adaptive Knowledge Transferring with Switching Dual-Student Framework for Semi-Supervised Medical Image Segmentation 2026-02-04
Show

Teacher-student frameworks have emerged as a leading approach in semi-supervised medical image segmentation, demonstrating strong performance across various tasks. However, the learning effects are still limited by the strong correlation and unreliable knowledge transfer process between teacher and student networks. To overcome this limitation, we introduce a novel switching Dual-Student architecture that strategically selects the most reliable student at each iteration to enhance dual-student collaboration and prevent error reinforcement. We also introduce a strategy of Loss-Aware Exponential Moving Average to dynamically ensure that the teacher absorbs meaningful information from students, improving the quality of pseudo-labels. Our plug-and-play framework is extensively evaluated on 3D medical image segmentation datasets, where it outperforms state-of-the-art semi-supervised methods, demonstrating its effectiveness in improving segmentation accuracy under limited supervision.

The p...

The paper is published at Pattern Recognition Journal

Fast, Unsupervised Framework for Registration Quality Assessment of Multi-stain Histological Whole Slide Pairs 2026-02-03
Show

High-fidelity registration of histopathological whole slide images (WSIs), such as hematoxylin & eosin (H&E) and immunohistochemistry (IHC), is vital for integrated molecular analysis but challenging to evaluate without ground-truth (GT) annotations. Existing WSI-level assessments -- using annotated landmarks or intensity-based similarity metrics -- are often time-consuming, unreliable, and computationally intensive, limiting large-scale applicability. This study proposes a fast, unsupervised framework that jointly employs down-sampled tissue masks- and deformations-based metrics for registration quality assessment (RQA) of registered H&E and IHC WSI pairs. The masks-based metrics measure global structural correspondence, while the deformations-based metrics evaluate local smoothness, continuity, and transformation realism. Validation across multiple IHC markers and multi-expert assessments demonstrate a strong correlation between automated metrics and human evaluations. In the absence of GT, this framework offers reliable, real-time RQA with high fidelity and minimal computational resources, making it suitable for large-scale quality control in digital pathology.

Accep...

Accepted to IEEE ISBI 2026

DOME: Improving Signal-to-Noise in Stochastic Gradient Descent via Sharp-Direction Subspace Filtering 2026-02-02
Show

Stochastic gradients for deep neural networks exhibit strong correlations along the optimization trajectory, and are often aligned with a small set of Hessian eigenvectors associated with outlier eigenvalues. Recent work shows that projecting gradients away from this Hessian outlier subspace has little impact on optimization, despite capturing a large fraction of gradient variability. Since computing the Hessian is intractable in practice, we introduce a principled first-order characterization of the nuisance subspace based on the covariance of stochastic gradients, and propose an efficient method to estimate it online. We show that removing this subspace also has little impact on optimization, and yields practical benefits for applications sensitive to gradient signal-to-noise ratio such as gradient compression.

CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset 2026-02-02
Show

We introduce CUS-QA, a benchmark for evaluation of open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. We evaluate state-of-the-art LLMs through prompting and add human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results show that even the best open-weight LLMs achieve only over 40% accuracy on textual questions and below 30% on visual questions. LLM-based evaluation metrics show strong correlation with human judgment, while traditional string-overlap metrics perform surprisingly well due to the prevalence of named entities in answers.

Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models 2026-02-02
Show

Multimodal reward models are crucial for aligning multimodal large language models with human preferences. Recent works have incorporated reasoning capabilities into these models, achieving promising results. However, training these models suffers from two critical challenges: (1) the inherent noise in preference datasets, which degrades model performance, and (2) the inefficiency of conventional training methods, which ignore the differences in sample difficulty. In this paper, we identify a strong correlation between response entropy and accuracy, indicating that entropy can serve as a reliable and unsupervised proxy for annotation noise and sample difficulty. Based on this insight, we propose a novel Entropy-Guided Training (EGT) approach for multimodal reasoning reward models, which combines two strategies: (1) entropy-guided data curation to mitigate the impact of unreliable samples, and (2) an entropy-guided training strategy that progressively introduces more complex examples. Extensive experiments across three benchmarks show that the EGT-trained model consistently outperforms state-of-the-art multimodal reward models.

Latent Debate: A Surrogate Framework for Interpreting LLM Thinking 2026-02-01
Show

Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge. To this end, we introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self-consistency and multi-agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference. We first present a model- and task-agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks. Empirical studies demonstrate that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM. Beyond interpretability, we demonstrate that latent debate provides a strong baseline for hallucination detection. Further analysis reveals strong correlations between hallucinations and debate patterns, such as a high degree of latent debates in the middle layers is linked to a higher risk of hallucinations. These findings position latent debate as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during the inference steps.

Preprint
VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis 2026-02-01
Show

We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.

Accep...

Accepted by ICASSP 2026

Beyond Cosine Similarity: Magnitude-Aware CLIP for No-Reference Image Quality Assessment 2026-01-31
Show

Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as "a good photo" or "a bad photo." However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.

Accep...

Accepted by AAAI 2026

FOCUS: DLLMs Know How to Tame Their Compute Bound 2026-01-30
Show

Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS -- an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy, while preserving or improving generation quality across multiple benchmarks. The FOCUS system is publicly available on GitHub: https://github.com/sands-lab/FOCUS.

22 pages, 15 figures
Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems 2026-01-29
Show

Individual agents in multi-agent (MA) systems often lack robustness, tending to blindly conform to misleading peers. We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability. To address this, we first formalize the learning problem of history-aware reference, introducing the historical interactions of peers as additional input, so that agents can estimate peer reliability and learn from trustworthy peers when uncertain. This shifts the task from evaluating peer reasoning quality to estimating peer reliability based on interaction history. We then develop Epistemic Context Learning (ECL): a reasoning framework that conditions predictions on explicitly-built peer profiles from history. We further optimize ECL by reinforcement learning using auxiliary rewards. Our experiments reveal that our ECL enables small models like Qwen 3-4B to outperform a history-agnostic baseline 8x its size (Qwen 3-30B) by accurately identifying reliable peers. ECL also boosts frontier models to near-perfect (100%) performance. We show that ECL generalizes well to various MA configurations and we find that trust is modeled well by LLMs, revealing a strong correlation in trust modeling accuracy and final answer quality.

Codes...

Codes and data are available at https://github.com/skyriver-2000/epistemic-context-learning

DA-SPS: A Dual-stage Network based on Singular Spectrum Analysis, Patching-strategy and Spearman-correlation for Multivariate Time-series Prediction 2026-01-29
Show

Multivariate time-series forecasting, as a typical problem in the field of time series prediction, has a wide range of applications in weather forecasting, traffic flow prediction, and other scenarios. However, existing works do not effectively consider the impact of extraneous variables on the prediction of the target variable. On the other hand, they fail to fully extract complex sequence information based on various time patterns of the sequences. To address these drawbacks, we propose a DA-SPS model, which adopts different modules for feature extraction based on the information characteristics of different variables. DA-SPS mainly consists of two stages: the target variable processing stage (TVPS) and the extraneous variables processing stage (EVPS). In TVPS, the model first uses Singular Spectrum Analysis (SSA) to process the target variable sequence and then uses Long Short-Term Memory (LSTM) and P-Conv-LSTM which deploys a patching strategy to extract features from trend and seasonality components, respectively. In EVPS, the model filters extraneous variables that have a strong correlation with the target variate by using Spearman correlation analysis and further analyses them using the L-Attention module which consists of LSTM and attention mechanism. Finally, the results obtained by TVPS and EVPS are combined through weighted summation and linear mapping to produce the final prediction. The results on four public datasets demonstrate that the DA-SPS model outperforms existing state-of-the-art methods. Additionally, its performance in real-world scenarios is further validated using a private dataset collected by ourselves, which contains the test items' information on laptop motherboards.

12 pa...

12 pages, 7 figures, 6 tables, submitted to IEEE Transactions on Emerging Topics in Computational Intelligence

Cross-Examination Framework: A Task-Agnostic Diagnostic for Information Fidelity in Text-to-Text Generation 2026-01-27
Show

Traditional metrics like BLEU and BERTScore fail to capture semantic fidelity in generative text-to-text tasks. We adapt the Cross-Examination Framework (CEF) for a reference-free, multi-dimensional evaluation by treating the source and candidate as independent knowledge bases. CEF generates verifiable questions from each text and performs a cross-examination to derive three interpretable scores: Coverage, Conformity, and Consistency. Validated across translation, summarization and clinical note-generation, our framework identifies critical errors, such as content omissions and factual contradictions, missed by standard metrics. A key contribution is a systematic robustness analysis to select a stable judge model. Crucially, the strong correlation between our reference-free and with-reference modes validates CEF's reliability without gold references. Furthermore, human expert validation demonstrates that CEF mismatching questions align with meaning-altering semantic errors higher than with non-semantic errors, particularly excelling at identifying entity-based and relational distortions.

Link Representation Learning for Probabilistic Travel Time Estimation 2026-01-26
Show

Travel time estimation is a key task in navigation apps and web mapping services. Existing deterministic and probabilistic methods, based on the assumption of trip independence, predominantly focus on modeling individual trips while overlooking trip correlations. However, real-world conditions frequently introduce strong correlations between trips, influenced by external and internal factors such as weather and the tendencies of drivers. To address this, we propose a deep hierarchical joint probabilistic model ProbETA for travel time estimation, capturing both inter-trip and intra-trip correlations. The joint distribution of travel times across multiple trips is modeled as a low-rank multivariate Gaussian, parameterized by learnable link representations estimated using the empirical Bayes approach. We also introduce a data augmentation method based on trip sub-sampling, allowing for fine-grained gradient backpropagation when learning link representations. During inference, our model estimates the probability distribution of travel time for a queried trip, conditional on spatiotemporally adjacent completed trips. Evaluation on two real-world GPS trajectory datasets demonstrates that ProbETA outperforms state-of-the-art deterministic and probabilistic baselines, with Mean Absolute Percentage Error decreasing by over 12.60%. Moreover, the learned link representations align with the physical network geometry, potentially making them applicable for other tasks.

Controlling Language Difficulty in Dialogues with Linguistic Features 2026-01-26
Show

Large language models (LLMs) have emerged as powerful tools for supporting second language acquisition, particularly in simulating interactive dialogues for speaking practice. However, adapting the language difficulty of LLM-generated responses to match learners' proficiency levels remains a challenge. This work addresses this issue by proposing a framework for controlling language proficiency in educational dialogue systems. Our approach leverages three categories of linguistic features, readability features (e.g., Flesch-Kincaid Grade Level), syntactic features (e.g., syntactic tree depth), and lexical features (e.g., simple word ratio), to quantify and regulate text complexity. We demonstrate that training LLMs on linguistically annotated dialogue data enables precise modulation of language proficiency, outperforming prompt-based methods in both flexibility and stability. To evaluate this, we introduce Dilaprix, a novel metric integrating the aforementioned features, which shows strong correlation with expert judgments of language difficulty. Empirical results reveal that our approach achieves superior controllability of language proficiency while maintaining high dialogue quality.

15 pages,9 figures
Are We Evaluating the Edit Locality of LLM Model Editing Properly? 2026-01-24
Show

Model editing has recently emerged as a popular paradigm for efficiently updating knowledge in LLMs. A central desideratum of updating knowledge is to balance editing efficacy, i.e., the successful injection of target knowledge, and specificity (also known as edit locality), i.e., the preservation of existing non-target knowledge. However, we find that existing specificity evaluation protocols are inadequate for this purpose. We systematically elaborated on the three fundamental issues it faces. Beyond the conceptual issues, we further empirically demonstrate that existing specificity metrics are weakly correlated with the strength of specificity regularizers. We also find that current metrics lack sufficient sensitivity, rendering them ineffective at distinguishing the specificity performance of different methods. Finally, we propose a constructive evaluation protocol. Under this protocol, the conflict between open-ended LLMs and the assumption of determined answers is eliminated, query-independent fluency biases are avoided, and the evaluation strictness can be smoothly adjusted within a near-continuous space. Experiments across various LLMs, datasets, and editing methods show that metrics derived from the proposed protocol are more sensitive to changes in the strength of specificity regularizers and exhibit strong correlation with them, enabling more fine-grained discrimination of different methods' knowledge preservation capabilities.

VMask: Tunable Label Privacy Protection for Vertical Federated Learning via Layer Masking 2026-01-22
Show

Though vertical federated learning (VFL) is generally considered to be privacy-preserving, recent studies have shown that VFL system is vulnerable to label inference attacks originating from various attack surfaces. Among these attacks, the model completion (MC) attack is currently the most powerful one. Existing defense methods against it either sacrifice model accuracy or incur impractical computational overhead. In this paper, we propose VMask, a novel label privacy protection framework designed to defend against MC attack from the perspective of layer masking. Our key insight is to disrupt the strong correlation between input data and intermediate outputs by applying the secret sharing (SS) technique to mask layer parameters in the attacker's model. We devise a strategy for selecting critical layers to mask, reducing the overhead that would arise from naively applying SS to the entire model. Moreover, VMask is the first framework to offer a tunable privacy budget to defenders, allowing for flexible control over the levels of label privacy according to actual requirements. We built a VFL system, implemented VMask on it, and extensively evaluated it using five model architectures and 13 datasets with different modalities, comparing it to 12 other defense methods. The results demonstrate that VMask achieves the best privacy-utility trade-off, successfully thwarting the MC attack (reducing the label inference accuracy to a random guessing level) while preserving model performance (e.g., in Transformer-based model, the averaged drop of VFL model accuracy is only 0.09%). VMask's runtime is up to 60,846 times faster than cryptography-based methods, and it only marginally exceeds that of standard VFL by 1.8 times in a large Transformer-based model, which is generally acceptable.

Accep...

Accepted by Frontiers of Computer Science (FCS)

Ego4OOD: Rethinking Egocentric Video Domain Generalization via Covariate Shift Scoring 2026-01-21
Show

Egocentric video action recognition under domain shifts remains challenging due to large intra-class spatio-temporal variability, long-tailed feature distributions, and strong correlations between actions and environments. Existing benchmarks for egocentric domain generalization often conflate covariate shifts with concept shifts, making it difficult to reliably evaluate a model's ability to generalize across input distributions. To address this limitation, we introduce Ego4OOD, a domain generalization benchmark derived from Ego4D that emphasizes measurable covariate diversity while reducing concept shift through semantically coherent, moment-level action categories. Ego4OOD spans eight geographically distinct domains and is accompanied by a clustering-based covariate shift metric that provides a quantitative proxy for domain difficulty. We further leverage a one-vs-all binary training objective that decomposes multi-class action recognition into independent binary classification tasks. This formulation is particularly well-suited for covariate shift by reducing interference between visually similar classes under feature distribution shift. Using this formulation, we show that a lightweight two-layer fully connected network achieves performance competitive with state-of-the-art egocentric domain generalization methods on both Argo1M and Ego4OOD, despite using fewer parameters and no additional modalities. Our empirical analysis demonstrates a clear relationship between measured covariate shift and recognition performance, highlighting the importance of controlled benchmarks and quantitative domain characterization for studying out-of-distribution generalization in egocentric video.

VegaChat: A Robust Framework for LLM-Based Chart Generation and Assessment 2026-01-21
Show

Natural-language-to-visualization (NL2VIS) systems based on large language models (LLMs) have substantially improved the accessibility of data visualization. However, their further adoption is hindered by two coupled challenges: (i) the absence of standardized evaluation metrics makes it difficult to assess progress in the field and compare different approaches; and (ii) natural language descriptions are inherently underspecified, so multiple visualizations may be valid for the same query. To address these issues, we introduce VegaChat, a framework for generating, validating, and assessing declarative visualizations from natural language. We propose two complementary metrics: Spec Score, a deterministic metric that measures specification-level similarity without invoking an LLM, and Vision Score, a library-agnostic, image-based metric that leverages a multimodal LLM to assess chart similarity and prompt compliance. We evaluate VegaChat on the NLV Corpus and on the annotated subset of ChartLLM. VegaChat achieves near-zero rates of invalid or empty visualizations, while Spec Score and Vision Score exhibit strong correlation with human judgments (Pearson 0.65 and 0.71, respectively), indicating that the proposed metrics support consistent, cross-library comparison. The code and evaluation artifacts are available at https://zenodo.org/records/17062309.

8 pages, 9 figures
Whom We Trust, What We Fear: COVID-19 Fear and the Politics of Information 2026-01-21
Show

The COVID-19 pandemic triggered not only a global health crisis but also an infodemic, an overload of information from diverse sources influencing public perception and emotional responses. In this context, fear emerged as a central emotional reaction, shaped by both media exposure and demographic factors. In this study, we analyzed the relationship between individuals' self-reported levels of fear about COVID-19 and the information sources they rely on, across nine source categories, including medical experts, government institutions, media, and personal networks. In particular, we defined a score that ranks fear levels based on self-reported concerns about the pandemic, collected through the Delphi CTIS survey in the United States between May 2021 and June 2022. We found that both fear levels and information source usage closely follow COVID-19 infection trends, exhibit strong correlations within each group (fear levels across sources are strongly correlated, as are patterns of source usage), and vary significantly across demographic groups, particularly by age and education. Applying causal inference methods, we found that among age, education, and information source, the latter is the most influential factor affecting individuals' fear levels. We further quantified the impact of different information sources on fear by estimating the average treatment effect, indicating how each source alters fear relative to a control. Furthermore, we demonstrated that information source preferences can reliably match the political orientation of U.S. states. These findings highlight the importance of information ecosystem dynamics in shaping emotional and behavioral responses during large-scale crises.

New techniques to investigate the AGN-SF connection with integral field spectroscopy 2026-01-20
Show

Understanding the connection between active galactic nuclei and star-formation (the AGN-SF connection) is one of the longest standing problems in modern astrophysics. In the age of large Integral Field Unit (IFU) surveys, studies of the AGN-SF connection greatly benefit from spatially resolving AGN and SF contributions to study the two processes independently. Using IFU data for 54 local active galaxies from the S7 sample, we present a new method to separate emission from AGN activity and SF using mixing sequences observed in the [NII]$λ6584$/H$α$ vs. [OIII]$λ5007$/H$β$ Baldwin-Phillips-Terlevich (BPT) diagram. We use the new decomposition method to calculate the H$α$ star-formation rate and AGN [OIII] luminosity for the galaxies. Our new method is robust to outliers in the line-ratio distribution and can be applied to large galaxy samples with little manual intervention. We infer star-formation histories (SFHs) using pPXF, conducting detailed recovery tests to determine the quantities that can be considered robust. We test the correlation between the AGN Eddington ratio, using the proxy L[OIII]/$σ_*^4$, and star-formation properties. We find a moderately strong correlation between the Eddington ratio and the star-formation rate (SFR). We also observe marginally significant correlations between the AGN Eddington ratio and the light-weighted stellar age under 100 Myr. Our results point to higher AGN accretion being associated with young nuclear star formation under 100 Myr, consistent with timelines presented in previous studies. The correlations found in this paper are relatively weak; extending our methods to larger samples, including radio-quiet galaxies, will help better constrain the physical mechanisms and timescales of the AGN-SF connection.

26 pages, 23 figures
QASTAnet: A DNN-based Quality Metric for Spatial Audio 2026-01-19
Show

In the development of spatial audio technologies, reliable and shared methods for evaluating audio quality are essential. Listening tests are currently the standard but remain costly in terms of time and resources. Several models predicting subjective scores have been proposed, but they do not generalize well to real-world signals. In this paper, we propose QASTAnet (Quality Assessment for SpaTial Audio network), a new metric based on a deep neural network, specialized on spatial audio (ambisonics and binaural). As training data is scarce, we aim for the model to be trainable with a small amount of data. To do so, we propose to rely on expert modeling of the low-level auditory system and use a neurnal network to model the high-level cognitive function of the quality judgement. We compare its performance to two reference metrics on a wide range of content types (speech, music, ambiance, anechoic, reverberated) and focusing on codec artifacts. Results demonstrate that QASTAnet overcomes the aforementioned limitations of the existing methods. The strong correlation between the proposed metric prediction and subjective scores makes it a good candidate for comparing codecs in their development.

Minor...

Minor typo corrections, figure updates for clarity, add two references, no modifications of the results and their interpretation

Efficient Hyper-Parameter Search for LoRA via Language-aided Bayesian Optimization 2026-01-19
Show

Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enables resource-efficient personalization or specialization, but it comes at the expense of additional hyperparameter tuning. Although LoRA makes fine-tuning efficient, it is highly sensitive to the choice of hyperparameters, and exhaustive hyperparameter search is still computationally very demanding. To address these challenges, we propose a framework that integrates the domain knowledge of pre-trained LLMs into Bayesian Optimization (BO) to efficiently search for LoRA hyperparameters. To leverage the informed knowledge of LLMs, we repurpose LLMs as a discrete-to-continuous mapping to link the hyperparameters and their domain knowledge with a continuous vector space, where BO is conducted. We design and control the mapping by language prompting, where we provide a domain-aware textual prompt describing the relationships among hyperparameters and their respective roles; thereby, we explicitly inject domain knowledge about LoRA into the LLM in natural language. Also, we model the residual information that is hard to linguistically describe in the prompt with an additional learnable token. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the observation of the strong correlation between the respective performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation with a data subset. This further increases the efficiency of our method. We demonstrate that our hyperparameter found with only about 30 iterations achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations.

Segment and Matte Anything in a Unified Model 2026-01-17
Show

Segment Anything (SAM) has recently pushed the boundaries of segmentation by demonstrating zero-shot generalization and flexible prompting after training on over one billion masks. Despite this, its mask prediction accuracy often falls short of the precision required in real-world applications. While several refinement modules have been proposed to boost SAM's segmentation quality, achieving highly accurate object delineation within a single, unified framework remains an open challenge. Furthermore, interactive image matting, which aims to generate fine-grained alpha mattes guided by diverse user hints, has not yet been explored in the context of SAM. Insights from recent studies highlight strong correlations between segmentation and matting, suggesting the feasibility of a unified model capable of both tasks. In this paper, we introduce Segment And Matte Anything (SAMA), a lightweight extension of SAM that delivers high-quality interactive image segmentation and matting with minimal extra parameters. Our Multi-View Localization Encoder (MVLE) captures detailed features from local views, while the Localization Adapter (Local-Adapter) refines mask outputs by recovering subtle boundary details. We also incorporate two prediction heads for each task into the architecture to generate segmentation and matting masks, simultaneously. Trained on a diverse dataset aggregated from publicly available sources, SAMA achieves state-of-the-art performance across multiple segmentation and matting benchmarks, showcasing its adaptability and effectiveness in a wide range of downstream tasks.

AAAI 2026
Relational Linearity is a Predictor of Hallucinations 2026-01-16
Show

Hallucination is a central failure mode in large language models (LLMs). We focus on hallucinations of answers to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities that are unknown to the model. Surprisingly, we find that medium-size models like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. We hypothesize that an important factor in causing these hallucinations is the linearity of the relation: linear relations tend to be stored more abstractly, making it difficult for the LLM to assess its knowledge; the facts of nonlinear relations tend to be stored more directly, making knowledge assessment easier. To investigate this hypothesis, we create SyntHal, a dataset of 6000 synthetic entities for six relations. In our experiments with four models, we determine, for each relation, the hallucination rate on SyntHal and also measure its linearity, using $Δ\cos$. We find a strong correlation ($r \in [.78,.82]$) between relational linearity and hallucination rate, providing evidence for our hypothesis that the underlying storage of triples of a relation is a factor in how well a model can self-assess its knowledge. This finding has implications for how to manage hallucination behavior and suggests new research directions for improving the representation of factual knowledge in LLMs.

11 pa...

11 pages, 4 figures, 8 tables

MindGuard: Intrinsic Decision Inspection for Securing LLM Agents Against Metadata Poisoning 2026-01-15
Show

The Model Context Protocol (MCP) is increasingly adopted to standardize the interaction between LLM agents and external tools. However, this trend introduces a new threat: Tool Poisoning Attacks (TPA), where tool metadata is poisoned to induce the agent to perform unauthorized operations. Existing defenses that primarily focus on behavior-level analysis are fundamentally ineffective against TPA, as poisoned tools need not be executed, leaving no behavioral trace to monitor. Thus, we propose MindGuard, a decision-level guardrail for LLM agents, providing provenance tracking of call decisions, policy-agnostic detection, and poisoning source attribution against TPA. While fully explaining LLM decision remains challenging, our empirical findings uncover a strong correlation between LLM attention mechanisms and tool invocation decisions. Therefore, we choose attention as an empirical signal for decision tracking and formalize this as the Decision Dependence Graph (DDG), which models the LLM's reasoning process as a weighted, directed graph where vertices represent logical concepts and edges quantify the attention-based dependencies. We further design robust DDG construction and graph-based anomaly analysis mechanisms that efficiently detect and attribute TPA attacks. Extensive experiments on real-world datasets demonstrate that MindGuard achieves 94%-99% average precision in detecting poisoned invocations, 95%-100% attribution accuracy, with processing times under one second and no additional token cost. Moreover, DDG can be viewed as an adaptation of the classical Program Dependence Graph (PDG), providing a solid foundation for applying traditional security policies at the decision level.

Kernel Limit for a Class of Recurrent Neural Networks Trained on Ergodic Data Sequences 2026-01-14
Show

Mathematical methods are developed to characterize the asymptotics of recurrent neural networks (RNN) as the number of hidden units, data samples in the sequence, hidden state updates, and training steps simultaneously grow to infinity. In the case of an RNN with a simplified weight matrix, we prove the convergence of the RNN to the solution of an infinite-dimensional ODE coupled with the fixed point of a random algebraic equation. The analysis requires addressing several challenges which are unique to RNNs. In typical mean-field applications (e.g., feedforward neural networks), discrete updates are of magnitude $\mathcal{O}(1/N)$ and the number of updates is $\mathcal{O}(N)$. Therefore, the system can be represented as an Euler approximation of an appropriate ODE/PDE, which it will converge to as $N \rightarrow \infty$. However, the RNN hidden layer updates are $\mathcal{O}(1)$. Therefore, RNNs cannot be represented as a discretization of an ODE/PDE and standard mean-field techniques cannot be applied. Instead, we develop a fixed point analysis for the evolution of the RNN memory states, with convergence estimates in terms of the number of update steps and the number of hidden units. The RNN hidden layer is studied as a function in a Sobolev space, whose evolution is governed by the data sequence (a Markov chain), the parameter updates, and its dependence on the RNN hidden layer at the previous time step. Due to the strong correlation between updates, a Poisson equation must be used to bound the fluctuations of the RNN around its limit equation. These mathematical methods give rise to the neural tangent kernel (NTK) limits for RNNs trained on data sequences as the number of data samples and size of the neural network grow to infinity.

Revis...

Revision in response to reviewers' comments. The mean-field random function has been replaced by a mean-field term. Some typos fixed. Minor title change

A Highly Efficient Diversity-based Input Selection for DNN Improvement Using VLMs 2026-01-12
Show

Maintaining or improving the performance of Deep Neural Networks (DNNs) through fine-tuning requires labeling newly collected inputs, a process that is often costly and time-consuming. To alleviate this problem, input selection approaches have been developed in recent years to identify small, yet highly informative subsets for labeling. Diversity-based selection is one of the most effective approaches for this purpose. However, they are often computationally intensive and lack scalability for large input sets, limiting their practical applicability. To address this challenge, we introduce Concept-Based Diversity (CBD), a highly efficient metric for image inputs that leverages Vision-Language Models (VLM). Our results show that CBD exhibits a strong correlation with Geometric Diversity (GD), an established diversity metric, while requiring only a fraction of its computation time. Building on this finding, we propose a hybrid input selection approach that combines CBD with Margin, a simple uncertainty metric. We conduct a comprehensive evaluation across a diverse set of DNN models, input sets, selection budgets, and five most effective state-of-the-art selection baselines. The results demonstrate that the CBD-based selection consistently outperforms all baselines at guiding input selection to improve the DNN model. Furthermore, the CBD-based selection approach remains highly efficient, requiring selection times close to those of simple uncertainty-based methods such as Margin, even on larger input sets like ImageNet. These results confirm not only the effectiveness and computational advantage of the CBD-based approach, particularly compared to hybrid baselines, but also its scalability in repetitive and extensive input selection scenarios.

Predicting Region of Interest in Human Visual Search Based on Statistical Texture and Gabor Features 2026-01-12
Show

Understanding human visual search behavior is a fundamental problem in vision science and computer vision, with direct implications for modeling how observers allocate attention in location-unknown search tasks. In this study, we investigate the relationship between Gabor-based features and gray-level co-occurrence matrix (GLCM) based texture features in modeling early-stage visual search behavior. Two feature-combination pipelines are proposed to integrate Gabor and GLCM features for narrowing the region of possible human fixations. The pipelines are evaluated using simulated digital breast tomosynthesis images. Results show qualitative agreement among fixation candidates predicted by the proposed pipelines and a threshold-based model observer. A strong correlation is observed between GLCM mean and Gabor feature responses, indicating that these features encode related image information despite their different formulations. Eye-tracking data from human observers further suggest consistency between predicted fixation regions and early-stage gaze behavior. These findings highlight the value of combining structural and texture-based features for modeling visual search and support the development of perceptually informed observer models.

10 pages, 6 fgures
SEE: Signal Embedding Energy for Quantifying Noise Interference in Large Audio Language Models 2026-01-12
Show

Large Audio Language Models (LALMs) have been widely applied in real-time scenarios, such as in-car assistants and online meeting comprehension. In practice, audio inputs are often corrupted by device and environmental noise, leading to performance degradation. However, existing LALM studies on noise lack quantitative analysis and rely mainly on intuition and empirical observation, thus failing to understand practical robustness. To address this issue, we introduce Signal Embedding Energy (SEE), a method for quantifying the impact of noise intensity on LALM inputs, enabling the differentiation of LALM robustness in real-world deployments. SEE introduces a perspective based on structured activation subspaces derived from the model's internal representations, which more accurately captures its perception of noise than raw audio features. Across experiments, SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98. Surprisingly, traditional audio denoising methods are only marginally effective for LALMs, and, in some cases, even increase SEE and impair performance. This suggests a mismatch between speech-centric denoising objectives and the noise sensitivity of modern LALMs. Therefore, we propose a mitigation strategy derived from SEE to denoise LALM inputs, outperforming existing denoising methods. This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.

HERE: Hierarchical Active Exploration of Radiance Field with Epistemic Uncertainty Minimization 2026-01-12
Show

We present HERE, an active 3D scene reconstruction framework based on neural radiance fields, enabling high-fidelity implicit mapping. Our approach centers around an active learning strategy for camera trajectory generation, driven by accurate identification of unseen regions, which supports efficient data acquisition and precise scene reconstruction. The key to our approach is epistemic uncertainty quantification based on evidential deep learning, which directly captures data insufficiency and exhibits a strong correlation with reconstruction errors. This allows our framework to more reliably identify unexplored or poorly reconstructed regions compared to existing methods, leading to more informed and targeted exploration. Additionally, we design a hierarchical exploration strategy that leverages learned epistemic uncertainty, where local planning extracts target viewpoints from high-uncertainty voxels based on visibility for trajectory generation, and global planning uses uncertainty to guide large-scale coverage for efficient and comprehensive reconstruction. The effectiveness of the proposed method in active 3D reconstruction is demonstrated by achieving higher reconstruction completeness compared to previous approaches on photorealistic simulated scenes across varying scales, while a hardware demonstration further validates its real-world applicability.

Accep...

Accepted to IEEE RA-L. The first two authors contributed equally

Boosting Latent Diffusion Models via Disentangled Representation Alignment 2026-01-09
Show

Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.

SPAM: Style Prompt Adherence Metric for Prompt-based TTS 2026-01-09
Show

Prompt-based text-to-speech (TTS) aims to generate speech that adheres to fine-grained style cues provided in a text prompt. However, most prior works depend on neither plausible nor faithful measures to evaluate prompt adherence. That is, they cannot ensure whether the evaluation is grounded on the prompt and is similar to a human. Thus, we present a new automatic metric, the Style Prompt Adherence Metric, which explicitly satisfies both plausibility and faithfulness. Inspired by the CLAP, our approach factorizes speech into acoustic attributes and aligns them with the style prompt. Also, we trained the scorer with a supervised contrastive loss, which could provide a clearer distinction between different semantics. We conducted two experiments on two perspectives. The plausibility experiment showed that SPAM achieved a strong correlation with the mean opinion score (MOS). Also, the faithfulness experiment demonstrated that SPAM is successfully grounded to the given style prompt, as it can discriminate different semantics of the prompt. We believe that SPAM can provide a viable automatic solution for evaluating style prompt adherence of synthesized speech.

Expert Preference-based Evaluation of Automated Related Work Generation 2026-01-09
Show

Expert domain writing, such as scientific writing, typically demands extensive domain knowledge. Although large language models (LLMs) show promising potential in this task, evaluating the quality of automatically generated scientific writing is a crucial open issue, as it requires knowledge of domain-specific criteria and the ability to discern expert preferences. Conventional task-agnostic automatic evaluation metrics and LLM-as-a-judge systems, primarily designed for mainstream NLP tasks, are insufficient to grasp expert preferences and domain-specific quality standards. To address this gap and support realistic human-AI collaborative writing, we focus on related work generation, one of the most challenging scientific tasks, as an exemplar. We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences. Our framework decomposes the evaluation into smaller fine-grained dimensions. This localized evaluation is further augmented with contrastive examples to provide detailed contextual guidance for the evaluation dimensions. Empirical investigation reveals that our framework is able to assess the quality of related work sections in a much more robust manner compared to standard LLM judges, reflects natural scenarios of scientific writing, and bears a strong correlation with the assessment of human experts. We also observe that generations from state-of-the-art (SoTA) LLMs struggle to satisfy validation constraints of a suitable related work section.

Proje...

Project page: https://ukplab.github.io/arxiv2025-expert-eval-rw/

Global law of conjugate kernel random matrices with heavy-tailed weights 2026-01-06
Show

We study the asymptotic spectral distribution of the conjugate kernel random matrix $YY^\top$, where $Y= f(WX)$ arises from a two-layer neural network model. We consider the setting where $W$ and $X$ are random rectangular matrices with i.i.d.\ entries, where the entries of $W$ follow a heavy-tailed distribution, while those of $X$ have light tails. Our assumptions on $W$ include a broad class of heavy-tailed distributions, such as symmetric $α$-stable laws with $α\in ]0,2[$ and sparse matrices with $\mathcal{O}(1)$ nonzero entries per row. The activation function $f$, applied entrywise, is bounded, smooth, odd, and nonlinear. We compute the limiting eigenvalue distribution of $YY^\top$ through its moments and show that heavy-tailed weights induce strong correlations between the entries of $Y$, resulting in richer and fundamentally different spectral behavior compared to the light-tailed case.

48 pages, 7 figures
PatentMind: A Multi-Aspect Reasoning Graph for Patent Similarity Evaluation 2026-01-05
Show

Patent similarity evaluation plays a critical role in intellectual property analysis. However, existing methods often overlook the intricate structure of patent documents, which integrate technical specifications, legal boundaries, and application contexts. We introduce PatentMind, a novel framework for patent similarity assessment based on a Multi-Aspect Reasoning Graph (MARG). PatentMind decomposes patents into their three dimensions of technical features, application domains, and claim scopes, then dimension-specific similarity scores are calculated over the MARG. These scores are dynamically weighted through a context-aware reasoning process, which integrates contextual signals to emulate expert-level judgment. To support evaluation, we construct a human-annotated benchmark PatentSimBench, comprising 500 patent pairs. Experimental results demonstrate that the PatentMind-generated scores show a strong correlation ($r=0.938$) with expert annotations, significantly outperforming embedding-based models, patent-specific models, and advanced prompt engineering methods. Beyond computational linguistics, our framework provides a structured and semantically grounded foundation for real-world decision-making, particularly for tasks such as infringement risk assessment, underscoring its broader impact on both patent analytics and evaluation.

Predicting Early and Complete Drug Release from Long-Acting Injectables Using Explainable Machine Learning 2026-01-05
Show

Polymer-based long-acting injectables (LAIs) have transformed the treatment of chronic diseases by enabling controlled drug delivery, thus reducing dosing frequency and extending therapeutic duration. Achieving controlled drug release from LAIs requires extensive optimization of the complex underlying physicochemical properties. Machine learning (ML) can accelerate LAI development by modeling the complex relationships between LAI properties and drug release. However, recent ML studies have provided limited information on key properties that modulate drug release, due to the lack of custom modeling and analysis tailored to LAI data. This paper presents a novel data transformation and explainable ML approach to synthesize actionable information from 321 LAI formulations by predicting early drug release at 24, 48, and 72 hours, classification of release profile types, and prediction of complete release profiles. These three experiments investigate the contribution and control of LAI material characteristics in early and complete drug release profiles. A strong correlation (>0.65) is observed between the true and predicted drug release in 72 hours, while a 0.87 F1-score is obtained in classifying release profile types. A time-independent ML framework predicts delayed biphasic and triphasic curves with better performance than current time-dependent approaches. Shapley additive explanations reveal the relative influence of material characteristics during early and for complete release which fill several gaps in previous in-vitro and ML-based studies. The novel approach and findings can provide a quantitative strategy and recommendations for scientists to optimize the drug-release dynamics of LAI. The source code for the model implementation is publicly available.

Adversarial Training for Failure-Sensitive User Simulation in Mental Health Dialogue Optimization 2026-01-05
Show

Realistic user simulation is crucial for training and evaluating task-oriented dialogue (TOD) systems, yet creating simulators that accurately replicate human behavior remains challenging. A key property of effective simulators is their ability to expose failure modes of the systems they evaluate. We present an adversarial training framework that iteratively improves user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. Applied to mental health support chatbots, our approach demonstrates that fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues, and adversarial training further enhances diversity, distributional alignment, and predictive validity. The resulting simulator achieves a strong correlation between simulated and real failure occurrence rates across diverse chatbot configurations while maintaining low distributional divergence of failure modes. Discriminator accuracy decreases drastically after three adversarial iterations, suggesting improved realism. These results provide evidence that adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.

computational chemistry

Title Date Abstract Comment
Multiconfiguration Pair-Density Functional Theory Calculations of Ground and Excited States of Complex Chemical Systems with Quantum Computers 2026-02-11
Show

Accurately describing strong electron correlation in complex systems remains a prominent challenge in computational chemistry as near-term quantum algorithms treating total correlation often require prohibitively deep circuits. Here we present a hybrid strategy combining the Variational Quantum Eigensolver with Multiconfiguration Pair-Density Functional Theory to efficiently decouple correlation effects. This approach confines static correlation to a compact multireference quantum state while recovering dynamic correlation through a classical on-top density functional using reduced-density information. By enabling self-consistent orbital optimization, the method significantly reduces quantum resource overheads without sacrificing physical rigor. We demonstrate chemical accuracy on standard benchmarks by reproducing C$_2$ equilibrium bond lengths and benzene excitation energies with mean absolute errors of 0.006 Å and 0.048 eV respectively. Most notably, for the strongly correlated Cr$_2$ dimer requiring a large complete active space (48e, 42o), the framework yields a bound potential-energy curve and recovers qualitative dissociation behavior despite realistic hardware noise. These results establish that separating correlation types provides a practical route to reliable predictions on near-term quantum hardware.

Two-dimensional RMSD projections for reaction path visualization and validation 2026-02-09
Show

Transition state or minimum energy path finding methods constitute a routine component of the computational chemistry toolkit. Standard analysis involves trajectories conventionally plotted in terms of the relative energy to the initial state against a cumulative displacement variable, or the image number. These dimensional reductions obscure structural rearrangements in high dimensions and are often history dependent. This precludes the ability to compare optimization histories of different methods beyond the number of calculations, time taken, and final saddle geometry. We present a method mapping trajectories onto a two-dimension projection defined by a permutation corrected root mean square deviation from the reactant and product configurations. Energy is represented as an interpolated color-mapped surface constructed from all optimization steps using a gradient aware derivative Gaussian Process. This representation highlights optimization trajectories, identifies endpoint basins, and diagnoses convergence concerns invisible in one-dimensional profiles. We demonstrate the framework on a cycloaddition reaction, showing that a machine-learned potential saddle and density functional theory reference lie on comparable energy contours despite geometric displacements, along with the ratification of the visualization for more complex reactions, a grignard rearrangement, and a bicyclobutadiene rearrangement.

6 pages, 2 figures
Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds 2026-02-07
Show

Geometric data and purpose-built generative models on them have become ubiquitous in high-impact deep learning application domains, ranging from protein backbone generation and computational chemistry to geospatial data. Current geometric generative models remain computationally expensive at inference -- requiring many steps of complex numerical simulation -- as they are derived from dynamical measure transport frameworks such as diffusion and flow-matching on Riemannian manifolds. In this paper, we propose Generalised Flow Maps (GFM), a new class of few-step generative models that generalises the Flow Map framework in Euclidean spaces to arbitrary Riemannian manifolds. We instantiate GFMs with three self-distillation-based training methods: Generalised Lagrangian Flow Maps, Generalised Eulerian Flow Maps, and Generalised Progressive Flow Maps. We theoretically show that GFMs, under specific design decisions, unify and elevate existing Euclidean few-step generative models, such as consistency models, shortcut models, and meanflows, to the Riemannian setting. We benchmark GFMs against other geometric generative models on a suite of geometric datasets, including geospatial data, RNA torsion angles, and hyperbolic manifolds, and achieve state-of-the-art sample quality for single- and few-step evaluations, and superior or competitive log-likelihoods using the implicit probability flow.

Under review
Amortized Sampling with Transferable Normalizing Flows 2026-02-04
Show

Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in full for each system of interest. The widespread success of generative models has inspired interest towards overcoming this limitation through learning sampling algorithms. Despite performing competitively with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We demonstrate that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 285 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based fine-tuning procedure to achieve competitive performance to established methods such as sequential Monte Carlo. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and objectives.

Prese...

Presented at NeurIPS 2025

El Agente Quntur: A research collaborator agent for quantum chemistry 2026-02-04
Show

Quantum chemistry is a foundational enabling tool for the fields of chemistry, materials science, computational biology and others. Despite of its power, the practical application of quantum chemistry simulations remains in the hands of qualified experts due to methodological complexity, software heterogeneity, and the need for informed interpretation of results. To bridge the accessibility gap for these tools and expand their reach to chemists with broader backgrounds, we introduce El Agente Quntur, a hierarchical, multi-agent AI system designed to operate not merely as an automation tool but as a research collaborator for computational quantum chemistry. Quntur was designed following three main strategies: i) elimination of hard-coded procedural policies in favour of reasoning-driven decisions, ii) construction of general and composable actions that facilitate generalization and efficiency, and iii) implementation of guided deep research to integrate abstract quantum-chemical reasoning across subdisciplines and a detailed understanding of the software's internal logic and syntax. Although instantiated in ORCA, these design principles are applicable to research agents more generally and easily expandable to additional quantum chemistry packages and beyond. Quntur supports the full range of calculations available in ORCA 6.0 and reasons over software documentation and scientific literature to plan, execute, adapt, and analyze in silico chemistry experiments following best practices. We discuss the advances and current bottlenecks in agentic systems operating at the research level in computational chemistry, and outline a roadmap toward a fully autonomous end-to-end computational chemistry research agent.

QUASAR: A Universal Autonomous System for Atomistic Simulation and a Benchmark of Its Capabilities 2026-01-30
Show

The integration of large language models (LLMs) into materials science offers a transformative opportunity to streamline computational workflows, yet current agentic systems remain constrained by rigid tool-calling approaches and narrowly scoped agents. In this work, we introduce QUASAR, a universal autonomous system for atomistic simulation designed to facilitate production-grade scientific discovery. QUASAR autonomously orchestrates complex multi-scale workflows across diverse methods, including density functional theory, machine learning potentials, molecular dynamics, and Monte Carlo simulations. The system incorporates robust mechanisms for adaptive planning, context-efficient memory management, and hybrid knowledge retrieval to navigate real-world research scenarios without human intervention. We benchmark QUASAR against a series of three-tiered tasks, progressing from routine tasks to frontier research challenges such as photocatalyst screening and novel material assessment. These results suggest that QUASAR can function as a general atomistic reasoning system rather than a task-specific automation framework. They also provide initial evidence supporting the potential deployment of agentic AI as a component of computational chemistry research workflows, while identifying areas requiring further development.

12 pages, 2 figures
Have ASkotch: A Neat Solution for Large-scale Kernel Ridge Regression 2026-01-27
Show

Kernel ridge regression (KRR) is a fundamental computational tool, appearing in problems that range from computational chemistry to health analytics, with a particular interest due to its starring role in Gaussian process regression. However, full KRR solvers are challenging to scale to large datasets: both direct (i.e., Cholesky decomposition) and iterative methods (i.e., PCG) incur prohibitive computational and storage costs. The standard approach to scale KRR to large datasets chooses a set of inducing points and solves an approximate version of the problem, inducing points KRR. However, the resulting solution tends to have worse predictive performance than the full KRR solution. In this work, we introduce a new solver, ASkotch, for full KRR that provides better solutions faster than state-of-the-art solvers for full and inducing points KRR. ASkotch is a scalable, accelerated, iterative method for full KRR that provably obtains linear convergence. Under appropriate conditions, we show that ASkotch obtains condition-number-free linear convergence. This convergence analysis rests on the theory of ridge leverage scores and determinantal point processes. ASkotch outperforms state-of-the-art KRR solvers on a testbed of 23 large-scale KRR regression and classification tasks derived from a wide range of application domains, demonstrating the superiority of full KRR over inducing points KRR. Our work opens up the possibility of as-yet-unimagined applications of full KRR across a number of disciplines.

63 pa...

63 pages (including appendices), 17 figures, 6 tables

Global Optimization of Atomic Clusters via Physically-Constrained Tensor Train Decomposition 2026-01-26
Show

The global optimization of atomic clusters represents a fundamental challenge in computational chemistry and materials science due to the exponential growth of local minima with system size (i.e., the curse of dimensionality). We introduce a novel framework that overcomes this limitation by exploiting the low-rank structure of potential energy surfaces through Tensor Train (TT) decomposition. Our approach combines two complementary TT-based strategies: the algebraic TTOpt method, which utilizes maximum volume sampling, and the probabilistic PROTES method, which employs generative sampling. A key innovation is the development of physically-constrained encoding schemes that incorporate molecular constraints directly into the discretization process. We demonstrate the efficacy of our method by identifying global minima of Lennard-Jones clusters containing up to 45 atoms. Furthermore, we establish its practical applicability to real-world systems by optimizing 20-atom carbon clusters using a machine-learned Moment Tensor Potential, achieving geometries consistent with quantum-accurate simulations. This work establishes TT-decomposition as a powerful tool for molecular structure prediction and provides a general framework adaptable to a wide range of high-dimensional optimization problems in computational material science.

18 pages, 6 figures
Endowing Molecular Language with Geometry Perception via Modality Compensation for High-Throughput Quantum Hamiltonian Prediction 2026-01-22
Show

The quantum Hamiltonian is a fundamental property that governs a molecule's electronic structure and behavior, and its calculation and prediction are paramount in computational chemistry and materials science. Accurate prediction is highly reliant on extensive training data, including precise molecular geometries and the Hamiltonian matrices, which are expensive to acquire via either experimental or computational methods. Towards a fast yet accurate method for Hamiltonian prediction, we first introduce a geometry information-aware molecular language model to bypass the use of expensive molecular geometries by only using the readily available molecular language -- simplified molecular input line entry system (SMILES). Our method employs multimodal alignment to bridge the relationship between SMILES strings and their corresponding molecular geometries. Recognizing that the molecular language inherently lacks explicit geometric information, we propose a geometry modality compensation strategy to imbue molecular language representations with essential geometric features, thereby enabling accurate predictions using SMILES. In addition, given the high cost of acquiring Hamiltonian data, we devise a weakly supervised strategy to fine-tune the molecular language model, thus improving the data efficiency. Theoretically, we prove that the prediction generalization error without explicit molecular geometry can be bounded through our modality compensation scheme. Empirically, our method achieves superior computational efficiency, providing up to 100x speedup over conventional quantum mechanical methods while maintaining comparable prediction accuracy. We further demonstrate the practical case study of our approach in the screening of electrolyte formulations.

Riemannian Denoising Model for Molecular Structure Optimization with Chemical Accuracy 2026-01-19
Show

We introduce a framework for molecular structure optimization using denoising model on a physics-informed Riemannian manifold (R-DM). Unlike conventional approaches operating in Euclidean space, our method leverages a Riemannian metric that better aligns with molecular energy change, enabling more robust modeling of potential energy surfaces. By incorporating internal coordinates reflective of energetic properties, R-DM achieves chemical accuracy with an energy error below 1 kcal/mol. Comparative evaluations on QM9, QM7-X, and GEOM datasets demonstrate improvements in both structural and energetic accuracy, surpassing conventional Euclidean-based denoising models. This approach highlights the potential of physics-informed coordinates for tackling complex molecular optimization problems, with implications for tasks in computational chemistry and materials science.

Canonicalization of Batched Einstein Summations for Tuning Retrieval 2026-01-18
Show

We present an algorithm for normalizing \emph{Batched Einstein Summation} expressions by mapping mathematically equivalent formulations to a unique normal form. Batches of einsums with the same Einstein notation that exhibit substantial data reuse appear frequently in finite element methods (FEM), numerical linear algebra, and computational chemistry. To effectively exploit this temporal locality for high performance, we consider groups of einsums in batched form. Representations of equivalent batched einsums may differ due to index renaming, permutations within the batch, and, due to the commutativity and associativity of multiplication operation. The lack of a canonical representation hinders the reuse of optimization and tuning knowledge in software systems. To this end, we develop a novel encoding of batched einsums as colored graphs and apply graph canonicalization to derive a normal form. In addition to the canonicalization algorithm, we propose a representation of einsums using functional array operands and provide a strategy to transfer transformations operating on the normal form to \emph{functional batched einsums} that exhibit the same normal form; crucial for fusing surrounding computations for memory bound einsums. We evaluate our approach against JAX, and observe a geomean speedup of $4.7\times$ for einsums from the TCCG benchmark suite and an FEM solver.

BLIPs: Bayesian Learned Interatomic Potentials 2026-01-16
Show

Machine Learning Interatomic Potentials (MLIPs) are becoming a central tool in simulation-based chemistry. However, like most deep learning models, MLIPs struggle to make accurate predictions on out-of-distribution data or when trained in a data-scarce regime, both common scenarios in simulation-based chemistry. Moreover, MLIPs do not provide uncertainty estimates by construction, which are fundamental to guide active learning pipelines and to ensure the accuracy of simulation results compared to quantum calculations. To address this shortcoming, we propose BLIPs: Bayesian Learned Interatomic Potentials. BLIP is a scalable, architecture-agnostic variational Bayesian framework for training or fine-tuning MLIPs, built on an adaptive version of Variational Dropout. BLIP delivers well-calibrated uncertainty estimates and minimal computational overhead for energy and forces prediction at inference time, while integrating seamlessly with (equivariant) message-passing architectures. Empirical results on simulation-based computational chemistry tasks demonstrate improved predictive accuracy with respect to standard MLIPs, and trustworthy uncertainty estimates, especially in data-scarse or heavy out-of-distribution regimes. Moreover, fine-tuning pretrained MLIPs with BLIP yields consistent performance gains and calibrated uncertainties.

A large language model-type architecture for high-dimensional molecular potential energy surfaces 2025-12-26
Show

Computing high-dimensional potential energy surfaces for molecular systems and materials is considered to be a great challenge in computational chemistry with potential impact in a range of areas including the fundamental prediction of reaction rates. In this paper, we design and discuss an algorithm that has similarities to large language models in generative AI and natural language processing. Specifically, we represent a molecular system as a graph which contains a set of nodes, edges, faces, etc. Interactions between these sets, which represent molecular subsystems in our case, are used to construct the potential energy surface for a reasonably sized chemical system with 51 nuclear dimensions. For this purpose, a family of neural networks that pertain to the graph-theoretically obtained subsystems get the job done for this 51 nuclear dimensional system. We then ask if this same family of lower-dimensional graph-based neural networks can be transformed to provide accurate predictions for a 186-dimensional potential energy surface. We find that our algorithm does provide accurate results for this larger-dimensional problem with sub-kcal/mol accuracy for the higher-dimensional potential energy surface problem. Indeed, as a result of these developments, here we produce the first efforts towards a full-dimensional potential energy surface for the protonated 21-water cluster (186 nuclear dimensions) at CCSD level accuracy.

31 pages, 35 figures
Shoot from the HIP: Hessian Interatomic Potentials without derivatives 2025-12-14
Show

Fundamental tasks in computational chemistry, from transition state search to vibrational analysis, rely on molecular Hessians, which are the second derivatives of the potential energy. Yet, Hessians are computationally expensive to calculate and scale poorly with system size, with both quantum mechanical methods and neural networks. In this work, we demonstrate that Hessians can be predicted directly from a deep learning model, without relying on automatic differentiation or finite differences. We observe that one can construct SE(3)-equivariant, symmetric Hessians from irreducible representations (irrep) features up to degree $l$=2 computed during message passing in graph neural networks. This makes HIP Hessians one to two orders of magnitude faster, more accurate, more memory efficient, easier to train, and enables more favorable scaling with system size. We validate our predictions across a wide range of downstream tasks, demonstrating consistently superior performance for transition state search, accelerated geometry optimization, zero-point energy corrections, and vibrational analysis benchmarks. We open-source the HIP codebase and model weights to enable further development of the direct prediction of Hessians at https://github.com/BurgerAndreas/hip

https...

https://github.com/BurgerAndreas/hip

Unveiling Latent Knowledge in Chemistry Language Models through Sparse Autoencoders 2025-12-08
Show

Since the advent of machine learning, interpretability has remained a persistent challenge, becoming increasingly urgent as generative models support high-stakes applications in drug and material discovery. Recent advances in large language model (LLM) architectures have yielded chemistry language models (CLMs) with impressive capabilities in molecular property prediction and molecular generation. However, how these models internally represent chemical knowledge remains poorly understood. In this work, we extend sparse autoencoder techniques to uncover and examine interpretable features within CLMs. Applying our methodology to the Foundation Models for Materials (FM4M) SMI-TED chemistry foundation model, we extract semantically meaningful latent features and analyse their activation patterns across diverse molecular datasets. Our findings reveal that these models encode a rich landscape of chemical concepts. We identify correlations between specific latent features and distinct domains of chemical knowledge, including structural motifs, physicochemical properties, and pharmacological drug classes. Our approach provides a generalisable framework for uncovering latent knowledge in chemistry-focused AI systems. This work has implications for both foundational understanding and practical deployment; with the potential to accelerate computational chemistry research.

OXtal: An All-Atom Diffusion Model for Organic Crystal Structure Prediction 2025-12-07
Show

Accurately predicting experimentally-realizable 3D molecular crystal structures from their 2D chemical graphs is a long-standing open challenge in computational chemistry called crystal structure prediction (CSP). Efficiently solving this problem has implications ranging from pharmaceuticals to organic semiconductors, as crystal packing directly governs the physical and chemical properties of organic solids. In this paper, we introduce OXtal, a large-scale 100M parameter all-atom diffusion model that directly learns the conditional joint distribution over intramolecular conformations and periodic packing. To efficiently scale OXtal, we abandon explicit equivariant architectures imposing inductive bias arising from crystal symmetries in favor of data augmentation strategies. We further propose a novel crystallization-inspired lattice-free training scheme, Stoichiometric Stochastic Shell Sampling ($S^4$), that efficiently captures long-range interactions while sidestepping explicit lattice parametrization -- thus enabling more scalable architectural choices at all-atom resolution. By leveraging a large dataset of 600K experimentally validated crystal structures (including rigid and flexible molecules, co-crystals, and solvates), OXtal achieves orders-of-magnitude improvements over prior ab initio machine learning CSP methods, while remaining orders of magnitude cheaper than traditional quantum-chemical approaches. Specifically, OXtal recovers experimental structures with conformer $\text{RMSD}_1<0.5$ Å and attains over 80% packing similarity rate, demonstrating its ability to model both thermodynamic and kinetic regularities of molecular crystallization.

Learning Biomolecular Motion: The Physics-Informed Machine Learning Paradigm 2025-11-10
Show

The convergence of statistical learning and molecular physics is transforming our approach to modeling biomolecular systems. Physics-informed machine learning (PIML) offers a systematic framework that integrates data-driven inference with physical constraints, resulting in models that are accurate, mechanistic, generalizable, and able to extrapolate beyond observed domains. This review surveys recent advances in physics-informed neural networks and operator learning, differentiable molecular simulation, and hybrid physics-ML potentials, with emphasis on long-timescale kinetics, rare events, and free-energy estimation. We frame these approaches as solutions to the "biomolecular closure problem", recovering unresolved interactions beyond classical force fields while preserving thermodynamic consistency and mechanistic interpretability. We examine theoretical foundations, tools and frameworks, computational trade-offs, and unresolved issues, including model expressiveness and stability. We outline prospective research avenues at the intersection of machine learning, statistical physics, and computational chemistry, contending that future advancements will depend on mechanistic inductive biases, and integrated differentiable physical learning frameworks for biomolecular simulation and discovery.

31 pa...

31 pages, 4 figures, 3 tables. Review article

An Analytic Theory of Quantum Imaginary Time Evolution 2025-10-26
Show

Quantum imaginary time evolution (QITE) algorithm is one of the most promising variational quantum algorithms (VQAs), bridging the current era of Noisy Intermediate-Scale Quantum devices and the future of fully fault-tolerant quantum computing. Although practical demonstrations of QITE and its potential advantages over the general VQA trained with vanilla gradient descent (GD) in certain tasks have been reported, a first-principle, theoretical understanding of QITE remains limited. Here, we aim to develop an analytic theory for the dynamics of QITE. First, we show that QITE can be interpreted as a form of a general VQA trained with Quantum Natural Gradient Descent (QNGD), where the inverse quantum Fisher information matrix serves as the learning-rate tensor. This equivalence is established not only at the level of gradient update rules, but also through the action principle: the variational principle can be directly connected to the geometric geodesic distance in the quantum Fisher information metric, up to an integration constant. Second, for wide quantum neural networks, we employ the quantum neural tangent kernel framework to construct an analytic model for QITE. We prove that QITE always converges faster than GD-based VQA, though this advantage is suppressed by the exponential growth of Hilbert space dimension. This helps explain certain experimental results in quantum computational chemistry. Our theory encompasses linear, quadratic, and more general loss functions. We validate the analytic results through numerical simulations. Our findings establish a theoretical foundation for QITE dynamics and provide analytic insights for the first-principle design of variational quantum algorithms.

35 pages, 8 figures
The dark side of the forces: assessing non-conservative force models for atomistic machine learning 2025-10-25
Show

The use of machine learning to estimate the energy of a group of atoms, and the forces that drive them to more stable configurations, has revolutionized the fields of computational chemistry and materials discovery. In this domain, rigorous enforcement of symmetry and conservation laws has traditionally been considered essential. For this reason, interatomic forces are usually computed as the derivatives of the potential energy, ensuring energy conservation. Several recent works have questioned this physically constrained approach, suggesting that directly predicting the forces yields a better trade-off between accuracy and computational efficiency, and that energy conservation can be learned during training. This work investigates the applicability of such non-conservative models in microscopic simulations. We identify and demonstrate several fundamental issues, from ill-defined convergence of geometry optimization to instability in various types of molecular dynamics. Given the difficulty in monitoring and correcting the lack of energy conservation, direct forces should be used with great care. We show that the best approach to exploit the acceleration they afford is to use them in conjunction with conservative forces. A model can be pre-trained efficiently on direct forces, then fine-tuned using backpropagation. At evaluation time, both force types can be used together to avoid unphysical effects while still benefitting almost entirely from the computational efficiency of direct forces.

ICML 2025 (Oral)
Fast Non-Log-Concave Sampling under Nonconvex Equality and Inequality Constraints with Landing 2025-10-24
Show

Sampling from constrained statistical distributions is a fundamental task in various fields including Bayesian statistics, computational chemistry, and statistical physics. This article considers the cases where the constrained distribution is described by an unconstrained density, as well as additional equality and/or inequality constraints, which often make the constraint set nonconvex. Existing methods for nonconvex constraint set $Σ\subset \mathbb{R}^d$ defined by equality or inequality constraints commonly rely on costly projection steps. Moreover, they cannot handle equality and inequality constraints simultaneously as each method only specialized in one case. In addition, rigorous and quantitative convergence guarantee is often lacking. In this paper, we introduce Overdamped Langevin with LAnding (OLLA), a new framework that can design overdamped Langevin dynamics accommodating both equality and inequality constraints. The proposed dynamics also deterministically corrects trajectories along the normal direction of the constraint surface, thus obviating the need for explicit projections. We show that, under suitable regularity conditions on the target density and $Σ$, OLLA converges exponentially fast in $W_2$ distance to the constrained target density $ρ_Σ(x) \propto \exp(-f(x))dσ_Σ$. Lastly, through experiments, we demonstrate the efficiency of OLLA compared to projection-based constrained Langevin algorithms and their slack variable variants, highlighting its favorable computational cost and reasonable empirical mixing.

62 pages
Revisiting Orbital Minimization Method for Neural Operator Decomposition 2025-10-24
Show

Spectral decomposition of linear operators plays a central role in many areas of machine learning and scientific computing. Recent work has explored training neural networks to approximate eigenfunctions of such operators, enabling scalable approaches to representation learning, dynamical systems, and partial differential equations (PDEs). In this paper, we revisit a classical optimization framework from the computational physics literature known as the \emph{orbital minimization method} (OMM), originally proposed in the 1990s for solving eigenvalue problems in computational chemistry. We provide a simple linear-algebraic proof of the consistency of the OMM objective, and reveal connections between this method and several ideas that have appeared independently across different domains. Our primary goal is to justify its broader applicability in modern learning pipelines. We adapt this framework to train neural networks to decompose positive semidefinite operators, and demonstrate its practical advantages across a range of benchmark tasks. Our results highlight how revisiting classical numerical methods through the lens of modern theory and computation can provide not only a principled approach for deploying neural networks in numerical simulation, but also effective and scalable tools for machine learning.

25 pa...

25 pages, 8 figures. To appear at NeurIPS 2025

Superior Molecular Representations from Intermediate Encoder Layers 2025-10-15
Show

Pretrained molecular encoders have become indispensable in computational chemistry for tasks such as property prediction and molecular generation. However, the standard practice of relying solely on final-layer embeddings for downstream tasks may discard valuable information. In this work, we first analyze the information flow in five diverse molecular encoders and find that intermediate layers retain more general-purpose features, whereas the final-layer specializes and compresses information. We then perform an empirical layer-wise evaluation across 22 property prediction tasks. We find that using frozen embeddings from optimal intermediate layers improves downstream performance by an average of 5.4%, up to 28.6%, compared to the final-layer. Furthermore, finetuning encoders truncated at intermediate depths achieves even greater average improvements of 8.5%, with increases as high as 40.8%, obtaining new state-of-the-art results on several benchmarks. These findings highlight the importance of exploring the full representational depth of molecular encoders to achieve substantial performance improvements and computational efficiency. The code will be made publicly available.

Adapting Quantum Machine Learning for Energy Dissociation of Bonds 2025-10-08
Show

Accurate prediction of bond dissociation energies (BDEs) underpins mechanistic insight and the rational design of molecules and materials. We present a systematic, reproducible benchmark comparing quantum and classical machine learning models for BDE prediction using a chemically curated feature set encompassing atomic properties (atomic numbers, hybridization), bond characteristics (bond order, type), and local environmental descriptors. Our quantum framework, implemented in Qiskit Aer on six qubits, employs ZZFeatureMap encodings with variational ansatz (RealAmplitudes) across multiple architectures Variational Quantum Regressors (VQR), Quantum Support Vector Regressors (QSVR), Quantum Neural Networks (QNN), Quantum Convolutional Neural Networks (QCNN), and Quantum Random Forests (QRF). These are rigorously benchmarked against strong classical baselines, including Support Vector Regression (SVR), Random Forests (RF), and Multi-Layer Perceptrons (MLP). Comprehensive evaluation spanning absolute and relative error metrics, threshold accuracies, and error distributions shows that top-performing quantum models (QCNN, QRF) match the predictive accuracy and robustness of classical ensembles and deep networks, particularly within the chemically prevalent mid-range BDE regime. These findings establish a transparent baseline for quantum-enhanced molecular property prediction and outline a practical foundation for advancing quantum computational chemistry toward near chemical accuracy.

QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry 2025-10-04
Show

Quantitative chemistry is central to modern chemical research, yet the ability of large language models (LLMs) to perform its rigorous, step-by-step calculations remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields, which contains analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry. To systematically evaluate the mathematical reasoning abilities of large language models (LLMs), they are categorized into three tiers: easy, medium, and difficult. Each problem, rooted in realistic chemical scenarios, is structured to prevent heuristic shortcuts and demand explicit numerical reasoning. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain-adaptive fine-tuning or multi-modal integration. Evaluations on 24 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy. Code for QCBench is available at https://github.com/jiaqingxie/QCBench.

Revis...

Revision at Journal of Chemical Information and Modeling

Machine learning for accuracy in density functional approximations 2025-10-02
Show

Machine learning techniques have found their way into computational chemistry as indispensable tools to accelerate atomistic simulations and materials design. In addition, machine learning approaches hold the potential to boost the predictive power of computationally efficient electronic structure methods, such as density functional theory, to chemical accuracy and to correct for fundamental errors in density functional approaches. Here, recent progress in applying machine learning to improve the accuracy of density functional and related approximations is reviewed. Promises and challenges in devising machine learning models transferable between different chemistries and materials classes are discussed with the help of examples applying promising models to systems far outside their training sets.

Round-trip Reinforcement Learning: Self-Consistent Training for Better Chemical LLMs 2025-10-01
Show

Large Language Models (LLMs) are emerging as versatile foundation models for computational chemistry, handling bidirectional tasks like reaction prediction and retrosynthesis. However, these models often lack round-trip consistency. For instance, a state-of-the-art chemical LLM may successfully caption a molecule, yet be unable to accurately reconstruct the original structure from its own generated text. This inconsistency suggests that models are learning unidirectional memorization rather than flexible mastery. Indeed, recent work has demonstrated a strong correlation between a model's round-trip consistency and its performance on the primary tasks. This strong correlation reframes consistency into a direct target for model improvement. We therefore introduce Round-Trip Reinforcement Learning (RTRL), a novel framework that trains a model to improve its consistency by using the success of a round-trip transformation as a reward signal. We further propose an iterative variant where forward and reverse mappings alternately train each other in a self-improvement loop, a process that is highly data-efficient and notably effective with the massive amount of unlabelled data common in chemistry. Experiments demonstrate that RTRL significantly \textbf{boosts performance and consistency} over strong baselines across supervised, self-supervised, and synthetic data regimes. This work shows that round-trip consistency is not just a desirable property but a trainable objective, offering a new path toward more robust and reliable foundation models.

19 pages
A general optimization framework for mapping local transition-state networks 2025-09-30
Show

Understanding how complex systems transition between states requires mapping the energy landscape that governs these changes. Local transition-state networks reveal the barrier architecture that explains observed behaviour and enables mechanism-based prediction across computational chemistry, biology, and physics, yet current practice either prescribes endpoints or randomly samples only a few saddles around an initial guess. We present a general optimization framework that systematically expands local coverage by coupling a multi-objective explorer with a bilayer minimum-mode kernel. The inner layer uses Hessian-vector products to recover the lowest-curvature subspace (smallest k eigenpairs), the outer layer optimizes on a reflected force to reach index-1 saddles, then a two-sided descent certifies connectivity. The GPU-based pipeline is portable across autodiff backends and eigensolvers and, on large atomistic-spin tests, matches explicit-Hessian accuracy while cutting peak memory and wall time by orders of magnitude. Applied to a DFT-parameterized Néel-type skyrmionic model, it recovers known routes and reveals previously unreported mechanisms, including meron-antimeron-mediated Néel-type skyrmionic duplication, annihilation, and chiral-droplet formation, enabling up to 32 pathways between biskyrmion (Q=2) and biantiskyrmion (Q=-2). The same core transfers to Cartesian atoms, automatically mapping canonical rearrangements of a Ni(111) heptamer, underscoring the framework's generality.

Euclidean Fast Attention -- Machine Learning Global Atomic Representations at Linear Cost 2025-09-29
Show

Long-range correlations are essential across numerous machine learning tasks, especially for data embedded in Euclidean space, where the relative positions and orientations of distant components are often critical for accurate predictions. Self-attention offers a compelling mechanism for capturing these global effects, but its quadratic complexity presents a significant practical limitation. This problem is particularly pronounced in computational chemistry, where the stringent efficiency requirements of machine learning force fields (MLFFs) often preclude accurately modeling long-range interactions. To address this, we introduce Euclidean fast attention (EFA), a linear-scaling attention-like mechanism designed for Euclidean data, which can be easily incorporated into existing model architectures. A core component of EFA are novel Euclidean rotary positional encodings (ERoPE), which enable efficient encoding of spatial information while respecting essential physical symmetries. We empirically demonstrate that EFA effectively captures diverse long-range effects, enabling EFA-equipped MLFFs to describe challenging chemical interactions for which conventional MLFFs yield incorrect results.

A Transformer Model for Predicting Chemical Products from Generic SMARTS Templates with Data Augmentation 2025-09-24
Show

The accurate prediction of chemical reaction outcomes is a major challenge in computational chemistry. Current models rely heavily on either highly specific reaction templates or template-free methods, both of which present limitations. To address these, this work proposes the Broad Reaction Set (BRS), a set featuring 20 generic reaction templates written in SMARTS, a pattern-based notation designed to describe substructures and reactivity. Additionally, we introduce ProPreT5, a T5-based model specifically adapted for chemistry and, to the best of our knowledge, the first language model capable of directly handling and applying SMARTS reaction templates. To further improve generalization, we propose the first augmentation strategy for SMARTS, which injects structural diversity at the pattern level. Trained on augmented templates, ProPreT5 demonstrates strong predictive performance and generalization to unseen reactions. Together, these contributions provide a novel and practical alternative to current methods, advancing the field of template-based reaction prediction.

ICTAI 2025
SMILES-Inspired Transfer Learning for Quantum Operators in Generative Quantum Eigensolver 2025-09-24
Show

Given the inherent limitations of traditional Variational Quantum Eigensolver(VQE) algorithms, the integration of deep generative models into hybrid quantum-classical frameworks, specifically the Generative Quantum Eigensolver(GQE), represents a promising innovative approach. However, taking the Unitary Coupled Cluster with Singles and Doubles(UCCSD) ansatz which is widely used in quantum chemistry as an example, different molecular systems require constructions of distinct quantum operators. Considering the similarity of different molecules, the construction of quantum operators utilizing the similarity can reduce the computational cost significantly. Inspired by the SMILES representation method in computational chemistry, we developed a text-based representation approach for UCCSD quantum operators by leveraging the inherent representational similarities between different molecular systems. This framework explores text pattern similarities in quantum operators and employs text similarity metrics to establish a transfer learning framework. Our approach with a naive baseline setting demonstrates knowledge transfer between different molecular systems for ground-state energy calculations within the GQE paradigm. This discovery offers significant benefits for hybrid quantum-classical computation of molecular ground-state energies, substantially reducing computational resource requirements.

7 pages, 5 figures
Unsupervised Atomic Data Mining via Multi-Kernel Graph Autoencoders for Machine Learning Force Fields 2025-09-15
Show

Constructing a chemically diverse dataset while avoiding sampling bias is critical to training efficient and generalizable force fields. However, in computational chemistry and materials science, many common dataset generation techniques are prone to oversampling regions of the potential energy surface. Furthermore, these regions can be difficult to identify and isolate from each other or may not align well with human intuition, making it challenging to systematically remove bias in the dataset. While traditional clustering and pruning (down-sampling) approaches can be useful for this, they can often lead to information loss or a failure to properly identify distinct regions of the potential energy surface due to difficulties associated with the high dimensionality of atomic descriptors. In this work, we introduce the Multi-kernel Edge Attention-based Graph Autoencoder (MEAGraph) model, an unsupervised approach for analyzing atomic datasets. MEAGraph combines multiple linear kernel transformations with attention-based message passing to capture geometric sensitivity and enable effective dataset pruning without relying on labels or extensive training. Demonstrated applications on niobium, tantalum, and iron datasets show that MEAGraph efficiently groups similar atomic environments, allowing for the use of basic pruning techniques for removing sampling bias. This approach provides an effective method for representation learning and clustering that can be used for data analysis, outlier detection, and dataset optimization.

BioMD: All-atom Generative Model for Biomolecular Dynamics Simulation 2025-09-02
Show

Molecular dynamics (MD) simulations are essential tools in computational chemistry and drug discovery, offering crucial insights into dynamic molecular behavior. However, their utility is significantly limited by substantial computational costs, which severely restrict accessible timescales for many biologically relevant processes. Despite the encouraging performance of existing machine learning (ML) methods, they struggle to generate extended biomolecular system trajectories, primarily due to the lack of MD datasets and the large computational demands of modeling long historical trajectories. Here, we introduce BioMD, the first all-atom generative model to simulate long-timescale protein-ligand dynamics using a hierarchical framework of forecasting and interpolation. We demonstrate the effectiveness and versatility of BioMD on the DD-13M (ligand unbinding) and MISATO datasets. For both datasets, BioMD generates highly realistic conformations, showing high physical plausibility and low reconstruction errors. Besides, BioMD successfully generates ligand unbinding paths for 97.1% of the protein-ligand systems within ten attempts, demonstrating its ability to explore critical unbinding pathways. Collectively, these results establish BioMD as a tool for simulating complex biomolecular processes, offering broad applicability for computational chemistry and drug discovery.

Instruction-Based Molecular Graph Generation with Unified Text-Graph Diffusion Model 2025-08-26
Show

Recent advancements in computational chemistry have increasingly focused on synthesizing molecules based on textual instructions. Integrating graph generation with these instructions is complex, leading most current methods to use molecular sequences with pre-trained large language models. In response to this challenge, we propose a novel framework, named $\textbf{UTGDiff (Unified Text-Graph Diffusion Model)}$, which utilizes language models for discrete graph diffusion to generate molecular graphs from instructions. UTGDiff features a unified text-graph transformer as the denoising network, derived from pre-trained language models and minimally modified to process graph data through attention bias. Our experimental results demonstrate that UTGDiff consistently outperforms sequence-based baselines in tasks involving instruction-based molecule generation and editing, achieving superior performance with fewer parameters given an equivalent level of pretraining corpus. Our code is availble at https://github.com/ran1812/UTGDiff.

ECAI 2025
Experiences with Model Context Protocol Servers for Science and High Performance Computing 2025-08-25
Show

Large language model (LLM)-powered agents are increasingly used to plan and execute scientific workflows, yet most research cyberinfrastructure (CI) exposes heterogeneous APIs and implements security models that present barriers for use by agents. We report on our experience using the Model Context Protocol (MCP) as a unifying interface that makes research capabilities discoverable, invokable, and composable. Our approach is pragmatic: we implement thin MCP servers over mature services, including Globus Transfer, Compute, and Search; status APIs exposed by computing facilities; Octopus event fabric; and domain-specific tools such as Garden and Galaxy. We use case studies in computational chemistry, bioinformatics, quantum chemistry, and filesystem monitoring to illustrate how this MCP-oriented architecture can be used in practice. We distill lessons learned and outline open challenges in evaluation and trust for agent-led science.

11 pa...

11 pages, including a 4-page appendix

Combinatorial Parameterized Algorithms for Chemical Descriptors based on Molecular Graph Sparsity 2025-08-11
Show

We present efficient combinatorial parameterized algorithms for several classical graph-based counting problems in computational chemistry, including (i) Kekule structures, (ii) the Hosoya index, (iii) the Merrifield-Simmons index, and (iv) Graph entropy based on matchings and independent sets. All these problems were known to be #P-complete. Building on the intuition that molecular graphs are often sparse and tree-like, we provide fixed-parameter tractable (FPT) algorithms using treewidth as our parameter. We also provide extensive experimental results over the entire PubChem database of chemical compounds, containing more than 113 million real-world molecules. In our experiments, we observe that the molecules are indeed sparse and tree-like, with more than 99.9% of them having a treewidth of at most 5. This justifies our choice of parameter. Our experiments also illustrate considerable improvements over the previous approaches. Based on these results, we argue that parameterized algorithms, especially based on treewidth, should be adopted as the default approach for problems in computational chemistry that are defined over molecular graphs.

El Agente: An Autonomous Agent for Quantum Chemistry 2025-08-08
Show

Computational chemistry tools are widely used to study the behaviour of chemical phenomena. Yet, the complexity of these tools can make them inaccessible to non-specialists and challenging even for experts. In this work, we introduce El Agente Q, an LLM-based multi-agent system that dynamically generates and executes quantum chemistry workflows from natural language user prompts. The system is built on a novel cognitive architecture featuring a hierarchical memory framework that enables flexible task decomposition, adaptive tool selection, post-analysis, and autonomous file handling and submission. El Agente Q is benchmarked on six university-level course exercises and two case studies, demonstrating robust problem-solving performance (averaging >87% task success) and adaptive error handling through in situ debugging. It also supports longer-term, multi-step task execution for more complex workflows, while maintaining transparency through detailed action trace logs. Together, these capabilities lay the foundation for increasingly autonomous and accessible quantum chemistry.

MolSnap: Snap-Fast Molecular Generation with Latent Variational Mean Flow 2025-08-07
Show

Molecular generation conditioned on textual descriptions is a fundamental task in computational chemistry and drug discovery. Existing methods often struggle to simultaneously ensure high-quality, diverse generation and fast inference. In this work, we propose a novel causality-aware framework that addresses these challenges through two key innovations. First, we introduce a Causality-Aware Transformer (CAT) that jointly encodes molecular graph tokens and text instructions while enforcing causal dependencies during generation. Second, we develop a Variational Mean Flow (VMF) framework that generalizes existing flow-based methods by modeling the latent space as a mixture of Gaussians, enhancing expressiveness beyond unimodal priors. VMF enables efficient one-step inference while maintaining strong generation quality and diversity. Extensive experiments on four standard molecular benchmarks demonstrate that our model outperforms state-of-the-art baselines, achieving higher novelty (up to 74.5%), diversity (up to 70.3%), and 100% validity across all datasets. Moreover, VMF requires only one number of function evaluation (NFE) during conditional generation and up to five NFEs for unconditional generation, offering substantial computational efficiency over diffusion-based methods.

Bayesian Hierarchical Models for Quantitative Estimates for Performance metrics applied to Saddle Search Algorithms 2025-07-22
Show

Rigorous performance evaluation is essential for developing robust algorithms for high-throughput computational chemistry. Traditional benchmarking, however, often struggles to account for system-specific variability, making it difficult to form actionable conclusions. We present a Bayesian hierarchical modeling framework that rigorously quantifies performance metrics and their uncertainty, enabling a nuanced comparison of algorithmic strategies. We apply this framework to analyze the Dimer method, comparing Conjugate Gradient (CG) and L-BFGS rotation optimizers, with and without the removal of external rotations, across a benchmark of 500 molecular systems. Our analysis confirms that CG offers higher overall robustness than L-BFGS in this context. While the theoretically-motivated removal of external rotations led to higher computational cost (>40% more energy and force calls) for most systems in this set, our models also reveal a subtle interplay, hinting that this feature may improve the reliability of the L-BFGS optimizer. Rather than identifying a single superior method, our findings support the design of adaptive "chain of methods" workflows. This work showcases how a robust statistical paradigm can move beyond simple performance rankings to inform the intelligent, context-dependent application of computational chemistry methods.

37 pa...

37 pages, 30 figures, 7 tables

Aitomia: Your Intelligent Assistant for AI-Driven Atomistic and Quantum Chemical Simulations 2025-07-22
Show

We have developed Aitomia - a platform powered by AI to assist in performing AI-driven atomistic and quantum chemical (QC) simulations. This evolving intelligent assistant platform is equipped with chatbots and AI agents to help experts and guide non-experts in setting up and running atomistic simulations, monitoring their computational status, analyzing simulation results, and summarizing them for the user in both textual and graphical forms. We achieve these goals by exploiting large language models that leverage the versatility of our MLatom ecosystem, supporting AI-enhanced computational chemistry tasks ranging from ground-state to excited-state calculations, including geometry optimizations, thermochemistry, and spectral calculations. The multi-agent implementation enables autonomous executions of the complex computational workflows, such as the computation of the reaction enthalpies. Aitomia is the first intelligent assistant publicly accessible online on a cloud computing platform for atomistic simulations of broad scope (Aitomistic Hub at https://aitomistic.xyz). It may also be deployed locally as described at http://mlatom.com/aitomia. Aitomia is expected to lower the barrier to performing atomistic simulations, thereby democratizing simulations and accelerating research and development in relevant fields.

Acquiring and Adapting Priors for Novel Tasks via Neural Meta-Architectures 2025-07-15
Show

The ability to transfer knowledge from prior experiences to novel tasks stands as a pivotal capability of intelligent agents, including both humans and computational models. This principle forms the basis of transfer learning, where large pre-trained neural networks are fine-tuned to adapt to downstream tasks. Transfer learning has demonstrated tremendous success, both in terms of task adaptation speed and performance. However there are several domains where, due to lack of data, training such large pre-trained models or foundational models is not a possibility - computational chemistry, computational immunology, and medical imaging are examples. To address these challenges, our work focuses on designing architectures to enable efficient acquisition of priors when large amounts of data are unavailable. In particular, we demonstrate that we can use neural memory to enable adaptation on non-stationary distributions with only a few samples. Then we demonstrate that our hypernetwork designs (a network that generates another network) can acquire more generalizable priors than standard networks when trained with Model Agnostic Meta-Learning (MAML). Subsequently, we apply hypernetworks to 3D scene generation, demonstrating that they can acquire priors efficiently on just a handful of training scenes, thereby leading to faster text-to-3D generation. We then extend our hypernetwork framework to perform 3D segmentation on novel scenes with limited data by efficiently transferring priors from earlier viewed scenes. Finally, we repurpose an existing molecular generative method as a pre-training framework that facilitates improved molecular property prediction, addressing critical challenges in computational immunology.

arXiv...

arXiv admin note: text overlap with arXiv:2310.17075

Efficient Molecular Conformer Generation with SO(3)-Averaged Flow Matching and Reflow 2025-07-13
Show

Fast and accurate generation of molecular conformers is desired for downstream computational chemistry and drug discovery tasks. Currently, training and sampling state-of-the-art diffusion or flow-based models for conformer generation require significant computational resources. In this work, we build upon flow-matching and propose two mechanisms for accelerating training and inference of generative models for 3D molecular conformer generation. For fast training, we introduce the SO(3)-Averaged Flow training objective, which leads to faster convergence to better generation quality compared to conditional optimal transport flow or Kabsch-aligned flow. We demonstrate that models trained using SO(3)-Averaged Flow can reach state-of-the-art conformer generation quality. For fast inference, we show that the reflow and distillation methods of flow-based models enable few-steps or even one-step molecular conformer generation with high quality. The training techniques proposed in this work show a path towards highly efficient molecular conformer generation with flow-based models.

ICML 2025 poster
MolVision: Molecular Property Prediction with Vision Language Models 2025-07-04
Show

Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally less informative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure as images and textual descriptions to enhance property prediction. We construct a benchmark spanning ten diverse datasets, covering classification, regression and description tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such as LoRA. Our results reveal that while visual information alone is insufficient, multimodal fusion significantly enhances generalization across molecular properties. Adaptation of vision encoder for molecular images in conjunction with LoRA further improves the performance. The code and data is available at : $\href{https://molvision.github.io/MolVision/}{https://molvision.github.io/MolVision/}$.

ILVES: Accurate and efficient bond length and angle constraints in molecular dynamics 2025-06-16
Show

All-atom, force field-based molecular dynamics simulations are essential tools in computational chemistry, enabling the prediction and analysis of biomolecular systems with atomic-level resolution. However, as system sizes and simulation timescales increase, so does the associated computational cost. To extend simulated time using the same resources, a common strategy is to constrain the fastest degrees of freedom, such as bond lengths, allowing for larger integration time steps without compromising accuracy. The de facto state-of-the-art algorithms for this purpose (SHAKE, LINCS, and P-LINCS) are integrated into most molecular dynamics packages and widely adopted across the field. Despite their impact, these methods exhibit limitations: all converge slowly when high numerical accuracy is required, and the LINCS and P-LINCS algorithms cannot handle general angular constraints, limiting further increases in time step. In this article, we introduce ILVES, a family of parallel algorithms that converge so rapidly that it is now practical to solve bond length and associated angular constraint equations as accurately as the hardware will allow. We have integrated ILVES into Gromacs and our analysis demonstrates that it is superior to the state-of-the-art when constraining bond lengths. Due to its better convergence properties, we also show that if the time step is increased up to 3.5 fs by enforcing angular constraints, ILVES enables a 1.65x increase in simulated time using the same computational resources and wall-clock time, an outcome unattainable with current methods. This advance can significantly reduce the computational cost of most all-atom molecular dynamics simulations while improving their accuracy and extending access to larger systems and longer timescales.

MACS: Multi-Agent Reinforcement Learning for Optimization of Crystal Structures 2025-06-04
Show

Geometry optimization of atomic structures is a common and crucial task in computational chemistry and materials design. Following the learning to optimize paradigm, we propose a new multi-agent reinforcement learning method called Multi-Agent Crystal Structure optimization (MACS) to address periodic crystal structure optimization. MACS treats geometry optimization as a partially observable Markov game in which atoms are agents that adjust their positions to collectively discover a stable configuration. We train MACS across various compositions of reported crystalline materials to obtain a policy that successfully optimizes structures from the training compositions as well as structures of larger sizes and unseen compositions, confirming its excellent scalability and zero-shot transferability. We benchmark our approach against a broad range of state-of-the-art optimization methods and demonstrate that MACS optimizes periodic crystal structures significantly faster, with fewer energy calculations, and the lowest failure rate.

ChemGraph: An Agentic Framework for Computational Chemistry Workflows 2025-06-03
Show

Atomistic simulations are essential tools in chemistry and materials science, accelerating the discovery of novel catalysts, energy storage materials, and pharmaceuticals. However, running these simulations remains challenging due to the wide range of computational methods, diverse software ecosystems, and the need for expert knowledge and manual effort for the setup, execution, and validation stages. In this work, we present ChemGraph, an agentic framework powered by artificial intelligence and state-of-the-art simulation tools to streamline and automate computational chemistry and materials science workflows. ChemGraph leverages graph neural network-based foundation models for accurate yet computationally efficient calculations and large language models (LLMs) for natural language understanding, task planning, and scientific reasoning to provide an intuitive and interactive interface. Users can perform tasks such as molecular structure generation, single-point energy, geometry optimization, vibrational analysis, and thermochemistry calculations with methods ranging from tight-binding and machine learning interatomic potentials to density functional theory or wave function theory-based methods. We evaluate ChemGraph across 13 benchmark tasks and demonstrate that smaller LLMs (GPT-4o-mini, Claude-3.5-haiku, Qwen2.5-14B) perform well on simple workflows, while more complex tasks benefit from using larger models like GPT-4o. Importantly, we show that decomposing complex tasks into smaller subtasks through a multi-agent framework enables smaller LLM models to match or exceed GPT-4o's performance in specific scenarios.

Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching 2025-05-28
Show

We introduce Adjoint Sampling, a highly scalable and efficient algorithm for learning diffusion processes that sample from unnormalized densities, or energy functions. It is the first on-policy approach that allows significantly more gradient updates than the number of energy evaluations and model samples, allowing us to scale to much larger problem settings than previously explored by similar methods. Our framework is theoretically grounded in stochastic optimal control and shares the same theoretical guarantees as Adjoint Matching, being able to train without the need for corrective measures that push samples towards the target distribution. We show how to incorporate key symmetries, as well as periodic boundary conditions, for modeling molecules in both cartesian and torsional coordinates. We demonstrate the effectiveness of our approach through extensive experiments on classical energy functions, and further scale up to neural network-based energy models where we perform amortized conformer generation across many molecular systems. To encourage further research in developing highly scalable sampling methods, we plan to open source these challenging benchmarks, where successful methods can directly impact progress in computational chemistry.

Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity 2025-05-22
Show

Hamiltonian matrix prediction is pivotal in computational chemistry, serving as the foundation for determining a wide range of molecular properties. While SE(3) equivariant graph neural networks have achieved remarkable success in this domain, their substantial computational cost--driven by high-order tensor product (TP) operations--restricts their scalability to large molecular systems with extensive basis sets. To address this challenge, we introduce SPHNet, an efficient and scalable equivariant network, that incorporates adaptive SParsity into Hamiltonian prediction. SPHNet employs two innovative sparse gates to selectively constrain non-critical interaction combinations, significantly reducing tensor product computations while maintaining accuracy. To optimize the sparse representation, we develop a Three-phase Sparsity Scheduler, ensuring stable convergence and achieving high performance at sparsity rates of up to 70%. Extensive evaluations on QH9 and PubchemQH datasets demonstrate that SPHNet achieves state-of-the-art accuracy while providing up to a 7x speedup over existing models. Beyond Hamiltonian prediction, the proposed sparsification techniques also hold significant potential for improving the efficiency and scalability of other SE(3) equivariant networks, further broadening their applicability and impact. Our code can be found at https://github.com/microsoft/SPHNet.

Physics-Informed Weakly Supervised Learning for Interatomic Potentials 2025-05-05
Show

Machine learning plays an increasingly important role in computational chemistry and materials science, complementing computationally intensive ab initio and first-principles methods. Despite their utility, machine-learning models often lack generalization capability and robustness during atomistic simulations, yielding unphysical energy and force predictions that hinder their real-world applications. We address this challenge by introducing a physics-informed, weakly supervised approach for training machine-learned interatomic potentials (MLIPs). We introduce two novel loss functions, extrapolating the potential energy via a Taylor expansion and using the concept of conservative forces. Our approach improves the accuracy of MLIPs applied to training tasks with sparse training data sets and reduces the need for pre-training computationally demanding models with large data sets. Particularly, we perform extensive experiments demonstrating reduced energy and force errors -- often lower by a factor of two -- for various baseline models and benchmark data sets. Moreover, we demonstrate improved robustness during MD simulations of the MLIP models trained with the proposed weakly supervised loss. Finally, our approach improves the fine-tuning of foundation models on sparse, highly accurate ab initio data. An implementation of our method and scripts for executing experiments are available at https://github.com/nec-research/PICPS-ML4Sci.

29 pa...

29 pages, 4 figures, 31 Tables. Accepted for publication in ICML2025

How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension 2025-04-20
Show

Benchmarking the capabilities and limitations of large language models (LLMs) in graph-related tasks is becoming an increasingly popular and crucial area of research. Recent studies have shown that LLMs exhibit a preliminary ability to understand graph structures and node features. However, the potential of LLMs in graph pattern mining remains largely unexplored. This is a key component in fields such as computational chemistry, biology, and social network analysis. To bridge this gap, this work introduces a comprehensive benchmark to assess LLMs' capabilities in graph pattern tasks. We have developed a benchmark that evaluates whether LLMs can understand graph patterns based on either terminological or topological descriptions. Additionally, our benchmark tests the LLMs' capacity to autonomously discover graph patterns from data. The benchmark encompasses both synthetic and real datasets, and a variety of models, with a total of 11 tasks and 7 models. Our experimental framework is designed for easy expansion to accommodate new models and datasets. Our findings reveal that: (1) LLMs have preliminary abilities to understand graph patterns, with O1-mini outperforming in the majority of tasks; (2) Formatting input data to align with the knowledge acquired during pretraining can enhance performance; (3) The strategies employed by LLMs may differ from those used in conventional algorithms.

The p...

The paper is published in ICLR 2025

Embrace rejection: Kernel matrix approximation by accelerated randomly pivoted Cholesky 2025-04-06
Show

Randomly pivoted Cholesky (RPCholesky) is an algorithm for constructing a low-rank approximation of a positive-semidefinite matrix using a small number of columns. This paper develops an accelerated version of RPCholesky that employs block matrix computations and rejection sampling to efficiently simulate the execution of the original algorithm. For the task of approximating a kernel matrix, the accelerated algorithm can run over $40\times$ faster. The paper contains implementation details, theoretical guarantees, experiments on benchmark data sets, and an application to computational chemistry.

32 pa...

32 pages, 4 figures; v3 new introduction to section 4, reorganization

Equivariant Masked Position Prediction for Efficient Molecular Representation 2025-03-11
Show

Graph neural networks (GNNs) have shown considerable promise in computational chemistry. However, the limited availability of molecular data raises concerns regarding GNNs' ability to effectively capture the fundamental principles of physics and chemistry, which constrains their generalization capabilities. To address this challenge, we introduce a novel self-supervised approach termed Equivariant Masked Position Prediction (EMPP), grounded in intramolecular potential and force theory. Unlike conventional attribute masking techniques, EMPP formulates a nuanced position prediction task that is more well-defined and enhances the learning of quantum mechanical features. EMPP also bypasses the approximation of the Gaussian mixture distribution commonly used in denoising methods, allowing for more accurate acquisition of physical properties. Experimental results indicate that EMPP significantly enhances performance of advanced molecular architectures, surpassing state-of-the-art self-supervised approaches. Our code is released in https://github.com/ajy112/EMPP

24 pages, 6 figures
Machine Learned Force Fields: Fundamentals, its reach, and challenges 2025-03-07
Show

Highly accurate force fields are a mandatory requirement to generate predictive simulations. In this regard, Machine Learning Force Fields (MLFFs) have emerged as a revolutionary approach in computational chemistry and materials science, combining the accuracy of quantum mechanical methods with computational efficiency orders of magnitude superior to ab-initio methods. This chapter provides an introduction of the fundamentals of learning and how it is applied to construct MLFFs, detailing key methodologies such as neural network potentials and kernel-based models. Emphasis is placed on the construction of SchNet model, as one of the most elemental neural network-based force fields that are nowadays the basis of modern architectures. Additionally, the GDML framework is described in detail as an example of how the elegant formulation of kernel methods can be used to construct mathematically robust and physics-inspired MLFFs. The ongoing advancements in MLFF development continue to expand their applicability, enabling precise simulations of large and complex systems that were previously beyond reach. This chapter concludes by highlighting the transformative impact of MLFFs on scientific research, underscoring their role in driving future discoveries in the fields of chemistry, physics, and materials science.

9 figures

About

Daily ArXiv Papers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%