Daily AI Papers

Summaries auto-generated from HuggingFace's Daily Papers using Gemini and GitHub Actions. All credits go to the research and HuggingFace communities.

🔉 You can get audio summaries via OpenAI's text-to-speech API on Telegram.

Note: Authors may be listed by their HuggingFace IDs. Additionally, summaries are generated by LLM and may contain mistakes. You can see the prompt used here here.

Papers for 2025-01-10

Title	Authors	Summary
The GAN is dead; long live the GAN! A Modern GAN Baseline (Read more on arXiv or HuggingFace)	jamestompkin, kuleshov, Skylion007, Eva1209	Here is a concise summary of the paper: i) The paper introduces R3GAN, a new baseline for Generative Adversarial Networks (GANs) that achieves state-of-the-art results without relying on ad-hoc tricks common in previous GAN architectures. ii) The main research objective is to develop a more principled and stable GAN baseline by addressing mode dropping and non-convergence issues in existing GAN training. iii) The key methodology involves proposing a novel regularized relativistic GAN loss (RpGAN + R1 + R2) and modernizing the network backbone using ResNet design principles and grouped convolutions. iv) The primary results show that R3GAN surpasses StyleGAN2 on FFHQ-256, achieving an FID score of 7.05 compared to StyleGAN2's 7.52, and matches or exceeds state-of-the-art GANs and diffusion models on various datasets. v) The principal implication for AI practitioners is that R3GAN provides a robust and efficient baseline for image generation tasks, demonstrating that GANs remain competitive with modern architectures and can be trained reliably without complex, ad-hoc techniques.
An Empirical Study of Autoregressive Pre-training from Videos (Read more on arXiv or HuggingFace)	Ilija Radosavovic, jitendra1995, yossig, rravishankar, brjathu	This paper empirically studies autoregressive pre-training of transformer models on videos for visual representation learning. The main research question is how effective is autoregressive pre-training on videos for learning visual representations across various downstream tasks. The key methodology involves training a series of autoregressive video models, called Toto, to predict future tokens in videos and images, using a diverse dataset of over 1 trillion visual tokens and evaluating these models on downstream tasks. The primary result is that autoregressive pre-training leads to competitive performance across all benchmarks, with the Toto-1b model achieving 75.3% top-1 accuracy on ImageNet classification. The principal implication for AI practitioners is that autoregressive pre-training on videos is a viable method for learning visual representations, achieving strong performance on various tasks despite minimal inductive biases.
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives (Read more on arXiv or HuggingFace)	ZwwWayne, Chonghao, THUdyh, ldkong, shaoyuanxie	DriveBench, a benchmark dataset, evaluates the reliability of Vision-Language Models (VLMs) in autonomous driving across various tasks and conditions. The main research question is: Are existing VLMs capable of providing reliable explanations grounded on visual cues for driving? The methodology involves evaluating 12 VLMs on a dataset with 19,200 frames and 20,498 QA pairs across 17 settings (clean, corrupted, and text-only inputs), using metrics like accuracy, traditional language metrics, and GPT scores. Primary results indicate that under clean image inputs, the GPT-4 model achieved a GPT score of 75.75 in the planning task, but VLMs often generated plausible yet fabricated responses under degraded or missing visual inputs. The principal implication for AI practitioners is that current VLMs are not yet reliable for autonomous driving applications due to their tendency to provide fabricated responses under degraded visual conditions, emphasizing the need for improved datasets and evaluation protocols.
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis (Read more on arXiv or HuggingFace)	Yingyu Liang, Xiaoyu Li, Zhenmei, JamesSand, keyekun	Visual Autoregressive (VAR) models' computational complexity and efficiency for image generation are analyzed in this paper. The main research question is whether the computations of VAR models can be performed faster than O(n⁴) time. The key methodology involves analyzing the computation of VAR models under the Strong Exponential Time Hypothesis (SETH) and using low-rank approximations to develop efficient algorithms. A primary result is that when the hidden dimension d = O(log n) and the bound of the entries of the input matrices R = o(√log n), there is an algorithm that approximates the VAR model up to 1/poly(n) additive error in O(n²⁺⁰⁽¹⁾) time. The principal implication for AI practitioners is that VAR models can be computed in almost quadratic time under specific conditions, offering a more efficient approach to image generation than previous O(n⁴) methods.
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model (Read more on arXiv or HuggingFace)	Radu Timofte, Chris Biemann, Carolin Holtermann, Florian Schneider, Gregor Geigle	Centurio is a 100-language large vision-language model (LVLM) that offers state-of-the-art performance across 14 tasks and 56 languages. The main research question is what are the optimal training strategies for developing massively multilingual LVLMs, focusing on the number of training languages, data distribution across languages, and techniques for improving multilingual text-in-image understanding. The key methodology involves a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically varying the training data composition and evaluating performance. A primary result is that including up to 100 training languages simultaneously with as little as 25-50% of non-English data greatly improves multilingual performance while retaining strong English performance, with negligible performance degradation compared to fewer languages. The principal implication for AI practitioners is that massively multilingual LVLMs can be effectively trained with a balanced mix of English and multilingual data, even for low-resource languages, and incorporating synthetic OCR data can significantly enhance multilingual text-in-image understanding.
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models (Read more on arXiv or HuggingFace)	Ece Elif Adak, tcTHEBESTMAN, fatihburakkaragoz, temretiras, sbozates	The paper introduces new resources and models for natural language processing (NLP) of historical Turkish, a previously underexplored area. The main research objective is to develop foundational resources and models for NLP tasks in historical Turkish, including named entity recognition (NER), dependency parsing, and part-of-speech (POS) tagging. The key methodology involves creating and annotating datasets (HisTR, OTA-BOUN), compiling a clean text corpus (Ottoman Text Corpus - OTC), and fine-tuning transformer-based language models (BERTurk, mBERT, TURNA) on these resources. Primary results indicate that the BERTurk model fine-tuned on both MilliyetNER and HisTR achieved a 90.07 F1 score on the HisTR development set for NER. The principal implication for AI practitioners is that fine-tuning language-specific pre-trained models on domain-specific datasets is a viable approach for historical Turkish NLP, but challenges remain in adapting to out-of-domain data.
Entropy-Guided Attention for Private LLMs (Read more on arXiv or HuggingFace)	Brandon Reagen, nandan523	This paper introduces an information-theoretic framework to optimize transformer architectures for privacy-preserving language model inference. The main research question is how the removal of nonlinearities in decoder-only language models impacts their training dynamics and expressiveness, particularly in the context of private inference (PI). The key methodology involves using Shannon's entropy to analyze the dual role of nonlinearities in maintaining training stability and attention head diversity, and exploring PI-friendly alternatives like weight normalization and entropy regularization. A primary result is that the proposed entropy-guided attention mechanism with a Softmax-only model reduces communication overhead by 3.94x and improves end-to-end PI latency by 1.72x, compared to a baseline GPT-2 model with GELU and LayerNorm. The principal implication for AI practitioners is that entropy-guided attention can enable more efficient and scalable privacy-preserving inference for large language models by reducing reliance on computationally expensive nonlinear operations.

Papers for 2025-01-09

Title	Authors	Summary
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking (Read more on arXiv or HuggingFace)	Youran Sun, Yifei Liu, Xinyu Guan, J-shang, lynazhang	rStar-Math demonstrates that small language models (SLMs) can achieve advanced math reasoning through self-evolved deep thinking. The main research question is whether SLMs can rival or surpass the mathematical reasoning capabilities of larger models like OpenAI's models without distillation from superior models. The key methodology involves a novel code-augmented Chain-of-Thought data synthesis method, Monte Carlo Tree Search (MCTS) for test-time search guided by an SLM-based process reward model, and a four-round self-evolution recipe to iteratively improve the policy SLM and process preference model (PPM). The primary result is that rStar-Math improves the accuracy of the Qwen2.5-Math-7B model on the MATH benchmark from 58.8% to 90.0% with 64 search trajectories. The principal implication for AI practitioners is that they can leverage rStar-Math's self-evolutionary framework to enhance the mathematical reasoning capabilities of SLMs without relying on larger, more resource-intensive models.
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics (Read more on arXiv or HuggingFace)	Xinzhe Ni, Yiyao Yu, Yifan Wang, fun6668, AntimageTHU	URSA-7B is a new model for multimodal mathematical reasoning that uses chain-of-thought (CoT) supervision to improve performance. The main research question is how to enhance the CoT reasoning capabilities of Multimodal Large Language Models (MLLMs) in mathematical problem-solving using a new dataset and training method. The key methodology involves a three-module synthesis strategy that integrates CoT distillation, trajectory-format rewriting, and format unification to create a high-quality CoT reasoning instruction fine-tuning dataset, MMathCoT-1M, and a dual-view process supervision data synthesis to train a reward model, URSA-RM-7B. The primary results show that URSA-7B achieves state-of-the-art performance on multiple multimodal mathematical benchmarks, with a 97.1 pass@64 accuracy on the GPS task of MathVista. The principal implication for AI practitioners is that using high-quality CoT datasets and advanced process supervision can significantly enhance MLLMs' mathematical reasoning capabilities, offering a pathway to improve performance in tasks requiring complex, multi-step reasoning.
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though (Read more on arXiv or HuggingFace)	Kanishk Gandhi, Charlie Snell, Violet Xiang, nlile, Asap7772	This paper introduces Meta Chain-of-Thought (Meta-CoT), a framework for enhancing reasoning in large language models (LLMs) by explicitly modeling the underlying thought processes involved in reaching a solution. The main research question is how to enable LLMs to perform complex reasoning analogous to System 2 cognitive processes by integrating search, verification, and iterative refinement into their operational framework. The key methodology involves process supervision, synthetic data generation via search algorithms (e.g. Monte Carlo Tree Search, A*), and reinforcement learning to train models on linearized search traces. Primary results indicate that models trained with Meta-CoT, specifically when using a backtracking strategy at a rate of 50% for incorrect steps, can achieve up to 94% accuracy on hard math problems, compared to 78% for standard Chain-of-Thought models. The principal implication for AI practitioners is that incorporating Meta-CoT into model training can significantly improve the ability of LLMs to solve complex reasoning tasks, suggesting that future model development should focus on integrating explicit search and verification mechanisms.
Agent Laboratory: Using LLM Agents as Research Assistants (Read more on arXiv or HuggingFace)	Jialian Wu, Ximeng Sun, Ze Wang, Yusheng Su, Samuel Schmidgall	Agent Laboratory is an autonomous LLM-based framework designed to conduct the entire research process, from literature review to experimentation and report writing, with optional human feedback. The main research question is whether this framework can accelerate scientific discovery, reduce research costs, and improve research quality. The key methodology involves a three-stage process: literature review using the arXiv API, experimentation using specialized agents and tools like `mle-solver` for code generation, and report writing with a module called `paper-solver` for iterative report generation and refinement. The primary results show that Agent Laboratory driven by o1-preview generates the best research outcomes, and human involvement at each stage improves the overall quality of research, with an 84% decrease in research expenses compared to previous autonomous research methods. The principal implication for AI practitioners is that Agent Laboratory can enable researchers to allocate more effort toward creative ideation rather than low-level coding and writing, potentially accelerating scientific discovery in machine learning.
LLM4SR: A Survey on Large Language Models for Scientific Research (Read more on arXiv or HuggingFace)	Xinya Du, Wei Yang, Ziming Luo, Ason-jay, ZonglinY	LLM4SR is a survey that systematically explores the application of large language models (LLMs) across the scientific research lifecycle. The main research question is how LLMs are being integrated into various stages of scientific research, including hypothesis discovery, experiment planning and implementation, scientific writing, and peer review. The key methodology used involves a comprehensive review and analysis of existing literature, focusing on task-specific methodologies, evaluation benchmarks, and the unique roles LLMs play in each research stage. The primary results indicate that LLMs have been used to generate novel hypotheses, with one study showing LLMs generating hypotheses in chemistry and material science published in high impact journals such as Nature or Science after the training cutoff date of the LLM; however, the paper does not explicitly state quantitative results across all stages. The principal implication for AI practitioners is that LLMs present significant opportunities for enhancing and automating various aspects of the scientific research process, but challenges remain in areas such as ensuring the validity of generated hypotheses and addressing ethical considerations.
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection (Read more on arXiv or HuggingFace)	Xueyu Hu, Congkai Xie, Zishu Wei, Yuhang Liu, pengxiang	InfiGUIAgent is a multimodal GUI agent designed for task automation on computing devices, trained through a two-stage supervised fine-tuning pipeline. The main research objective is to develop a GUI agent with enhanced reasoning capabilities and reduced reliance on textual annotations. The key methodology involves two-stage supervised fine-tuning (SFT), with Stage 1 focusing on fundamental skills like GUI understanding and grounding using diverse datasets, and Stage 2 integrating hierarchical reasoning and expectation-reflection reasoning skills into synthesized data. Primary results show that InfiGUIAgent-2B achieves 76.3% accuracy on the ScreenSpot benchmark, surpassing several strong baselines. For AI practitioners, the principal implication is that a two-stage SFT approach incorporating hierarchical and expectation-reflection reasoning can significantly enhance GUI agents' performance on benchmarks without reliance on additional GUI metadata, suggesting a path towards more robust and autonomous GUI automation.
GeAR: Generation Augmented Retrieval (Read more on arXiv or HuggingFace)	Hao Sun, Yuefeng Zhan, Jianfeng Liu, Shaohan Huang, noobimp	GeAR: Generation Augmented Retrieval introduces a novel method to enhance document retrieval with fine-grained information localization. The main research question is whether integrating information localization capabilities into existing retrievers is possible without sacrificing their retrieval capabilities. The key methodology involves constructing (query-document-information) triples and employing a text decoder to generate relevant fine-grained information from fused query and document representations, optimized with contrastive learning. The primary results show that GeAR achieves competitive performance on retrieval tasks, with a recall rate of 0.963 at rank 5 on the PAQ dataset, and effectively localizes information within documents. The principal implication for AI practitioners is that GeAR provides a flexible framework capable of handling both document retrieval and fine-grained unit localization simultaneously, offering new insights into the interpretation of retrieval results.
Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation (Read more on arXiv or HuggingFace)	Chee Seng Chan, Jiankang Deng, Jia Wei Sii, Jing Yang, Kam Woh Ng	This paper introduces Chirpy3D, a novel framework for fine-grained, creative 3D bird generation using continuous part latents. The main research objective is to enable the generation of detailed and creative 3D objects by lifting 2D fine-grained understanding into 3D space and enabling part-level control. The key methodology involves fine-tuning a multi-view diffusion model (MVDream) with 2D images, modeling part latents as continuous Gaussian distributions, and introducing a self-supervised feature consistency loss. Primary results show that Chirpy3D effectively reconstructs 3D subjects, with a cosine similarity score of 0.724 for part composition, and generates novel species with diverse parts. The principal implication for AI practitioners is that Chirpy3D offers a new approach for generating high-quality, creative 3D assets with fine-grained control, which is directly applicable to improve creative freedom and output detail in 3D content creation.
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images (Read more on arXiv or HuggingFace)	Varun Jampani, James M. Rehg, Aaryaman Vasishta, Zixuan Huang, mboss	SPAR3D is a two-stage model for reconstructing 3D objects from single images. The main research question is how to combine the strengths of regression-based and diffusion-based methods for single-image 3D object reconstruction while avoiding their limitations. The key methodology involves a two-stage approach: first, a point diffusion model generates a sparse 3D point cloud, and second, a meshing stage uses the point cloud and the input image to create a detailed mesh. On the GSO dataset, SPAR3D achieves a Chamfer Distance (CD) of 0.120, outperforming prior methods. The principal implication for AI practitioners is that SPAR3D offers a computationally efficient approach to generate high-quality 3D meshes from single images, with an inference speed of 0.7 seconds per object, and enables interactive user edits.
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization (Read more on arXiv or HuggingFace)	Rajarshi Roy, Danush Khanna, Suranjana Trivedy, Amitava Das, amanchadha	Here is a concise summary of the AI research paper "DPO-Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization": i) This paper introduces DPO-Kernels, an enhanced framework for direct preference optimization (DPO) that integrates kernel methods and alternative divergence measures to improve alignment of large language models with human preferences. ii) The main research objective is to address the limitations of standard DPO in aligning models with diverse human values and preferences by proposing a more expressive and adaptable framework. iii) The key methodology involves integrating kernelized representations (using polynomial, RBF, Mahalanobis, and spectral kernels), a hybrid loss function combining probability-based and embedding-based signals, and alternative divergence measures (Jensen-Shannon, Hellinger, Rényi, Bhattacharyya, Wasserstein, and f-divergences), along with data-driven selection of kernel-divergence pairs and a Hierarchical Mixture of Kernels (HMK). iv) Evaluations on 12 datasets show that DPO-Kernels, particularly HMK, achieve state-of-the-art generalization in factuality, safety, reasoning, and instruction-following tasks, with HMK demonstrating a performance improvement of up to 9.2% over baseline DPO. v) The principal implication for AI practitioners is that DPO-Kernels provide a more robust and flexible framework for preference alignment in large language models, but they must carefully consider the 3-4x higher computational costs associated with HMK.
EpiCoder: Encompassing Diversity and Complexity in Code Generation (Read more on arXiv or HuggingFace)	Xiao Liu, Jie Wu, Yaoxiang Wang, CharonBony, Ringo1110	EpiCoder is a novel feature tree-based code synthesis framework designed to enhance the diversity and complexity of code generation. The main research question is how to generate more nuanced, diverse, and complex code instruction data that aligns with real-world programming scenarios. The key methodology involves a feature tree-based synthesis inspired by Abstract Syntax Trees (AST) that models semantic relationships between code elements, iteratively refined to enhance feature diversity. The primary results show that EpiCoder-Qwen-7B achieves state-of-the-art performance on function-level code generation benchmarks, with an 81.7% average pass rate on HumanEval and MBPP. The principal implication for AI practitioners is that using EpiCoder's feature tree-based framework can significantly improve the quality and diversity of synthesized code data, leading to more robust and adaptable code generation models.

Papers for 2025-01-08

Title	Authors	Summary
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models (Read more on arXiv or HuggingFace)	chuyi777	REINFORCE++ is a novel variant of the REINFORCE algorithm designed to enhance the alignment of large language models (LLMs) with human preferences. The main research objective is to develop a more efficient and stable reinforcement learning from human feedback (RLHF) algorithm by simplifying the REINFORCE framework and removing the need for a critic network. Key methodologies include a token-level Kullback-Leibler (KL) penalty, Proximal Policy Optimization (PPO)-clip integration, mini-batch updates, and reward normalization. Primary results demonstrate that REINFORCE++ achieves comparable or superior performance to PPO and Group Relative Policy Optimization (GRPO), with a specific quantitative finding showing a reduction in training time from 60 hours (for PPO) to 42 hours on NVIDIA H100 with the LLaMA3 8b model. Principal implication for AI practitioners is that REINFORCE++ provides a simpler and more computationally efficient method for aligning LLMs, making it a valuable alternative to more complex RLHF approaches like PPO.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models (Read more on arXiv or HuggingFace)	Lefan Wang, Weihan Wang, Zhuoyi Yang, LiquidAmmonia, wenyi	MotionBench: A comprehensive benchmark for evaluating fine-grained video motion understanding in vision-language models (VLMs). The research objective was to assess the capability of VLMs in understanding fine-grained video motion and to improve VLM performance in this area. The key methodology involved creating a new benchmark, MotionBench, with diverse video sources and question types focusing on motion-level perception, along with proposing a novel Through-Encoder (TE) Fusion method for enhancing video feature representation. The primary results indicated that existing VLMs perform poorly in understanding fine-grained motions, achieving accuracies below 60% on MotionBench; TE Fusion yielded improvements in motion understanding. The paper does not clearly specify the improvement magnitude. The principal implication is that MotionBench provides a valuable resource for evaluating and improving video understanding VLMs, highlighting a significant deficiency in current models' ability to handle fine-grained motion and offering a novel architectural approach to address this limitation.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos (Read more on arXiv or HuggingFace)	Shilin Xu, Zilong Huang, Tao Zhang, Xiangtai Li, HarborYuan	Sa2VA is a unified model for dense grounded understanding of images and videos, integrating SAM-2 and LLaVA-like models. The research objective was to create a model capable of handling a wide range of image and video tasks, including referring segmentation and conversation, within a single framework. The methodology involved a one-shot visual instruction tuning approach, unifying text, image, and video into a shared LLM token space. Sa2VA achieved state-of-the-art results on multiple benchmarks, exceeding GLaMM-7B by 2.1, 3.6, and 4.5 cIoU on RefCOCO, RefCOCO+, and RefCOCOg respectively. For AI practitioners, this work provides a unified, highly effective architecture and demonstrates that integrating powerful visual foundation models with LLMs is highly effective for a broad range of vision-language tasks, offering a superior approach to the design of multi-modal models.
Cosmos World Foundation Model Platform for Physical AI (Read more on arXiv or HuggingFace)	Yogesh Balaji, Maciej Bala, Arslan Ali, Niket Agarwal, NVIDIA	The Cosmos World Foundation Model Platform facilitates Physical AI development by providing pre-trained world models and tools for customization. The research objective was to create a platform for building and fine-tuning world foundation models (WFMs) for Physical AI applications. The methodology involved developing video data curation, pre-trained WFMs using diffusion and autoregressive models, video tokenizers, and post-training techniques. Results showed Cosmos Tokenizer achieved a 4dB PSNR improvement over existing tokenizers on the DAVIS dataset at 8× spatial compression. The platform's open-source nature and model availability empower AI practitioners to build and deploy customized WFMs for their specific Physical AI systems, potentially accelerating development in various applications.
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token (Read more on arXiv or HuggingFace)	Yang Feng, Zhe Yang, Qingkai Fang, Shaolei Zhang	LLaVA-Mini introduces an efficient large multimodal model using a single vision token to represent images and videos. The research objective was to develop efficient large multimodal models (LMMs) by minimizing the number of vision tokens while maintaining performance. The key methodology involved modality pre-fusion to fuse visual information into text tokens before feeding them into the LLM backbone, along with a compression module to reduce vision token quantity. Results show LLaVA-Mini outperforms LLaVA-v1.5 with only one vision token instead of 576, achieving a 77% reduction in FLOPs. This research demonstrates the feasibility of building highly efficient LMMs with significantly reduced computational costs, potentially leading to faster inference times and wider accessibility for real-time multimodal applications.
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control (Read more on arXiv or HuggingFace)	Zhiyang Dou, Jiahao Lu, Rui Yan, Zekai Gu, pengHTYX	Diffusion as Shader (DaS) is a 3D-aware video diffusion model that enables versatile control over video generation by utilizing 3D tracking videos as conditional inputs. The main research objective is to develop a unified framework for video generation that supports multiple control tasks, such as mesh-to-video generation, camera control, motion transfer, and object manipulation. The key methodology involves using 3D tracking videos, which represent the motion trajectories of 3D points, as control inputs to a video diffusion model that acts as a shader to compute shaded appearances. The primary results demonstrate that DaS outperforms baseline methods on camera control, achieving a rotation error of 10.40 degrees and a translation error of 5.97 degrees on large camera movements, compared to 39.86 and 67.05 for MotionCtrl. For AI practitioners, the principal implication is that leveraging 3D tracking videos as control signals enables more precise and temporally consistent control over video generation compared to methods that rely solely on 2D control signals.
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting (Read more on arXiv or HuggingFace)	Jihyong Oh, Won-Sik Cheong, Jun Young Jeong, Joonsoo Kim, Sangwoon Kwak	MoDec-GS is a memory-efficient 3D Gaussian splatting framework for reconstructing novel views from dynamic videos with complex motions. The research objective was to develop a method for efficiently representing and rendering dynamic scenes with complex motions, addressing limitations in existing methods regarding storage and representation of complex movements. MoDec-GS uses Global-to-Local Motion Decomposition (GLMD) and Temporal Interval Adjustment (TIA) to model complex motions effectively and efficiently. The results demonstrate a 70% average reduction in model size compared to state-of-the-art methods while maintaining or improving rendering quality; specifically, on the iPhone dataset, MoDec-GS achieved a 0.7dB PSNR gain and a 94% storage reduction compared to the second-best method. This work provides a highly compact and efficient approach for dynamic scene representation relevant to AI practitioners working on real-time video processing and novel view synthesis.
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides (Read more on arXiv or HuggingFace)	Hongyu Lin, Jia Zheng, Hao Kong, Xinyan Guan, Forceless	PPTAgent is a novel two-stage, edit-based framework for automatic presentation generation that leverages reference presentations and LLMs. The research aimed to improve presentation generation by addressing the limitations of existing text-to-slide methods. PPTAgent utilizes a two-stage process: presentation analysis (clustering slides and extracting schemas) and presentation generation (iterative editing of reference slides). Experiments showed that PPTAgent significantly outperformed baselines across three dimensions (Content, Design, Coherence), achieving an average score of 3.67 and a 97.8% success rate. This work provides a new approach for AI practitioners to generate high-quality presentations, improving efficiency and visual effectiveness in communication.
MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control (Read more on arXiv or HuggingFace)	Guoying Zhao, Huai-Qian Khor, Xingxun Jiang, Tuomas Varanka, Mengting Wei	MagicFace: High-fidelity facial expression editing using action unit (AU) variations as conditions within a Stable Diffusion framework. The research objective was to develop a method for high-fidelity facial expression editing that is both interpretable and controllable by adjusting AU variations. The methodology involved a diffusion model conditioned on AU variations, an ID encoder for identity preservation, and an Attribute Controller for maintaining background and pose consistency. The model was trained on a dataset of 30,000 image pairs. The primary result showed that MagicFace achieved a mean squared error (MSE) of 0.261 for AU intensity, outperforming other methods. The main implication for AI practitioners is the demonstration of precise and controllable facial expression editing using AU variations within a diffusion model framework; this offers improvements for generating photorealistic facial expressions for applications like virtual characters and avatars.
Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers (Read more on arXiv or HuggingFace)	Zexin Yan, Bohao Peng, Bin Xia, Yaoyang Liu, julianjuaner	Magic Mirror: A novel framework for generating high-fidelity identity-preserved videos using video diffusion transformers. The research objective is to develop a method for generating high-quality, identity-preserved videos with dynamic motion, addressing the challenge of maintaining consistent identity while producing natural motion in existing text-to-video generation models. The methodology involves a dual-branch facial feature extractor, a lightweight cross-modal adapter with Conditioned Adaptive Normalization (CAN) for efficient identity integration, and a two-stage training strategy. The primary results demonstrate that Magic Mirror outperforms existing methods, achieving an average ID similarity of 0.911 while maintaining high video quality metrics and dynamic motion. The overall preference score from a user study was 7.315. The paper does not explicitly specify if the user study is statistically significant. The most impactful finding is the successful integration of identity preservation into a video diffusion transformer architecture without person-specific fine-tuning, offering a more efficient and scalable approach to personalized video generation. This has direct relevance for AI practitioners working with video diffusion models, as it provides a more efficient and effective method for identity-preserved video generation.
Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback (Read more on arXiv or HuggingFace)	Tao Chen, Botian Shi, Xiangchao Yan, Jiakang Yuan, BoZhang	DOLPHIN is a closed-loop open-ended auto-research framework automating the scientific research process. The research aims to create a fully automated scientific research system capable of generating research ideas, performing experiments, and iteratively refining ideas based on results. DOLPHIN employs LLMs for idea generation and code generation, incorporating an exception-traceback-guided debugging process. Experiments across three benchmark datasets demonstrated DOLPHIN generating methods comparable to state-of-the-art in some tasks; for example, a 2.9% improvement in ModelNet40 accuracy over the baseline. This work provides a significant advancement for AI practitioners in automating the scientific research process, though the paper lacks information regarding certain experimental setup details.

Papers for 2025-01-07

Title	Authors	Summary
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution (Read more on arXiv or HuggingFace)	yingtai, zhenheny, chenzhao, yinhongliu, SherryX	STAR introduces a novel approach for real-world video super-resolution using text-to-video models. The research objective was to enhance spatio-temporal quality in restored videos by addressing artifacts from complex degradations and mitigating fidelity loss from powerful generative models. The methodology involved a Local Information Enhancement Module (LIEM) and a Dynamic Frequency (DF) Loss. Results showed STAR outperforming state-of-the-art methods, achieving a 0.5422 DOVER score on the UDM10 dataset. This research highlights the significant potential of integrating text-to-video models and specifically designed loss functions for improving the fidelity and temporal consistency of real-world video super-resolution.
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning (Read more on arXiv or HuggingFace)	lindahua, yhcao, KennyUTC, yuhangzang, BeichenZhang	Here's a concise summary of the paper: i) BoostStep improves large language models' mathematical reasoning by enhancing single-step reasoning through step-level in-context learning. ii) The main objective is to address the granularity mismatch and negative-effect noise in in-context learning examples to improve the reasoning quality within each step of a multi-step mathematical problem-solving process. iii) The key methodology is step-level in-context learning with a "first-try" strategy, which aligns the granularity between retrieving and reasoning on a step-by-step basis using an example problem bank constructed with step-level granularity. iv) Quantitatively, BoostStep improves GPT-4o's performance on various mathematical benchmarks by 3.6% and Qwen2.5-Math-72B by 2.0%. v) For AI practitioners, BoostStep provides a method to enhance the mathematical reasoning ability of large language models without additional training, demonstrating the importance of fine-grained, step-level guidance in complex problem-solving.
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction (Read more on arXiv or HuggingFace)	myownskyW7, lindahua, yhcao, yuhangzang, Mar2Ding	Dispider is a novel system designed for active real-time interaction with streaming video using large language models (LLMs). The main research objective is to enable video LLMs to process and respond to streaming video input continuously and in real-time, unlike existing offline models. The key methodology is a disentangled architecture that separates perception, decision, and reaction into asynchronous modules operating in parallel, with a lightweight proactive streaming video processing module and an asynchronous interaction module. Primary results show that Dispider outperforms VideoLLM-online in the Proactive Output task with a score of 25.3, and achieves a leading performance of 55.6 on the EgoSchema benchmark. The principal implication for AI practitioners is that Dispider's disentangled and asynchronous design enables more efficient and responsive real-time video interaction, making it ideal for long-duration video streams and maintaining strong performance in conventional video QA tasks.
Test-time Computing: from System-1 Thinking to System-2 Thinking (Read more on arXiv or HuggingFace)	Jia Xu, Kaixin Wu, Hai Ye, douvleplus, Yisam	This paper surveys test-time computing methods, focusing on their role in enabling the transition from System-1 to System-2 thinking in AI models. The main research question is how test-time computing can enhance the robustness, generalization, and reasoning ability of AI models, particularly large language models (LLMs). The methodology involves a comprehensive review and categorization of existing literature on test-time computing techniques, including test-time adaptation and test-time reasoning, applied to both System-1 and System-2 models. A primary result highlighted is that self-consistency Chain-of-Thought prompting can improve accuracy by 18% over vanilla Chain-of-Thought in math reasoning tasks. The principal implication for AI practitioners is that leveraging test-time computing strategies can significantly enhance model performance on downstream tasks, particularly in complex reasoning scenarios, without the need for retraining.
Personalized Graph-Based Retrieval for Large Language Models (Read more on arXiv or HuggingFace)	Franck-Dernoncourt, namyongp, Ojasmitha17, Tobilee, StevenAu	Personalized Graph-Based Retrieval for Large Language Models introduces a framework called PGraphRAG to enhance personalized text generation. The main research question is how to improve the performance of large language models (LLMs) in generating personalized text, especially in cold-start scenarios with sparse user data. The key methodology is PGraphRAG, a framework that leverages user-centric knowledge graphs to augment prompts with user-relevant context during the retrieval process. Primary results show that PGraphRAG significantly outperforms state-of-the-art personalization methods across diverse tasks, with a +32.1% improvement in ROUGE-1 for Hotel Experience Generation using the LLaMA-3.1-8B model. The principal implication for AI practitioners is that integrating structured user knowledge via PGraphRAG enhances the ability of LLMs to generate personalized and contextually appropriate text, particularly when user history is limited.
METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring (Read more on arXiv or HuggingFace)	willieneis, oliu-io, upup-ashton-wang, Johannes, oliu-io	METAGENE-1: A 7-billion parameter autoregressive transformer model is pretrained on a novel metagenomic dataset for pandemic monitoring. The research aimed to pretrain a foundation model on diverse metagenomic DNA and RNA sequences from human wastewater samples. Byte-pair encoding (BPE) tokenization was used for the dataset, and the model was pretrained using a decoder-style architecture. METAGENE-1 achieved state-of-the-art results on pathogen detection benchmarks, with a 92.96% average MCC score across four datasets. The successful pretraining of a large-scale metagenomic language model demonstrates the potential of this technology for applications in public health and opens up avenues for AI practitioners to develop and deploy similar models for diverse genomic tasks.
TransPixar: Advancing Text-to-Video Generation with Transparency (Read more on arXiv or HuggingFace)	Yijun Li, yingcongchen, HeZhang, zhifeichen097, wileewang	TransPixar introduces a method for generating RGBA videos from text prompts, addressing the challenge of producing transparent visual effects in text-to-video models. The research objective was to extend pretrained video models to generate RGBA videos while preserving original RGB capabilities. The methodology involved incorporating alpha-specific tokens and using LoRA-based fine-tuning within a diffusion transformer architecture, optimizing attention mechanisms to align RGB and alpha channels. A user study revealed a significant preference for TransPixar's RGBA alignment (93.3%) over a comparable method (6.7%). This work demonstrates that high-quality RGBA video generation is achievable with limited training data using a modified DiT architecture, offering a practical advancement for creating realistic video effects with transparency for applications such as VFX.
Ingredients: Blending Custom Photos with Video Diffusion Transformers (Read more on arXiv or HuggingFace)	Di Qiu, MichaelFan, Changqian, Debang, onion	This paper introduces Ingredients, a framework for customizing video generation by incorporating multiple specific identity (ID) photos with video diffusion Transformers. The main research question is how to achieve multi-ID customization in video generation while preserving high-fidelity identity, enhancing content flexibility, and ensuring natural video generation. The key methodology involves a facial extractor for versatile facial feature capture, a multi-scale projector to map embeddings into the contextual space of image query in video diffusion Transformers, and an ID router for dynamically combining and allocating multiple ID embeddings to corresponding space-time regions, trained through a multi-stage protocol. The primary results show that the proposed Ingredients method achieved a face similarity score of 77.1% in multi-ID video generation, significantly outperforming baselines. The principal implication for AI practitioners is that Ingredients provides a training-free framework for multi-ID customization in video generation based on diffusion Transformers, enabling the preservation of multiple IDs while supporting precise textual control signals.
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation (Read more on arXiv or HuggingFace)	Ruijie Zhu, Hao Zhang, Bo Li, Zerong Wang, Ziyang Song	DepthMaster is a single-step diffusion model designed for improved monocular depth estimation by adapting generative features to this discriminative task. The main research question is how to adapt generative features in diffusion models to enhance the performance of discriminative depth estimation while maintaining efficiency. The key methodology involves a Feature Alignment module to incorporate high-quality semantic features into the denoising network and a Fourier Enhancement module to balance low-frequency structure and high-frequency details in a single forward pass, using a two-stage training strategy. The primary results show that DepthMaster achieves state-of-the-art zero-shot performance, with an 8.2% AbsRel on the KITTI dataset. The principal implication for AI practitioners is that DepthMaster provides an effective way to leverage diffusion models for depth estimation with improved generalization and detail preservation, which is particularly beneficial for applications such as autonomous driving.
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation (Read more on arXiv or HuggingFace)	Yaniv Taigman, Shelly Sheynin, Amit Zohar, Yuval Kirstain, GuyYariv	Through-The-Mask proposes a two-stage image-to-video generation framework using mask-based motion trajectories. The research objective was to improve the accuracy and consistency of object motion in generated videos, especially in multi-object scenarios. The methodology involved generating mask-based motion trajectories as an intermediate representation, conditioned on the input image, segmentation mask, and text prompt, followed by video generation conditioned on this representation. Results demonstrated state-of-the-art performance on several benchmarks, including a FVD score of 925.39 (U-Net) on the SA-V-128 benchmark. This work provides AI practitioners with a novel two-stage framework for I2V generation that significantly improves motion realism and consistency, particularly in complex scenes.
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking (Read more on arXiv or HuggingFace)	Yijin Li, Xiaoyu Shi, Zhaoyang Huang, Weikang Bian, wangfuyun	GS-DiT advances video generation by enabling 4D video control using pseudo 4D Gaussian fields and efficient dense 3D point tracking. The main research objective is to enable precise 4D control in video generation, such as multi-camera shooting and dolly zoom, without requiring expensive multi-view videos. The key methodology involves constructing a pseudo 4D Gaussian field with a novel dense 3D point tracking method (D3D-PT) and finetuning a pretrained Diffusion Transformer (DiT) to generate videos guided by the rendered videos from this field. The primary result is that D3D-PT outperforms SpatialTracker in accuracy and accelerates dense 3D point tracking by two orders of magnitude, achieving a 3D-AJ score of 9.0 on the TAPVid-3D minival split. The principal implication for AI practitioners is that GS-DiT enables 4D controllable video generation from monocular videos, broadening the applicability of advanced cinematic techniques in AI-driven video content creation.
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models (Read more on arXiv or HuggingFace)	Weiqiang Wang, Huijia Zhu, Yaojie Lu, Shuhen Zhou, Yanjiang Liu	AUTO-RT is a reinforcement learning framework for automatically exploring and optimizing attack strategies to uncover security vulnerabilities in large language models (LLMs). The main research objective is to develop an automated red-teaming approach that can efficiently identify complex vulnerabilities in LLMs without relying on predefined safety flaws or fixed attack strategies. The key methodology involves two mechanisms: Early-terminated Exploration, which focuses on high-potential attack strategies, and a Progressive Reward Tracking algorithm that uses intermediate downgrade models to refine the search trajectory. The primary result is that AUTO-RT achieved a 16.63% higher success rate in detecting vulnerabilities compared to existing methods. The principal implication for AI practitioners is that they can use AUTO-RT to improve the efficiency of discovering vulnerabilities in LLMs, enabling more robust and secure language model development.
Samba-asr state-of-the-art speech recognition leveraging structured state-space models (Read more on arXiv or HuggingFace)	Kartik-angadi, kruthika, SyedAbdul	Samba-ASR is a novel speech recognition model utilizing state-space models (SSMs) for improved accuracy and efficiency. The main research objective is to develop an Automatic Speech Recognition (ASR) model that outperforms existing transformer-based models by leveraging the Mamba architecture. The key methodology involves replacing transformer encoders with Mamba's state-space modeling in both the encoder and decoder, using a Mamba-cross-connection mechanism, and training on a combined dataset of LibriSpeech, GigaSpeech, and SPGISpeech. The primary result is that Samba-ASR achieved a Word Error Rate (WER) of 3.65% on average across multiple benchmark datasets, including a 1.17% WER on LibriSpeech Clean. For AI practitioners, Samba-ASR offers a new state-of-the-art model for speech recognition, demonstrating that SSMs can surpass transformers in accuracy and efficiency, particularly for long audio sequences.
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use (Read more on arXiv or HuggingFace)	Yufei Xu, Xuesong Yao, Zhengyin Du, Junjie Ye, maverick1994	Here is a concise summary of the research paper "ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use": ToolHop is a new benchmark for evaluating large language models (LLMs) on multi-hop tool use, focusing on their ability to decompose complex queries and utilize multiple tools sequentially. The main research objective is to assess LLMs' capabilities in understanding, reasoning, and function-calling within a multi-hop tool-use context. The key methodology involves a query-driven data construction process that includes tool creation, document refinement, and code generation, resulting in 995 multi-hop queries and 3,912 associated tools. The primary result is that the leading model, GPT-4o, achieved an accuracy of only 49.04% in the mandatory tool use scenario, highlighting significant limitations in current LLMs' multi-hop tool-use abilities. The principal implication for AI practitioners is that there is substantial room for improvement in developing LLMs that can effectively handle complex multi-hop reasoning and tool-use tasks, as evidenced by the leading model's relatively low performance.
Scaling Laws for Floating Point Quantization Training (Read more on arXiv or HuggingFace)	Kan Wu, Weidong Han, Ruobing Xie, Shuaipeng Li, Xingwu Sun	This paper explores scaling laws for floating-point quantization training in large language models (LLMs) to optimize low-precision training. The main research question is how do factors like data size, model size, exponent bits, mantissa bits, and block size of scaling factors affect the performance of LLMs under floating-point quantization training. The key methodology involves training 366 LLMs with various configurations and analyzing the relationships between these factors and model loss to formulate a unified scaling law. The primary result is a unified scaling law that accurately predicts LLM performance under different floating-point quantization settings, with the optimal floating-point quantization precision being directly proportional to computational power. The principal implication for AI practitioners is that they can use the derived scaling law to optimize the trade-off between computational cost and performance when training LLMs with floating-point quantization, particularly that the best cost-performance precision lies between 4-8 bits within a wide computational power range.

Papers for 2025-01-06

Title	Authors	Summary
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation (Read more on arXiv or HuggingFace)	jzzzzk, Shengcong, lyuukuu, pathcn, SiyuanH	Here's a concise summary of the research paper: i) ENERVERSE is a comprehensive framework for embodied future space generation designed for robotic manipulation tasks, integrating a novel chunk-wise autoregressive diffusion model with a Free Anchor View (FAV) space and a 4D Gaussian Splatting (4DGS) data engine pipeline. ii) The main research objective is to develop a method for generating embodied future spaces that enhances a robot's ability to perform long-range manipulation tasks by improving predictive capabilities and spatial understanding. iii) The key methodology involves a chunk-wise autoregressive diffusion model with a sparse contextual memory mechanism, a FAV-based 4D future space generation method, and a data flywheel pipeline integrating 4DGS optimization with multi-view video generation. iv) The proposed method achieved a state-of-the-art average success rate of 88.5 on the LIBERO benchmark with a Three Third View configuration. v) For AI practitioners, the principal implication is that integrating ENERVERSE's future space generation prior into policy learning can significantly enhance the performance of robotic systems, particularly in complex, long-range manipulation tasks, by leveraging enhanced spatial understanding and a robust data generation pipeline.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction (Read more on arXiv or HuggingFace)	hertin, shenyunhang, yifanzhang114, xiongwang, linhaojia13	VITA-1.5 is a multimodal large language model designed for real-time vision and speech interaction. The main research objective is to develop a model that integrates vision, language, and speech modalities without compromising performance due to modality differences. The key methodology involves a three-stage training process: vision-language training, audio input tuning, and audio output tuning, progressively incorporating each modality. The primary results show that VITA-1.5 achieves a Character Error Rate (CER) of 2.2 on the aishell-1 Mandarin speech recognition benchmark and maintains comparable performance to state-of-the-art models in vision tasks after audio training. The principal implication for AI practitioners is that VITA-1.5 provides an effective framework for building multimodal AI systems with near real-time vision and speech interaction capabilities, eliminating the need for separate ASR and TTS modules.
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM (Read more on arXiv or HuggingFace)	jrwen, whenfra, yifanli, JohnCage, Richard1999	Virgo is a multimodal slow-thinking system developed by fine-tuning a capable MLLM with a small amount of textual long-form thought data. The main research question is whether slow-thinking ability can be transferred across modalities through fine-tuning with text-based long-thought data and if this ability is comparable to that distilled from multimodal slow-thinking systems. The key methodology involves fine-tuning Qwen2-VL-72B-Instruct with textual and visual long-thought instruction datasets, including data distilled from other slow-thinking models. The primary result is that Virgo-72B, fine-tuned with 5K textual instructions, achieved 48.4% accuracy on MathVerse, which is comparable to or surpasses commercial reasoning systems. The principal implication for AI practitioners is that fine-tuning MLLMs with textual long-form thought data can effectively transfer slow-thinking capacities, suggesting a simpler approach to developing such systems.
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation (Read more on arXiv or HuggingFace)	Jiajun Xu, Yuanming Yang, Jiale Cheng, Yu Huang, xujz0703	Here is a concise summary of the research paper "VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation": i) The paper introduces VisionReward, a fine-grained, multi-dimensional reward model for aligning visual generation models with human preferences, and a Multi-Objective Preference Optimization (MPO) algorithm for stable model tuning. ii) The main research objective is to develop a reward model that accurately and interpretably predicts human preferences in both image and video generation, addressing the limitations of existing reward models and optimization methods. iii) The key methodology involves decomposing human preferences into multiple dimensions, represented by a series of judgment questions, linearly weighted and summed to produce an interpretable score, and using a multi-objective preference learning algorithm to address confounding factors in preference data. iv) The primary results show that VisionReward surpasses existing methods in video preference prediction, outperforming VideoScore by 17.2% in accuracy. v) The principal implication for AI practitioners is that they can use VisionReward to better align image and video generation models with human preferences, leading to more satisfactory outputs in visual content creation.
Graph Generative Pre-trained Transformer (Read more on arXiv or HuggingFace)	XiaolinXu, y6q9, RArchered, Spony, xchen16	1. Summary: The paper introduces the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that generates graphs as sequences of nodes and edges, utilizing a transformer decoder for next-token prediction, and explores fine-tuning for goal-oriented generation and property prediction. 2. Main research question or objective: The main objective is to develop an efficient graph generative model that leverages a novel sequence-based representation and auto-regressive transformer architecture. 3. Key methodology used: The key methodology involves representing graphs as sequences, training a transformer decoder on these sequences using next-token prediction, and applying fine-tuning strategies such as rejection sampling and reinforcement learning for downstream tasks. 4. Primary results: G2PT achieves superior performance on generic graph and molecule datasets; for instance, on the MOSES dataset, G2PT achieves a validity score of 97.2 and an FCD score of 1.02. 5. Principal implication for AI practitioners: AI practitioners can utilize G2PT as a versatile framework for graph generation and property prediction tasks, benefiting from its strong adaptability and superior performance demonstrated across multiple datasets.
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models (Read more on arXiv or HuggingFace)	anoperson, Franck-Dernoncourt, ryanrossi, ntnghia1811, Hieuman	LUSIFER is a zero-shot approach that enhances multilingual embeddings of English-centric large language models (LLMs) without requiring multilingual training data. The main research objective is to adapt LLM-based embedding models for multilingual tasks without requiring explicit multilingual supervision. The key methodology involves integrating a multilingual encoder (XLM-R) with an English-centric LLM (Mistral-7B) using a connector with minimal trainable parameters, trained in two stages: alignment and representation finetuning. The primary result is that LUSIFER achieved a state-of-the-art average score of 62.63 across 14 languages on five embedding tasks, outperforming the previous best baseline by 3.19 points. For AI practitioners, LUSIFER offers an effective method to enhance multilingual performance of English-centric LLM embedding models without the need for multilingual training data or architectural modifications, significantly improving performance in medium and low-resource languages.
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery (Read more on arXiv or HuggingFace)	Louise Li, Lyle Goodyear, ngoodman, michaelyli, obiwan96	BoxingGym is a benchmark for evaluating AI agents on scientific reasoning tasks. Main research question or objective: How well can current language models perform automated experimental design and model discovery in a variety of scientific domains? Key methodology used: The authors introduce BoxingGym, a benchmark with 10 environments based on real-world scientific models, where agents interact by proposing experiments, observing outcomes, and refining models, evaluated using expected information gain (EIG) and a communication-based model discovery metric. Primary results: GPT-4o struggles with both experimental design and model discovery, with an average standardized prediction error of 0.74 on the hyperbolic discounting choice task after 10 experiments. Augmenting the agent with an explicit statistical model does not reliably improve these results. Principal implication for AI practitioners: The benchmark highlights significant limitations of current large language models (LLMs) in performing scientific reasoning, suggesting a need for developing new methods for automated experimental design and model discovery.

Papers for 2025-01-03

Title	Authors	Summary
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining (Read more on arXiv or HuggingFace)	Yongliang Shen, Jiashuo Sun, Xin Li, Hang Zhang, Wenqi Zhang	A high-quality multimodal textbook corpus, constructed from 2.5 years of instructional videos, is introduced for vision-language model (VLM) pretraining. The research aimed to create a more coherent, knowledge-rich interleaved corpus than existing web-crawled datasets. The methodology involved LLM-based video collection and filtering, followed by progressive extraction and refinement of visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos. Experiments demonstrated significantly improved pretraining performance, with VLMs achieving an average gain of +4.6% across seven benchmarks in 0-4 shot settings (e.g., +20% improvement on ScienceQA). The resulting textbook dataset offers superior interleaved context awareness, beneficial for improving VLM knowledge and reasoning capabilities.
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control (Read more on arXiv or HuggingFace)	Xiang Bai, Sihui Ji, Xi Chen, Hao Luo, Yuanpeng Tu	VideoAnydoor is a zero-shot video object insertion framework achieving high-fidelity detail preservation and precise motion control. The research objective was to develop a method for accurately preserving object identity and precisely controlling object motion during video insertion. The methodology involved an end-to-end framework utilizing an ID extractor, a pixel warper for fine-grained motion control, and a reweighted reconstruction loss. Quantitative results showed VideoAnydoor outperforming existing methods, achieving a 37.7 PSNR score, exceeding previous state-of-the-art techniques. This work provides AI practitioners with a robust, end-to-end framework for high-fidelity video object insertion and precise motion control, applicable to various downstream tasks without task-specific fine-tuning.
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings (Read more on arXiv or HuggingFace)	Dayiheng Liu, Bo Zheng, Bowen Yu, Jiaxi Yang, Shanghaoran Quan	CODEELO is a benchmark for evaluating large language models (LLMs) on competition-level code generation using human-comparable Elo ratings. The main research objective is to develop a standardized benchmark that addresses limitations of existing benchmarks, such as the unavailability of private test cases and misaligned execution environments, to effectively assess LLMs' coding abilities at a competitive level. The key methodology involves submitting LLM-generated code to the CodeForces platform for judging and calculating Elo ratings based on the performance, aligned with the platform's system but with lower variance. The primary results show that the 01-mini model achieved the highest Elo rating of 1578, surpassing nearly 90% of human participants, while most other models struggled, with many falling in the lowest 20th percentile of human competitors. The principal implication for AI practitioners is that enhancing the length of the chain-of-thought (CoT) presents a promising avenue for improving LLMs' reasoning abilities in code generation, as evidenced by the significant performance of 01-mini and QwQ-32B-Preview.
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM (Read more on arXiv or HuggingFace)	Boqiang Zhang, Zesen Cheng, Wentong Li, Hang Zhang, Yuqian Yuan	VideoRefer Suite introduces a benchmark and model for fine-grained spatial-temporal video understanding. The research objective was to improve Video LLMs' ability to understand fine-grained spatial and temporal details in videos. A multi-agent data engine created a large-scale object-level video instruction dataset (VideoRefer-700K), and a VideoRefer model with a versatile spatial-temporal object encoder was developed. VideoRefer achieved a 3.46 average score on the VideoRefer-BenchD benchmark (a multi-dimensional evaluation of description generation), exceeding existing methods. This work provides a valuable resource (dataset, model, benchmark) for advancing Video LLM capabilities, particularly in applications requiring fine-grained object-level understanding.
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models (Read more on arXiv or HuggingFace)	Xinggang Wang, Jingfeng Yao	Latent diffusion models with high-dimensional visual tokenizers exhibit an optimization dilemma: improved reconstruction quality comes at the cost of degraded generation performance. The research objective is to address the optimization dilemma in latent diffusion models by improving the training efficiency and generative performance of high-dimensional visual tokenizers. The key methodology is to align the latent space of the visual tokenizer with pre-trained vision foundation models during training, using a novel vision foundation model alignment loss (VF Loss). The primary result shows a significant improvement in training speed; achieving an FID score of 2.11 in just 64 epochs—a 21x speedup compared to the original DiT. Additionally, the integrated system achieved state-of-the-art performance on ImageNet 256x256 generation with an FID score of 1.35. The principal implication for AI practitioners is that the proposed VA-VAE and LightningDiT framework offers a practical solution to a common problem in latent diffusion models, enabling faster convergence and improved generation performance with higher-dimensional tokenizers.
ProgCo: Program Helps Self-Correction of Large Language Models (Read more on arXiv or HuggingFace)	Wenbo Su, Jiaheng Liu, Weixun Wang, Yanan Wu, Xiaoshuai Song	ProgCo improves large language model (LLM) self-correction by integrating program-driven verification and refinement. The research aimed to enhance LLM self-correction, particularly for complex reasoning tasks, where existing methods often fail. ProgCo uses self-generated and self-executed verification pseudo-programs to achieve more robust verification, followed by dual refinement of both responses and programs. Experiments showed ProgCo achieved significant improvements, for example, a 5.8% accuracy increase on the MATH dataset with one round of self-correction. This work suggests that incorporating program-driven techniques can significantly improve LLM self-correction capabilities, impacting development of more reliable and robust AI systems.
A3: Android Agent Arena for Mobile GUI Agents (Read more on arXiv or HuggingFace)	Guozhi Wang, Liang Liu, Jiayu Zhang, Hanhao Li, Yuxiang Chai	Android Agent Arena (A3) introduces a novel evaluation platform for mobile GUI agents. The research aims to address limitations of existing datasets and benchmarks by providing a comprehensive, interactive evaluation platform for mobile GUI agents operating in real-world scenarios. A3 employs a dynamic evaluation approach incorporating 201 tasks across 21 widely used third-party apps and leverages business-level LLMs for automated task evaluation. Results showed GPT-40 achieved 84% accuracy in LLM-based evaluation of task completion. A3 offers AI practitioners a more realistic and scalable evaluation framework for assessing the performance of mobile GUI agents.
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models (Read more on arXiv or HuggingFace)	Md Hasebul Hasan, Md Tanvir Parvez, Md Tanvir Hassan, Mahir Labib Dihan, eunus	MAPEVAL is a benchmark for evaluating geo-spatial reasoning in foundation models. The main research objective is to assess foundation models' ability to handle diverse and complex map-based user queries requiring geo-spatial reasoning. The key methodology used is a new benchmark called MAPEVAL, comprising 700 unique multiple-choice questions across three task types (textual, API-based, and visual) that test spatial relationships, map infographics, travel planning, and navigation. The primary result is that Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro performed competitively, but Claude-3.5-Sonnet agents outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21% respectively in the MAPEVAL-API task. The principal implication for AI practitioners is that MAPEVAL provides a critical tool for advancing general-purpose foundation models with stronger geo-spatial understanding, as evidenced by the significant performance gaps observed even among the most advanced models.
Dynamic Scaling of Unit Tests for Code Reward Modeling (Read more on arXiv or HuggingFace)	Sijia Luo, Jifan Yu, Jing Zhang, Xiaokang Zhang, KAKA22	This paper investigates improving code generation accuracy by scaling the number of unit tests used for reward modeling. The research objective was to determine if increasing unit test quantity enhances reward signal quality, leading to better code selection. A unit test-based majority voting framework was employed, coupled with a novel unit test generator (CodeRM-8B) and dynamic scaling based on problem difficulty. Results show a positive correlation between unit test quantity and reward signal quality, with a specific finding of an 18.43% performance gain for Llama3-8B on HumanEval Plus. This research indicates that scaling unit tests, particularly using CodeRM-8B and dynamic scaling, can significantly enhance code generation performance in LLMs, providing a practical method for improving model accuracy.
MLLM-as-a-Judge for Image Safety without Human Labeling (Read more on arXiv or HuggingFace)	Felix Juefei-Xu, Xiaowen Lin, Shiyu Zhao, Shuming Hu, Zhenting Wang	This paper investigates zero-shot image safety judgment using pre-trained Multimodal Large Language Models (MLLMs). The main objective is to determine if unsafe images can be detected without human labeling, solely by querying MLLMs using a predefined safety constitution. The proposed method, CLUE, involves objectifying safety rules, assessing rule-image relevance, using debiased token probabilities for judgment, and employing cascaded chain-of-thought reasoning. Experiments demonstrate high effectiveness, achieving 95.9% recall and 94.8% accuracy with InternVL2-76B on a complex safety constitution. This work suggests a scalable, human-labeling-free approach for image safety assessment, potentially significantly reducing costs associated with existing methods.
MapQaTor: A System for Efficient Annotation of Map Query Datasets (Read more on arXiv or HuggingFace)	Md Rizwan Parvez, Mohammed Eunus Ali, mahirlabibdihan	MapQATOR is a web application designed to efficiently create reproducible map-based question-answering datasets for evaluating large language models' geospatial reasoning capabilities. The research objective was to develop a system for streamlined annotation of map-based QA datasets, overcoming challenges in creating reliable geospatial QA data. The methodology involved building a plug-and-play web application integrating with multiple map APIs, incorporating data visualization tools, and utilizing a caching mechanism to ensure data consistency. Results demonstrated a 30x speedup in annotation compared to manual methods. The principal implication for AI practitioners is that MapQATOR significantly accelerates the creation of high-quality, reproducible geospatial datasets crucial for training and benchmarking LLMs on complex reasoning tasks.
Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing (Read more on arXiv or HuggingFace)	Jiajun Zhu, Yuehao Wang, Ruisi Cai, Peihao Wang, pragsri8	Structured State Space Models (SSMs) are investigated for their limitations in capturing long-range dependencies. The research aims to understand and mitigate bottlenecks in SSMs, focusing on recency bias and over-smoothing. A novel polarization technique, modifying state transition matrices, is proposed and empirically evaluated. Results show that polarization consistently improves associative recall accuracy of long-range tokens (e.g., a 93.43% average accuracy in one experiment), unlocking the benefits of deeper architectures in SSMs. This work highlights the inherent limitations of SSMs regarding recency and over-smoothing, directly impacting their scalability and robustness for long sequence processing and suggesting design modifications for improved performance.
SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration (Read more on arXiv or HuggingFace)	Ceyuan Yang, Yang Zhao, Meng Wei, Zhijie Lin, Jianyi Wang	SeedVR: a novel diffusion transformer for generic video restoration. The research objective was to develop a diffusion transformer model capable of handling real-world video restoration with arbitrary length and resolution. The key methodology involved a shifted window attention mechanism within a diffusion transformer, a causal video variational autoencoder (CVVAE) for efficient compression, and a multi-stage progressive training strategy. SeedVR demonstrated impressive restoration capabilities; for example, it outperformed existing methods on several benchmark datasets, achieving a 10.508 DOVER score on the SPMCS dataset. The most impactful finding, relevant for AI practitioners, is SeedVR's superior efficiency compared to existing diffusion-based video restoration approaches, achieving over 2x faster inference speed despite having a larger parameter count. The details regarding the comparison of training time are unclear.
SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization (Read more on arXiv or HuggingFace)	Haozhou Sun, Zihan Jia, Zhenbang Xu, Haodong Chen, Yongle Huang	SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization proposes a novel semi-supervised learning framework for fine-grained action recognition. The research objective is to develop a robust method for fine-grained action recognition using limited labeled data, addressing challenges inherent in existing large language models. The methodology incorporates dual-level temporal element modeling, moderate temporal perturbation as a strong augmentation strategy, and adaptive regulation to stabilize the learning process. SeFAR achieves state-of-the-art performance on fine-grained datasets, outperforming other methods by margins such as 7.8% to 8.4% increase in accuracy on FineDiving depending on the labeling rate. This research demonstrates a significant improvement in semi-supervised fine-grained action recognition and provides AI practitioners with a novel framework applicable to vision-based tasks involving nuanced temporal dynamics and limited data.

Papers for 2025-01-02

Title	Authors	Summary
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (Read more on arXiv or HuggingFace)	Yian Wang, Chuanyang Jin, Kanzhi Cheng, heroding77, QiushiSun	OS-Genesis is a novel pipeline that automates the generation of high-quality trajectory data for training GUI agents without human supervision or predefined tasks. The main research question is how to automatically construct diverse and high-quality GUI agent trajectories to improve their performance on complex computer tasks. The key methodology is a reverse task synthesis process involving interaction-driven exploration of GUI environments to collect state-action triplets, followed by the generation of low-level and high-level instructions using an annotation model and a trajectory reward model to ensure data quality. The primary result is that agents trained with OS-Genesis showed significant performance improvements on online benchmarks, such as achieving a 17.41% success rate on AndroidWorld compared to 9.82% for the self-instruction baseline. The principal implication for AI practitioners is that OS-Genesis provides an effective method for generating high-quality training data for GUI agents, which can significantly improve their ability to automate complex real-world computer tasks, particularly in dynamic environments.
Xmodel-2 Technical Report (Read more on arXiv or HuggingFace)	Jiang Ling, Qu Zhijiu, Lin Qingquan, Liu Yang, valeriaWong	Xmodel-2 is a 1.2 billion-parameter language model designed for reasoning tasks, emphasizing efficiency and performance. The main research question is how to optimize a language model for complex reasoning while maintaining low training costs and efficiency. The key methodology involves using the Warmup-Stable-Decay (WSD) learning rate scheduler, optimizing data ratios during the decay phase of training, and employing an architecture that allows different model scales to share a unified set of hyperparameters. The primary results show that Xmodel-2 achieves state-of-the-art performance among 1B-parameter models in complex reasoning tasks, with an average score of 39.62 on complex reasoning benchmarks (GSM8K, MATH, BBH, MMLU, HumanEval, and MBPP). The principal implication for AI practitioners is that Xmodel-2 provides a strong, efficient model for reasoning tasks, demonstrating the effectiveness of the WSD learning rate scheduler and data ratio optimization in enhancing model performance.

Papers for 2025-01-01

Title	Authors	Summary
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization (Read more on arXiv or HuggingFace)	Tao Yuan, Yuxin Song, Yifan Sun, Xiu-Shen Wei, axxkaya	The paper introduces Explanatory Instructions, a method for defining computer vision (CV) tasks through natural language descriptions of transformations between input and output images, to improve zero-shot generalization. The main research question is whether Explanatory Instructions can enable vision-language models (VLMs) to genuinely understand and generalize to unseen CV tasks. The key methodology involves constructing a dataset (DECVT) with 12 million triplets of "image input → explanatory instruction → output" and training an auto-regressive-based VLM on these instructions. The primary results show that the trained model achieved instruction-level zero-shot capabilities and promising task-level zero-shot capabilities on certain tasks; for instance, it achieved a F1 score of 20.69 on the zero-shot Canny-to-Image task using the MultiGen-20M dataset. The principal implication for AI practitioners is that Explanatory Instructions can enhance VLMs' ability to perform novel vision tasks without explicit training, although the model's task-level zero-shot generalization ability remains unstable and requires further development.
On the Compositional Generalization of Multimodal LLMs for Medical Imaging (Read more on arXiv or HuggingFace)	Yonglin Deng, Weihong Wang, Rongsheng Wang, Junying Chen, Zhenyang Cai	This paper investigates the compositional generalization (CG) capabilities of Multimodal Large Language Models (MLLMs) for medical imaging. The main research question is whether MLLMs can leverage CG to understand unseen medical images by recombining learned elements (Modality, Anatomical area, and Task). The key methodology involved constructing a dataset called Med-MAT from 106 medical datasets, defining the MAT-Triplet, and evaluating MLLMs' ability to generalize to unseen combinations of these elements through multi-task training and controlled variable experiments. A primary result is that MLLMs trained on multiple tasks achieved 96% accuracy on subset 02 in the in-distribution dataset, significantly outperforming single-task training and demonstrating the effectiveness of CG. The principal implication for AI practitioners is that leveraging CG in MLLMs by training with diverse datasets sharing MAT-Triplets can significantly enhance the models' ability to understand and generalize to unseen medical images, which has a direct impact on the development of robust medical imaging applications.
Bringing Objects to Life: 4D generation from 3D objects (Read more on arXiv or HuggingFace)	Gal Chechik, Dvir Samuel, Ori Malca, Ohad Rahamim	This paper introduces 3to4D, a novel method for generating 4D content from static 3D objects and text prompts. The main research question is how to animate user-provided 3D objects while maintaining their identity and adhering to textual prompts that describe the desired motion. The key methodology involves first converting a 3D mesh into a static 4D Neural Radiance Field (NeRF), then animating it using an Image-to-Video diffusion model conditioned on the initial object and text prompt, with an incremental viewpoint selection protocol and masked Score Distillation Sampling (SDS) loss for improved motion realism. The primary results show that 3to4D outperforms baseline methods, achieving a threefold improvement in identity preservation measured using LPIPS scores (15.0 ±0.1 for 3to4D vs. 44.3 ± 0.2 for the best-performing baseline). The principal implication for AI practitioners is that 3to4D provides a method for creating custom 4D animations from existing 3D assets, leveraging text prompts to guide the desired motion while preserving the original object's visual characteristics.
Efficiently Serving LLM Reasoning Programs with Certaindex (Read more on arXiv or HuggingFace)	Zhongdongming Dai, Zheyu Fu, Siqi Zhu, Junda Chen, Yichao Fu	Dynasor is a system designed to optimize inference-time compute for Large Language Model (LLM) reasoning queries by dynamically allocating resources based on model certainty. The main research question is how to efficiently serve LLM reasoning programs that refine outputs by exploring multiple solution paths. The key methodology involves tracking and scheduling requests within reasoning queries using certaindex, a proxy that measures statistical reasoning progress based on model certainty, to guide compute allocation dynamically. Dynasor reduces compute by up to 50% in batch processing and sustains 3.3x higher query rates or 4.7x tighter latency SLOs in online serving compared to prior state-of-the-art systems. The principal implication for AI practitioners is that Dynasor enables more efficient deployment of LLM reasoning algorithms in real-world applications by optimizing resource use and improving response times.
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization (Read more on arXiv or HuggingFace)	Rafael Valle, Ambuj Mehrish, Zhifeng Kong, Navonil Majumder, Chia-Yu Hung	TangoFlux is a text-to-audio model that uses flow matching and CLAP-ranked preference optimization for fast and high-quality audio generation. The main research objective is to develop an efficient text-to-audio (TTA) generative model that addresses the challenges of aligning TTA models due to the difficulty of creating preference pairs. The key methodology used is CLAP-Ranked Preference Optimization (CRPO), which iteratively generates and optimizes preference data using a CLAP model as a proxy reward model. The primary results show that TangoFlux achieves state-of-the-art performance with a CLAP score of 0.480 and an FD score of 75.1 in just 3.7 seconds using 515M parameters. The principal implication for AI practitioners is that TangoFlux provides a fast and efficient method for generating high-quality audio with fewer trainable parameters, which can be particularly useful in scenarios where inference time and computational resources are constrained.
Edicho: Consistent Image Editing in the Wild (Read more on arXiv or HuggingFace)	Ceyuan Yang, Qiuyu Wang, Yinghao Xu, Hao Ouyang, Qingyan Bai	The paper introduces Edicho, a training-free method for consistent image editing across multiple images using diffusion models. The main research question is how to achieve consistent image editing across diverse in-the-wild images without requiring training. The key methodology involves leveraging pre-estimated explicit image correspondence to guide a modified attention mechanism and classifier-free guidance during the denoising process of diffusion models. The primary results show that Edicho achieves a text alignment score of 0.3228 and an editing consistency score of 0.9355 in global image editing tasks, outperforming existing methods. For AI practitioners, Edicho offers a plug-and-play solution for consistent image editing that can be integrated with existing diffusion-based editing models, enabling applications like generating consistent image sets and 3D reconstruction of edits.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs (Read more on arXiv or HuggingFace)	Jianhui Pang, Zhiwei He, Tian Liang, Jiahao Xu, Xingyu Chen	This paper investigates the phenomenon of "overthinking" in o1-like large language models (LLMs), where these models expend excessive computational resources on simple tasks. The main research question is how to quantify and mitigate overthinking in o1-like LLMs during inference. The key methodology involves analyzing solution distributions and proposing outcome and process efficiency metrics, alongside self-training strategies to optimize response generation. A primary result is that the o1-like model QwQ-32B-Preview used 1,953% more tokens than conventional models for the simple query "what is the answer of 2 plus 3?". The principal implication for AI practitioners is the need to optimize inference efficiency in o1-like LLMs by addressing overthinking, potentially reducing computational overhead without compromising accuracy using methods like self-training with response simplification.
Facilitating large language model Russian adaptation with Learned Embedding Propagation (Read more on arXiv or HuggingFace)	Daniil Chernyshev, RefalMachine	This paper introduces Learned Embedding Propagation (LEP) as a cost-effective method for adapting large language models (LLMs) to new languages, specifically Russian, without full retraining. The main research objective is to address the limitations of language adaptation posed by restricted access to high-quality instruction-tuning data and the computational expense of full LLM retraining. The key methodology involves training a new tokenization vocabulary, initializing new embeddings by averaging existing ones, and then propagating these embeddings to an instruction-tuned model using linear transformations derived from fine-tuned variants. The primary results show that LEP applied to LLaMa-3-8B and Mistral-7B achieves competitive performance levels, with the LEP-Extended variant of OpenChat 3.5 achieving a Micro-Avg score of 0.632 on the Darumeru benchmark after calibration. For AI practitioners, the principal implication is that LEP offers a viable and efficient alternative to traditional language-specific instruction-tuning, significantly reducing the costs associated with language adaptation while maintaining or surpassing existing performance benchmarks.
OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System (Read more on arXiv or HuggingFace)	Mengshu Sun, Lin Yuan, Kangwei Liu, Xiangyuan Ru, Yujie Luo	OneKE is a dockerized, schema-guided, large language model (LLM) agent-based knowledge extraction system designed for diverse data types and domains. The main research objective is to develop a comprehensive system that can extract knowledge from various data sources following complex schemas and handle debugging/error correction effectively. The key methodology involves a multi-agent design with a configurable knowledge base, utilizing Schema, Extraction, and Reflection Agents to process data, extract information, and refine results, respectively. The primary results show that using the Case Retrieval method, the Extraction Agent achieved significant performance improvements on both CrossNER and NYT-11-HRL datasets, with F1 scores increasing substantially compared to the vanilla method. The principal implication for AI practitioners is that OneKE provides a flexible and adaptable framework for knowledge extraction tasks, supporting various LLMs and data formats without requiring fine-tuning, while the Case Repository enables continuous improvement through error correction.
Slow Perception: Let's Perceive Geometric Figures Step-by-step (Read more on arXiv or HuggingFace)	Liang Zhao, Jia Wang, Yumeng Li, Youyang Yin, Haoran Wei	The paper introduces "Slow Perception," a novel approach for parsing geometric figures in images by mimicking human-like gradual perception. Main research question or objective: How to improve the accuracy of geometric figure parsing in images by Large Vision Language Models (LVLMs)? Key methodology used: The authors propose a two-stage "Slow Perception" (SP) framework: a) perception decomposition, breaking down complex figures into basic units (points and lines); and b) perception flow, using a "perceptual ruler" to trace lines stroke-by-stroke, avoiding "long visual jumps." Primary results: SP improves the F1-score of geometric parsing by 6.1% over the baseline when using a perceptual ruler length of 4 in the test set. Slow perception also exhibits an inference time scaling law, where shorter perceptual ruler lengths lead to longer inference times but improved performance. Principal implication for AI practitioners: AI practitioners can leverage the slow perception framework to enhance the accuracy of geometric figure parsing, particularly in applications requiring precise spatial reasoning, and this framework may offer a new pathway to better performance in other visual tasks.
PERSE: Personalized 3D Generative Avatars from A Single Portrait (Read more on arXiv or HuggingFace)	Hanbyul Joo, Inhee Lee, Hyunsoo Cha	PERSE is a method for creating animatable 3D avatars from a single portrait image with controllable facial attributes. The main research question is how to build a 3D personalized generative avatar from a single reference portrait image that allows for continuous and disentangled control over various facial attributes while preserving the individual's identity. The key methodology involves synthesizing large-scale 2D video datasets with facial attribute editing, and training a 3D Gaussian Splatting-based avatar model with a novel latent space regularization technique using interpolated 2D faces as supervision. The primary result is that PERSE generates high-quality avatars with an FID score of 214.46 on interpolated renderings. The principal implication for AI practitioners is that PERSE provides a novel approach for creating personalized 3D avatars with controllable attributes from a single image, offering a valuable tool for applications in VR/AR environments.
Training Software Engineering Agents and Verifiers with SWE-Gym (Read more on arXiv or HuggingFace)	Navdeep Jaitly, Graham Neubig, Xingyao Wang, alsuhr, Jiayi-Pan	SWE-Gym is a new benchmark for evaluating software engineering agents on real-world coding tasks. The main research objective is to develop and assess a training environment, SWE-Gym, for improving the performance of language model-based software engineering agents. The key methodology involves fine-tuning language models on agent trajectories sampled from SWE-Gym and employing verifiers trained on these trajectories for inference-time scaling. Primary results show that fine-tuning on SWE-Gym improves agents' performance, achieving a 32.0% resolve rate on the SWE-Bench Verified test set. The principal implication for AI practitioners is that SWE-Gym can be used to train and improve software engineering agents through scalable learning methods.
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation (Read more on arXiv or HuggingFace)	Xiao-Ping Zhang, Arman Cohan, Yilun Zhao, Zhaojian Yu	The paper introduces HumanEval Pro and MBPP Pro, benchmarks for evaluating large language models (LLMs) on self-invoking code generation tasks. The main research question is how well LLMs can generate code that solves a complex problem by invoking their own solution to a related, simpler base problem. The key methodology involves generating new, more complex versions of existing benchmarks (HumanEval and MBPP) by creating self-invoking problems that require using the solution of a base problem and evaluating over twenty LLMs using metrics like pass@1. The primary result is that most LLMs experience a significant performance drop on self-invoking tasks compared to traditional code generation; for example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. The principal implication for AI practitioners is that current LLMs, while proficient in generating code for isolated tasks, still struggle with more complex, multi-step reasoning required for self-invoking code generation, highlighting a crucial area for further development in code-generating models.

Papers for 2024-12-31

Title	Authors	Summary
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization (Read more on arXiv or HuggingFace)	Tao Yuan, Yuxin Song, Yifan Sun, Xiu-Shen Wei, axxkaya	The research introduces Explanatory Instructions, a novel approach for defining computer vision tasks through linguistic descriptions, to improve zero-shot generalization in vision-language models. The main research objective is to enable vision-language models to genuinely understand and generalize to unseen vision tasks by using detailed linguistic transformations from input to output images. The key methodology involves creating a dataset (DECVT) with 12 million "image input → explanatory instruction → output" triplets and training an auto-regressive-based vision-language model (AR-based VLM) on this dataset. The primary results show that the trained model achieved instruction-level zero-shot capabilities and demonstrated promising vision task-level zero-shot generalization, with the model achieving a 20.69 F1 score on the Canny-to-Image task using unseen instructions. The principal implication for AI practitioners is that Explanatory Instructions can enhance the adaptability of vision-language models, allowing them to perform unseen tasks without task-specific fine-tuning, although the paper notes that the model's task-level zero-shot ability is still limited and unstable.
On the Compositional Generalization of Multimodal LLMs for Medical Imaging (Read more on arXiv or HuggingFace)	Yonglin Deng, Weihong Wang, Rongsheng Wang, Junying Chen, Zhenyang Cai	This paper investigates compositional generalization (CG) in multimodal large language models (MLLMs) for medical imaging analysis. The main research question is whether MLLMs can leverage CG to understand unseen medical images by recombining learned elements (Modality, Anatomical area, and Task). The key methodology involved constructing a dataset called Med-MAT from 106 medical datasets, defining image elements by MAT-Triplet, and conducting experiments to assess model performance on unseen combinations. A primary result is that MLLMs trained on combinations sharing the same MAT-Triplet demonstrated successful generalization, with the model achieving 91% accuracy on the X-ray, Brain dataset when trained on combinations like CT, Brain(State) and X-ray, Bones. The principal implication for AI practitioners is that CG can be used by MLLMs for medical imaging analysis, which is a way to understand unseen medical images and improve generalization in multi-task training scenarios involving medical image data.
Efficiently Serving LLM Reasoning Programs with Certaindex (Read more on arXiv or HuggingFace)	Zhongdongming Dai, Zheyu Fu, Siqi Zhu, Junda Chen, Yichao Fu	Dynasor is a system designed to optimize inference-time compute for large language model (LLM) reasoning queries. The main research question is how to effectively schedule and allocate inference compute for LLM reasoning programs that generate multiple outputs for a single query. The key methodology is using "certaindex," a proxy for statistical reasoning progress based on model certainty, to dynamically guide compute allocation and co-adapt scheduling with reasoning progress. Dynasor reduces compute by up to 50% in batch processing and sustains 3.3 times higher query rates or 4.7 times tighter latency SLOs in online serving compared to existing systems. The principal implication for AI practitioners is that using certaindex to dynamically allocate resources for LLM reasoning tasks can significantly improve efficiency and meet latency targets without sacrificing accuracy.
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization (Read more on arXiv or HuggingFace)	Rafael Valle, Ambuj Mehrish, Zhifeng Kong, Navonil Majumder, Chia-Yu Hung	TangoFlux is a text-to-audio model that uses flow matching and CLAP-Ranked Preference Optimization for fast and high-quality audio generation. The main research objective is to develop an efficient text-to-audio (TTA) model that addresses the challenges of controllability and preference alignment in audio generation. The key methodology involves a rectified flow-based model trained with CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference pairs using a CLAP model as a proxy reward model. Primary results show that TangoFlux achieves a CLAP score of 0.480 and an FD score of 75.1 in 3.7 seconds using 50 steps, outperforming other models in objective evaluations and aligning well with human preferences. The principal implication for AI practitioners is that TangoFlux provides a highly efficient and effective solution for generating high-quality, text-aligned audio, making it a valuable tool for practical applications where inference speed and audio quality are critical.
Edicho: Consistent Image Editing in the Wild (Read more on arXiv or HuggingFace)	Ceyuan Yang, Qiuyu Wang, Yinghao Xu, Hao Ouyang, Qingyan Bai	Edicho is a training-free method for consistent image editing across multiple in-the-wild images. The main research objective is to achieve consistent edits across diverse images without requiring paired training data or optimization. The key methodology involves using explicit image correspondence to guide the self-attention mechanism and classifier-free guidance during the denoising process of diffusion models. Primary results demonstrate that Edicho achieves a text alignment score of 0.3228 and an editing consistency score of 0.9355 in global editing tasks, outperforming other methods. For AI practitioners, Edicho offers a plug-and-play solution for consistent image editing that can be integrated with existing diffusion-based editing models, enabling applications like generating coherent visual narratives and maintaining characteristics in marketing materials.
Bringing Objects to Life: 4D generation from 3D objects (Read more on arXiv or HuggingFace)	Gal Chechik, Dvir Samuel, Ori Malca, Ohad Rahamim	3to4D generates 4D content from static 3D objects and text prompts. The main research question is how to generate 4D content (dynamic 3D objects) from user-provided 3D assets and text prompts while maintaining the object's identity. The key methodology involves first converting a 3D mesh into a static 4D Neural Radiance Field (NeRF), then animating it using an Image-to-Video diffusion model guided by text, employing incremental viewpoint selection and masked Score Distillation Sampling (SDS) loss for improved motion realism. The primary results show that 3to4D outperforms baseline methods, achieving a threefold improvement in identity preservation measured using LPIPS scores (15.0 ±0.1 for 3to4D vs 44.3 ± 0.2 for the next best method). The principal implication for AI practitioners is that 3to4D provides a more effective method for generating customized 4D content from existing 3D models compared to adapting existing text-to-4D or image-to-4D methods.
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation (Read more on arXiv or HuggingFace)	Xiao-Ping Zhang, Arman Cohan, Yilun Zhao, Zhaojian Yu	The paper introduces HumanEval Pro and MBPP Pro, benchmarks for evaluating large language models (LLMs) on self-invoking code generation tasks. The main research objective is to assess LLMs' ability to solve a base problem and then utilize that solution to address a more complex, related problem. The key methodology involves generating new, more challenging versions of existing benchmarks (HumanEval and MBPP) using Deepseek-V2.5, then manually reviewing and refining them. The primary result is that most LLMs experience a significant performance drop on self-invoking tasks compared to traditional code generation; for instance, the o1-mini model achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. The principal implication for AI practitioners is that current LLMs, while proficient in isolated code generation, struggle with tasks requiring progressive reasoning and self-invoking code, highlighting a need for further research in this area.
Facilitating large language model Russian adaptation with Learned Embedding Propagation (Read more on arXiv or HuggingFace)	Daniil Chernyshev, RefalMachine	This paper introduces Learned Embedding Propagation (LEP) as a cost-effective method for adapting large language models (LLMs) to new languages, specifically Russian, while preserving original model knowledge. The main research objective is to address the limitations of language adaptation posed by restricted access to high-quality instruction-tuning data. The key methodology involves training new token embeddings and propagating them to an instruction-tuned LLM using linear transformations derived from parameter decomposition, bypassing the need for full instruction-tuning. The primary results show that LEP applied to LLaMa-3-8B and Mistral-7B achieves competitive performance with OpenChat 3.5, with the LEP-Extended model achieving a Micro-Avg score of 0.632 after calibration. The principal implication for AI practitioners is that LEP offers a viable alternative to traditional language-specific instruction-tuning, reducing costs associated with language adaptation while maintaining or surpassing performance benchmarks.
Training Software Engineering Agents and Verifiers with SWE-Gym (Read more on arXiv or HuggingFace)	Navdeep Jaitly, Graham Neubig, Xingyao Wang, alsuhr, Jiayi-Pan	SWE-Gym is a new benchmark for training software engineering agents that can solve real-world GitHub issues. The main research objective is to create an environment for training and evaluating language-model-based software engineering agents using real-world Python tasks. The key methodology involves constructing SWE-Gym, containing 2,438 Python tasks with executable runtime environments, unit tests, and natural language task specifications, and using it to train agents via policy improvement algorithms like rejection sampling, fine-tuning and inference-time scaling through verifiers. The primary result is that fine-tuned models achieved up to 19% absolute gains in resolve rate on SWE-Bench Verified and Lite test sets. The principal implication for AI practitioners is that SWE-Gym enables the development of more capable software engineering agents by providing a realistic and scalable training environment with executable feedback.
OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System (Read more on arXiv or HuggingFace)	Mengshu Sun, Lin Yuan, Kangwei Liu, Xiangyuan Ru, Yujie Luo	OneKE is a dockerized system for knowledge extraction that uses LLM-based agents and a configurable knowledge base. The main research objective is to develop a comprehensive system for knowledge extraction that can handle diverse data types, complex schemas, and improve through error debugging. The key methodology involves using three agents (Schema Agent, Extraction Agent, and Reflection Agent) with a configurable knowledge base consisting of a Schema Repository and Case Repository to support schema analysis, knowledge extraction, and error handling. The primary results show that the Case Retrieval method improved performance on both CrossNER and NYT-11-HRL datasets, with F1 scores increasing from approximately 40 to over 60 on CrossNER when using the LLaMA-3-8B-Instruct model. The principal implication for AI practitioners is that OneKE provides a flexible framework for knowledge extraction tasks without requiring model fine-tuning, allowing for easier adaptation to various domains and data formats, although it's unclear how performance compares to other fine-tuned methods.

Papers for 2024-12-30

Title	Authors	Summary
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs (Read more on arXiv or HuggingFace)	Wanlong Liu, Xidong Wang, Ke Ji, Zhenyang Cai, Junying Chen	Here is a concise summary of the research paper "HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs": i) The paper introduces HuatuoGPT-o1, a medical large language model (LLM) designed to enhance complex reasoning in the medical domain using verifiable medical problems and a two-stage training approach. ii) The main research objective is to develop an LLM capable of performing complex medical reasoning verifiable through objective ground-truth answers. iii) The key methodology involves a two-stage approach: (1) using a verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, and (2) applying reinforcement learning (RL) with verifier-based rewards to enhance reasoning. iv) The primary result is that the 70B parameter version of HuatuoGPT-o1 outperformed other open-source general and medical-specific LLMs across multiple medical benchmarks, achieving an average score of 73.4. v) The principal implication for AI practitioners is that using verifiable problems and a two-stage training process (fine-tuning with complex reasoning trajectories followed by RL with verifier feedback) can significantly enhance the complex reasoning abilities of LLMs in specialized domains like medicine.
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models (Read more on arXiv or HuggingFace)	Hengshuang Zhao, Chao Du, Tianyu Pang, Ziang Zhang, Zehan Wang	Here is a concise summary of the research paper "Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models": i) Summary: This paper introduces Orient Anything, a novel model for estimating the 3D orientation of objects in single- and free-view images by learning from a dataset of rendered 3D models. ii) Main research question or objective: How can a robust and generalizable model be developed to accurately estimate object orientation in images, overcoming the scarcity of labeled training data? iii) Key methodology: A pipeline was developed to annotate the front face of 3D objects and render 2 million images from random views; the model is trained to predict 3D orientation by fitting probability distributions of three angles, incorporating strategies for synthetic-to-real transfer. iv) Primary results: Orient Anything achieves state-of-the-art accuracy in orientation estimation on both rendered and real images; specifically, it achieved 73.94% accuracy in predicting the azimuth of objects in rendered images. v) Principal implication for AI practitioners: AI practitioners can leverage Orient Anything as a foundational tool for tasks requiring accurate object orientation estimation, such as enhancing spatial reasoning in vision-language models and improving the generation of images with specific object poses.
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment (Read more on arXiv or HuggingFace)	Kunchang Li, Chenting Wang, Yinan He, Zhilin Li, Ziang Yan	Here is a concise summary of the research paper "Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment": i) This paper introduces Task Preference Optimization (TPO), a novel method to enhance multimodal large language models (MLLMs) by aligning them with fine-grained visual tasks. ii) The main research objective is to improve MLLMs' fine-grained visual understanding and performance on specific visual tasks without compromising their general multimodal capabilities. iii) The key methodology is the use of differentiable task preferences derived from visual tasks, learnable task tokens, and multi-task co-training of task-specific heads with the MLLM. iv) The primary result is that TPO improves the performance of VideoChat and LLaVA on multimodal benchmarks, achieving an overall 14.6% improvement in multimodal performance compared to baseline models. v) For AI practitioners, TPO provides a scalable method to enhance MLLMs with specialized visual perception skills, enabling the development of more robust and versatile multimodal AI systems.
The Superposition of Diffusion Models Using the Itô Density Estimator (Read more on arXiv or HuggingFace)	Kirill Neklyudov, Alexander Tong, Avishek Joey Bose, Lazar Atanackovic, Marta Skreta	Here is a concise summary of the AI research paper: i) Summary: The paper introduces SUPERDIFF, a novel framework for combining pre-trained diffusion models during inference using a scalable Itô density estimator. ii) Main research question/objective: Can multiple pre-trained diffusion models be combined solely at inference in a theoretically sound and efficient manner? iii) Key methodology: SUPERDIFF leverages a new Itô density estimator for the log-likelihood of the diffusion SDE to enable superposition, combining models through an automated re-weighting scheme during inference. iv) Primary results: SUPERDIFF outperforms individual models on CIFAR-10, with a Feature Likelihood Divergence (FLD) of 5.33 ± 0.05 compared to 7.51 ± 0.11 for the best single model, and enables effective prompt-based image editing and de novo protein structure design. v) Principal implication for AI practitioners: AI practitioners can use SUPERDIFF to combine multiple pre-trained diffusion models without retraining, enabling efficient generation, improved performance, and novel applications like concept interpolation and protein design.
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition (Read more on arXiv or HuggingFace)	Ji Li, Ting Liu, Danqing Huang, Shizhao Sun, Jiawei Lin	Here's a concise summary of the research paper: i) Summary: This paper introduces LaDeCo, a novel framework for automatic graphic design composition from multimodal elements using a layered approach. ii) Main research question/objective: How to automatically compose multimodal graphic elements into a cohesive and aesthetically pleasing design. iii) Key methodology: LaDeCo employs a layer planning module using GPT-4o to categorize elements and a layered design composition process that uses fine-tuned Large Multimodal Models (LMMs) to predict element attributes layer-by-layer, incorporating rendered images of previous layers as context. iv) Primary results: LaDeCo significantly outperforms baseline models in design composition, achieving an overall LLaVA-OV score of 8.08 compared to 5.34 for FlexDM and 6.53 for GPT-4o on the design composition task. v) Principal implication for AI practitioners: AI practitioners can leverage LaDeCo's layered approach and LMMs to build more effective and efficient automatic graphic design systems, enabling applications such as resolution adjustment, element filling, and design variation.
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging (Read more on arXiv or HuggingFace)	Shang-Tse Chen, Saurav Sahay, Shachi H Kumar, Hsuan Su, farnhua	Here is a concise summary of the research paper, strictly following your guidelines: i) This paper proposes a method to mitigate safety degradation in fine-tuned large language models (LLMs) by merging the weights of pre- and post-fine-tuned models. ii) The main research question is how to improve downstream task performance while preserving safety in LLMs without relying on additional safety data. iii) The key methodology used is a two-step approach: fine-tuning the base model on a downstream task, then merging the base model with the fine-tuned model via weight interpolation. iv) The primary result shows that merging the models significantly reduces the Attack Success Rate (ASR) across various downstream tasks; for instance, on the medical assistance task, the ASR is reduced by over 30%. v) For AI practitioners, this method offers a practical solution for adapting safety-aligned LLMs to downstream tasks while preserving their inherent safety features without requiring additional safety data.
SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images (Read more on arXiv or HuggingFace)	Yoshitaka Ushiku, Tosho Hirasawa, Shohei Tanaka, Kuniaki Saito, Risa Shinoda	Here's a concise summary of the research paper "SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images," strictly adhering to your guidelines: i) Summary: The paper introduces SBS Figures, a synthetic dataset for pre-training figure-based question-answering models, generated through a novel stage-by-stage pipeline. ii) Main research question/objective: The main objective is to develop a method for creating a large-scale, diverse, synthetic figure QA dataset to improve the performance of figure QA models. iii) Key methodology: A three-stage pipeline was used: (1) generate visualization target data, (2) render figures via Python code, and (3) generate QA pairs using LLMs, all progressively transforming seed data. iv) Primary results: Pre-training with SBS Figures improved the average accuracy on the ChartQA dataset by 6.42 points for the Pix2Struct model. v) Principal implication for AI practitioners: AI practitioners can use the SBS Figures dataset and pipeline to pre-train and fine-tune their models, enhancing performance on figure QA tasks without the need for manual annotation.
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models (Read more on arXiv or HuggingFace)	Junfu Pu, Zhongang Qi, Xiaodong Cun, Yong Zhang, Tao Wu	Here is a concise summary of the research paper "VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models": i) Summary: VideoMaker is a framework for zero-shot customized video generation that leverages the inherent capabilities of video diffusion models (VDMs) for subject feature extraction and injection without requiring additional modules. ii) Main research question/objective: Can VDMs be utilized to extract and inject subject features for customized video generation without the need for external modules or extensive retraining? iii) Key methodology: The method uses the VDM itself to extract fine-grained subject features from a reference image and injects these features using a modified spatial self-attention mechanism within the VDM, along with a Guidance Information Recognition Loss. iv) Primary results: VideoMaker outperformed existing methods in customized human video generation, achieving a Face Similarity score of 0.8047 compared to the next best result of 0.7323 from ID-Animator. v) Principal implication for AI practitioners: AI practitioners can achieve high-quality, zero-shot customized video generation by fine-tuning the pre-trained VDM to activate the inherent force of video diffusion model, offering a more efficient alternative to existing methods that rely on external modules.

Papers for 2024-12-27

Title	Authors	Summary
YuLan-Mini: An Open Data-efficient Language Model (Read more on arXiv or HuggingFace)	Jie Chen, Jiapeng Wang, Jia Deng, Huatong Song, Yiwen Hu	Here is a concise summary of the AI research paper "YuLan-Mini: An Open Data-efficient Language Model": i) YuLan-Mini is a 2.42B parameter language model designed for efficient pre-training, achieving high performance with limited data. ii) The main research objective was to develop a high-performing, small-scale language model using only publicly available data with a restricted compute budget, focusing on data efficiency and training stability. iii) Key methodologies used include an elaborate data pipeline with cleaning and scheduling, a robust optimization method to mitigate training instability using scaled initialization, and an annealing approach with targeted data selection and long-context training. iv) The primary result is that YuLan-Mini, trained on 1.08T tokens, achieved a score of 64.00 on the HumanEval (zero-shot) benchmark, comparable to industry-leading models. v) For AI practitioners, YuLan-Mini demonstrates that competitive language models can be developed with limited data and computational resources by focusing on data quality, optimization methods, and efficient training strategies.
A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression (Read more on arXiv or HuggingFace)	Xinting Huang, Shuaiyi Li, Kelong Mao, Zhisong Zhang, ChenlongDeng	Here is a concise summary of the research paper: i) Summary: This paper investigates gist token-based context compression methods for improving long-context processing in large language models (LLMs). ii) Main research question/objective: To what extent can gist-based architectures replace full attention models, and what failure patterns arise from compression? iii) Key methodology: The authors propose a unified framework to categorize gist-based models and conduct experiments on language modeling, weak context-dependent, and long-context tasks using Llama3-8B and Qwen2-7B models. iv) Primary results: Fine-grained KV cache architecture achieves near-lossless performance on many tasks, but struggles with tasks like synthetic recall; at a compression ratio of 4, Fine-KV achieves 40.6% accuracy on synthetic recall compared to full attention's 93.9%. v) Principal implication for AI practitioners: While gist token-based compression can effectively reduce computational costs for many tasks, practitioners should be aware of its limitations in tasks requiring precise token-level recall and explore the proposed mitigation strategies (fine-grained autoencoding and segment-wise token importance estimation) to enhance performance.

Papers for 2024-12-26

Title	Authors	Summary
Token-Budget-Aware LLM Reasoning (Read more on arXiv or HuggingFace)	Zhenyu Chen, Shiqing Ma, Shiyu Zhao, Chunrong Fang, Tingxu Han	Here is a concise summary of the paper "Token-Budget-Aware LLM Reasoning": i) Summary: This paper introduces TALE, a framework to reduce token redundancy in large language model (LLM) reasoning by dynamically estimating and incorporating token budgets into prompts. ii) Main research question or objective: How to effectively reduce token costs in Chain-of-Thought (CoT) reasoning while preserving LLM performance. iii) Key methodology: TALE estimates a token budget based on reasoning complexity and uses it to guide the LLM's reasoning process via a token-budget-aware prompt. iv) Primary results: TALE reduces token usage by 68.64% on average compared to vanilla CoT, with less than a 5% decrease in accuracy. v) Principal implication for AI practitioners: AI practitioners can use TALE to optimize token efficiency in LLM reasoning tasks, significantly reducing computational costs and resource usage while maintaining performance.

Papers for 2024-12-25

Title	Authors	Summary
DepthLab: From Partial to Complete (Read more on arXiv or HuggingFace)	Hao Ouyang, Shuzhe Wang, Qiuyu Wang, Ka Leong Cheng, Zhiheng Liu	Here's a summary of the research paper "DepthLab: From Partial to Complete" following your guidelines: i) Summary: DepthLab is a foundation model for RGB image-conditioned depth inpainting that leverages image diffusion priors to complete missing or occluded depth information. ii) Main research question or objective: To develop a robust and generalizable model for depth inpainting that preserves scale consistency and demonstrates resilience to depth-deficient regions. iii) Key methodology: A dual-branch depth inpainting diffusion framework is used, processing a reference image through a Reference U-Net for RGB feature extraction and integrating these features into an Estimation U-Net that handles depth and mask inputs. iv) Primary results: DepthLab achieved an AbsRel of 2.3 on the ScanNet dataset, outperforming other methods in numerical performance and visual quality across various downstream tasks. v) Principal implication for AI practitioners: AI practitioners can leverage DepthLab as a foundation model for various depth-related tasks, including 3D scene inpainting, text-to-3D scene generation, sparse-view reconstruction, and LiDAR depth completion, without the need for extensive task-specific training.
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding (Read more on arXiv or HuggingFace)	Dmitry Yudin, wingrune	Here's a summary of the AI research paper following your strict guidelines: i) 3DGraphLLM combines semantic graphs and large language models for improved 3D scene understanding in vision-language tasks. ii) The research objective was to develop a method for constructing a learnable representation of a 3D scene graph to improve the accuracy of LLMs in performing 3D vision-language tasks. The paper specifically focuses on solving 3D referred object grounding, 3D dense scene captioning, and 3D visual question answering. iii) The key methodology involved creating a learnable representation of a 3D scene graph using object embeddings and their semantic relationships, encoded as triplets, which were fed as input to a pre-trained LLM. The model uses VL-SAT for semantic relationship extraction and k-nearest neighbor selection to create the flat sequence of graph tokens. iv) 3DGraphLLM achieved a 5.8% improvement in F1@0.5 on the Multi3DRefer benchmark for 3D referred object grounding compared to a baseline. (Other quantitative results are presented, but this is one specific example) v) The significant finding, a substantial performance improvement on visual grounding with the integration of semantic relationships, directly implies that incorporating semantic graph structures into LLM inputs can substantially enhance 3D vision-language task performance. This suggests a valuable approach for AI practitioners developing embodied AI agents or systems requiring robust 3D scene understanding.
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization (Read more on arXiv or HuggingFace)	Ning Ding, Kaiyan Zhang, Xingtai Lv, Che Jiang, Ermo Hua	Here is a concise summary of the research paper "Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization": i) Summary: This paper introduces Fourier Position Embedding (FoPE) to improve the length generalization of language models (LMs) by enhancing the frequency-domain properties of attention in Rotary Position Embedding (RoPE). ii) Main research question/objective: How to address the limitations of RoPE that hinder length generalization in language models. iii) Key methodology used: The authors use Discrete Signal Processing theory to analyze RoPE, identifying spectral damage as a key issue, and propose FoPE, which constructs Fourier Series and zero-outs destructive frequency components. iv) Primary results: FoPE maintains a more stable perplexity and achieves better accuracy in a needle-in-haystack task compared to RoPE and ALiBi; for example, FoPE achieved an accuracy of 100% on the Passkey Retrieval task with a sequence length of 512, while RoPE's accuracy dropped to nearly 0% at sequence length of 2048. v) Principal implication for AI practitioners: FoPE offers a method to enhance the length generalization of LMs without significant computational overhead, making it a valuable technique for AI/ML engineers and data scientists working with transformer-based models.
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation (Read more on arXiv or HuggingFace)	Zhaoyang Zhang, Wenze Liu, Xiaoyu Li, Xiaodong Cun, Minghong Cai	Here's a summary of the AI research paper following your strict guidelines: i) DiTCtrl is a tuning-free method for generating coherent multi-prompt longer videos using a pre-trained Multi-Modal Diffusion Transformer (MM-DiT). ii) The research objective was to develop a training-free method for multi-prompt video generation capable of producing long videos with smooth transitions and accurate prompt following, overcoming limitations of existing single-prompt methods. iii) The key methodology involved analyzing the MM-DiT's attention mechanism, designing a KV-sharing mechanism and a latent blending strategy to achieve smooth transitions between video segments generated from sequential prompts. iv) DiTCtrl achieved state-of-the-art performance on the MPVBench benchmark, a new benchmark specifically designed for multi-prompt video generation. A specific quantitative result was not clearly presented, though the paper mentions state-of-the-art performance on CSCV metric. v) The most impactful finding is the development of a training-free method for multi-prompt video generation; this is highly relevant to AI practitioners as it allows leveraging existing pre-trained MM-DiT models for complex video generation tasks without requiring extensive retraining, reducing computational costs and data requirements.
In Case You Missed It: ARC 'Challenge' Is Not That Challenging (Read more on arXiv or HuggingFace)	Borchmann	Here's a summary of the AI research paper following the provided guidelines: i) 1-line summary: The paper challenges the established evaluation methodology for several multiple-choice question benchmarks, demonstrating that a seemingly simple change in setup dramatically impacts model performance and potentially misrepresents model capabilities. ii) Main research question or objective: To investigate the impact of different evaluation setups (separate vs. simultaneous presentation of answer choices) on the performance of large language models (LLMs) across multiple-choice question benchmarks. iii) Key methodology used: The authors compared LLM performance on established benchmarks (ARC, OpenBookQA, SIQA) using two evaluation setups: one presenting answer choices separately, and another presenting them simultaneously. They then compared the reported accuracy scores from the literature to their own replications under each setup. The paper does not explicitly detail all aspects of the model training or testing procedures used in its replications. iv) Primary results (include one specific quantitative finding): Switching from presenting ARC Challenge answer choices separately to presenting them all at once increased Llama 3.1 70B accuracy from 64% to 93%. v) Principal implication for AI practitioners: The evaluation setup significantly influences performance metrics and model rankings on multiple-choice question benchmarks. AI practitioners should carefully consider and evaluate the impact of evaluation setup, potentially reconsidering the established methods for existing benchmarks and future design.
PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models (Read more on arXiv or HuggingFace)	Jianyuan Wang, Tom Monnier, Iro Laina, Roman Shapovalov, Minghao Chen	Here is a concise summary of the research paper "PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models": i) Summary: PartGen is a novel method that generates or reconstructs 3D objects as compositions of meaningful parts, starting from text, images, or unstructured 3D objects. ii) Main research question/objective: How can we automatically segment a 3D object into its meaningful parts and reconstruct these parts in high quality, even when they are partially or fully occluded? iii) Key methodology: PartGen uses a two-stage approach employing multi-view diffusion models, first segmenting objects into parts by generating consistent 2D segmentation maps across multiple views, and then completing and reconstructing each part in 3D while considering the context of the entire object. iv) Primary results: PartGen outperforms segmentation baselines on a dataset of artist-created 3D assets, achieving a 59.3% mAP50 score for automatic segmentation with 10 samples, compared to 37.4% for a fine-tuned SAM2 model. v) Principal implication for AI practitioners: PartGen provides a method for generating structured 3D assets composed of complete, semantically meaningful parts, which is crucial for downstream applications like 3D editing, animation, and robotic manipulation that currently requires significant manual effort.
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing (Read more on arXiv or HuggingFace)	Jun Zhu, Jianfei Chen, Ziteng Wang	Here is a summary of the AI research paper "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing" following your strict guidelines: i) One-line summary: This paper introduces ReMoE, a fully differentiable Mixture-of-Experts (MoE) model using ReLU routing to improve performance and scalability compared to traditional TopK routing. ii) Main research question/objective: How can the non-differentiable nature of TopK routing in MoE models be addressed to improve performance and scalability? iii) Key methodology: The authors propose ReMoE, replacing the TopK+Softmax routing mechanism with a ReLU-based router and introduce an adaptive L1 regularization for controlling sparsity and load balancing. iv) Primary results: ReMoE consistently outperforms TopK-routed MoE across various model sizes, expert counts, and levels of granularity; for example, on downstream tasks, ReMoE achieved a 40.03% average zero-shot accuracy compared to MoE's 38.20% on a specific configuration. v) Principal implication for AI practitioners: ReMoE offers a drop-in replacement for TopK routing in MoE models, enabling fully differentiable training and improved scalability, leading to potentially more efficient and performant large language models. The paper lacks clear details on the computational cost differences between ReMoE and standard MoE during training.
SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval (Read more on arXiv or HuggingFace)	Divya Chaudhary, Vinija Jain, Aman Chadha, Vinesh Kumar Gande, Aakash Mahalingam	Here's a summary of the AI research paper following your strict guidelines: i) SKETCH enhances Retrieval-Augmented Generation (RAG) systems by integrating semantic text retrieval with knowledge graphs for improved text comprehension. ii) The research objective was to improve the efficiency and accuracy of RAG systems in processing large datasets while maintaining a comprehensive understanding of the context. iii) The key methodology involved a novel approach called SKETCH, which integrates semantic text chunking with knowledge graphs to merge structured and unstructured data for holistic comprehension. iv) SKETCH consistently outperformed baseline approaches on multiple datasets; notably, on the Italian Cuisine dataset, it achieved an answer relevancy of 0.94 and a context precision of 0.99. v) The significantly high answer relevancy and context precision (0.94 and 0.99 respectively) on the Italian Cuisine dataset demonstrates SKETCH's potential to improve the accuracy and contextual relevance of RAG systems, particularly beneficial for applications requiring precise and contextually rich information retrieval. The paper does not explicitly detail the implications for specific engineering or application tasks beyond this general finding.

Papers for 2024-12-24

Papers for 2024-12-23

Title	Authors	Summary
Parallelized Autoregressive Visual Generation (Read more on arXiv or HuggingFace)	jshfeng, zhenheny, Ikuinen, ShuhuaiRen, Epiphqny	Here is a concise summary of the research paper "Parallelized Autoregressive Visual Generation": i) Summary: This paper introduces a novel approach for parallelized autoregressive visual generation that improves efficiency while maintaining the quality of generated images and videos. ii) Main research question or objective: Can parallel visual generation be achieved while preserving the simplicity and flexibility of standard autoregressive models? iii) Key methodology: The authors propose a parallel generation strategy that generates weakly dependent tokens in parallel across non-local regions while maintaining sequential generation for strongly dependent local tokens, implemented by dividing the image into regions and using a token re-ordering mechanism. iv) Primary results: The proposed method achieves a 3.6x speedup with comparable image quality and up to a 9.5x speedup with minimal quality degradation on image and video generation tasks. Specifically, the method reduces generation time from 12.41s to 3.46s (PAR-4x) on the ImageNet dataset. v) Principal implication for AI practitioners: AI practitioners can integrate this approach into existing autoregressive models to significantly accelerate the visual generation process with minimal impact on quality, enabling more efficient deployment in real-world applications.
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation (Read more on arXiv or HuggingFace)	Yilong Lai, Zhenglin Wang, zhoudeyu, lzhang472, callanwu	Here is a concise summary of the research paper "SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation": i) Summary: This paper introduces SCOPE, a framework for optimizing Key-Value (KV) cache compression in large language models (LLMs) during long-context generation by separately compressing the prefill and decoding phases. ii) Main research question or objective: How to effectively compress the KV cache in LLMs for long-context generation tasks without significantly degrading performance. iii) Key methodology: SCOPE preserves the KV cache during the prefill phase and uses a sliding strategy with adaptive and discontinuous optimizations to select and manage heavy hitters during the decoding phase. iv) Primary results: SCOPE achieved comparable performance to the full KV cache when the overall compression rate was 35% on the LONGGENBENCH benchmark. v) Principal implication for AI practitioners: AI practitioners can use SCOPE to optimize memory usage and transfer during long-context generation without losing the performance, particularly for reasoning tasks, making it easier to deploy LLMs in resource-constrained environments.
Offline Reinforcement Learning for LLM Multi-Step Reasoning (Read more on arXiv or HuggingFace)	yiwu, ZhangShenao, hendrydong, Shibo-UCSD, jwhj	Here is a concise summary of the research paper "Offline Reinforcement Learning for LLM Multi-Step Reasoning": i) Summary: This paper introduces OREO, an offline reinforcement learning algorithm designed to improve the multi-step reasoning capabilities of large language models (LLMs). ii) Main research question or objective: The main objective is to develop an offline RL method that enhances LLM multi-step reasoning without requiring paired preference data or treating all tokens uniformly. iii) Key methodology used: OREO jointly learns a policy model and value function by optimizing the soft Bellman Equation, enabling finer-grained credit assignment and leveraging unpaired data with sparse rewards. iv) Primary results: OREO outperforms baseline methods, including rejection sampling, DPO, and KTO, on math reasoning and embodied agent control tasks; a 1.5B model trained with OREO achieves a 52.5% accuracy on the MATH dataset. v) Principal implication for AI practitioners: AI practitioners can use OREO to enhance LLMs' multi-step reasoning abilities using pre-existing datasets without live interaction, and leverage the learned value function for test-time improvements via beam search.
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up (Read more on arXiv or HuggingFace)	wxcTest, ZhenxiongTang, flyingman	Here is a concise summary of the paper "CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up": i) Summary: This paper introduces CLEAR, a method to linearize the attention mechanism in pre-trained Diffusion Transformers (DiTs) for efficient high-resolution image generation. ii) Main Research Question/Objective: Can a pre-trained DiT be converted to achieve linear computational complexity without significant performance degradation? iii) Key Methodology: CLEAR employs a convolution-like local attention strategy that limits feature interactions to a local window around each query token, ensuring linear complexity. Knowledge distillation is used during fine-tuning. iv) Primary Results: CLEAR reduces attention computations by 99.5% and accelerates generation by 6.3 times for 8K-resolution images, achieving comparable results to the teacher model after fine-tuning on 10K self-generated samples. v) Principal Implication for AI Practitioners: AI practitioners can leverage CLEAR to significantly improve the efficiency of high-resolution image generation using DiTs, enabling faster inference and reduced computational costs, particularly for ultra-high-resolution outputs.
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis (Read more on arXiv or HuggingFace)	Akio Hayakawa, mittu1204, TakashiShibuyaSony, mi141, hkchengrex	Here's a concise summary of the paper, following your guidelines: i) Summary: This paper introduces MMAudio, a multimodal framework for generating high-quality and temporally aligned audio for video and text inputs, using joint training on audio-visual and audio-text datasets. ii) Main research question or objective: How to synthesize high-quality audio that is semantically and temporally aligned to video inputs, with optional text conditioning. iii) Key methodology: MMAudio utilizes a multimodal transformer network trained with a flow-matching objective and incorporates a conditional synchronization module for frame-level audio-visual alignment. Additionally, it leverages joint training on large-scale audio-visual and audio-text datasets. iv) Primary results: MMAudio achieves state-of-the-art performance in video-to-audio synthesis among public models, demonstrating improved audio quality, semantic alignment, and temporal alignment; the smallest model (157M parameters) achieves a 10% lower Fréchet Distance compared to previous methods. v) Principal implication for AI practitioners: AI practitioners can leverage MMAudio's multimodal joint training paradigm and conditional synchronization module to develop more effective video-to-audio synthesis models, enabling the creation of higher-quality, more realistic audio for video content.
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design (Read more on arXiv or HuggingFace)	chuanjieliu, xiaonans, JamesTheZ	Here is a concise summary of the paper "MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design": i) MixLLM is a quantization method that applies mixed-precision to different output features based on their globally assessed impact on model loss, achieving high accuracy and system efficiency. ii) The main research objective is to develop a quantization solution for Large Language Models (LLMs) that simultaneously optimizes accuracy, memory consumption, and system efficiency. iii) Key methodology involves identifying high-salience output features globally, applying mixed-precision (4-bit and 8-bit) quantization to weights, using 8-bit symmetric quantization for activations, and designing a two-step dequantization process with optimized GPU kernel execution. iv) Primary results show that MixLLM with only 10% more bits (W4.4A8) reduces perplexity (PPL) increasement from about 0.5 in state-of-the-art methods to within 0.2 for Llama 3.1 70B. v) The principal implication for AI practitioners is that MixLLM provides a method for deploying LLMs with significantly reduced memory footprint and improved inference speed without substantial accuracy loss, facilitating more efficient use of computational resources.
LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps (Read more on arXiv or HuggingFace)	navigli, mbrack, PSaiml, sted97, felfri	Here is a concise summary of the AI research paper "LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps": i) Summary: This paper introduces M-ALERT, a multilingual benchmark for evaluating the safety of Large Language Models (LLMs) across five languages, revealing significant safety inconsistencies. ii) Main research question or objective: The main objective is to evaluate the safety performance of LLMs across multiple languages (English, French, German, Italian, and Spanish) and identify potential safety gaps. iii) Key methodology: The authors developed a translation pipeline using advanced machine translation models to create M-ALERT, a benchmark with 75k safety prompts (15k per language), and evaluated 10 state-of-the-art LLMs using an automated evaluation framework involving a multilingual judge model (LlamaGuard-3). iv) Primary results: The study found that no model achieved the safe threshold (99%) across all languages, and the c4ai-command model exhibited the lowest safety performance, with scores predominantly below 90%. v) Principal implication for AI practitioners: AI practitioners must prioritize language-specific safety analysis and implement robust multilingual safety measures to ensure responsible LLM deployment globally, as current models exhibit significant safety inconsistencies across different languages.
Sequence Matters: Harnessing Video Models in 3D Super-Resolution (Read more on arXiv or HuggingFace)	juxhee, blee, yi0109-park, HEOK, lanikoisgod	Here is a concise summary of the AI research paper "Sequence Matters: Harnessing Video Models in 3D Super-Resolution": i) This paper introduces a novel approach for 3D super-resolution by leveraging video super-resolution (VSR) models to enhance the quality of 3D models reconstructed from low-resolution multi-view images. ii) The main research objective is to improve the consistency and detail of high-fidelity 3D models generated from low-resolution inputs by utilizing VSR models. iii) The key methodology involves ordering unordered low-resolution multi-view images into a sequence using a simple greedy algorithm based on either camera poses or visual features, and applying adaptive-length subsequencing and multiple thresholds to refine the input for VSR models. iv) The proposed method achieved a PSNR of 31.41 on the NeRF-synthetic dataset, outperforming other baseline models. v) The principal implication for AI practitioners is that they can generate more accurate and detailed 3D models from low-resolution images by effectively ordering input images, without requiring additional fine-tuning or training of 3D Gaussian Splatting (3DGS) on low-resolution images to render 'smooth' video.
Fietje: An open, efficient LLM for Dutch (Read more on arXiv or HuggingFace)	BramVanroy	Here's a concise summary of the research paper "Fietje: An open, efficient LLM for Dutch" by Bram Vanroy, following your guidelines: i) Summary: This paper introduces Fietje, a 2.7 billion parameter language model specifically adapted for Dutch, alongside instruction-tuned and chat-optimized variants, with a focus on transparency and reproducibility. ii) Main research question/objective: To develop and evaluate an efficient, open-source language model specifically for the Dutch language that demonstrates competitive performance. iii) Key methodology: Continued pretraining of the English-centric Phi-2 model on 28 billion Dutch tokens sourced from filtered web data (CulturaX) and Wikipedia, followed by supervised fine-tuning and preference alignment using synthetic Dutch datasets. iv) Primary results: Fietje Chat outperformed larger models like GEITje 7B Ultra in two out of five tasks, and on the DBRD benchmark, Boreas Chat achieved a 94.38% F1 score. v) Principal implication for AI practitioners: AI practitioners can leverage Fietje's open-source nature (model weights, datasets, training, and evaluation code) to advance the development and assessment of efficient, high-performing LLMs and SLMs for underrepresented languages like Dutch, but should be aware of rapid changes in state-of-the-art models and the limitations of current evaluation methodologies.

Papers for 2024-12-20

Title	Authors	Summary
Qwen2.5 Technical Report (Read more on arXiv or HuggingFace)	Losin94, bowenYu, bzheng, huybery, Baosong	Here's a concise summary of the Qwen2.5 Technical Report, strictly following the specified guidelines: i) A 1-line summary Qwen2.5 is a series of large language models designed with enhanced pre-training and post-training techniques to improve performance across various tasks. ii) Main research question or objective The main objective was to develop Qwen2.5, an improved iteration of large language models (LLMs) with enhanced capabilities in language understanding, reasoning, mathematics, coding, and human preference alignment. iii) Key methodology used The key methodology involved scaling pre-training data to 18 trillion tokens, implementing supervised finetuning with over 1 million samples, and using multistage reinforcement learning including offline learning DPO and online learning GRPO. iv) Primary results (include one specific quantitative finding) The Qwen2.5-72B-Instruct model outperformed numerous open and proprietary models, achieving a score of 83.1 on the MATH benchmark. v) Principal implication for AI practitioners (e.g., AI/ML/Software Engineers, Data Scientist) AI practitioners can leverage Qwen2.5's architecture and training techniques as a foundation for developing specialized models or applications requiring advanced language understanding and generation capabilities, particularly in domains requiring strong mathematical reasoning.
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval (Read more on arXiv or HuggingFace)	BoZhaoHuggingFace, yzwang, Shitao, zl101, JUNJIE99	Here is a concise summary of the AI research paper "MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval": i) Summary: The paper introduces MegaPairs, a new method for synthesizing large-scale multimodal datasets for training universal multimodal retrieval models. ii) Main Research Question/Objective: To develop a method for creating high-quality, large-scale instruction-tuning datasets to improve multimodal retrieval performance. iii) Key Methodology: MegaPairs constructs heterogeneous KNN triplets from open-domain images using multiple similarity models and utilizes open-source VLM and LLM annotators to generate instructions for sampled image pairs. iv) Primary Results: Models trained on MegaPairs achieved state-of-the-art zero-shot performance on composed image retrieval benchmarks; notably, the MMRet-MLLM model achieved 42.2% mAP@5 on the CIRCO benchmark. v) Principal Implication for AI Practitioners: AI practitioners can leverage the publicly available MegaPairs dataset, well-trained models, and data synthesis pipeline to develop more powerful and versatile multimodal retrieval systems.
Progressive Multimodal Reasoning via Active Retrieval (Read more on arXiv or HuggingFace)	douzc, yutaozhu94, dengmengjie, Snow-Nation, dongguanting	Here's a concise summary of the research paper "Progressive Multimodal Reasoning via Active Retrieval": i) This paper introduces AR-MCTS, a framework that enhances multimodal reasoning in large language models (MLLMs) by integrating active retrieval with Monte Carlo Tree Search (MCTS). ii) The main research objective is to improve the performance of MLLMs on complex multi-step multimodal reasoning tasks. iii) The key methodology involves a unified retrieval module for acquiring key insights, an active retrieval strategy during MCTS expansion, and a progressively aligned process reward model (PRM). iv) The primary results show that AR-MCTS significantly improves performance across various MLLMs; for example, Qwen2-VL-7B with AR-MCTS achieved a 5.3% improvement on the MATHVISTA benchmark compared to its zero-shot setting. v) For AI practitioners, AR-MCTS offers a plug-and-play framework to enhance MLLMs' reasoning capabilities without retraining the foundational models, providing a way to optimize sampling diversity and accuracy in multimodal reasoning tasks.
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks (Read more on arXiv or HuggingFace)	wangxz098, haopeng01, NeoZ123, tsq2000, bys0318	Here is a concise summary of the paper "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" based on your requirements: i) Summary: LongBench v2 is a benchmark designed to evaluate the deep understanding and reasoning capabilities of large language models (LLMs) on long-context, real-world multitasks. ii) Main research question or objective: The main objective is to create a challenging benchmark to assess whether LLMs can genuinely comprehend, learn from, and reason over long texts, ranging from 8k to 2M words, across diverse real-world scenarios. iii) Key methodology used: The researchers collected 503 multiple-choice questions from nearly 100 human experts, categorized into six task types, and implemented a rigorous annotation and review process involving both automated checks using LLMs and manual verification by human experts to ensure data quality and difficulty. iv) Primary results: The best-performing LLM (01-preview model) achieved 57.7% accuracy when incorporating longer reasoning, whereas human experts achieved only 53.7% accuracy under a 15-minute time constraint. v) Principal implication for AI practitioners: AI practitioners should focus on enhancing the reasoning capabilities and scaling inference-time compute of LLMs to address the challenges posed by long-context tasks that require deep understanding, as opposed to mere retrieval or shallow processing of information.
How to Synthesize Text Data without Model Collapse? (Read more on arXiv or HuggingFace)	XingtaiHF, iseesaw, Hengli, daixuancheng, xuekai	Here is a concise summary of the research paper "How to Synthesize Text Data without Model Collapse?": i) This paper investigates the impact of synthetic data on language model training and proposes a token-level editing method to mitigate model collapse. ii) The main research questions are: what is the impact of synthetic data on language model training, and how can data be synthesized without causing model collapse? iii) The key methodology used is pre-training language models on varying proportions of synthetic and human-produced data, statistical analysis of synthetic data distributions, and a proposed token-level editing approach with theoretical proof and empirical validation. iv) The primary results show a negative correlation between the proportion of synthetic data and model performance, with the perplexity of models trained on synthetic data reaching 49.30 on average compared to 21.37 for human data. v) The principal implication for AI practitioners is that directly using synthetic data in training can lead to performance degradation (model collapse), and token-level editing can be used to improve data quality and enhance model performance.
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution (Read more on arXiv or HuggingFace)	Andrew Brown, Alan Yuille, Xi Yin, mannatsingh, QHL067	Here is a concise summary of the research paper "Flowing from Words to Pixels: A Framework for Cross-Modality Evolution": i) The paper introduces CrossFlow, a framework that directly evolves one modality into another using flow matching without additional conditioning. ii) The main research question is whether flow matching models can learn a direct mapping between the distributions of different modalities, obviating noise and conditioning mechanisms. iii) The key methodology involves using Variational Encoders to encode source modality data to the same shape as the target modality and a novel method to enable Classifier-free guidance in a cross-modal flow matching setting. iv) CrossFlow achieved a zero-shot FID-30K score of 9.63 on COCO for text-to-image generation, outperforming standard flow matching baselines. v) For AI practitioners, CrossFlow offers a simpler and more scalable framework for cross-modal generation tasks, demonstrating that direct evolution between modalities is achievable and efficient.
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis (Read more on arXiv or HuggingFace)	lmwang, cqf, felixcheng97, qiuyuu, hlwang06	Here is a concise summary of the research paper "LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis": i) Summary: LeviTor is a novel image-to-video synthesis method that enables precise 3D trajectory control of objects by combining depth information with K-means clustered points. ii) Main research question or objective: The main objective was to develop a method for controlling object trajectories in image-to-video synthesis that can handle out-of-plane movements and occlusions in 3D space, overcoming the limitations of existing 2D trajectory-based methods. iii) Key methodology: The authors propose representing control signals by combining depth information with K-means clustered points derived from object masks and using this representation to guide a fine-tuned video diffusion model (Stable Video Diffusion). iv) Primary results: LeviTor achieves accurate 3D trajectory control, demonstrated by a Frechet Video Distance (FVD) of 190.44 on the DAVIS dataset with the multi-points setting, compared to 330.17 for DragNUWA 1.5 in single point setting. v) Principal implication for AI practitioners: AI practitioners can utilize LeviTor to generate videos with precise control over object movements in 3D space, enabling more realistic and complex video synthesis without requiring explicit 3D trajectory inputs from users.
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion (Read more on arXiv or HuggingFace)	Ye Liu, hpfister, dwei, EthanTaylor, Kakituken	Here is a concise summary of the research paper "Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion": i) Summary: This paper introduces a new task and method for inserting objects into images realistically, guided by affordance and position prompts, using a novel dataset and a dual-diffusion model. ii) Main research question/objective: How to develop a model for affordance-aware object insertion that can seamlessly integrate any object into any scene with various position prompts. iii) Key methodology: The authors propose a Mask-Aware Dual Diffusion (MADD) model, which uses a dual-stream architecture to denoise the RGB image and the insertion mask simultaneously, trained on a new dataset (SAM-FB) derived from SA-1B. iv) Primary results: MADD outperforms state-of-the-art methods on the affordance-aware object insertion task; for example it achieves an FID score of 13.53 with mask prompts, compared to 15.41 for Stable Diffusion. v) Principal implication for AI practitioners: AI practitioners can utilize the MADD model and the SAM-FB dataset for realistic image composition, with explicit control over object placement and appearance via diverse prompts.
DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation (Read more on arXiv or HuggingFace)	Yuejiang Dong, yshan2u, bluestyle97, pookiefoof, thuzhaowang	Here is a concise summary of the research paper "DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation" based on the provided guidelines: i) DI-PCG is a diffusion-based method for efficient inverse procedural content generation (I-PCG) that creates high-quality 3D assets from image conditions. ii) The main research objective is to automatically estimate the best-fit parameters for procedural generators under given image conditions to achieve controllable 3D content generation. iii) The key methodology is a lightweight diffusion transformer model that treats PCG parameters as the denoising target and observed images as conditions to control parameter generation. iv) The primary result is that DI-PCG achieves a Chamfer Distance (CD) of 0.093 on the ShapeNet chair subset, demonstrating accurate parameter recovery. v) The principal implication for AI practitioners is that DI-PCG offers an efficient and effective way to perform inverse procedural content generation, which can be used for high-quality image-to-3D generation.
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling (Read more on arXiv or HuggingFace)	wping, ctnzr, shoeybi, ychenNLP, zihanliu	Here is a concise summary of the research paper "AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling": i) Summary: The paper introduces AceMath, a suite of math-specialized language models and reward models designed to enhance mathematical reasoning capabilities. ii) Main research question or objective: The main objective is to develop advanced supervised fine-tuning (SFT) and reward modeling (RM) techniques to improve the performance of large language models (LLMs) on complex mathematical reasoning tasks. iii) Key methodology used: The methodology involves a two-stage SFT process (general domain followed by math-specific fine-tuning) using curated prompts and synthetically generated responses, and a systematic approach to build math reward models evaluated on a new benchmark called AceMath-RewardBench. iv) Primary results: The resulting AceMath-72B-Instruct model outperforms Qwen2.5-Math-72B-Instruct, GPT-40, and Claude-3.5 Sonnet on math reasoning benchmarks. Specifically, AceMath-72B-Instruct achieves an average score of 71.84 across seven math reasoning benchmarks, compared to 68.16 for Qwen2.5-Math-72B-Instruct. v) Principal implication for AI practitioners: AI practitioners can leverage the proposed SFT and RM techniques, along with the provided open-source models and data, to develop more powerful and accurate math-specialized LLMs, pushing the boundaries of automated mathematical reasoning.
UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency (Read more on arXiv or HuggingFace)	Federico Tombari, Yongqin Xian, thofmann, Alessiot, enisimsar	Here's a concise summary of the research paper "UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency" based on the provided guidelines: i) Summary: The paper introduces UIP2P, an unsupervised instruction-based image editing model that uses Cycle Edit Consistency (CEC) to enable reversible and coherent edits without requiring ground-truth edited images during training. ii) Main research question or objective: How to develop an instruction-based image editing model that does not rely on supervised datasets containing triplets of input image, edited image, and edit instruction. iii) Key methodology used: Cycle Edit Consistency (CEC) is enforced by applying forward and reverse edits in one training step and ensuring consistency in image, attention, and CLIP embedding spaces, leveraging unified prediction with varying diffusion steps. iv) Primary results: UIP2P outperforms InstructPix2Pix on the IP2P test dataset in both CLIP image similarity and CLIP text-image similarity metrics; for instance, it achieves a 22% preference score in user studies compared to 8% for InstructPix2Pix when evaluating how well the edit matches the instruction and localization. v) Principal implication for AI practitioners: AI practitioners can leverage UIP2P to train image editing models on real-image datasets without the need for ground-truth edited images, enabling the use of large-scale datasets that lack such annotations.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception (Read more on arXiv or HuggingFace)	Ke Zhu, Jing Hao, FuNz, cloud913, syp115	Here's a summary of the paper, following your specified guidelines: i) The paper introduces Descriptive Caption Enhancement (DCE), a method that enhances image captions by integrating outputs from multiple visual specialist models. ii) The main objective is to generate more detailed and accurate image captions than existing methods, which rely on human annotations or large multimodal models (LMMs). iii) DCE leverages various visual specialists (e.g., for object detection, depth estimation, emotion recognition) to extract attributes, then uses a large language model (LLM) to combine these into a coherent caption. iv) When trained with DCE, LLaVA-v1.5 achieved an accuracy of 80.9 on the VQAv2 benchmark. v) AI practitioners can use DCE to improve the performance of LMMs on visual understanding tasks by providing them with more comprehensive and detailed image captions, generated without relying on expensive human annotation.
TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation (Read more on arXiv or HuggingFace)	Qing Li, Yunqing Liu, Jiatong Li, schrodingers-tiger, Duke-de-Artois	Here is a concise summary of the research paper "TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation": i) Summary: This paper introduces TOMG-Bench, a benchmark for evaluating large language models (LLMs) on text-based open molecule generation, alongside an instruction-tuning dataset, OpenMolIns. ii) Main research question or objective: The main objective was to evaluate the capability of LLMs to generate novel molecules based on open-ended textual instructions, moving beyond targeted molecule generation. iii) Key methodology: The authors developed a benchmark (TOMG-Bench) with three tasks (molecule editing, optimization, and customized generation), each with three subtasks. They also used an automated evaluation system and a new instruction-tuning dataset (OpenMolIns) to assess 25 LLMs. iv) Primary results: The best performing model, Claude-3.5, achieved a weighted average accuracy of 35.92% on TOMG-Bench, while instruction-tuned Llama3.1-8B outperformed all open-source general LLMs. v) Principal implication for AI practitioners: AI practitioners can leverage TOMG-Bench to assess LLMs for open-domain molecule generation tasks and use OpenMolIns to improve model performance in this area, although there is still significant room for improvement in generating molecules from scratch.
Move-in-2D: 2D-Conditioned Human Motion Generation (Read more on arXiv or HuggingFace)	Feng Liu, Difan Liu, Jui-Hsien Wang, Yang Zhou, hsinh	Here is a concise summary of the research paper "Move-in-2D: 2D-Conditioned Human Motion Generation": i) This paper introduces a novel method, Move-in-2D, for generating realistic human motion sequences conditioned on a 2D scene image and a text prompt. ii) The main research objective is to generate diverse human motion sequences that are semantically aligned with a text prompt and spatially compatible with a given 2D background image. iii) The key methodology is a multi-conditional diffusion model that utilizes a transformer architecture with in-context learning to integrate scene image and text prompt conditions. iv) The proposed model achieved an FID score of 44.639, which is better than other compared models. v) For AI practitioners, this method provides a new modality for motion generation by incorporating scene awareness without requiring 3D scene data and improves motion quality in human video generation tasks.

Papers for 2024-12-19

Papers for 2024-12-18

Title	Authors	Summary
Are Your LLMs Capable of Stable Reasoning? (Read more on arXiv or HuggingFace)	Linchen Xiao, Hongwei Liu, Junnan Liu, zsytony, Harold-lkk	Here's a concise summary of the research paper "Are Your LLMs Capable of Stable Reasoning?": i) Summary: This paper introduces G-Pass@k, a new metric to evaluate both the problem-solving ability and performance consistency of Large Language Models (LLMs), alongside a new benchmark, LiveMathBench, for assessing mathematical reasoning. ii) Main research question or objective: How can we assess both the peak performance and stability of LLMs in complex reasoning tasks, particularly in mathematical problem-solving? iii) Key methodology used: The authors propose G-Pass@k, which measures performance consistency across multiple sampling attempts, and LiveMathBench, a dynamic benchmark with contemporary mathematical problems. They evaluate various LLMs using these tools. iv) Primary results: The study found significant instability in LLM reasoning on challenging tasks, with performance drops exceeding 50% in many cases when evaluated using G-Pass@k. For instance, the Llama-3.1-8B-Instruct model's accuracy plummeted from 18.1% (Greedy) to 0.8% (G-Pass@161.0) on the LiveMathBench. v) Principal implication for AI practitioners: AI practitioners should use G-Pass@k to gain a more realistic assessment of LLM capabilities in complex reasoning, as it reveals that current evaluation metrics may overestimate actual performance consistency, highlighting the need for more stable models in real-world applications.
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models (Read more on arXiv or HuggingFace)	Xiaoshuai Song, Zhuoma GongQue, Runqi Qiao, Shanglin Lei, YiFan Zhang	Here is a concise summary of the AI research paper "Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models" based on your guidelines: i) This paper introduces the Multi-Dimensional Insights (MDI) benchmark to evaluate the performance of large multimodal models (LMMs) on real-world personalization tasks across various scenarios, age groups, and problem complexities. ii) The main research objective is to assess whether LMMs can align with the diverse needs of humans in real-world scenarios and address the specific demands of distinct demographic groups. iii) The key methodology involves constructing a dataset of over 500 images and 1.2k human-posed questions spanning six common scenarios, stratified by three age groups and two levels of complexity, and evaluating several LMMs using this benchmark. iv) The primary result is that the strongest model tested, GPT-4o, achieved 79% accuracy on age-related tasks, but with noticeable gaps across different scenarios and complexities. v) The principal implication for AI practitioners is that current LMMs still have considerable room for improvement in addressing real-world applications, particularly in tailoring responses to diverse user needs, highlighting the need for continued development to enhance personalized AI assistant capabilities.
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain (Read more on arXiv or HuggingFace)	Ji-Rong Wen, Zhicheng Dou, Jiejun Tan, ShootingWong	Here is a concise summary of the research paper "OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain": i) Summary: This paper introduces OmniEval, an automatic and multidimensional benchmark for evaluating Retrieval-Augmented Generation (RAG) models in the financial domain. ii) Main research question/objective: The main objective is to develop a comprehensive benchmark to evaluate the performance of RAG models on various financial topics and tasks. iii) Key methodology: The methodology involves a matrix-based RAG scenario evaluation system, multi-dimensional evaluation data generation using GPT-4 and human annotation, a multi-stage evaluation of retrieval and generation, and multi-dimensional evaluation metrics including rule-based and Large Language Model (LLM)-based ones. iv) Primary results: The automated data generation approach achieved an 87.47% acceptance ratio in human evaluations. v) Principal implication for AI practitioners: OmniEval provides a standardized framework for evaluating and improving RAG models in specialized domains like finance, using the benchmark's publicly available code.
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers (Read more on arXiv or HuggingFace)	Pulkit Agrawal, Jeff Gore, Jinyeop Song, Seungwook Han	Here is a concise summary of the research paper: i) This paper introduces a concept encoding-decoding mechanism to explain how transformers perform in-context learning (ICL). ii) The main research question is how transformers form and use internal abstractions during ICL. iii) The key methodology involves analyzing the training dynamics of a small transformer on synthetic ICL tasks and evaluating concept encoding-decoding across pretrained models of varying scales using techniques like UMAP visualization, concept decodability, and mechanistic intervention. iv) The primary results are that transformers concurrently learn to map latent concepts into separable representations and develop context-specific decoding algorithms, with a positive correlation (R² = 0.781) between concept decodability and ICL performance observed in the POS tagging task using the Llama-3.1 8B model. v) The principal implication for AI practitioners is that enhancing the quality of concept encoding (e.g., through early layer finetuning) can directly improve the ICL performance of transformers.
MIVE: New Design and Benchmark for Multi-Instance Video Editing (Read more on arXiv or HuggingFace)	Munchurl Kim, Jihyong Oh, Soo Ye Kim, Agus Gunawan, Samuel Teodoro	Here is a concise summary of the research paper "MIVE: New Design and Benchmark for Multi-Instance Video Editing" based on the provided guidelines: i) The paper introduces MIVE, a zero-shot mask-based framework for multi-instance video editing that disentangles edits and prevents editing leakage. ii) The main research objective is to develop a method for localized editing of multiple objects in videos without unintended changes to other parts of the video. iii) The key methodology uses Disentangled Multi-instance Sampling (DMS) to prevent editing leakage and Instance-centric Probability Redistribution (IPR) to ensure precise localization. iv) Primary results show that MIVE outperforms state-of-the-art methods in multi-instance video editing, achieving a Cross-Instance Accuracy (CIA) Score of 0.7100 in evaluations. v) For AI practitioners, MIVE provides a framework for performing precise, multi-instance video edits without requiring additional training, enabling more efficient and accurate video editing applications.

Papers for 2024-12-17

Papers for 2024-12-16

Title	Authors	Summary
GenEx: Generating an Explorable World (Read more on arXiv or HuggingFace)	danyaljj, jiahaoplus, lambertxiao, tshu, TaiMingLu	Here's a summary of the research paper "GenEx: Generating an Explorable World" following your guidelines: 1. Summary: GenEx is a system that generates explorable, 3D-consistent virtual worlds from a single RGB image, enabling embodied AI agents to navigate and interact within these generated environments. 2. Main research question/objective: How can an agent make more informed decisions through exploration in a generative 360° world? 3. Key methodology: GenEx employs a physics-based data engine to create panoramic video streams representing 360° environments, uses GPT-assisted agents for exploration, and implements an imagination-augmented policy for decision-making. 4. Primary results: GenEx achieves high-quality world generation, with its earlier version demonstrating a PSNR of 30.2 and SSIM of 0.94 in video quality metrics. 5. Principal implication for AI practitioners: GenEx provides a platform for AI practitioners to develop and evaluate embodied AI agents in realistic, dynamically generated environments, enabling advancements in areas such as navigation, interactive gaming, and VR/AR.
Apollo: An Exploration of Video Understanding in Large Multimodal Models (Read more on arXiv or HuggingFace)	minione, lichengyu, YannDubs, nicholswang, orrzohar	This paper explores design choices impacting video understanding in Large Multimodal Models (LMMs). The research investigates how various architectural and training decisions affect video-LMM performance. A combination of controlled experiments on smaller models (demonstrating "Scaling Consistency") and large-scale training was used, leading to the development of the Apollo family of models. Apollo-3B achieved a score of 68.7 on the MLVU benchmark, outperforming most existing 7B models. This work suggests AI practitioners can leverage Scaling Consistency to perform efficient experimentation on smaller models before scaling up, thereby saving computational resources and accelerating the development of high-performing video-LMMs.
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities (Read more on arXiv or HuggingFace)	Saeed Yahya Alseiari, Mohammed Irfan Kurpath, hishamcholakkal, HuggingSara, sahalshajim	Here is a concise summary of the research paper "BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities" based on your specified format: i) Summary: BiMediX2 is a bilingual Arabic-English Large Multimodal Model (LMM) designed for advanced medical image understanding and text-based interactions, leveraging the Llama3.1 architecture. ii) Main research question or objective: To develop a unified bilingual (Arabic-English) multimodal AI model that excels in both medical image understanding and text-based medical tasks. iii) Key methodology used: The model was trained on a 1.6M sample bilingual healthcare dataset, utilizing a Vision Encoder, a Projector for image-text alignment, and LoRA adapters for fine-tuning the Llama 3.1 language model. iv) Primary results: BiMediX2 achieved state-of-the-art performance on several medical benchmarks, outperforming GPT-4 by over 9% in UPHILL factual accuracy evaluations. v) Principal implication for AI practitioners: AI practitioners can leverage BiMediX2's unified architecture and training methodology to develop advanced, multilingual medical AI systems capable of handling diverse modalities and achieving high accuracy in both image and text-based tasks without compromising the advanced text based medical understanding of LLMs.
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption (Read more on arXiv or HuggingFace)	BradyFU, zhenheny, SherryX, nankepan, AnonMegumi	Here's a summary of the paper "InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption" based on your specifications: i) This paper introduces InstanceCap, a novel instance-aware structured captioning framework for text-to-video generation, enhancing video fidelity and consistency. ii) The main research objective is to develop a method for generating detailed, instance-level video captions that improve the accuracy and fidelity of text-to-video generation models. iii) The key methodology involves an Auxiliary Models Cluster (AMC) to isolate video instances and an improved Chain-of-Thought (CoT) process with Multimodal Large Language Models (MLLMs) to refine dense prompts into structured phrases. iv) Primary results show that InstanceCap significantly outperforms previous models, with finetuned models achieving a 37.88% average metric in a specific quantitative evaluation (Table 2). v) For AI practitioners, InstanceCap provides a method to enhance the fidelity of text-to-video models by utilizing detailed, structured captions, enabling the generation of videos with accurate instance details and motion actions.
Large Action Models: From Inception to Implementation (Read more on arXiv or HuggingFace)	Eliblo1969, substill, shilhe, Lujunting, vyokky	This paper introduces Large Action Models (LAMs), designed to perform actions in digital and physical environments. The objective is to develop a framework for creating LAMs, transitioning from Large Language Models (LLMs) limited to textual output, focusing on action generation and execution within dynamic environments. A four-phase training approach is employed, encompassing task-plan pretraining, expert imitation, self-boosting exploration, and reward model-based optimization, using a Windows OS-based GUI agent as a case study. The developed LAM achieved a Task Success Rate (TSR) of 81.2% in offline evaluation on Word tasks, surpassing the 67.2% TSR of GPT-40. This demonstrates the effectiveness of specialized training for action-oriented tasks and provides a practical workflow for AI practitioners developing agents capable of interacting with and manipulating real-world environments through actions rather than just text.
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion (Read more on arXiv or HuggingFace)	JacobYuan, Ruihang, weilllllls, StevenZhang, MoonQiu	Here is a concise summary of the research paper "FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion": i) Summary: This paper introduces FreeScale, a tuning-free inference paradigm that enhances the resolution of pre-trained diffusion models for image and video generation via scale fusion. ii) Main Research Objective: The main research objective is to enable pre-trained diffusion models to generate high-fidelity, high-resolution visual content without requiring additional training or fine-tuning. iii) Key Methodology: FreeScale employs tailored self-cascade upscaling, restrained dilated convolution, and scale fusion, which processes and fuses information from different receptive scales by extracting desired frequency components within the self-attention layers. iv) Primary Results: FreeScale successfully generates 8K-resolution images and outperforms existing methods; for example, when generating 4096x4096 images, it achieves a FID score of 49.796, compared to 72.378 for DemoFusion. v) Principal Implication: AI practitioners can use FreeScale to extend the capabilities of existing diffusion models to generate higher-resolution images and videos without the need for model retraining, offering a practical solution for high-resolution visual content creation.
ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation (Read more on arXiv or HuggingFace)	Dana Berman, Matan Cohen, Asaf Shul, yedid, danielwinter	Here's a concise summary of the research paper "ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation" : i) Summary: This paper introduces ObjectMate, a tuning-free method for photorealistic object insertion and subject-driven generation using a recurrence prior over large unlabeled datasets. ii) Main research question/objective: How to achieve photorealistic object composition into a scene while preserving the object's identity without requiring test-time tuning. iii) Key methodology: ObjectMate leverages a recurrence prior to create a supervised dataset from mass-produced objects across multiple images, then trains a text-to-image diffusion architecture to map object and scene descriptions to a composited image. iv) Primary results: ObjectMate demonstrates superior identity preservation and photorealistic composition compared to state-of-the-art methods in both object insertion and subject-driven generation; users preferred ObjectMate's composition over ObjectDrop's 76% of the time. v) Principal implication for AI practitioners: AI practitioners can use the recurrence prior, which exploits the natural repetition of objects in large-scale datasets, to build more powerful and efficient models for object insertion and subject-driven generation, without the need for test-time fine-tuning or manual data collection.
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing (Read more on arXiv or HuggingFace)	Fan Tang, Changwang Mei, duke1852022, MagicBag, yingying87	Here is a concise summary of the research paper "FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing": i) This paper introduces FireFlow, a novel zero-shot method for fast inversion and semantic editing of images using Rectified Flow (ReFlow) models. ii) Main research question/objective: How to achieve accurate and efficient inversion and editing in ReFlow-based generative models, specifically within 8 steps. iii) Key methodology: A new numerical solver is proposed that achieves second-order precision while maintaining the computational cost of a first-order Euler method by reusing intermediate velocity approximations. iv) Primary results: FireFlow achieves a 3x runtime speedup compared to state-of-the-art ReFlow inversion techniques, with a reconstruction error of 0.1579 in the proposed method compared to 0.2926 for the next best performing method (RF-Solver). v) Principal implication for AI practitioners: AI practitioners can leverage FireFlow for faster and more accurate image inversion and editing using ReFlow models, enabling more efficient development of applications requiring fine-grained control over image generation.
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation (Read more on arXiv or HuggingFace)	morninghaze, baochenxi, wzk1015, JackyZhuo, wbs2788	Here is a concise summary of the research paper "Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation": i) Summary: This paper introduces VMB, a novel multimodal music generation framework that utilizes text and music as explicit bridges for aligning and generating music from various input modalities. ii) Main research question/objective: The main objective is to address challenges in multimodal music generation such as data scarcity, weak cross-modal alignment, and limited controllability. iii) Key methodology: The key methodology involves a Multimodal Music Description Model to create text bridges, a Dual-track Music Retrieval module to provide music bridges, and an Explicitly Conditioned Music Generation framework based on a diffusion transformer. iv) Primary results: VMB achieved a KLpasst score of 48.84 on the SymMV dataset for video-to-music generation, outperforming existing methods. v) Principal implication for AI practitioners: AI practitioners can leverage VMB's explicit text and music bridges to improve the quality, alignment, and controllability of multimodal music generation models, which could be applied in areas like automatic video soundtrack creation.
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding (Read more on arXiv or HuggingFace)	wzk1015, Einsiedler, hehesang, Changyao, cpsxhao	Here is a concise summary of the research paper "SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding": i) SynerGen-VL is an encoder-free Multimodal Large Language Model (MLLM) that integrates image understanding and generation capabilities using vision experts and token folding. ii) The main research objective is to develop a unified MLLM that simplifies the model architecture and training pipeline while effectively supporting high-resolution image understanding and generation. iii) Key methodologies include a token folding mechanism to reduce visual token sequence length, a vision-expert-based progressive alignment pretraining strategy, and a unified next-token prediction objective for both image understanding and generation. iv) Primary results show that SynerGen-VL achieves competitive performance; for instance, with only 2.4B activated parameters, it achieves a Multi-Modal Massive Multitask Understanding (MMMU) score of 34.2, comparable to existing encoder-free unified MLLMs with larger parameter sizes. v) For AI practitioners, SynerGen-VL offers a simplified and scalable approach to building unified MLLMs, potentially streamlining development by eliminating the need for separate encoders or complex training objectives for image understanding and generation tasks.
SCBench: A KV Cache-Centric Analysis of Long-Context Methods (Read more on arXiv or HuggingFace)	Chengruidong, luoxufang, qianhuiwu, iofu728, liyucheng	SCBench benchmarks long-context language models (LLMs) focusing on KV cache usage. The research investigates the performance of long-context methods in scenarios involving KV cache reuse, like multi-turn dialogue. A comprehensive benchmark comprising 12 tasks across four long-context abilities (string retrieval, semantic retrieval, global information processing, and multi-tasking) was created. MInference, a dynamic sparse attention method, shows superior performance in shared context and multi-turn scenarios, particularly in retrieval tasks, achieving up to 51.2% accuracy. AI practitioners can leverage these insights to choose efficient long-context methods based on task needs, especially in dynamic conversational applications, focusing on strategies that maintain or dynamically compress KV cache for optimal performance.
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers (Read more on arXiv or HuggingFace)	Pinar Yanardag, Kavana Venkatesh, ydalva	Here is a concise summary of the research paper "FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers": i) Summary: The paper introduces FluxSpace, a novel method for performing disentangled semantic editing on images generated by rectified flow transformers. ii) Main research question/objective: To develop a domain-agnostic image editing method that allows for precise, attribute-specific modifications without affecting unrelated aspects of the image in rectified flow models. iii) Key methodology: FluxSpace leverages the attention layer outputs within the joint transformer blocks of rectified flow models to create a semantically interpretable representation space, enabling linear editing operations for both fine-grained and coarse-level image modifications. iv) Primary results: FluxSpace achieves disentangled image editing, outperforming existing methods in quantitative evaluations; for instance, it achieved a CLIP-I score of 0.9417 for eyeglass editing, indicating high content preservation. v) Principal implication for AI practitioners: AI practitioners can utilize FluxSpace for precise and disentangled semantic editing of images generated by rectified flow transformers without additional training, offering enhanced control and efficiency in image generation and manipulation tasks.
SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs (Read more on arXiv or HuggingFace)	SultanR	Here's a summary of the paper "SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs" adhering to your guidelines: i) The paper introduces SmolTulu, a 1.7B parameter instruction-tuned language model that achieves state-of-the-art performance among sub-2B parameter models by adapting the Tulu 3 post-training pipeline. ii) The main research question is how the relationship between learning rate and batch size impacts the performance of small language models (SLMs) during supervised finetuning across different types of tasks. iii) The key methodology involved empirical analysis using a 135M parameter model and a 1.7B parameter model, with ablations of learning rate and batch size during supervised finetuning and direct preference optimization. iv) The primary result is that higher learning rate to batch size ratios improved performance on reasoning tasks, with SmolTulu-DPO-1130 achieving 67.7% on IFEval. v) The principal implication for AI practitioners is that optimal learning rate to batch size ratios for SLMs may differ significantly from larger models and are task-dependent, necessitating careful tuning for optimal performance in different applications.
Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images (Read more on arXiv or HuggingFace)	Ilker Hacihaliloglu, Leonid Sigal, Clayton Allard, moein99, yasimed	Here is a summary of the research paper "Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images": i) The paper introduces Prompt2Perturb (P2P), a novel method for generating text-guided adversarial attacks on breast ultrasound images using diffusion models without retraining. ii) Main research question/objective: How can adversarial examples be generated for breast ultrasound images using text prompts, bypassing the need for retraining diffusion models and ensuring clinical relevance? iii) Key methodology: P2P leverages learnable prompts within a frozen text encoder to directly update text embeddings, optimizing only the early reverse diffusion steps to create subtle yet impactful perturbations guided by text instructions. iv) Primary results: P2P achieved a 98% attack success rate on the DenseNet121 model using the BUSI dataset, while maintaining low LPIPS (0.13) and FID (45.84) scores, indicating high visual quality and stealthiness. v) Principal implication for AI practitioners: AI practitioners can use P2P to generate effective and stealthy adversarial attacks on medical imaging models using only text prompts, highlighting potential vulnerabilities in these systems without requiring extensive data or model retraining.

Papers for 2024-12-13

Papers for 2024-12-12

Title	Authors	Summary
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints (Read more on arXiv or HuggingFace)	lemonaddie, ziyangy, Xintao, menghanxia, jianhongbai	Here is a concise summary of the AI research paper "SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints": i) Summary: SynCamMaster is a novel framework for generating synchronized multi-camera videos from diverse viewpoints using a pre-trained text-to-video model augmented with a plug-and-play module. ii) Main research question or objective: How to achieve dynamic consistency across multiple viewpoints in open-domain multi-camera video generation. iii) Key methodology: A multi-view synchronization module is introduced to maintain appearance and geometry consistency, and a hybrid training scheme leverages multi-camera images, monocular videos, and Unreal Engine-rendered multi-camera videos. iv) Primary results: SynCamMaster outperforms baseline methods in generating view-synchronized videos, achieving a matching pixel count (Mat. Pix) of 527.1K, compared to the next best method's 116.8K. v) Principal implication for AI practitioners: AI practitioners can utilize SynCamMaster's multi-view synchronization module to generate consistent multi-camera videos, enhancing applications such as virtual filming.
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations (Read more on arXiv or HuggingFace)	MAJIARUI, SYZhang0805, yeezlee, mengcy, hyllbd	Here is a concise summary of the research paper: i) The paper introduces LAION-SG, a large-scale dataset with scene graph annotations for training text-to-image models to generate complex images with multiple objects and intricate relationships. ii) The main research question is how to improve text-to-image models' performance in generating complex compositional images involving multiple objects and relationships. iii) The key methodology involves automatically generating scene graph annotations using GPT-4 and constructing a new dataset, LAION-SG, based on LAION-Aesthetics V2, along with developing a foundation model, SDXL-SG, that incorporates scene graph information into the Stable Diffusion XL model using graph neural networks. iv) The primary result is that SDXL-SG outperforms existing models on complex scene generation, achieving a 20.1 FID score and 0.558 SG-IoU on LAION-SG, indicating improved image quality and semantic accuracy. v) For AI practitioners, LAION-SG provides a valuable resource for training and evaluating models for complex image generation, and SDXL-SG offers a new approach to incorporating structural information into the generation process, with the potential to enhance the accuracy and controllability of text-to-image models.
POINTS1.5: Building a Vision-Language Model towards Real World Applications (Read more on arXiv or HuggingFace)	Xiao Zhou, Le Tian, yangyu1, kavio, YuanLiuuuuuu	Okay, here is a concise summary of the paper "POINTS1.5: Building a Vision-Language Model towards Real World Applications" following your specified guidelines: i) POINTS1.5 is a vision-language model designed for enhanced performance in real-world applications like optical character recognition and diagram analysis. ii) The main research objective is to develop an improved vision-language model, POINTS1.5, that surpasses its predecessor, POINTS1.0, by incorporating native dynamic high-resolution image processing and bilingual support, specifically for English and Chinese. iii) Key methodology involves replacing the CLIP vision encoder with a NaViT-style encoder for dynamic resolution support, creating a large Chinese corpus for pre-training and visual instruction tuning, and implementing rigorous filtering methods for the visual instruction tuning datasets. iv) Primary results show that POINTS1.5-7B outperforms all other models under 10 billion parameters on the OpenCompass leaderboard, achieving a score of 67.4 after model soup. v) Principal implication for AI practitioners is that POINTS1.5 provides a more accurate and efficient framework for real-world vision-language tasks, particularly those requiring high-resolution image understanding and bilingual (Chinese-English) language processing, offering a strong foundation for developing applications that can handle diverse visual and textual data inputs.
Learning Flow Fields in Attention for Controllable Person Image Generation (Read more on arXiv or HuggingFace)	AdityaPatel, Wall-dandelion, Yuren, shikunl, franciszzj	Here is a concise summary of the research paper "Learning Flow Fields in Attention for Controllable Person Image Generation": i) This paper introduces Leffa, a regularization loss that improves controllable person image generation by learning flow fields within attention mechanisms to reduce detail distortion. ii) Main research objective: To alleviate the distortion of fine-grained details in controllable person image generation while maintaining high overall image quality. iii) Key methodology: A regularization loss (Leffa) is proposed that guides target queries to attend to correct reference keys in attention layers by transforming attention maps into flow fields and warping the reference image towards the target image. iv) Primary results: Leffa achieves state-of-the-art performance on virtual try-on and pose transfer, achieving a FID of 4.54 on the VITON-HD dataset (paired setting) for virtual try-on. v) Principal implication for AI practitioners: AI practitioners can use Leffa as a model-agnostic loss function to enhance the performance of existing diffusion models in controllable person image generation tasks by reducing fine-grained detail distortion without additional inference costs or parameters.
StyleMaster: Stylize Your Video with Artistic Generation and Translation (Read more on arXiv or HuggingFace)	Huijuan Huang, whluo, qq8933, Xintao, zixuan-ye	Here is a concise summary of the research paper "StyleMaster: Stylize Your Video with Artistic Generation and Translation": i) StyleMaster is a novel framework for video stylization that achieves high-quality results in both stylized video generation and video-to-video style transfer. ii) Main research question/objective: How to effectively extract and inject style features into video generation models to achieve accurate and consistent stylization while preserving content fidelity? iii) Key methodology: A style extraction module with local patch selection based on prompt-patch similarity and global style projection trained via contrastive learning on a paired style dataset generated through model illusion, coupled with a motion adapter and a gray tile ControlNet. iv) Primary results: StyleMaster outperforms existing methods in style resemblance and temporal coherence, achieving a CLIP-Text similarity score of 0.305 in stylized video generation. v) Principal implication for AI practitioners: AI practitioners can leverage StyleMaster's style extraction and injection techniques to develop advanced video editing tools and creative applications with enhanced control over stylization.
Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction (Read more on arXiv or HuggingFace)	JustinOh, LeeYG, lelady, xysun, stnamjef	Here is a concise summary of the research paper "Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction": i) Summary: This paper introduces Generative Densification (GD), a method to improve the detail representation of generalized feed-forward Gaussian models for 3D reconstruction. ii) Main research question/objective: How can the densification strategy used in per-scene 3D Gaussian Splatting be adapted to enhance the representation of high-frequency details in generalized feed-forward Gaussian models? iii) Key methodology: GD selectively densifies the top K Gaussians with large view-space positional gradients based on learned prior knowledge, up-sampling feature representations and generating corresponding fine Gaussians in a single forward pass using a point-level transformer. iv) Primary results: The proposed method outperforms state-of-the-art approaches on object-level and scene-level reconstruction tasks; for instance, it achieved a PSNR of 28.75 on the Gobjaverse dataset, compared to 27.49 for the LaRa baseline. v) Principal implication for AI practitioners: AI practitioners can leverage GD to improve the fidelity of 3D reconstructions from sparse-view inputs by efficiently densifying Gaussians based on learned prior knowledge, enabling more detailed and accurate 3D models.
StreamChat: Chatting with Streaming Video (Read more on arXiv or HuggingFace)	Shiyi Lan, hsli-cuhk, LucasFang, Zhiding, jjjjh	Here is a concise summary of the StreamChat paper based on your guidelines: i) Summary: StreamChat is a novel approach that enables large multimodal models (LMMs) to dynamically interact with streaming video by updating the visual context at each decoding step. ii) Main Research Question/Objective: How to enable LMMs to effectively interact with streaming videos and utilize up-to-date video content throughout the decoding process. iii) Key Methodology: Introduction of a cross-attention-based architecture that processes dynamic streaming inputs, a parallel 3D-RoPE mechanism for encoding temporal information, and a new dense instruction dataset for training. iv) Primary Results: StreamChat-7B outperforms the state-of-the-art LLaVA-Video-72B model in streaming interaction scenarios, with the StreamChat-7B model producing equally or more preferable answers in 77% of the evaluation cases compared to VILA-1.5-40B. v) Principal Implication for AI Practitioners: AI practitioners can use StreamChat to develop more interactive and responsive video understanding models that maintain context continuity in streaming scenarios, enhancing user experience in real-time applications.
Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation (Read more on arXiv or HuggingFace)	Frag1le	Here is a concise summary of the research paper "Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation" by Frag1le: i) This paper introduces Mogo, a novel GPT-type model for generating high-quality, long, and open-vocabulary 3D human motion sequences. ii) The main research objective is to develop a model that surpasses the quality of BERT-type models in text-to-motion generation while leveraging the streaming output capability of GPT-type models. iii) The key methodology involves a hierarchical residual vector quantization variational autoencoder (RVQ-VAE) for motion sequence discretization and a Hierarchical Causal Transformer for autoregressive generation and residual inference. iv) On the HumanML3D test set, Mogo achieves a Fréchet Inception Distance (FID) score of 0.079, outperforming the T2M-GPT model. v) For AI practitioners, Mogo offers a new approach that combines the strengths of GPT and BERT-type models in a single transformer model, improving the quality and efficiency of 3D human motion generation without adding extra refinement models.
KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models (Read more on arXiv or HuggingFace)	Jing Tang, Sunghun Kim, Chansung Park, Juyong Jiang, Fan Wang	Here is a concise summary of the research paper "KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models" based on the guidelines provided: 1. Summary: The paper introduces Knowledge-aware Singular-value Adaptation (KaSA), a parameter-efficient fine-tuning (PEFT) method that leverages singular value decomposition (SVD) to dynamically activate relevant knowledge in large language models (LLMs) for specific downstream tasks. 2. Main research question or objective: The main objective is to develop a PEFT method that addresses the limitations of existing methods like LoRA by dynamically activating task-relevant knowledge while minimizing the interference of noisy or irrelevant knowledge during fine-tuning. 3. Key methodology used: KaSA employs SVD with knowledge-aware singular values to adapt LLMs. It performs knowledge-based SVD truncation to remove minor singular components representing noise and reparameterizes task-specific updates in SVD form to maintain a consistent representational space. It introduces knowledge-aware singular values (Δσι, ..., Δσr) to activate relevant parametric knowledge based on its relevance to specific downstream tasks and incorporates regularization terms (L2 and L3) to constrain the task-specific updates. 4. Primary results: KaSA consistently outperforms full fine-tuning (FFT) and 14 popular PEFT baselines across 16 benchmarks and 4 synthetic datasets. Specifically, on the GLUE benchmark, KaSA achieved an average performance of 86.3% for RoBERTa-base, surpassing other methods. 5. Principal implication for AI practitioners: AI practitioners can leverage KaSA as a superior PEFT method to efficiently adapt LLMs to various downstream tasks, achieving improved performance with significantly reduced computational and memory costs compared to full fine-tuning and other popular PEFT methods.
FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models (Read more on arXiv or HuggingFace)	Tomer Michaeli, Inbar Huberman-Spiegelglas, Matan Kleiner, Vladimir Kulikov	Here is a concise summary of the research paper "FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models": i) Summary: FlowEdit is a novel, inversion-free, and optimization-free method for text-based image editing using pre-trained flow models. ii) Main research question/objective: The main objective is to develop a text-based image editing method for flow models that directly maps between source and target image distributions without relying on inversion, optimization, or model-specific interventions. iii) Key methodology used: FlowEdit constructs an ordinary differential equation (ODE) that directly maps the source image distribution to the target distribution, corresponding to the source and target text prompts, achieving a lower transport cost than inversion-based methods. iv) Primary results: FlowEdit achieves lower transport cost compared to editing-by-inversion (1376 vs. 2239 for MSE between source-target pairs in a synthetic dataset of model-generated images). v) Principal implication for AI practitioners: AI practitioners can use FlowEdit for efficient and structure-preserving text-based image editing with pre-trained flow models, without the need for computationally intensive inversion or optimization steps.
StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements (Read more on arXiv or HuggingFace)	Chi Zhang, Hao Wang, Beier Zhu, Xue Song, Mingkun Lei	Here is a concise summary of the research paper "StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements": i) StyleStudio is a text-driven style transfer model that improves upon existing methods by enhancing the alignment of generated images with text prompts while preserving style fidelity and layout structure. ii) The main objective is to address the challenges of style overfitting, limited stylistic control, and misalignment with textual content in text-driven style transfer. iii) The key methodology includes a cross-modal Adaptive Instance Normalization (AdaIN) for feature integration, a Style-based Classifier-Free Guidance (SCFG) for selective style control, and a teacher model for stabilizing spatial layouts. iv) The proposed method achieves a text alignment score of 0.235, outperforming other methods evaluated. v) For AI practitioners, the principal implication is that StyleStudio can be integrated into existing style transfer frameworks without fine-tuning to improve text-to-image generation alignment and offer finer control over stylistic elements.
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation (Read more on arXiv or HuggingFace)	Lijie Wen, Shaolin Zhu, liboaccn	Here is a concise summary of the AI research paper "MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation": i) Summary: This paper introduces MIT-10M, a new dataset for multilingual image translation, addressing limitations in existing datasets regarding scale, diversity, and quality. ii) Main research question or objective: The main objective is to create a large-scale, high-quality parallel corpus for multilingual image translation that reflects real-world data complexities. iii) Key methodology used: The methodology involved web crawling, data cleaning, OCR annotation, and multilingual translation with validation using GPT-4 and Google Translate. iv) Primary results: The MIT-10M dataset contains over 10 million image-text pairs across 14 languages and 840K images; fine-tuning the Qwen2-VL model with MIT-10M improved the BLEU score by 230%. v) Principal implication for AI practitioners: AI practitioners can use MIT-10M to train and evaluate multilingual image translation models, leading to more robust models capable of handling diverse, real-world scenarios.

Papers for 2024-12-11

Papers for 2024-12-10

Title	Authors	Summary
ProcessBench: Identifying Process Errors in Mathematical Reasoning (Read more on arXiv or HuggingFace)	Keming Lu, Beichen Zhang, Zhenru Zhang, RunjiLin, chujiezheng	Here is a concise summary of the research paper "PROCESSBENCH: Identifying Process Errors in Mathematical Reasoning": i) PROCESSBENCH is a new benchmark for evaluating the ability of language models to identify erroneous steps in mathematical reasoning. ii) The main research objective is to develop and evaluate a benchmark, PROCESSBENCH, for measuring the capability of models to identify the earliest erroneous step in mathematical reasoning solutions. iii) The key methodology involves curating a dataset of 3,400 mathematical problems with expert-annotated step-by-step solutions, and evaluating various process reward models (PRMs) and critic models (i.e., prompted general language models) on their ability to identify the first incorrect step. iv) The primary result is that the best open-source model, QwQ-32B-Preview, achieved an average F1 score of 71.5 across all subsets, demonstrating competitive performance with the proprietary model GPT-40 (61.9 F1 score) but lagging behind o1-mini (87.9 F1 score). v) The principal implication for AI practitioners is that existing PRMs generally fail to identify process errors in challenging math problems, while prompting large language models as critics shows promise, highlighting the need for better methods for scalable oversight of mathematical reasoning in AI systems.
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models (Read more on arXiv or HuggingFace)	Wanxiang Che, Libo Qin, Yuxi Xie, Tianhao Niu, LooperXX	Here is a concise summary of the AI research paper "Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models" based on your specific guidelines: 1. Summary: This paper introduces MMGIC, a new multimodal dataset featuring multi-grained concept annotations, and demonstrates its effectiveness in improving the performance of Multimodal Large Language Models (MLLMs) on vision-language tasks. 2. Main Research Question/Objective: The main objective was to investigate whether integrating fine-grained concept annotations (e.g., object labels, attributes, and relationships) with coarse-grained annotations (e.g., image captions) can enhance MLLMs' performance in multimodal comprehension and generation. 3. Key Methodology: The authors constructed the MMGIC dataset by integrating multi-grained concept annotations into image-text interleaved documents using a structured template and trained MLLMs with an autoregressive objective to predict the next visual or textual token in a multimodal sequence. They evaluate different data recipes and compare MMGIC with image-caption data. 4. Primary Results: Experiments showed that multi-grained concept annotations in MMGIC integrate and complement each other, leading to improved performance on 12 multimodal comprehension and generation benchmarks. For instance, the appropriate combination of MMGIC with image-caption data achieved a 3.95% absolute improvement over image-caption data alone on the POPE benchmark. 5. Principal Implication for AI Practitioners: AI practitioners can leverage the MMGIC dataset and the proposed training framework to develop MLLMs with enhanced capabilities in aligning vision and language at multiple granularities, leading to better performance on downstream vision-language tasks.
Training Large Language Models to Reason in a Continuous Latent Space (Read more on arXiv or HuggingFace)	Zhiting Hu, Xian Li, DiJia Su, Sainbayar Sukhbaatar, Shibo Hao	Here is a concise summary of the research paper: i) Summary: The paper introduces COCONUT, a novel paradigm that enables large language models (LLMs) to reason in a continuous latent space instead of the discrete language space. ii) Main research question or objective: Can LLMs reason more effectively in an unrestricted continuous latent space compared to the traditional language space? iii) Key methodology: COCONUT utilizes the last hidden state of the LLM as a "continuous thought" and feeds it back as the subsequent input embedding, training with a multi-stage curriculum that replaces language reasoning steps with continuous thoughts. iv) Primary results: COCONUT outperforms the Chain-of-Thought (CoT) method in certain logical reasoning tasks, achieving 97.0% accuracy on the ProsQA dataset compared to 77.5% for CoT. v) Principal implication for AI practitioners: AI practitioners can leverage COCONUT to develop LLMs with enhanced reasoning capabilities, especially for tasks requiring substantial planning and fewer inference tokens.
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation (Read more on arXiv or HuggingFace)	Ying Shan, Yixiao Ge, Yizhuo Li, Yuying Ge	Here is a concise summary of the paper "Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation" based on your specified format: i) Summary: This paper introduces Divot, a diffusion-powered video tokenizer that learns spatiotemporal video representations for unified video comprehension and generation within a large language model (LLM). ii) Main research question/objective: To develop a video tokenizer that captures spatial and temporal video features, enabling LLMs to perform both video comprehension and generation. iii) Key methodology: A diffusion model is trained to de-noise video clips conditioned on the tokenizer's spatiotemporal representations, thereby optimizing the tokenizer. The tokenizer is then integrated with a pre-trained LLM, Divot-LLM, to predict the parameters of a Gaussian Mixture Model (GMM) for modeling the distribution of continuous video features. iv) Primary results: Divot-LLM achieves competitive performance on video comprehension benchmarks; for example, it obtains a 76.4% accuracy on the MVBench video comprehension benchmark. v) Principal implication for AI practitioners: AI practitioners can leverage the proposed diffusion-based video tokenizer to build unified models for video understanding and generation tasks.
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale (Read more on arXiv or HuggingFace)	Tiejun Huang, Zhengxiong Luo, Haoge Deng, Infinite888, bruiiii	Okay, here is a concise summary of the research paper "You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale", strictly adhering to your guidelines: i) Summary: This paper introduces See3D, a visual-conditional multi-view diffusion model for 3D content creation trained on a large-scale dataset of internet videos without pose annotations. ii) Main research question or objective: How can we effectively learn 3D knowledge from large-scale Internet videos without explicit 3D geometry or camera pose annotations? iii) Key methodology: A four-step data curation pipeline was used to create WebVi3D dataset, and a novel visual-conditional multi-view diffusion model, See3D, was trained on this dataset using a time-dependent visual signal generated by adding noise to masked video data, thereby eliminating the need for pose conditions. iv) Primary results: See3D achieved a PSNR of 24.28 on the CO3D dataset for single-view reconstruction, outperforming models trained on constrained 3D datasets. v) Principal implication for AI practitioners: AI practitioners can leverage See3D to develop 3D generation models using large-scale, readily available video data without the need for costly 3D or pose annotations, significantly reducing the barriers to creating scalable 3D content generation systems.
Robust Multi-bit Text Watermark with LLM-based Paraphrasers (Read more on arXiv or HuggingFace)	Hang Li, Yang Liu, Yuanshun Yao, Jinghan Jia, xiaojunxu	Here is a concise summary of the research paper: i) Summary: This paper introduces a method for embedding multi-bit watermarks into text using fine-tuned, LLM-based paraphrasers and a trained decoder, achieving high detection accuracy and robustness. ii) Main research question/objective: How can a multi-bit watermark be robustly embedded into text while preserving its semantic meaning and remaining imperceptible? iii) Key methodology: The authors fine-tune a pair of LLM paraphrasers as encoders to inject watermark bits by alternatively paraphrasing text segments, and train an LLM-based text classifier as a decoder to extract the watermark. The encoder-decoder pair is co-trained using PPO-based reinforcement learning techniques. iv) Primary results: The proposed method achieves over 99.99% detection AUC with small (1.1B) text paraphrasers, outperforming existing methods. The watermark is evaluated as robust under word substitution and sentence paraphrasing perturbations. v) Principal implication for AI practitioners: AI practitioners can use this watermarking technique to embed robust and imperceptible multi-bit watermarks in text generated by language models, enabling applications such as copyright protection and tracking of misinformation.
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction (Read more on arXiv or HuggingFace)	Mingyang Sun, Siteng Huang, Shangke Lyu, Pengxiang Ding, Zhefei Gong	Here is a concise summary of the research paper "CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction": i) Summary: The paper introduces Coarse-to-Fine AutoRegressive Policy (CARP), a novel visuomotor policy learning paradigm that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach for robotic tasks. ii) Main research question/objective: Can a coarse-to-fine autoregressive approach achieve the high performance of diffusion-based models while maintaining the efficiency of traditional autoregressive models in visuomotor policy learning? iii) Key methodology: CARP decouples action generation into two stages: a multi-scale action autoencoder learns representations of the action sequence, and a GPT-style transformer refines the sequence prediction through a coarse-to-fine autoregressive process. iv) Primary results: CARP achieves competitive success rates on state-based and image-based simulation benchmarks and real-world tasks, delivering 10x faster inference compared to state-of-the-art policies. v) Principal implication for AI practitioners: AI practitioners can leverage CARP as a high-performance, efficient, and flexible framework for action generation in robotic tasks, offering a superior balance of performance and efficiency compared to existing methods.

Papers for 2024-12-09

Title	Authors	Summary
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (Read more on arXiv or HuggingFace)	Yangzhou Liu, Yue Cao, Zhe Chen, qishisuren, Weiyun1025	Here's a summary of the AI research paper following your strict guidelines: i) InternVL 2.5, an advanced multimodal large language model (MLLM), significantly improves open-source multimodal capabilities through model, data, and test-time scaling. ii) To systematically investigate the relationship between model scaling and performance in MLLMs, focusing on how scaling vision encoders, language models, dataset sizes, and inference times impact performance. iii) The study employed a three-stage training pipeline (MLP warmup, optional ViT incremental learning, and full model instruction tuning) combined with dynamic high-resolution training and data filtering techniques. iv) InternVL 2.5 achieved a 3.7-point improvement on the MMMU benchmark (reaching 70.1%) through Chain-of-Thought (CoT) reasoning. The paper also presents many other results across several benchmarks which are not summarized here. v) The significant performance improvement of InternVL 2.5 on MMMU and other benchmarks, especially its surpassing 70% accuracy on MMMU, demonstrates the potential for open-source MLLMs to rival commercial models and provides a strong open-source baseline for future multimodal AI development. Some aspects of the training methodology, such as specifics of the data filtering techniques, are not fully detailed.
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment (Read more on arXiv or HuggingFace)	Cheng Jin, Xiaomeng Yang, Junyan Wang, Zhiyu Tan, Yibin Wang	Here is a concise summary of the research paper "LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment": i) This paper introduces LiFT, a novel pipeline that utilizes human feedback to improve the alignment of text-to-video (T2V) models with human preferences. ii) Main research question or objective: How can human feedback be effectively leveraged to align T2V models with subjective human expectations regarding video quality and content? iii) Key methodology used: A three-stage pipeline is proposed: human feedback collection to create the LIFT-HRA dataset, training a reward model (LIFT-CRITIC) to predict human feedback scores and reasoning, and fine-tuning the T2V model using reward-weighted likelihood maximization. iv) Primary results: The fine-tuned CogVideoX-2B model using LIFT-CRITIC-40B outperforms the CogVideoX-5B baseline across all 16 metrics of the VBench benchmark. For instance, in the "Object Class" category, CogVideoX-2B-LIFT (40B) achieves a score of 91.77, compared to CogVideoX-5B's score of 88.99. v) Principal implication for AI practitioners: AI practitioners can use the LiFT pipeline and the LIFT-HRA dataset to improve the alignment of T2V models by incorporating human feedback, but the paper does not specify how generalizable this method is to other T2V models.
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale (Read more on arXiv or HuggingFace)	Yuelin Bai, Tuney Zheng, Jarvis Guo, yuexiang96, luodian	Here's a summary of the AI research paper following your specified guidelines: i) 1-line summary: MAmmoTH-VL, a novel multimodal instruction-tuning dataset constructed using open-source models, significantly improves multimodal reasoning capabilities in large language models (LLMs). ii) Main research question or objective: How can a scalable and cost-effective method be developed to create a large-scale multimodal instruction-tuning dataset that elicits chain-of-thought (CoT) reasoning, thus improving the reasoning capabilities of open-source MLLMs? iii) Key methodology used: A three-step pipeline: (1) collecting and categorizing open-source multimodal data; (2) augmenting and rewriting tasks using open-source LLMs/MLLMs to elicit CoT reasoning; (3) self-filtering the data using an open-source MLLM to ensure data quality. iv) Primary results: Training an 8B parameter MLLM on the resulting 12M instruction-response pairs yielded an 8.1% improvement on the MathVerse benchmark compared to the previous open-source state-of-the-art. v) Principal implication for AI practitioners: The study provides a cost-effective and scalable methodology for building high-quality, rationale-enriched multimodal datasets using only open-source tools, significantly advancing the development and application of open-source MLLMs. The substantial performance gains demonstrate the importance of high-quality, CoT-style instruction data for enhancing reasoning capabilities in MLLMs.
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases (Read more on arXiv or HuggingFace)	Kyunghoon Bae, Soyoung An, LG AI Research, lhg912, Sunkyoung	Here is a summary of the AI research paper following your specified guidelines: i) This technical report introduces EXAONE 3.5, a series of instruction-tuned large language models (LLMs) with varying parameter sizes (2.4B, 7.8B, and 32B) designed for real-world applications. ii) The main objective is to develop and release a series of LLMs addressing user feedback regarding the need for smaller, efficient models deployable on low-resource devices and larger models with enhanced real-world performance capabilities, including superior instruction following and long-context processing. iii) The key methodology involved pre-training on a massive corpus followed by instruction tuning and preference optimization, including decontamination to remove test-set examples from training data. Long-context capability was improved using a long-context fine-tuning method. iv) EXAONE 3.5 models achieved the highest scores across seven benchmarks for real-world instruction following; one specific finding is the 2.4B model outperformed similarly sized baselines across all three evaluation categories. v) The most impactful finding, the superior performance of the smaller 2.4B model, offers implications for AI practitioners by demonstrating cost-effective and high-performing sLLMs, meeting industry demand for models suitable for on-device deployment and resource-constrained environments. The study's methodology for improving long-context processing also offers insight into improving LLMs.
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation (Read more on arXiv or HuggingFace)	Mingyu Ding, Yixiao Ge, Yizhuo Li, Yuying Ge, Yi Chen	Here's a concise summary of the research paper "Moto: Latent Motion Token as the Bridging Language for Robot Manipulation": i) Summary: This paper introduces Moto, a novel framework that utilizes latent motion tokens for autoregressive pre-training on videos to enhance robot manipulation learning. ii) Main research question or objective: Can a generative pre-training approach using latent motion tokens, derived from video data, effectively enhance robot learning for manipulation tasks? iii) Key methodology: Moto employs a Latent Motion Tokenizer to convert video content into sequences of latent motion tokens and pre-trains Moto-GPT via next motion token prediction, followed by a co-fine-tuning strategy to bridge motion priors and real robot control. iv) Primary results: Moto outperforms baseline models on the SIMPLER and CALVIN benchmarks; notably, on SIMPLER, Moto achieved an overall success rate of 0.614, surpassing larger models like RT-2-X and OpenVLA. v) Principal implication for AI practitioners: AI practitioners can leverage Moto's pre-training approach on readily available video datasets to enhance the performance of robot manipulation policies, especially in scenarios with limited action-labeled data.
APOLLO: SGD-like Memory, AdamW-level Performance (Read more on arXiv or HuggingFace)	Sem Park, Xi Liu, Wenyan Cong, Hanqing Zhu, Kyriection	Here is a concise summary of the research paper "APOLLO: SGD-like Memory, AdamW-level Performance": i) Summary: The paper introduces APOLLO, a memory-efficient optimizer for large language model (LLM) training that achieves performance comparable to AdamW while significantly reducing memory usage. ii) Main research question or objective: Can structured learning rate adaptation be converted into a practical, memory-efficient optimization method for LLM training? iii) Key methodology: APOLLO approximates channel-wise or tensor-wise gradient scaling factors using an auxiliary low-rank space based on random projections, eliminating the need for costly SVD operations. iv) Primary results: APOLLO consistently outperforms AdamW in pre-training experiments across various LLaMA model sizes, achieving up to a 2.8 reduction in validation perplexity, and enables 3x throughput on an 8xA100-80GB setup compared to AdamW. v) Principal implication for AI practitioners: APOLLO allows AI practitioners to train LLMs more efficiently by drastically reducing optimizer memory overhead, enabling larger batch sizes, improved model scalability, and training on lower-end GPUs.
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion (Read more on arXiv or HuggingFace)	Cuong Pham, Anh Tran, Khoi Nguyen, Quang Nguyen, Tung11	Here's a concise summary of the research paper "SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion," following your specified guidelines: i) Summary: SwiftEdit is a text-guided image editing tool that achieves editing via a one-step diffusion process. ii) Main research question/objective: Develop an efficient method for instant text-guided image editing that overcomes the speed limitations of existing multi-step diffusion-based methods. iii) Key methodology: A one-step inversion framework for image reconstruction and a mask-guided editing technique with attention rescaling for localized editing are proposed. The inversion framework uses a two-stage training strategy using synthetic and real images. iv) Primary results: SwiftEdit achieves text-guided image editing in 0.23 seconds, which is at least 50 times faster than previous multi-step methods while maintaining competitive editing quality. v) Principal implication for AI practitioners: SwiftEdit offers a highly efficient tool for instant text-guided image editing, enabling faster performance in real-world applications without the need for users to define masks.
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration (Read more on arXiv or HuggingFace)	Yu Wang, Xuefei Ning, Yukun Huang, fjxmlzn, NinaKarine	Here is a concise summary of the research paper "GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration": i) GENMAC is a multi-agent framework for compositional text-to-video generation that uses an iterative process with DESIGN, GENERATION, and REDESIGN stages. ii) The main research objective is to develop a system that can generate videos adhering to complex compositional text prompts involving multiple objects, attributes, and dynamic actions. iii) The key methodology involves decomposing the REDESIGN stage into sequential tasks (verification, suggestion, correction, and output structuring) handled by specialized MLLM-based agents, and using a self-routing mechanism to select the appropriate correction agent. iv) GENMAC achieved a 0.5166 G-Dino score on the generative numeracy subset of the T2V-CompBench benchmark, outperforming all baselines. v) For AI practitioners, GENMAC offers a framework for enhancing compositional text-to-video generation by leveraging multi-agent collaboration and iterative refinement, demonstrating a method to improve alignment between generated video content and complex textual descriptions.
Mind the Time: Temporally-Controlled Multi-Event Video Generation (Read more on arXiv or HuggingFace)	Yuwei Fang, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Ziyi Wu	Here is a summary of the paper "Mind the Time: Temporally-Controlled Multi-Event Video Generation" following your guidelines: i) Summary: This paper introduces MinT, a novel video generation model capable of producing multi-event videos with precise temporal control over each event. ii) Main research question/objective: How can AI models generate videos with multiple, temporally distinct events, each with specified start and end times, using individual text prompts? iii) Key methodology: MinT utilizes a temporally-grounded video diffusion transformer with a time-based positional encoding method called ReRoPE to bind each event to its specific time period, enabling time-aware cross-attention between event captions and video tokens. iv) Primary results: MinT outperforms existing open-source video generation models in multi-event video generation, achieving a text-to-video alignment score of 3.00 on the StoryBench dataset, compared to 2.83 for the next best model (MEVG). v) Principal implication for AI practitioners: AI practitioners can leverage MinT to generate videos with multiple events and precise temporal control, enabling more sophisticated and realistic video content creation.
2DGS-Room: Seed-Guided 2D Gaussian Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction (Read more on arXiv or HuggingFace)	Xiansong Lai, Haodong Xiang, Crayon-Shinchan, ChaosLiao, Valentina-Zhang	Here is a concise summary of the research paper "2DGS-Room: Seed-Guided 2D Gaussian Splatting with Geometric Constraints for High-Fidelity Indoor Scene Reconstruction": i) Summary: This paper introduces 2DGS-Room, a novel method for high-fidelity indoor scene reconstruction using 2D Gaussian Splatting with a seed-guided mechanism and geometric constraints. ii) Main research question or objective: The main objective is to develop a method for accurate and high-fidelity geometric reconstruction of indoor scenes. iii) Key methodology used: The key methodology involves a seed-guided mechanism to control the distribution of 2D Gaussians, adaptive growth and pruning of seed points, incorporation of monocular depth and normal priors, and multi-view consistency constraints. iv) Primary results: The method achieves state-of-the-art performance in indoor scene reconstruction on the ScanNet and ScanNet++ datasets; quantitatively, the 2DGS-Room achieves an F-score of 0.464 on the ScanNet++ dataset. v) Principal implication for AI practitioners: AI practitioners can utilize 2DGS-Room for improved 3D reconstruction of indoor scenes, leveraging its seed-guided 2D Gaussian Splatting approach for enhanced accuracy in applications like virtual reality and robotics.
DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling (Read more on arXiv or HuggingFace)	Haiyang Yu, Nan Xu, Kun Chen, Xinghua Zhang, iiiiwis	Here is a summary of the AI research paper "DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling" following your specified guidelines: i) This paper introduces DEMO, a benchmark for Dialogue Element Modeling, encompassing element awareness and dialogue agent interaction, to evaluate large language models' (LLMs) ability to understand and generate dialogues. ii) The main research objective is to develop a comprehensive framework and benchmark for modeling fine-grained dialogue elements across the entire dialogue lifecycle (prelude, interlocution, and epilogue). iii) The key methodology involves a novel data synthesis framework that distills goals, scenes, and personas, generates dialogues using advanced LLMs, and performs quality control through LLM-based annotation and human verification. They also trained a DEMO agent based on imitation learning. iv) The primary results show that while advanced LLMs like GPT-4o demonstrate strong performance, there is still significant room for improvement in dialogue element modeling, with the DEMO agent built on LLaMA achieving a SOTA element awareness score of 6.008. v) The principal implication for AI practitioners is that the DEMO benchmark and the associated agent provide a valuable tool for developing and evaluating LLMs with enhanced capabilities in understanding and generating nuanced, element-driven dialogue, particularly in social intelligence generalization.

Papers for 2024-12-06

Title	Authors	Summary
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection (Read more on arXiv or HuggingFace)	Zhongyuan Wang, Zhizheng Zhang, Qi Su, chengchi, Zhoues	Code-as-Monitor (CaM) uses a vision-language model to generate code that monitors for and prevents robot failures in real time. The research aims to create a unified system for both reactive (detecting failures after they occur) and proactive (preventing foreseeable failures) open-set failure detection in robotic tasks. The key methodology involves formulating robotic failure detection as a constraint satisfaction problem, using visually-prompted code to monitor if these constraints are met during task execution. In simulated "Stack in Order" tasks with severe disturbances, CaM achieved a 17.5% higher success rate than the DoReMi baseline. This allows AI practitioners to build more robust and reliable closed-loop robotic systems capable of handling unexpected events and complex, long-horizon tasks.
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction (Read more on arXiv or HuggingFace)	tianbaoxiexxx, ludunjie, ZeonLap, kugwzk, ranpox	AGUVIS is a unified, pure vision-based framework for building generalizable GUI agents. The research aimed to develop a cross-platform autonomous GUI agent capable of performing complex tasks independently without relying on external closed-source models. The key methodology involved a two-stage training pipeline using a Vision-Language Model (VLM): first for GUI grounding on a newly created template-augmented dataset, followed by planning and reasoning training on a VLM-augmented trajectory dataset. AGUVIS-72B achieved a task success rate of 89.2% on ScreenSpot, outperforming previous state-of-the-art methods in both offline and real-world online scenarios. This indicates a significant advancement towards creating fully autonomous, vision-based GUI agents, offering AI practitioners a potentially more efficient and adaptable solution for automating interactions with diverse digital environments compared to text-based or LLM-dependent approaches.
A Noise is Worth Diffusion Guidance (Read more on arXiv or HuggingFace)	Minjae Kim, Sanghyun Lee, Jiwon Kang, Donghoon Ahn, Min-Jaewon	NoiseRefine improves text-to-image diffusion model quality without guidance methods like classifier-free guidance (CFG). The research explores whether guidance can be replaced by refining initial noise in the diffusion pipeline. The authors train a noise refining model using multistep score distillation (MSD) to map standard Gaussian noise to a learned "guidance-free" noise space, derived from inverting guided high-quality images. Refined noise achieved FID scores comparable to, and in some cases better than, CFG guidance. This method offers AI practitioners a faster and potentially higher-quality alternative to computationally expensive guidance methods for text-to-image diffusion models.
Evaluating Language Models as Synthetic Data Generators (Read more on arXiv or HuggingFace)	Seongyun Lee, Vijay Viswanathan, Xiang Yue, Juyoung Suk, seungone	AGORABENCH benchmarks language models' (LMs) abilities to generate synthetic training data for other LMs. The research aimed to evaluate different LMs as synthetic data generators and understand the characteristics of effective training data generated by LMs. The study employed a controlled setting where various LMs generated 1.26 million training instances using existing data generation methods (instance generation, response generation, quality enhancement) across three domains (math, instruction-following, code), which were then used to fine-tune a student LM (Llama 3.1-8B). GPT-40 achieved the highest average Performance Gap Recovered (PGR) score of 46.8% in instance generation. AI practitioners can utilize AGORABENCH to select appropriate LMs for synthetic data generation based on the specific task and available resources, considering that problem-solving ability does not directly correlate with data generation effectiveness.
MV-Adapter: Multi-view Consistent Image Generation Made Easy (Read more on arXiv or HuggingFace)	Ran Yi, Haoran Wang, pookiefoof, bennyguo, huanngzh	MV-Adapter is a plug-and-play adapter enabling pre-trained text-to-image (T2I) diffusion models to generate multi-view consistent images. The objective is to efficiently generate multi-view consistent images while preserving the quality and knowledge of pre-trained T2I models, without full fine-tuning. The key methodology involves duplicating and parallelizing the self-attention layers of the base T2I model to create separate multi-view and image cross-attention layers within the adapter. On camera-guided image-to-multiview generation on the GSO dataset, MV-Adapter achieved 22.131 PSNR (Peak Signal-to-Noise Ratio) with SDXL. This allows AI practitioners to efficiently adapt existing high-quality T2I models for multi-view generation at high resolutions, reducing computational costs and mitigating overfitting risks associated with full model fine-tuning.
Negative Token Merging: Image-based Adversarial Feature Guidance (Read more on arXiv or HuggingFace)	Yejin Choi, Ranjay Krishna, Weijia Shi, Lindsey Li, Jaskirat Singh	NegToMe is a training-free method for adversarial guidance in text-to-image diffusion models using reference images. The research aimed to improve adversarial guidance beyond text-based negative prompts by leveraging visual features. The core methodology involves semantically matching and extrapolating source image tokens from their closest counterparts in a reference image during the reverse diffusion process. NegToMe improved output diversity (lower DreamSim score and higher Entropy) while maintaining or improving image quality (FID and IS) across different classifier-free guidance scales. This provides AI practitioners with a simple, efficient technique to enhance control and diversity of generated images using directly image-based references, overcoming limitations of purely text-based negative prompts.
Densing Law of LLMs (Read more on arXiv or HuggingFace)	Xu Han, Guoyang Zeng, Weilin Zhao, Jie Cai, xcjthu	Here's a summary of the AI research paper "Densing Law of LLMs" following the provided guidelines: i) 1-line summary: An empirical law, termed the "Densing Law," describes the exponential growth of Large Language Model (LLM) capacity density over time. ii) Main research question or objective: To introduce the concept of "capacity density" as a metric for evaluating LLM training quality, considering both effectiveness and efficiency, and to analyze the trend of LLM capacity density. iii) Key methodology used: Capacity density was defined as the ratio of a model's effective parameter size (minimum parameters needed for equivalent performance) to its actual parameter size. This was estimated using a two-step process: first, fitting a Scaling Law to language modeling loss, and second, fitting a function to relate loss to downstream task performance. Open-source base LLMs released since 2023 were evaluated against five benchmarks. iv) Primary results (include one specific quantitative finding): The maximum capacity density of LLMs doubles approximately every 3.3 months. v) Principal implication for AI practitioners: The Densing Law suggests that achieving comparable performance to state-of-the-art LLMs using significantly fewer parameters is possible within a timeframe of approximately three months, thereby emphasizing the importance of optimizing LLM capacity density for improved efficiency and reduced computational costs in future LLM development.
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion (Read more on arXiv or HuggingFace)	Dianqi Li, Haiping Wu, Jianwei Yang, Jiuhai Chen, zhoutianyi	Florence-VL enhances multimodal large language models (MLLMs) using the generative vision model Florence-2. The research aimed to improve vision-language alignment and performance on diverse multimodal tasks by leveraging Florence-2's enriched visual representations. The key methodology involved a novel "Depth-Breadth Fusion" (DBFusion) that combines visual features extracted from different layers and under multiple prompts of Florence-2, projecting these fused features into a pretrained LLM. Florence-VL 8B achieved 89.9% on MMBench (EN) compared to 67.9% for LLaVA next 8B, demonstrating significant improvements across various benchmarks. This implies that AI practitioners can leverage generative vision models like Florence-2 and fusion techniques like DBFusion to build more robust and versatile MLLMs for tasks requiring detailed image understanding.
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis (Read more on arXiv or HuggingFace)	Yuqi Zhang, Bin Yan, Yi Jiang, Jinlai Liu, Jian Han	Infinity introduces bitwise modeling for autoregressive high-resolution image synthesis. The research aimed to improve the scaling and visual detail representation of discrete generative models for text-to-image synthesis. The core methodology involved a bitwise multi-scale visual tokenizer, an infinite-vocabulary classifier, and a bitwise self-correction mechanism within a visual autoregressive model. On the GenEval benchmark, Infinity achieved an overall score of 0.73, surpassing the SD3-Medium score of 0.62. This work suggests that scaling tokenizer vocabulary and incorporating bitwise modeling can significantly enhance autoregressive models for image generation, providing AI practitioners with a faster, more detailed, and potentially superior alternative to diffusion-based models.
Towards Universal Soccer Video Understanding (Read more on arXiv or HuggingFace)	Yanfeng Wang, Ya Zhang, Hao Jiang, haoningwu, Homie0609	This paper introduces a new framework for multi-modal soccer video understanding. The objective is to develop a comprehensive model adaptable to various soccer video understanding tasks. The researchers constructed SoccerReplay-1988, a dataset of 1,988 soccer matches with rich annotations, and trained MatchVision, a visual-language foundation model, using supervised classification and video-language contrastive learning. MatchVision achieved 80.1% top-1 accuracy on event classification on the SoccerReplay-test benchmark. This work provides AI practitioners with a new dataset and a foundation model for developing more versatile and robust soccer video understanding applications, potentially enabling advancements in automated sports analysis and content generation.
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing (Read more on arXiv or HuggingFace)	Juncheng Li, Xiangtai Li, Ling Yang, WeiChow, BryanW	HumanEdit is a human-rewarded dataset for instruction-based image editing. The objective was to create a high-quality dataset aligned with human preferences for training and evaluating instruction-guided image editing models, addressing limitations of existing datasets like noisy instructions and low-resolution images. The dataset was created through a four-stage pipeline involving annotator training, image selection, instruction and edited image generation using DALL-E 2, and a two-tiered human quality review process. On the HumanEdit-core subset, the mask-free InstructPix2Pix model achieved a CLIP-I score of 0.8946, while the mask-provided Meissonic model achieved a CLIP-I score of 0.9348. The paper presents quantitative results for multiple baselines across different editing types (add, remove, replace, etc.) but doesn't explicitly compare them or declare a "best" overall. AI practitioners can use HumanEdit to train and benchmark instruction-based image editing models, especially for high-resolution, photorealistic editing tasks that better align with human expectations than previous datasets. The availability of masks, along with a subset allowing mask-free editing, allows for more flexible and diverse model training and evaluation.
Personalized Multimodal Large Language Models: A Survey (Read more on arXiv or HuggingFace)	Zhehao Zhang, Yu Xia, Hanjia Lyu, Junda Wu, Franck-Dernoncourt	This paper surveys techniques for personalizing multimodal large language models (MLLMs). The objective is to categorize and analyze existing methods for adapting MLLMs to individual user preferences across various modalities (text, image, audio, etc.). The authors propose a taxonomy classifying personalization techniques based on instruction, alignment, generation, and fine-tuning across different MLLM applications like text/image generation, recommendation, and retrieval. While specific quantitative results are inconsistently reported across surveyed works, the paper notes ConCon-Chi dataset contains 4008 images and 20 concepts within 101 contexts for evaluating personalized vision-language tasks. AI practitioners can use this taxonomy to understand the landscape of MLLM personalization techniques and identify suitable approaches for specific applications, though further research on standardized evaluation metrics and benchmark datasets is needed.
ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality (Read more on arXiv or HuggingFace)	Hong Zhou, Shaoxuan He, Yuanyu He, Feng Chen, Yefei He	ZipAR is a training-free, plug-and-play parallel decoding framework for accelerating auto-regressive visual generation. The research aims to reduce the latency of auto-regressive image generation models which typically decode visual tokens sequentially. ZipAR leverages the spatial locality of images by decoding tokens from different rows in parallel, based on a defined local window size. Experiments demonstrated up to a 91% reduction in forward steps on the Emu3-Gen model with minimal impact on image quality. This allows AI practitioners to significantly accelerate auto-regressive visual generation without retraining or architectural modifications.
MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities (Read more on arXiv or HuggingFace)	Yanfeng Wang, Weidi Xie, Ya Zhang, Ziheng Zhao, haoningwu	MRGen synthesizes training data for MRI segmentation models targeting modalities without existing mask annotations. The research aims to improve MRI segmentation model performance on unannotated modalities due to the cost and scarcity of annotated data. A two-stage training process involves text-guided pretraining on a large radiology image-text dataset (MedGen-1M) followed by mask-conditioned fine-tuning. On average, MRGen improved Dice Similarity Coefficient (DSC) scores by 25% compared to models trained on source-domain data only. This provides AI practitioners with a method to extend existing segmentation models to new MRI modalities without needing manually annotated data, potentially accelerating development and deployment of robust medical image analysis tools.
Discriminative Fine-tuning of LVLMs (Read more on arXiv or HuggingFace)	Ioannis Maniadis Metaxas, Anestis Zaganidis, Alexandros Xenos, Adrian Bulat, Yassine Ouali	This paper introduces VladVA, a novel framework for adapting generative Large Vision-Language Models (LVLMs) for discriminative vision-language tasks. The objective is to enhance LVLMs' discriminative capabilities while preserving their compositional strengths, addressing the limitations of contrastively-trained VLMs and autoregressive LVLMs. The key methodology involves fine-tuning LVLMs with both contrastive and next-token prediction losses on image-text pairs of variable lengths, combined with parameter-efficient adaptation using soft prompting and LoRA. On Flickr30k, VladVA achieves 85.0% recall@1 for image retrieval, a 5.5% absolute improvement over the baseline LLaVA 1.5-7B model. This work provides AI practitioners with a method to leverage the strengths of generative LVLMs for discriminative tasks like image-text retrieval, potentially leading to more robust and nuanced multimodal systems.
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation (Read more on arXiv or HuggingFace)	Jian Gang Ngui, David I. Adelani, Clémentine Fourrier, Angelika Romanou, Shivalika Singh	This paper investigates cultural and linguistic biases in the Massive Multitask Language Understanding (MMLU) benchmark and proposes an improved multilingual version. The research aims to understand how cultural biases in translated datasets influence the performance of multilingual language models and to improve the quality of these datasets. A large-scale evaluation of state-of-the-art language models was conducted using subsets of questions annotated as either culturally sensitive or culturally agnostic, alongside an improved, 42-language translated MMLU dataset called Global-MMLU. Analysis found that 28% of the English MMLU questions require culturally sensitive knowledge, with 86.5% of culturally sensitive questions focused on Western culture. AI practitioners should use Global-MMLU and report performance on culturally sensitive and agnostic subsets separately to better understand model capabilities across diverse cultures and languages, and to avoid inadvertently setting multilingual evaluation standards aligned with a single cultural paradigm.
Monet: Mixture of Monosemantic Experts for Transformers (Read more on arXiv or HuggingFace)	Jaewoo Kang, Kee-Eung Kim, Young Jin Ahn, affjljoo3581	Here is a summary of the AI research paper "Monet: Mixture of Monosemantic Experts for Transformers," following the provided guidelines: i) One-line summary: The MONET architecture integrates sparse dictionary learning into Mixture-of-Experts (MoE) transformer training to achieve parameter-efficient scaling of monosemantic experts and enhance mechanistic interpretability. ii) Main research question/objective: How can the internal computations of large language models (LLMs) be made more interpretable by disentangling polysemantic features and scaling the number of experts in a parameter-efficient manner? iii) Key methodology: The MONET architecture uses a novel expert decomposition method within a Mixture-of-Experts framework, employing product key composition of experts to achieve a square root scaling of total parameters with respect to the number of experts. This is implemented via Horizontal and Vertical Decomposition approaches. iv) Primary results: MONET achieves competitive performance with total parameter-matched dense LLMs on various benchmarks; MONET-VD (Vertical Decomposition) consistently outperforms MONET-HD (Horizontal Decomposition) across benchmarks and model sizes. Specific quantitative results from open-ended LLM benchmarks are provided in Table 2 of the paper. v) Principal implication for AI practitioners: The parameter-efficient scaling of monosemantic experts in MONET enables the creation of highly interpretable LLMs with a significantly increased number of experts. This facilitates robust knowledge manipulation (e.g., domain, language, toxicity control) without sacrificing overall model performance. The methodology offers a novel approach to scaling MoE architectures with enhanced interpretability and control.
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows (Read more on arXiv or HuggingFace)	Yusuke Kato, Zichun Liao, Akash Gokul, Konstantinos Kallidromitis, Shufan Li	OmniFlow is a novel generative AI model for any-to-any multi-modal generation. The research aimed to develop a unified model capable of generating various output modalities (text, image, audio) given any input modality combination. The core methodology involves extending rectified flows (RF) to a multi-modal setting, integrating a multi-modal guidance mechanism within a modular architecture inspired by Stable Diffusion 3. On the GenEval benchmark, OmniFlow achieves a score of 0.62 for text-to-image generation. This modular design, allowing for pretraining of individual components and subsequent merging, offers AI practitioners a more efficient and resource-conscious approach to developing and training unified multi-modal generative models, potentially reducing computational overhead compared to training large unified models from scratch.
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models (Read more on arXiv or HuggingFace)	Zhichao Liao, Fulong Ye, Pengze Zhang, Qichao Sun, Crayon-Shinchan	AnyDressing generates customized images of characters wearing multiple garments based on user-provided garments and text prompts. The research aims to address the limitations of existing virtual dressing methods that struggle with multi-garment combinations and text prompt fidelity. The proposed AnyDressing model uses two primary networks: GarmentsNet, with a Garment-Specific Feature Extractor for parallel encoding of garment textures, and DressingNet, with a Dressing-Attention mechanism and Instance-Level Garment Localization Learning for integrating features and preserving text-image consistency. On a multi-garment evaluation, AnyDressing achieves a CLIP-T score of 0.296, demonstrating improved text consistency. This provides AI practitioners with a more robust and controllable approach for generating virtual dressing images, enabling diverse combinations of attire and improved adherence to user-specified text prompts.
KV Shifting Attention Enhances Language Modeling (Read more on arXiv or HuggingFace)	Weipeng Chen, Bingning Wang, Wei Cheng, xumingyu16	Here's a concise summary of the AI research paper following your strict guidelines: i) 1-line summary: A novel KV shifting attention mechanism is proposed and empirically shown to improve language model training efficiency and performance, reducing the depth and width requirements of induction heads. ii) Main research question/objective: Can modifications to the transformer's attention mechanism improve the efficiency and effectiveness of learning induction heads, thus enhancing language modeling performance? iii) Key methodology: A novel "KV shifting attention" mechanism was proposed, decoupling keys and values in the attention mechanism to reduce the structural requirements for depth and width needed for induction heads. This was theoretically analyzed and empirically validated through experiments on both toy and large-scale language models. iv) Primary results: The KV shifting attention demonstrated superior performance to conventional multi-layer transformers, with a 2.9B parameter model achieving an average benchmark score of 38.57 (compared to 36.45 for Vanilla) after 500B training tokens. Specific details regarding the toy model experiments (Figure 1a and 1b) were provided but lacked complete numerical representation in the main text. v) Principal implication for AI practitioners: KV shifting attention offers a method to potentially improve the efficiency of training large language models by reducing computational resources required for induction heads, leading to better performance or faster convergence. Further investigation is needed to assess the applicability and impact across a wider range of architectures and model sizes, and additional numerical results from the small-scale and large-scale experiments would improve the clarity and impact of the conclusions.
Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement (Read more on arXiv or HuggingFace)	Yu Zhao, Tianqi Shi, Chenyang Lyu, Bo Zeng, Lingfeng Ming	Here is a summary of the AI research paper following your guidelines: i) Marco-LLM, a multilingual large language model (LLM), is developed using massive multilingual continual pre-training and post-training to bridge the performance gap between high- and low-resource languages. ii) The main objective is to develop a multilingual LLM that performs exceptionally well in multilingual tasks, including low-resource languages, while maintaining strong performance in high-resource languages like English. iii) The key methodology involves compiling a large-scale multilingual dataset, conducting two-stage continual pre-training using Qwen2 models, and performing extensive multilingual post-training including supervised fine-tuning and preference alignment. iv) Marco-LLM achieved substantial improvements over state-of-the-art LLMs in various multilingual benchmarks, for example, Marco-72B achieved a 93.7% accuracy on CEVAL and 81.2% accuracy on X-MMLU. v) The significant improvement in multilingual understanding and reasoning tasks across various benchmarks, especially for low-resource languages, highlights the efficacy of massive multilingual training and demonstrates the potential to improve LLM capabilities for under-resourced languages. Further investigation of continual learning parameters and data quality will be essential for future model iterations.

Papers for 2024-12-05

Title	Authors	Summary
SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance (Read more on arXiv or HuggingFace)	Khoi Nguyen, anhttran1111, termanteus, aengusng, viettmab	SNOOPI enhances one-step text-to-image diffusion model training stability and control via novel guidance techniques. The research aimed to address the instability of Variational Score Distillation (VSD) across different architectures and the lack of negative prompt guidance in one-step diffusion models. The authors introduced Proper Guidance - SwiftBrush (PG-SB), which utilizes a random guidance scale during training, and Negative-Away Steer Attention (NASA), which integrates negative prompts during inference via cross-attention manipulation. Integrating PG-SB and NASA with a PixArt-a backbone achieved a Human Preference Score v2 (HPSv2) of 31.08. This offers AI practitioners a more stable and controllable method for developing efficient one-step text-to-image diffusion models with enhanced image quality and adherence to both positive and negative prompts.
Imagine360: Immersive 360 Video Generation from Perspective Anchor (Read more on arXiv or HuggingFace)	liuziwei7, guoyww, mimihe, tongwu2020, jingtan	Imagine360 generates immersive 360° videos from standard perspective videos. The research aimed to develop a framework for transforming perspective videos into 360° equirectangular videos. The core methodology involved a dual-branch video denoising structure with antipodal masking and elevation-aware design, trained on a combined dataset of WEB360 and a newly collected YouTube dataset. Imagine360 achieved a VQA score of 0.8672, outperforming comparison methods like 360DVD and Follow-Your-Canvas. This provides AI practitioners with a new tool for generating high-quality 360° videos from readily available perspective video data, facilitating easier creation of immersive content.
Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion (Read more on arXiv or HuggingFace)	An Zhao, slysun, haoranxu, mengcy, SYZhang0805	ScoreLiDAR, a novel distillation method, accelerates 3D LiDAR scene completion using diffusion models. The research aimed to improve the speed of diffusion-based 3D LiDAR scene completion while maintaining high quality. The method uses Variational Score Distillation (VSD) adapted for 3D data and incorporates a novel Structural Loss to preserve geometric details. On the SemanticKITTI dataset, ScoreLiDAR achieved a 5x speedup, reducing completion time from 30.55 seconds to 5.37 seconds per frame while improving Chamfer Distance by 8%. This allows AI practitioners to utilize diffusion models for real-time or near real-time 3D LiDAR scene completion in applications like autonomous driving where fast processing is crucial.
PaliGemma 2: A Family of Versatile VLMs for Transfer (Read more on arXiv or HuggingFace)	mjlm, AlexeyG, yonatanbitton, dkeysers, mitsch	Here's a summary of the AI research paper following your strict guidelines: i) 1-line summary: PaliGemma 2, a family of versatile vision-language models (VLMs), was developed and evaluated on a broad range of transfer tasks, demonstrating improved performance over its predecessor. ii) Main research question/objective: To investigate the impact of model size and resolution on VLM transfer performance and expand the breadth of transfer tasks beyond those in the original PaliGemma. iii) Key methodology: A family of VLMs was created by combining the SigLIP-So400m vision encoder with various Gemma 2 language models (2B, 9B, and 27B), trained at three resolutions (224px², 448px², 896px²) using a three-stage training process. These models were then fine-tuned on a wide array of transfer tasks including several new tasks such as table and molecular structure recognition. iv) Primary results: PaliGemma 2 achieved state-of-the-art results on many transfer tasks; for example, on ICDAR'15 Incidental and Total-Text, it outperformed the previous state-of-the-art in text detection and recognition (HTS) achieving F1 scores of 75.9 and 74.2, respectively. v) Principal implication for AI practitioners: The release of PaliGemma 2 as open-weight models provides a resource for fine-tuning on various tasks, offering valuable insights into the impact of model scaling on transfer learning and state-of-the-art performance in several domains. The extensive analysis of model size and resolution's effects on numerous tasks provides a valuable resource for model design choices in VLM development. The specific quantitative results on numerous benchmarks allow for direct comparison with existing models and informed decision-making in selecting appropriate models for various applications.
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation (Read more on arXiv or HuggingFace)	sweetrabor, gaozong, xuwang, liqingzju, leo1117	TokenFlow is a novel unified image tokenizer designed to bridge the gap between multimodal understanding and generation. The central research question is whether a single image tokenizer can derive representations suitable for both multimodal understanding and generation. The key methodology involves a dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining alignment via shared index mapping, enabling simultaneous access to both feature types. In multimodal understanding benchmarks, TokenFlow surpasses LLaVA-1.5 13B by 7.2% average improvement, marking the first time discrete visual input outperforms this baseline. This improvement significantly impacts AI practitioners by providing a more efficient and performant approach to unify image representations for both understanding and generation tasks within a single framework.
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding (Read more on arXiv or HuggingFace)	asdfg80, slvjul, zd11024	Video-3D LLM enhances 3D scene understanding by incorporating 3D positional information into video representations. The research aimed to develop a generalist model for various 3D scene understanding tasks, addressing the limitations of current MLLMs in handling 3D spatial information. The authors developed Video-3D LLM, which leverages a pre-trained Video LLM and integrates 3D position encodings derived from depth images into video features, along with a maximum coverage sampling strategy for efficient frame selection. The model achieved state-of-the-art performance on benchmarks like ScanRefer (58.1% Acc@0.25), Scan2Cap (41.3 BLEU-4@0.5IoU), ScanQA (30.1% EM), and SQA3D (58.6% EM). AI practitioners can utilize this approach to enhance performance in applications requiring 3D spatial reasoning, such as robotics, 3D visual grounding, and question answering. The improvement in accuracy on ScanRefer, by incorporating 3D positional data, highlights the practical benefit for developing more robust 3D scene understanding applications.
NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images (Read more on arXiv or HuggingFace)	Chengwh, bluestyle97, Yw22, ZyZcuhk, l-li	NVComposer synthesizes novel views from multiple sparse and unposed images without requiring external alignment. The objective is to generate novel views at specified target camera poses from unposed conditional images without explicit pose estimation or pre-reconstruction. The approach uses an image-pose dual-stream diffusion model to generate views and implicitly predict poses, combined with a geometry-aware feature alignment adapter distilling geometric priors from a pre-trained dense stereo model. On the RealEstate10K dataset, NVComposer achieves a PSNR of 22.55 with four input views, outperforming comparison methods. This provides AI practitioners with a more robust and accessible method for generative novel view synthesis, eliminating the need for potentially unstable external alignment pre-processing.
VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models (Read more on arXiv or HuggingFace)	SunYoung Park, Daeyoung Kim, kimyoungjune, hojunssss	VARCO-VISION is a novel open-source, Korean-English bilingual vision-language model (VLM). The research aimed to develop a high-performing bilingual VLM and accompanying Korean evaluation benchmarks. The authors employed a four-stage training strategy involving feature alignment pre-training, basic and advanced supervised fine-tuning, and preference optimization using translated and human-validated datasets. VARCO-VISION-14B achieved 82.21% accuracy on the K-MMBench benchmark, outperforming similarly sized open-source models. This release provides AI practitioners with a powerful tool for developing Korean-focused multimodal applications and resources for further research in bilingual VLM training and evaluation.
CleanDIFT: Diffusion Features without Noise (Read more on arXiv or HuggingFace)	Björn Ommer, FrankFundel, kolja-b, stefan-baumann, kliyer	CleanDIFT is a novel method for extracting noise-free, timestep-independent features from pre-trained diffusion models. The research aimed to improve the quality and efficiency of diffusion feature extraction by eliminating the need for adding noise to input images. The methodology involved fine-tuning a trainable copy of a diffusion model on clean images while aligning its internal representations with the timestep-dependent features of the original model using projection heads and a cosine similarity loss. On the SPair-71k dataset for zero-shot unsupervised semantic correspondence, CleanDIFT improved PCKbbox accuracy by 1.86 percentage points compared to standard diffusion features. AI practitioners can use CleanDIFT to extract superior, noise-free features from diffusion models more efficiently, eliminating the need for noise or timestep ensembling for various downstream tasks like semantic correspondence, depth estimation, and semantic segmentation.
MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation (Read more on arXiv or HuggingFace)	zouzx, yhyang-myron, XingqiaoAn, bennyguo, huanngzh	MIDI generates compositional 3D scenes from single images by extending pretrained image-to-3D object generation models to multi-instance diffusion. The objective is to generate multiple spatially correlated 3D instances with accurate relationships from a single image. MIDI employs a novel multi-instance attention mechanism within a denoising transformer, trained on scene-level and single-object data, to model cross-instance interactions and spatial coherence directly during 3D generation. On the BlendSwap dataset, MIDI achieves a scene-level Chamfer Distance of 0.077 and F-Score of 78.21, outperforming other single-image 3D scene generation methods. AI practitioners can use MIDI to create coherent and high-fidelity 3D scenes from single images, potentially impacting applications like 3D content creation and scene understanding.
One Shot, One Talk: Whole-body Talking Avatar from a Single Image (Read more on arXiv or HuggingFace)	Boyang Guo, Leipeng Hu, JuyongZhang, YudongGuo, xiangjun-xj	This paper introduces a method for creating animatable, expressive, whole-body talking avatars from a single image. The objective is to reconstruct a 3D talking avatar from a single image that can be animated with realistic gestures and expressions. The method uses pose-guided image-to-video diffusion models to generate pseudo-labels and trains a coupled 3D Gaussian Splatting (3DGS)-mesh hybrid avatar representation with several regularizations. On a self-driven motion reenactment task, the method achieved a peak signal-to-noise ratio (PSNR) of 29.31, outperforming comparison methods. This provides AI practitioners with a new technique to create realistic and controllable talking avatars from limited input data, potentially impacting applications in virtual reality, augmented reality, and telepresence.
Mimir: Improving Video Diffusion Models for Precise Text Understanding (Read more on arXiv or HuggingFace)	Dandan Zheng, Kecheng Zheng, Yutong Feng, Shuai Tan, BiaoGong	Mimir is a novel text-to-video generation framework that enhances text comprehension in video diffusion models. The research aims to address the limited text understanding of current video diffusion models, especially when processing short captions or complex motions, by integrating the capabilities of large language models (LLMs). The key methodology involves a "token fuser" that harmonizes the outputs of text encoders and decoder-only LLMs, enabling the model to leverage both learned video priors and advanced text comprehension of LLMs. Mimir achieves 97.68% on Background Consistency in the VBench benchmark, outperforming all other compared models. This implies that AI practitioners can utilize Mimir’s architecture to improve video generation quality and text comprehension, particularly for short, complex prompts.
Weighted-Reward Preference Optimization for Implicit Model Fusion (Read more on arXiv or HuggingFace)	Xiaojun Quan, Tianyuan Shi, Longguang Zhong, Fanqi Wan, Ziyi Yang	The paper introduces Weighted-Reward Preference Optimization (WRPO) for fusing heterogeneous large language models (LLMs). The research aims to improve the capabilities of a target LLM by implicitly learning from multiple robust open-source LLMs without vocabulary alignment or distribution merging. WRPO uses a progressive adaptation strategy and weighted reward mechanism within a preference optimization framework, mitigating distributional deviations between source and target LLMs. When applied to LLaMA3-8B-Instruct, WRPO achieves a 55.9% length-controlled win rate against GPT-4-Preview-1106 on AlpacaEval-2. This provides AI practitioners with a more efficient and effective method for integrating strengths from various LLMs into a single model, potentially outperforming larger, computationally expensive ensembles.
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training (Read more on arXiv or HuggingFace)	Yi-Zhe Song, Kai Zou, Hmrishav Bandyopadhyay, ChenDY	NitroFusion introduces a dynamic adversarial training framework for high-fidelity single-step text-to-image diffusion. The objective is to improve the quality of single-step diffusion models, which typically suffer from quality degradation compared to multi-step models, while maintaining speed advantages. The key methodology involves a dynamic discriminator pool with specialized and periodically refreshed discriminator heads, employing multi-scale and dual-objective (conditional/unconditional) GAN training. NitroFusion achieves an Aesthetic Score of 5.92 and an Image Reward of 0.991 on the COCO-5k validation dataset, exceeding its 8-step teacher model in these metrics. This offers AI practitioners a single model capable of both rapid generation and high-fidelity image synthesis, dynamically adjustable through bottom-up refinement with 1-4 denoising steps.

Papers for 2024-12-04

Title	Authors	Summary
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation (Read more on arXiv or HuggingFace)	cqf, tfl01, AI4VR, Jethro37, Cheliosoops	VideoGen-of-Thought (VGoT) is a training-free architecture for generating multi-shot, coherent videos. The research aimed to address the challenge of creating multi-shot videos that maintain narrative logic and visual consistency across different shots. VGoT employs a four-module pipeline: Script Generation, Keyframe Generation, Shot-Level Video Generation, and a novel cross-shot Smooth Mechanism using latent features and reset boundaries. VGoT achieved higher Face Consistency (FC) and Style Consistency (SC) scores, particularly across shots, compared to baseline models (0.2738 cross-shot FC score for VGoT vs. a maximum of 0.0686 for baselines). This provides AI practitioners with a novel method to enhance narrative coherence and cross-shot consistency in generated multi-shot videos, particularly improving transitions between shots for a more natural visual flow.
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability (Read more on arXiv or HuggingFace)	zptu, Thu-redrobot, SihengLi, Chufan, Jiahao004	This paper introduces cDPO, a token-level contrastive preference optimization framework for enhancing LLM reasoning capabilities. The research investigates the impact of individual tokens, particularly "critical tokens," on the outcomes of reasoning tasks. The core methodology involves contrastive estimation using separately trained positive and negative models on correct and incorrect reasoning trajectories, coupled with a token-level extension of Direct Preference Optimization (DPO). On the GSM8K benchmark, cDPO achieves an average accuracy of 77.2%, significantly outperforming baseline methods (p < 0.005). This result suggests that AI practitioners can leverage token-level contrastive estimation during preference optimization to improve the accuracy of LLMs on reasoning tasks, specifically by mitigating the negative impact of critical tokens.
Free Process Rewards without Process Labels (Read more on arXiv or HuggingFace)	iseesaw, stingning, ganqu, wendili, lievan	This paper introduces a method for deriving process reward models (PRMs) without step-level labels. The research aimed to reduce the cost and complexity of training PRMs compared to outcome reward models (ORMs) and existing PRM training methods. The core methodology involves parameterizing the outcome reward as the log-likelihood ratio of policy and reference language models and training an ORM on response-level data. Experiments on MATH showed that the resulting implicit PRM, when instantiated with cross-entropy loss, outperformed a strong MCTS baseline (Math-Shepherd) by 0.6% while using less than 1/38 of the training data. This implies that AI practitioners can obtain high-performing PRMs at substantially lower cost by leveraging response-level data and this specific reward parameterization, potentially simplifying the development and deployment of reward models for complex reasoning tasks.
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? (Read more on arXiv or HuggingFace)	shijiay, MoFanCheng, BreakLee, KaituoFeng, kxgong	This paper introduces AV-Odyssey Bench, a benchmark designed to evaluate audio-visual comprehension in Multimodal Large Language Models (MLLMs). The research investigates whether MLLMs genuinely understand audio-visual information, or if their performance relies on surface-level patterns. The benchmark employs 4,555 multiple-choice questions across 26 tasks requiring integration of text, image/video, and audio. On AV-Odyssey, the best-performing model, GPT-40 (audio caption method), achieved only 34.5% accuracy. This indicates current MLLMs struggle with complex audio-visual integration, highlighting a critical area for model and dataset improvement, particularly the integration of audio information within multi-modal contexts.
OmniCreator: Self-Supervised Unified Generation with Universal Editing (Read more on arXiv or HuggingFace)	Harry Yang, Lan Wang, sernam, Harold328	Here's a concise summary of the AI research paper following your specified guidelines: i) One-line summary: OmniCreator, a self-supervised framework, achieves unified image and video generation and universal text-guided editing by leveraging the original video as a denoising condition. ii) Main research question/objective: To develop a unified framework capable of both text-prompted image and video generation and universal text-guided editing, addressing limitations of existing methods focused on specific editing types or requiring additional controls. iii) Key methodology: A self-supervised approach using original text-video pairs as conditions, with the same video serving as a denoising target, combined with an adapter and query transformer for multimodal fusion and spatiotemporal low-rank adaptations (LoRA) for efficiency. iv) Primary results: OmniCreator exhibits substantial superiority over existing models, achieving an average overall user study score of 4.33 on OmniBench-99 for video editing, compared to scores ranging from 2.00 to 3.33 for other methods. v) Principal implication for AI practitioners: OmniCreator’s self-supervised approach and superior performance on a comprehensive video editing benchmark demonstrates the potential for significant advancements in controllable generative models, particularly regarding unified image/video processing and efficient, flexible editing capabilities. The paper lacks a detailed quantitative evaluation on a standardized image editing benchmark.
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation (Read more on arXiv or HuggingFace)	zichenwen, ouyanglinke, binwang, qintong21, Carkham	OHRBench, a new benchmark for evaluating the impact of OCR on Retrieval-Augmented Generation (RAG) systems, reveals that OCR noise degrades RAG performance. The research investigates how OCR noise affects RAG by creating a dataset of PDFs, ground truth structured data, Q&As, and perturbed data with varying OCR noise levels. The key methodology involves evaluating several OCR solutions and then systematically analyzing the impact of semantic and formatting noise on retrieval and generation components of RAG. Results show even the best OCR solution reduces end-to-end RAG F1-score by at least 2.93 points compared to ground truth, and semantic noise consistently degrades performance across different RAG components. AI practitioners developing RAG systems should prioritize mitigating OCR noise for optimal performance, particularly focusing on semantic accuracy.
Scaling Image Tokenizers with Grouped Spherical Quantization (Read more on arXiv or HuggingFace)	Jiangtao Wang, kessel666, briqnn, yifAI, Doreamonzzz	This paper introduces Grouped Spherical Quantization (GSQ) for training image tokenizers. The research aims to address limitations in current image tokenizers related to GAN-based hyperparameters, biased comparisons, and a lack of scaling analysis. GSQ employs spherical codebook initialization, lookup regularization, and latent decomposition to improve training and reconstruction quality. GSQ-GAN achieves a reconstruction FID (rFID) of 0.50 with 16x downsampling on ImageNet at 256x256 resolution. This research suggests that AI practitioners can achieve improved reconstruction quality and efficiency in image tokenizers using GSQ, especially for tasks involving high spatial compression.
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences (Read more on arXiv or HuggingFace)	Sunxy111, Xiaomabufei, senfu, PeihaoChen, Hoyard	LSceneLLM enhances 3D scene understanding in large and complex environments. The research aimed to improve 3D Vision-Language Models' (3D-VLMs) ability to locate task-relevant visual information in large 3D scenes. The authors developed LSceneLLM, a framework incorporating a coarse scene understanding module and a scene magnifier module that uses LLM's visual preference for adaptive identification and detailed examination of relevant regions. LSceneLLM outperformed existing methods on the proposed XR-Scene cross-room understanding benchmark and other existing benchmarks; on XR-QA, LSceneLLM achieved a CIDER score of 117.21 compared to 112.80 for the next best method. AI practitioners can use the plug-and-play scene magnifier module to enhance existing 3D-VLMs for improved accuracy in tasks involving large and complex 3D scene understanding.
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation (Read more on arXiv or HuggingFace)	Dongyoon Han, Song Park, Seungho Lee, Minhyun Lee, bhheo	MaskRIS improves Referring Image Segmentation (RIS) by using a novel masking-based data augmentation strategy. The research aimed to develop a more effective data augmentation technique for RIS than conventional methods, which degrade performance due to semantic conflicts. The key methodology involves masking image and text inputs, combined with Distortion-aware Contextual Learning (DCL) to leverage both original and masked data. MaskRIS achieved state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg, increasing overall Intersection-over-Union (oIoU) scores by up to 2.25% compared to previous methods. This implies that AI practitioners working on RIS can significantly enhance model robustness and accuracy by incorporating the MaskRIS data augmentation framework into their training pipelines.
A dynamic parallel method for performance optimization on hybrid CPUs (Read more on arXiv or HuggingFace)	Liu Yucheng, Luo Yu, Haihao	This paper introduces a dynamic parallel method for optimizing Large Language Model (LLM) inference on hybrid CPUs. The research aims to address the low inference performance on hybrid CPUs caused by imbalanced hardware capabilities among cores. The proposed method dynamically balances the workload for each core before parallel work begins, integrating a new thread scheduler and CPU runtime with the Neural Speed framework. Results show a 20%-30% improvement in prefill phase latency compared to using OpenMP in Neural Speed, and over 90% of memory bandwidth utilization is achieved for INT4 GEMV on an Ultra-125H. This provides AI practitioners with a more efficient method for running LLM inference on hybrid CPUs, particularly relevant for client-side deployments where these processors are increasingly prevalent.
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval (Read more on arXiv or HuggingFace)	Nabeel Mohammed, Md Rizwan Parvez, shafin5, dpaul06	VideoLights is a novel framework for jointly performing video highlight detection (HD) and moment retrieval (MR). The research aimed to improve joint HD/MR by addressing limitations in cross-task and cross-modal interactions in existing models. The framework utilizes a Feature Refinement and Alignment (FRA) module, Bi-Directional Cross-Modal Fusion (Bi-CMF) network, Unidirectional Joint-Task Feedback Mechanism (Uni-JFM), and leverages LVLMs like BLIP-2. On the QVHighlights dataset, VideoLights-B-pt achieved a state-of-the-art R@0.5 of 70.36% for moment retrieval. This research provides AI practitioners with a new state-of-the-art model and framework for developing more robust and effective video understanding systems for tasks like content management and recommendation.

Papers for 2024-12-03

Title	Authors	Summary
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models (Read more on arXiv or HuggingFace)	lindahua, TheYJ, yuhangzang, tongwu2020, Zery	X-Prompt enhances in-context image generation in auto-regressive vision-language models. The research aimed to improve auto-regressive VLM performance across diverse seen and unseen image generation tasks within a unified in-context learning framework. The key methodology involved compressing in-context example features into fixed-length tokens, unifying image generation and description tasks, and using a retrieval-augmented image editing strategy. On the GenEval benchmark, X-Prompt with text prediction improved overall text-to-image generation by 0.08 compared to the baseline Chameleon model. This research provides AI practitioners with a method for enhancing the generalizability and efficiency of auto-regressive VLMs in diverse image generation applications, by enabling effective in-context learning with shorter context lengths.
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation (Read more on arXiv or HuggingFace)	LiruiZhao, yefly, xuzhaopan, xiaopengpeng, lyuukuu	OpenING is a new benchmark for evaluating open-ended interleaved image-text generation. The research aimed to create a comprehensive benchmark and robust judge model for open-ended interleaved image-text generation. The authors curated a dataset of 5,400 human-annotated instances across 56 real-world tasks and developed a judge model, IntJudge, trained with a novel reference-augmented generation approach. IntJudge achieved an 82.42% agreement rate with human judgments, outperforming GPT-based evaluators by 11.34%. AI practitioners can use OpenING to evaluate and benchmark new interleaved generation models and IntJudge as a more robust automated evaluation tool compared to GPT-based judges.
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis (Read more on arXiv or HuggingFace)	Dmitry Baranchuk, Valentin Khrulkov, Mikhail Khoroshikh, Anton Voronov, SpiridonSunRotator	SWITTI is a scale-wise transformer model for text-to-image synthesis designed for improved speed and quality. The research aimed to develop a faster, higher-quality text-to-image generation model using a scale-wise transformer architecture while investigating the role of autoregression and text conditioning across scales. The key methodology involved modifying a scale-wise autoregressive transformer architecture to improve training stability, removing the autoregressive component based on analysis of attention maps, and disabling classifier-free guidance at the highest resolution scales. SWITTI achieves comparable performance to state-of-the-art diffusion models on automated metrics and human evaluations while being up to 7x faster, with a single-step generation time of 9.5 milliseconds for a batch of 8 512x512 images on an NVIDIA A100 80GB GPU. The removal of the autoregressive component and disabling of classifier-free guidance at later stages significantly improved sampling speed while maintaining or slightly enhancing quality, offering practitioners a more efficient model for text-to-image generation.
Open-Sora Plan: Open-Source Large Video Generation Model (Read more on arXiv or HuggingFace)	Xinhua Cheng, Yunyang Ge, Lin-Chen, BestWishYsh, LanguageBind	Open-Sora Plan is an open-source project for generating high-resolution, long-duration videos. The objective is to develop a large generation model capable of producing desired videos from various user inputs, including text, images, and structure control signals. The project uses a Wavelet-Flow Variational Autoencoder (WF-VAE), a Joint Image-Video Skiparse Denoiser with 3D attention, and various condition controllers, along with training and inference optimization strategies like a min-max token strategy and adaptive gradient clipping. WF-VAE-L achieves a throughput of 5.55 videos/second when encoding 33-frame 512x512 videos, 7.8 times faster than Allegro with 8 times less memory usage. This project offers AI practitioners a comprehensive framework and efficient methods for developing and implementing high-quality video generation models.
TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video (Read more on arXiv or HuggingFace)	Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Hongyang Li, Jinyuan Qu	TAPTRv3 enhances point tracking robustness in long videos using spatial and temporal context. The research aimed to improve the long-video tracking performance of TAPTRv2, which struggles with feature querying due to increasing target variation and scene cuts. The authors introduce Context-aware Cross-Attention (CCA) and Visibility-aware Long-Temporal Attention (VLTA) to enhance spatial and temporal feature querying, respectively, along with a global matching module for scene cut handling. TAPTRv3 achieves state-of-the-art performance on multiple datasets, showing a 9.3 average Jaccard (AJ) improvement over TAPTRv2 on long video datasets (Kinetics, RGB-Stacking, and RoboTAP). This allows AI practitioners to implement more accurate and robust point tracking in long videos for applications such as video editing, SLAM, and robotic manipulation, even without large amounts of real training data.
o1-Coder: an o1 Replication for Coding (Read more on arXiv or HuggingFace)	Jinlin Xiao, Jiangming Shu, Yuqi Yang, Shangxi Wu, Yuxiang Zhang	O1-CODER replicates OpenAI's o1 model, focusing on coding tasks. The objective is to enhance a language model's System-2 thinking (deliberate, analytical processing) for code generation using reinforcement learning (RL) and Monte Carlo Tree Search (MCTS). The methodology involves training a Test Case Generator, using MCTS to generate reasoning-enhanced code data, and iteratively fine-tuning a policy model with a process reward model. Pseudocode-based code generation with Qwen2.5-Coder-7B achieved an Average Sampling Pass Rate (ASPR) of 74.9% on the MBPP benchmark, significantly exceeding vanilla Qwen2.5-7B's 49.3% ASPR. This implies that generating accurate pseudocode is crucial for correct code generation, highlighting the importance of methods like RL and MCTS for refining the reasoning process in LLMs for coding tasks.
TinyFusion: Diffusion Transformers Learned Shallow (Read more on arXiv or HuggingFace)	Xinchao Wang, Xinyin Ma, Kunjun Li, Gongfan Fang	TinyFusion is a learnable depth pruning method for compressing diffusion transformers. The objective is to create shallower diffusion transformer models with reduced inference costs while maintaining competitive post-fine-tuning performance. The method utilizes a differentiable sampling technique for layer mask selection, co-optimized with a weight update (using LoRA or full fine-tuning) to estimate recoverability. Experiments on DiT-XL show TinyFusion achieves an FID score of 2.86 after pruning to 14 layers and fine-tuning with Masked Knowledge Distillation, using only 7% of the original training cost. This allows AI practitioners to significantly reduce the computational cost of deploying diffusion transformers for image generation without drastically sacrificing generative quality.
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models (Read more on arXiv or HuggingFace)	Yueh-Hua Wu, Yong Man Ro, Yu-Chiang Frank Wang, Ryo Hachiuma, BK-Lee	VLsI is a new family of efficient vision-language models (VLMs) in 2B and 7B sizes. The research aimed to develop smaller VLMs that perform comparably to larger models without architectural changes. The key methodology involves layer-wise distillation using intermediate "verbalizers" that map each layer's output to natural language, aligning the smaller VLM's reasoning process with a larger one. VLsI-7B achieved a 17.4% performance improvement over GPT-4V on ten vision-language benchmarks. AI practitioners can utilize VLsI's layer-wise verbalization technique for efficient VLM distillation, enabling deployment on resource-constrained devices without significant performance degradation.
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model (Read more on arXiv or HuggingFace)	Liuhan Chen, Yang Ye, Zongjian Li, BestWishYsh, LanguageBind	WF-VAE enhances video reconstruction quality and computational efficiency for latent video diffusion models. The research aimed to address the computational bottlenecks and latent space discontinuities in existing video VAEs, particularly for long, high-resolution videos. The authors introduce Wavelet Flow VAE (WF-VAE), leveraging multi-level wavelet transforms to prioritize low-frequency information and a Causal Cache mechanism for lossless block-wise inference. WF-VAE-L achieves a PSNR of 35.87 and an LPIPS of 0.0175 on the Panda70M dataset with 16 latent channels, outperforming CogVideoX VAE in these metrics. This improvement enables AI practitioners to train and deploy more efficient and higher-quality video generation models, especially for resource-intensive, large-scale applications.
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters (Read more on arXiv or HuggingFace)	Huaizhong Zhang, Zhengyu Lin, Weiye Xiao, Jianping Jiang, caizhongang	SOLAMI is a novel end-to-end social Vision-Language-Action (VLA) framework for immersive interaction with 3D autonomous characters. The research aimed to create 3D autonomous characters capable of perceiving, understanding, and interacting with humans in immersive environments using multiple modalities. The researchers developed a unified social VLA architecture trained on a synthesized multimodal social interaction dataset (SynMSI) and implemented in a VR interface. SOLAMI achieved a lower inference latency (2.639 seconds) than the LLM+Speech and DLP baseline methods. This lower latency, coupled with improved performance in motion quality and context relevance, indicates that an end-to-end VLA model like SOLAMI can enable more natural and responsive real-time interactions with 3D characters in immersive applications.
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation (Read more on arXiv or HuggingFace)	Yuan Zhou, Qiuyue Wang, Yuxuan Cai, hyang0511, Cakeyan	Presto generates 15-second videos with enhanced content richness and long-range coherence. The research aimed to address the challenges of generating long videos with diverse scenarios and consistent storylines. The core methodology involves Segmented Cross-Attention (SCA), dividing hidden states into segments that cross-attend to corresponding sub-captions, and a curated LongTake-HD dataset of long videos with progressive sub-captions. Presto achieved a 78.5% VBench Semantic Score, outperforming state-of-the-art models. This provides AI practitioners with a novel architecture and dataset for generating longer, more coherent, and content-rich videos using diffusion models.
Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input (Read more on arXiv or HuggingFace)	Alessandro Farinelli, Alberto Castellini, Gianni Franchi, e-zorzi, ftaioli	AIUTA enables embodied agents to locate target objects in unknown environments through collaborative dialogue with users. The research addresses the challenge of instance navigation with minimal initial user input. The proposed method, AIUTA (Agent-user Interaction with Uncertainty Awareness), utilizes a self-questioning module with a VLM and LLM to refine object descriptions and an interaction trigger to determine when to query the user. On the CoIN-Bench with simulated users, AIUTA achieved a 14.47% success rate on the Train split, substantially outperforming a zero-shot baseline that lacked user interaction. This work provides a framework for building more practical and user-friendly instance navigation systems by reducing the burden of providing detailed upfront instructions.
VLSBench: Unveiling Visual Leakage in Multimodal Safety (Read more on arXiv or HuggingFace)	Jing Shao, Xuanjing Huang, LLLeo612, Max9803, Foreshhh	VLSBench, a new multimodal safety benchmark, is designed to address visual safety information leakage (VSIL) in existing multimodal datasets. The research aimed to understand why textual alignment performs comparably to multimodal alignment on existing multimodal safety benchmarks, suspecting a VSIL problem. The authors constructed VLSBench with 2.4k image-text pairs, preventing leakage from image to text through an automated pipeline involving harmful query generation, detoxification, iterative image generation, and filtration. Multimodal alignment methods outperformed textual alignment methods on VLSBench, with the best close-source model (Gemini-1.5-pro) achieving a 49.78% safety rate. This highlights the need for AI practitioners to prioritize multimodal alignment over textual alignment when addressing safety in multimodal models, especially in scenarios where sensitive visual content is not explicitly described in the text.
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge (Read more on arXiv or HuggingFace)	atcbosselut, jjzha, jebish7, shayekh, angelika	INCLUDE benchmarks multilingual LLMs' understanding of regional knowledge. The study investigates how large language models perform on questions requiring cultural and regional knowledge across diverse languages. Researchers compiled a novel dataset of 197,243 multiple-choice questions from local exams in 44 languages and 15 scripts, avoiding translation artifacts by using original-language sources and annotating questions for regionality and academic domain. GPT-4 achieved the highest overall accuracy of 77.1% on the INCLUDE-BASE subset. AI practitioners should account for regional knowledge variance when developing and evaluating multilingual LLMs and consider that model performance varies considerably based on language and question type, even within a single model.
Efficient Track Anything (Read more on arXiv or HuggingFace)	Chenchen Zhu, Lemeng Wu, Xiaoyu Xiang, Chong Zhou, yunyangx	EfficientTAMs are lightweight models for video object segmentation and tracking with reduced computational complexity compared to SAM 2. The research aimed to create more efficient track-anything models with low latency and small model size, suitable for mobile deployment. The methodology involves utilizing a vanilla Vision Transformer (ViT) as the image encoder and introducing an efficient memory module based on coarser representations of memory spatial tokens for cross-attention. On the SA-V test dataset for semi-supervised video object segmentation, EfficientTAM-S achieves 74.5 J&F, comparable to SAM 2, with ~2x speedup on A100 GPUs and ~2.4x parameter reduction. This allows AI practitioners to deploy real-time video object segmentation models on resource-constrained devices, such as mobile phones, broadening the potential applications of this technology.
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information (Read more on arXiv or HuggingFace)	Rui Zhang, Ranran Haoran Zhang, Sarkar Snigdha Sarathi Das, Yusen Zhang, ryokamoi	VisOnlyQA, a new dataset, reveals that Large Vision Language Models (LVLMs) struggle with visual perception of geometric information in scientific figures. The research aimed to evaluate the visual perception capabilities of LVLMs independent of reasoning and knowledge. The authors created VisOnlyQA, including real and synthetically generated scientific figures paired with multiple-choice questions about geometric and numerical information, and tested 20 different LVLMs. State-of-the-art models like GPT-40 and Gemini 1.5 Pro achieved only 51.4% and 54.2% accuracy respectively on the real image split, compared to near-perfect human performance (93.5%). The principal implication for AI practitioners is that both training data and model architectures need improvement to enhance the visual perception capabilities of LVLMs, as this weakness significantly limits performance on visual tasks.
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation (Read more on arXiv or HuggingFace)	Wenhu Chen, Cong Wei, Jie Min, hyang0511, wren93	VISTA improves long and high-resolution video understanding in Large Multimodal Models (LMMs) through data augmentation. The research aimed to address the scarcity of high-quality, long/high-resolution video instruction-following datasets. The key methodology involved spatially and temporally combining videos from existing datasets to create synthetic long and high-resolution video samples, followed by generating corresponding question-answer pairs using a language model (Gemini). Finetuning LMMs on VISTA-400K resulted in an average 3.3% improvement across four long-video understanding benchmarks and a 6.5% gain on the newly introduced HRVideoBench for high-resolution video understanding. This provides AI practitioners with a cost-effective method to improve LMM performance on long and high-resolution video understanding tasks through data augmentation, eliminating the need for costly manual annotation.
Steering Rectified Flow Models in the Vector Field for Controlled Image Generation (Read more on arXiv or HuggingFace)	Yezhou Yang, Dimitris N. Metaxas, Song Wen, mpatel57	FlowChef steers rectified flow models' denoising trajectories for controlled image generation. The paper investigates how to efficiently guide rectified flow models (RFMs) for tasks like image editing, classifier guidance, and solving linear inverse problems without computationally expensive inversion or backpropagation. The key methodology involves leveraging the smooth vector field dynamics of RFMs and a gradient skipping approach to directly adjust the trajectory during denoising. On linear inverse problems, FlowChef achieves 26.32 PSNR on box inpainting with a 20x20 mask, surpassing baselines on the pixel-space Rectified Flow++ model. This offers AI practitioners a computationally efficient and inversion-free method for controlled image generation using RFMs, potentially improving performance and reducing resource demands for applications like image editing and guided synthesis.
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos (Read more on arXiv or HuggingFace)	Hangyu Guo, Haoze Zhao, Haoran Tang, Meng Cao, zhangysk	PhysGame introduces a benchmark to evaluate the ability of video LLMs to understand physical commonsense violations in gameplay videos. The research aimed to assess and improve video LLMs' ability to recognize glitches that defy real-world physics. Researchers created PhysGame, a benchmark with 880 videos of glitches, PhysInstruct, an instruction tuning dataset with 140,057 question-answer pairs, and PhysDPO, a preference optimization dataset with 34,358 pairs using misleading video data. Their proposed PhysVLM model, trained on these datasets, achieved state-of-the-art performance on PhysGame and an overall accuracy of 61.1% on the Video-MME benchmark with subtitles. This work provides a benchmark and resources for training video LLMs capable of robust physical commonsense reasoning, crucial for developing more realistic and reliable AI agents in game development and broader applications.
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait (Read more on arXiv or HuggingFace)	Gyoungsu Chae, Dongchan Min, Taekyung Ki	FLOAT generates talking portrait videos from a single source image and audio using a flow matching generative model. The objective is to synthesize realistic talking motions from audio, including lip synchronization, head movements, and facial expressions, while addressing limitations of diffusion-based methods like slow sampling. The key methodology involves modeling talking motion within a learned motion latent space using a transformer-based vector field predictor and decoding the sampled motion latents into video frames. On the HDTF dataset, FLOAT achieves a Fréchet Inception Distance (FID) of 21.100, outperforming compared baselines. This efficient and high-quality approach offers AI practitioners a more effective method for generating realistic and temporally consistent talking portrait videos.
A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models (Read more on arXiv or HuggingFace)	Jingren Zhou, Bolin Ding, Yaliang Li, Xuchen Pan, yanxi-chen	This paper proposes a two-stage algorithm (generation and knockout) for improving the test-time compute of Large Language Models (LLMs). The research aims to boost the success probability of LLMs by increasing test-time compute, specifically addressing the challenge of ensuring high reliability in high-stakes scenarios. The proposed algorithm involves generating multiple candidate solutions and selecting the best one through a knockout tournament with pairwise comparisons. On a subset of the MMLU-Pro benchmark, the algorithm's accuracy improved from approximately 60% to over 65% for the "engineering" category when scaling the number of initial candidate solutions (N) from 1 to 32 with comparison parameter K=2 using Llama3.1. AI practitioners can leverage this method to enhance LLM reliability for complex tasks by scaling test-time computation with provable performance guarantees, provided the underlying assumptions regarding solution generation and comparison probabilities hold.
Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning (Read more on arXiv or HuggingFace)	Noel Crespi, Reza Farahbaksh, callmesan	This paper explores cross-lingual few-shot learning for audio abuse detection in low-resource languages. The research objective is to develop a model capable of detecting abusive language in multiple Indian languages using limited labeled data. The methodology involves extracting audio features using pre-trained Wav2Vec and Whisper models, normalizing these features using Temporal Mean or L2-Norm, and classifying them with a Model-Agnostic Meta-Learning (MAML) based few-shot classifier. Whisper with L2-Norm normalization achieved the highest accuracy, reaching 85.22% for Malayalam in the 100-shot setting. AI practitioners can leverage pre-trained audio representations and meta-learning techniques to develop robust abuse detection systems for low-resource languages, even with limited labeled data, highlighting the potential for improved content moderation across diverse linguistic groups.

Papers for 2024-12-02

Title	Authors	Summary
On Domain-Specific Post-Training for Multimodal Large Language Models (Read more on arXiv or HuggingFace)	Xintong Zhang, doubling, edward2021, buaahsh, daixuancheng	This paper investigates domain-specific post-training for adapting general Multimodal Large Language Models (MLLMs) to specialized domains like biomedicine and food. The research aims to improve MLLM performance in specific domains through data synthesis and a novel single-stage training pipeline. A visual instruction synthesizer generates domain-specific tasks from image-caption pairs, filtered by a consistency check, and used for single-stage training alongside image captioning data. AdaMLLM, the resulting adapted MLLM, outperformed general MLLMs across various domain-specific tasks, with a 58.3% average performance on biomedical tasks using PMC-Raw image-caption data and single-stage training. This research provides AI practitioners with a method for efficiently adapting pre-trained MLLMs to specialized domains using readily available image-caption datasets, enabling enhanced performance on domain-specific downstream tasks.
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS (Read more on arXiv or HuggingFace)	Zengqi Wen, Feihu Che, Shuai Zhang, fmk345, Jinyang23	HiAR-ICL enhances in-context learning for complex reasoning tasks by focusing on high-level thinking patterns rather than specific examples. The research aims to improve LLM performance on complex reasoning tasks by shifting from example-based in-context learning to a paradigm based on abstract thinking patterns. The core methodology uses Monte Carlo Tree Search (MCTS) to explore reasoning paths and construct “thought cards” representing these patterns, which are then selected based on a cognitive complexity metric. HiAR-ICL achieves 79.6% accuracy on the MATH benchmark using Qwen2.5-7B-Instruct, outperforming GPT-40 (76.6%) and Claude 3.5 (71.1%). This implies AI practitioners can leverage high-level reasoning patterns and MCTS to enhance the performance and generalization of LLMs, especially smaller models, on complex reasoning tasks without extensive demonstration engineering.
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model (Read more on arXiv or HuggingFace)	MoonQiu, weilllllls, Jeff-Wang, StevenZhang, LiewFeng	TeaCache accelerates video diffusion model inference by selectively caching intermediate model outputs. The research aimed to improve the inference speed of diffusion-based video generation models without compromising visual quality. The method estimates output differences using timestep embedding modulated noisy inputs and a rescaling strategy based on polynomial fitting to determine caching schedules. Experiments showed up to a 4.41x speedup on Open-Sora-Plan with a negligible -0.07% VBench score degradation. This training-free caching strategy offers AI practitioners a way to substantially reduce the computational cost of deploying state-of-the-art video diffusion models.
DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding (Read more on arXiv or HuggingFace)	Mingu Kang, Minseo Kim, Jisoo Kim, junwann, whwjdqls99	DisCoRD decodes discrete motion tokens into continuous motion using rectified flow to enhance naturalness while preserving faithfulness to conditioning signals. The research aimed to address the limitations of existing discrete and continuous human motion generation methods, specifically under-reconstruction and frame-wise noise in discrete methods, and cross-modal mapping ambiguity in continuous methods. The core methodology involves training a rectified flow model conditioned on frame-wise features extracted from discrete motion tokens, enabling iterative refinement in continuous space. On HumanML3D, DisCoRD achieved a Fréchet Inception Distance (FID) of 0.032, surpassing existing discrete methods in naturalness. This provides AI practitioners with a method to generate more realistic and faithful human motion from discrete representations, applicable to various motion generation tasks such as text-to-motion and music-to-dance generation.
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs (Read more on arXiv or HuggingFace)	nav4, nailon-nvidia, talor-abr, tomer-nv, abercovich	Puzzle is a framework for accelerating LLM inference on specific hardware while preserving model capabilities. The research aimed to optimize large language model architectures for efficient inference on specific hardware while maintaining accuracy. The methodology involved decomposed neural architecture search (NAS) using blockwise local knowledge distillation (BLD), mixed-integer programming for constraint optimization, and global knowledge distillation (GKD). The derived model, Nemotron-51B, achieved a 2.17x inference throughput speedup on a single NVIDIA H100 GPU compared to its parent model, Llama-3.1-70B-Instruct, while preserving 98.4% of its capabilities. This provides AI practitioners with access to state-of-the-art language models optimized for efficient deployment with minimal accuracy trade-offs, enabling wider adoption across various applications and hardware.
Trajectory Attention for Fine-grained Video Motion Control (Read more on arXiv or HuggingFace)	Xingang-Pan, Jianlou, PKUWilliamYang, Vicky0522, zeqixiao	This paper introduces trajectory attention for precise camera motion control in video generation. The research aims to improve the precision and consistency of camera motion control in generated videos, addressing limitations of existing methods that struggle with temporal coherence or rely on implicit control mechanisms. The core methodology involves modeling trajectory attention as an auxiliary branch alongside traditional temporal attention in video diffusion models, allowing explicit injection of trajectory information while maintaining the model's generative capabilities. Experiments on camera motion control for images show the method achieves an Absolute Trajectory Error (ATE) of 0.0396 meters on 25-frame sequences. This provides AI practitioners with a plug-and-play module for enhanced camera motion control in video diffusion models, improving the precision and consistency of generated video motion, particularly valuable for tasks requiring fine-grained control over camera movement.
Video Depth without Video Models (Read more on arXiv or HuggingFace)	toshas, PeterTor, peterjohnson, dnarnhofer, Bingxin	RollingDepth estimates temporally consistent video depth using a modified single-image latent diffusion model (LDM). The research aimed to develop accurate and temporally stable video depth estimation without computationally expensive video diffusion models. The key methodology involved adapting a single-image LDM (Marigold) to process short video snippets, incorporating cross-frame self-attention and a robust, optimization-based global alignment algorithm. RollingDepth achieved a 9.6% absolute mean relative error on the PointOdyssey dataset, outperforming existing video and single-image depth models. This implies that AI practitioners can leverage modified single-image LDMs for efficient and accurate video depth estimation, avoiding the computational burden of dedicated video models.
AlphaTablets: A Generic Plane Representation for 3D Planar Reconstruction from Monocular Videos (Read more on arXiv or HuggingFace)	bys0318, AlbertHuyb, lshmouse, thuzhaowang, hyz317	AlphaTablets is a novel 3D plane representation for reconstructing planar surfaces from monocular videos. The research aimed to develop a more accurate and generalizable method for 3D planar reconstruction from monocular video input. The core methodology involved representing 3D planes as rectangles with alpha channels (AlphaTablets), differentiable rasterization for rendering, and a bottom-up pipeline incorporating optimization and a merging scheme. On the ScanNet dataset, the method achieved a 0.456 F-score for 3D geometry reconstruction, outperforming existing methods. This new representation and pipeline offer AI practitioners a more effective and flexible way to reconstruct and edit 3D planar structures from monocular videos, potentially improving applications in scene understanding, robotics, and mixed reality.
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing (Read more on arXiv or HuggingFace)	Hyunjun Kim, dwightro, arkimjh, lakelee	Video-Ma²mba is a novel large multimodal model designed for efficient long-form video understanding. The research aimed to address the challenge of quadratic memory and computational demands of transformer-based models when processing long video sequences. The key methodology involved replacing the transformer backbone with the linear-complexity Mamba-2 architecture and introducing Multi-Axis Gradient Checkpointing (MA-GC) for memory efficiency. Video-Ma²mba achieved a 4.1% improvement on the Video-MME benchmark compared to a 16-frame limited baseline. This implies that AI practitioners can leverage MA-GC within the Mamba-2 framework to process long video sequences (up to 2 hours at 1 FPS on a single GPU) more efficiently than transformer-based models, potentially improving performance in video understanding tasks by capturing more complete temporal information.
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers (Read more on arXiv or HuggingFace)	willi-menapace, aliaksandr-siarohin, guochengqian, universome, sherwinbahmani	AC3D analyzes and improves 3D camera control within pre-trained video diffusion transformers. The research aims to enable precise 3D camera manipulation in video diffusion models without sacrificing video quality. The key methodology involves analyzing motion spectral volumes, linearly probing internal model representations for camera pose knowledge, and curating a dataset of dynamic videos with static cameras. Results show an 18% improvement in video fidelity (FVD) and 25% improvement in camera steering accuracy compared to the closest baseline. AI practitioners can leverage these insights to develop more precise and efficient camera control mechanisms for text-to-video generation and related applications by understanding how to condition camera pose within video diffusion transformer architectures and tailor training data to enhance scene dynamism while preserving camera control fidelity.
FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion (Read more on arXiv or HuggingFace)	Xiatian Zhu, Hai X. Pham, Isma Hadji, Adrian Bulat, Haosen Yang	FAM diffusion introduces two novel modules to improve high-resolution image generation with pre-trained latent diffusion models. The objective is to enable high-resolution image generation without retraining, addressing issues like object repetition and inconsistent local textures seen when upscaling. The key methodology involves a Frequency Modulation (FM) module, operating in the Fourier domain to enhance global structure consistency, and an Attention Modulation (AM) module to improve local texture consistency. FAM diffusion achieves state-of-the-art performance, demonstrating a CLIP score of 32.33 at 4x upscaling with SDXL, and significantly reducing latency compared to patch-based methods. This allows AI practitioners to generate high-quality, high-resolution images from pre-trained models without computationally expensive retraining or significant latency overheads.
LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification (Read more on arXiv or HuggingFace)	nljubesi, TajaKuzman	This paper proposes a teacher-student framework using LLMs for multilingual news topic classification without manual annotation. The research aims to develop accurate and computationally efficient multilingual IPTC news topic classifiers for languages lacking annotated training data. The methodology employs GPT-40 to automatically annotate news articles in four languages, creating a training dataset for fine-tuning an XLM-ROBERTa student model. The XLM-ROBERTa model, trained on 15,000 automatically labeled instances, achieves a macro-F1 score of 0.746. This demonstrates the feasibility of using LLM-generated labels to train smaller, more efficient models for multilingual text classification, enabling AI practitioners to build robust classifiers for low-resource languages without extensive manual annotation efforts.

Papers for 2024-11-29

Title	Authors	Summary
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning (Read more on arXiv or HuggingFace)	Jingdi Lei, jwu323, ZonglinY, Duke-de-Artois, qq8933	Critic-V is a framework for enhancing the reasoning capabilities of Vision-Language Models (VLMs). The research aims to address the issue of VLMs generating inaccurate or irrelevant responses in multimodal reasoning tasks. The key methodology involves a Reasoner-Critic architecture, where a Reasoner VLM generates reasoning paths and a Critic VLM provides feedback for refinement using Direct Preference Optimization (DPO) trained on a critique-VQA dataset. Qwen2-VL-7B with Critic-V achieved the highest scores on five out of eight benchmarks, with an 11.8% improvement on MathVista compared to the baseline. This provides AI practitioners with a method to improve the reliability and accuracy of VLMs in reasoning-heavy multimodal applications by integrating an external critic model for real-time feedback during inference.
ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting (Read more on arXiv or HuggingFace)	Hangwei Qian, Weijia Wu, Zhuohang Dang, Changliang Xia, ChengyouJia	ChatGen automates the text-to-image generation process from free-form user input. The research aimed to develop a model that automatically generates prompts, selects appropriate models, and configures arguments for text-to-image generation from freestyle user text, image, or chat history. The authors introduce a multi-stage evolution strategy (ChatGen-Evo) incorporating supervised fine-tuning for prompt generation, ModelTokens for model selection, and in-context learning for argument configuration. ChatGen-Evo achieved a Unified Metric score of 65.9 in supervised settings, surpassing other baselines and demonstrating comparable performance to a much larger 8B parameter model while using only 2B parameters. This work suggests that focusing on stage-wise training for complex automated text-to-image generation tasks can yield significant performance improvements with smaller models, offering a potential path towards more efficient and accessible automated image generation for AI practitioners.
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models (Read more on arXiv or HuggingFace)	Barbara Hammer, Robin Chan, Petra Bevandic, rizavelioglu	TryOffDiff reconstructs standardized garment images from photos of clothed individuals. The research objective is to generate canonical garment images from real-world photos, a task termed Virtual Try-Off (VTOFF). The key methodology involves adapting Stable Diffusion with SigLIP-based visual conditioning, replacing text prompts with image features. On the modified VITON-HD dataset, TryOffDiff achieves a DISTS score of 22.5, outperforming adapted VTON and pose transfer baselines. The paper mentions no background removal post-processing was applied to TryOffDiff while some form of removal was applied to baseline models; how this affects the comparison remains unclear. This work provides AI practitioners with a novel approach for high-fidelity garment reconstruction, potentially improving e-commerce product imagery and generative model evaluation.
Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models (Read more on arXiv or HuggingFace)	Jong Chul Ye, Bryan S Kim, kjm981995	Free$^2$Guide enhances text-video alignment in diffusion-based generative models without needing reward function gradients. The research aims to improve text alignment in text-to-video generation using non-differentiable reward functions like Large Vision-Language Models (LVLMs). The method approximates guidance by combining path integral control with zeroth-order gradient estimations and enables ensembling multiple reward models. Using GPT-40 with LaVie for text-video alignment showed a 28.6% improvement on the Spatial Relationship metric compared to the baseline LaVie model. This offers AI practitioners a way to leverage powerful black-box LVLMs for improved text-video alignment without needing model fine-tuning or differentiable reward functions, thereby potentially reducing computational overhead.
Morph: A Motion-free Physics Optimization Framework for Human Motion Generation (Read more on arXiv or HuggingFace)	Hao Liu, Xin Zhao, Ruibing Hou, Mingshuang Luo, Zhuo Li	Morph enhances the physical plausibility of generated human motion without using real motion data. The research aimed to develop a model-agnostic physics optimization method that doesn't require costly real motion capture data. A two-stage process trains a Motion Physics Refinement (MPR) module on synthetic noisy motion data from a generator, then uses the refined output to fine-tune the original generator. On the HumanML3D dataset, Morph-MoMask reduced ground penetration errors from 23.152 to 0.0. AI practitioners can use Morph to improve the physical realism of generated motions across diverse motion generation models and tasks (text-to-motion, music-to-dance) without needing expensive real-world motion datasets.
LongKey: Keyphrase Extraction for Long Documents (Read more on arXiv or HuggingFace)	Jean Paul Barddal, Cinthia Obladen de Almendra Freitas, Jeovane Honorio Alves, RaduState	LongKey is a novel framework for extracting keyphrases from long documents. The research aimed to address the limitations of existing keyphrase extraction methods in processing long-context documents (greater than 512 tokens). The methodology involves using Longformer for word embeddings, a max-pooling-based keyphrase embedding pooler, and a ranking loss combined with a chunking loss for candidate scoring. On the LDKP10K dataset, LongKey achieved an F1@5 score of 41.81%. The keyphrase embedding pooler significantly contributes to LongKey’s improved performance, offering AI practitioners a more effective technique for extracting keyphrases from lengthy texts, enhancing information retrieval and summarization tasks.

Papers for 2024-11-28

Title	Authors	Summary
ROICtrl: Boosting Instance Control for Visual Generation (Read more on arXiv or HuggingFace)	KevinQHLin, pcma, ynie, 365sleep, guyuchao	Here's a concise summary of the AI research paper following your strict guidelines: i) ROICtrl enhances diffusion models for precise multi-instance visual generation by introducing regional instance control via ROI-Align and a novel ROI-Unpool operation. ii) The research aimed to improve the accuracy and efficiency of multi-instance visual generation by addressing limitations in associating positional and attribute information with multiple instances in natural language prompts. iii) The key methodology involved using ROI-Align and a novel complementary operation, ROI-Unpool, to enable efficient and accurate manipulation of regions of interest (ROIs) on high-resolution feature maps for visual generation, followed by a learnable attention blending mechanism to integrate instance captions with global captions. iv) ROICtrl achieved a 0.73 instance success rate on the ROICtrl-Bench benchmark, surpassing previous methods in both template-based and free-form instance caption tasks. Specific details on other benchmarks are mentioned but complete numerical results are not provided in the summary. v) The development of ROI-Unpool, a complementary operation to ROI-Align for generative models, offers a significant advancement for AI practitioners working on visual generation. This enables more precise control over multiple instances within generated images, improving the accuracy and computational efficiency of multi-instance image synthesis tasks. Further implications are discussed but quantitative findings are not always fully summarized.
Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment (Read more on arXiv or HuggingFace)	ranjaykrishna, Tim666, lzy8465, Dipsy0830, shuaishuaicdp	This paper introduces ISG, a framework for evaluating interleaved text-and-image generation. The research aims to address the lack of robust evaluation metrics for models generating interleaved text and images. The ISG framework uses a scene graph representation and a four-level (holistic, structural, block, image) evaluation protocol leveraging question-answering feedback. Compositional models achieved a higher holistic score of 6.262 compared to 2.961 for the best unified model, though still lagging behind human performance. AI practitioners developing multimodal generative models should consider compositional architectures and the fine-grained insights provided by ISG for improving model performance and addressing limitations like instruction following and consistency across modalities.
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models (Read more on arXiv or HuggingFace)	Ruiqi Gao, holynski, atrevithick, doinkda, rundi	Here's a summary of the AI research paper following your strict guidelines: i) CAT4D generates dynamic 3D scenes from monocular video using a multi-view video diffusion model and deformable 3D Gaussian representation. ii) To create 4D (dynamic 3D) scenes from monocular video input, overcoming the limitations of requiring synchronized multi-view video data for accurate 4D reconstruction. iii) A multi-view video diffusion model trained on diverse datasets is used to transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation. A novel sampling strategy is employed to generate nearly-consistent multi-view videos beyond the model's native output length. iv) The model achieves competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks, demonstrating disentangled camera and time control (quantitative result: 21.97 PSNR, 0.683 SSIM, 0.121 LPIPS on disentangled control experiments using NSFF dataset). v) The disentangled camera and time control demonstrated by the model is a significant achievement for dynamic scene generation from limited input. This approach directly benefits AI practitioners working on video generation, 3D reconstruction, and augmented/virtual reality applications by providing a more robust method for creating dynamic 3D content from readily available monocular video data. The paper notes some ambiguity on the robustness of the method when dealing with highly dynamic scenes, implying a need for further research in that area.
Large Language Model-Brained GUI Agents: A Survey (Read more on arXiv or HuggingFace)	Gezelligheid520, liqul, bowenli, shilhe, vyokky	This paper surveys Large Language Model (LLM)-brained GUI agents, intelligent agents operating within GUI environments using LLMs. The objective is to provide a comprehensive overview of this burgeoning field, covering historical evolution, core components, and advanced techniques. The survey analyzes existing frameworks, data collection methods, model training strategies, evaluation benchmarks, and applications of LLM GUI agents. SeeAct, a multimodal LLM GUI agent, achieved a 51.1% task success rate on real-time web tasks. AI practitioners can use this survey as a guide for constructing LLM-powered GUI agents and as a reference for advancing research in this domain, particularly in optimizing model performance for complex, real-world GUI interactions.
MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation (Read more on arXiv or HuggingFace)	Sankalp Sinha, mzafzal, saali14, alootikki, SadilKhan	This paper introduces MARVEL-40M+, a large-scale, multi-level annotated dataset for text-to-3D content generation. The objective is to address the limitations of existing text-to-3D datasets in size, diversity, and annotation depth, hindering high-fidelity 3D model generation. A multi-stage annotation pipeline combining multi-view VLMs (InternVL2), LLMs (Qwen 2.5), and filtered human metadata creates five levels of descriptions for over 8.9 million 3D assets. Evaluation shows MARVEL-40M+ achieves a 72.41% win rate against existing datasets in image-text alignment as judged by GPT-4. AI practitioners can leverage MARVEL-40M+ to train and evaluate more robust and higher-fidelity text-to-3D generation models, benefiting applications in gaming, AR, and VR by providing a significantly richer and larger training resource.
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient (Read more on arXiv or HuggingFace)	Xinchao Wang, Gongfan Fang, horseee, Zigeng	Here's a summary of the AI research paper following your strict guidelines: i) One-line summary: Collaborative Decoding (CoDe) improves Visual Auto-Regressive (VAR) model efficiency by partitioning multi-scale inference between a large and a small model, resulting in significant speed and memory reductions with minimal quality loss. ii) Main research question/objective: How can the efficiency of Visual Auto-Regressive (VAR) image generation models be improved, particularly addressing memory consumption and computational redundancies associated with long token sequences? iii) Key methodology: A novel decoding strategy called Collaborative Decoding (CoDe) is proposed. CoDe divides the multi-scale inference process into a "drafter" (large model generating low-frequency content) and a "refiner" (small model generating high-frequency details). Model-specific fine-tuning is also applied. iv) Primary results: CoDe achieves a 1.7x speedup and reduces memory usage by approximately 50% compared to the original VAR model, with only a negligible increase in FID (from 1.95 to 1.98). A 2.9x speedup was also achieved under different drafting steps. v) Principal implication for AI practitioners: CoDe offers a practical method to significantly enhance the efficiency of VAR models for image generation, reducing both computational cost and memory requirements without substantial quality degradation. This is particularly relevant for deploying high-resolution image generation models on resource-constrained platforms.
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving (Read more on arXiv or HuggingFace)	Haoran Yin, xinggangw, bojiang-bentoml, csy71, LegendBC	Here is a summary of the AI research paper following your strict guidelines: i) DiffusionDrive, a truncated diffusion model, achieves real-time end-to-end autonomous driving performance superior to existing methods. ii) To develop a real-time, high-quality, multi-mode end-to-end autonomous driving policy that addresses the limitations of existing methods (mode collapse and computational cost). iii) A truncated diffusion policy incorporating prior multi-mode anchors, an efficient cascade diffusion decoder, and a reduced number of denoising steps. iv) On the NAVSIM navtest split, DiffusionDrive achieved 88.1 PDMS without post-processing, exceeding the state-of-the-art. v) The significant speed improvement (45 FPS on an NVIDIA 4090 GPU) and high performance using a ResNet-34 backbone demonstrate the potential of truncated diffusion models for real-time autonomous driving applications. This finding directly impacts the feasibility of deploying diffusion models in resource-constrained real-world scenarios.
DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching (Read more on arXiv or HuggingFace)	Diego Valsesia, emagli, mosams, u-michieli, Ema97x	DreamCache is a finetuning-free, lightweight approach for personalized image generation. The research aimed to develop an efficient and high-quality personalized image generation method overcoming limitations of existing approaches. DreamCache employs a feature caching mechanism with lightweight, trained conditioning adapters to dynamically modulate generated image features. The method achieved state-of-the-art image and text alignment with only 25M additional parameters; specifically, DreamCache achieved a DINO score of 0.767 on the SD 2.1 backbone with a single reference image. This efficient personalization approach significantly reduces computational costs and memory demands, making it suitable for resource-constrained devices and real-time applications.
Identity-Preserving Text-to-Video Generation by Frequency Decomposition (Read more on arXiv or HuggingFace)	Yunyuan Ge, LiuhanChen, hexianyi, Jinfa, BestWishYsh	Here's a summary of the AI research paper following your strict guidelines: i) One-line summary: ConsisID, a tuning-free diffusion transformer-based model, generates high-fidelity, identity-preserving videos by controlling identity features in the frequency domain. ii) Main research question/objective: To develop a tuning-free identity-preserving text-to-video generation model that maintains consistent human identity in generated videos and addresses limitations of existing Diffusion Transformer (DiT) based models. iii) Key methodology: Frequency decomposition of identity features into high-frequency (intrinsic) and low-frequency (global) components, injected into different DiT layers; hierarchical training strategy combining coarse-to-fine training, dynamic mask loss, and dynamic cross-face loss. iv) Primary results: ConsisID outperforms ID-Animator across multiple metrics, achieving a FaceSim-Arc score of 0.73 versus ID-Animator's 0.32. (Note: other quantitative metrics (FID, CLIPScore, FaceSim-Cur) are also reported). v) Principal implication for AI practitioners: The frequency decomposition approach and hierarchical training strategy offer a tuning-free method for identity-preserving video generation using DiT models, improving efficiency and generalization compared to previous tuning-based methods. This is significant as it reduces the computational cost and improves the applicability of DiT for identity-preserving video generation.
Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis (Read more on arXiv or HuggingFace)	Xiaoming Li, cavanloy, OAOA, itsmag11	Here's a summary of the AI research paper following your strict guidelines: i) One-line summary: A single parameter, �

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
_includes		_includes
archive		archive
scripts		scripts
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Daily AI Papers

Papers for 2025-01-10

Papers for 2025-01-09

Papers for 2025-01-08

Papers for 2025-01-07

Papers for 2025-01-06

Papers for 2025-01-03

Papers for 2025-01-02

Papers for 2025-01-01

Papers for 2024-12-31

Papers for 2024-12-30

Papers for 2024-12-27

Papers for 2024-12-26

Papers for 2024-12-25

Papers for 2024-12-24

Papers for 2024-12-23

Papers for 2024-12-20

Papers for 2024-12-19

Papers for 2024-12-18

Papers for 2024-12-17

Papers for 2024-12-16

Papers for 2024-12-13

Papers for 2024-12-12

Papers for 2024-12-11

Papers for 2024-12-10

Papers for 2024-12-09

Papers for 2024-12-06

Papers for 2024-12-05

Papers for 2024-12-04

Papers for 2024-12-03

Papers for 2024-12-02

Papers for 2024-11-29

Papers for 2024-11-28

About

Contributors 3

Languages

License

gabrielchua/daily-ai-papers

Folders and files

Latest commit

History

Repository files navigation

Daily AI Papers

Papers for 2025-01-10

Papers for 2025-01-09

Papers for 2025-01-08

Papers for 2025-01-07

Papers for 2025-01-06

Papers for 2025-01-03

Papers for 2025-01-02

Papers for 2025-01-01

Papers for 2024-12-31

Papers for 2024-12-30

Papers for 2024-12-27

Papers for 2024-12-26

Papers for 2024-12-25

Papers for 2024-12-24

Papers for 2024-12-23

Papers for 2024-12-20

Papers for 2024-12-19

Papers for 2024-12-18

Papers for 2024-12-17

Papers for 2024-12-16

Papers for 2024-12-13

Papers for 2024-12-12

Papers for 2024-12-11

Papers for 2024-12-10

Papers for 2024-12-09

Papers for 2024-12-06

Papers for 2024-12-05

Papers for 2024-12-04

Papers for 2024-12-03

Papers for 2024-12-02

Papers for 2024-11-29

Papers for 2024-11-28

About

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages