Computation and Language 74
☆ RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy
Reasoning before action and imagining potential outcomes (i.e., world models)
are essential for embodied agents operating in complex open-world environments.
Yet, prior work either incorporates only one of these abilities in an
end-to-end agent or integrates multiple specialized models into an agent
system, limiting the learning efficiency and generalization of the policy.
Thus, this paper makes the first attempt to synergize Reasoning and Imagination
in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end
manner, we construct a data pipeline that progressively integrates and enriches
the content of imagination and reasoning in the trajectories collected from
existing agents. The joint learning of reasoning and next image generation
explicitly models the inherent correlation between reasoning, action, and
dynamics of environments, and thus exhibits more than $17\times$ sample
efficiency improvements and generalization in comparison with previous works.
During inference, RIG first reasons about the next action, produces potential
action, and then predicts the action outcomes, which offers the agent a chance
to review and self-correct based on the imagination before taking real actions.
Experimental results show that the synergy of reasoning and imagination not
only improves the robustness, generalization, and interoperability of
generalist policy but also enables test-time scaling to enhance overall
performance.
☆ Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models
Rui Wang, Hongru Wang, Boyang Xue, Jianhui Pang, Shudong Liu, Yi Chen, Jiahao Qiu, Derek Fai Wong, Heng Ji, Kam-Fai Wong
Recent advancements in Large Language Models (LLMs) have significantly
enhanced their ability to perform complex reasoning tasks, transitioning from
fast and intuitive thinking (System 1) to slow and deep reasoning (System 2).
While System 2 reasoning improves task accuracy, it often incurs substantial
computational costs due to its slow thinking nature and inefficient or
unnecessary reasoning behaviors. In contrast, System 1 reasoning is
computationally efficient but leads to suboptimal performance. Consequently, it
is critical to balance the trade-off between performance (benefits) and
computational costs (budgets), giving rise to the concept of reasoning economy.
In this survey, we provide a comprehensive analysis of reasoning economy in
both the post-training and test-time inference stages of LLMs, encompassing i)
the cause of reasoning inefficiency, ii) behavior analysis of different
reasoning patterns, and iii) potential solutions to achieve reasoning economy.
By offering actionable insights and highlighting open challenges, we aim to
shed light on strategies for improving the reasoning economy of LLMs, thereby
serving as a valuable resource for advancing research in this evolving area. We
also provide a public repository to continually track developments in this
fast-evolving field.
comment: In Progress; Paper list Repo:
https://github.com/DevoAllen/Awesome-Reasoning-Economy-Papers
☆ Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1
Recent advancements in Chain of Thought (COT) generation have significantly
improved the reasoning capabilities of Large Language Models (LLMs), with
reinforcement learning (RL) emerging as an effective post-training approach.
Multimodal Large Language Models (MLLMs) inherit this reasoning potential but
remain underexplored in tasks requiring both perception and logical reasoning.
To address this, we introduce SEED-Bench-R1, a benchmark designed to
systematically evaluate post-training methods for MLLMs in video understanding.
It includes intricate real-world videos and complex everyday planning tasks in
the format of multiple-choice questions, requiring sophisticated perception and
reasoning. SEED-Bench-R1 assesses generalization through a three-level
hierarchy: in-distribution, cross-environment, and cross-environment-task
scenarios, equipped with a large-scale training dataset with easily verifiable
ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL
with supervised fine-tuning (SFT), demonstrating RL's data efficiency and
superior performance on both in-distribution and out-of-distribution tasks,
even outperforming SFT on general video understanding benchmarks like
LongVideoBench. Our detailed analysis reveals that RL enhances visual
perception but often produces less logically coherent reasoning chains. We
identify key limitations such as inconsistent reasoning and overlooked visual
cues, and suggest future improvements in base model reasoning, reward modeling,
and RL robustness against noisy signals.
comment: Technical Report (In Progress); Code released at:
https://github.com/TencentARC/SEED-Bench-R1
☆ Effectively Controlling Reasoning Models through Thinking Intervention
Reasoning-enhanced large language models (LLMs) explicitly generate
intermediate reasoning steps prior to generating final answers, helping the
model excel in complex problem-solving. In this paper, we demonstrate that this
emerging generation framework offers a unique opportunity for more fine-grained
control over model behavior. We propose Thinking Intervention, a novel paradigm
designed to explicitly guide the internal reasoning processes of LLMs by
strategically inserting or revising specific thinking tokens. We conduct
comprehensive evaluations across multiple tasks, including instruction
following on IFEval, instruction hierarchy on SEP, and safety alignment on
XSTest and SORRY-Bench. Our results demonstrate that Thinking Intervention
significantly outperforms baseline prompting approaches, achieving up to 6.7%
accuracy gains in instruction-following scenarios, 15.4% improvements in
reasoning about instruction hierarchies, and a 40.0% increase in refusal rates
for unsafe prompts using open-source DeepSeek R1 models. Overall, our work
opens a promising new research avenue for controlling reasoning LLMs.
☆ Query and Conquer: Execution-Guided SQL Generation
We propose a novel approach for generating complex outputs that significantly
improves accuracy in text-to-SQL tasks. Our method leverages execution results
to select the most semantically consistent query from multiple candidates,
enabling smaller, cost-effective models to surpass computationally intensive
reasoning methods such as o1, o3-mini, and DeepSeek R1 while reducing inference
cost by as much as 30 times. It integrates effortlessly with existing models,
offering a practical and scalable pathway to state-of-the-art SQL generation.
☆ SQuat: Subspace-orthogonal KV Cache Quantization
The key-value (KV) cache accelerates LLMs decoding by storing KV tensors from
previously generated tokens. It reduces redundant computation at the cost of
increased memory usage. To mitigate this overhead, existing approaches compress
KV tensors into lower-bit representations; however, quantization errors can
accumulate as more tokens are generated, potentially resulting in undesired
outputs. In this paper, we introduce SQuat (Subspace-orthogonal KV cache
quantization). It first constructs a subspace spanned by query tensors to
capture the most critical task-related information. During key tensor
quantization, it enforces that the difference between the (de)quantized and
original keys remains orthogonal to this subspace, minimizing the impact of
quantization errors on the attention mechanism's outputs. SQuat requires no
model fine-tuning, no additional calibration dataset for offline learning, and
is grounded in a theoretical framework we develop. Through numerical
experiments, we show that our method reduces peak memory by 2.17 to 2.82,
improves throughput by 2.45 to 3.60, and achieves more favorable benchmark
scores than existing KV cache quantization algorithms.
☆ ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion
Parameter generation has emerged as a novel paradigm for neural network
development, offering an alternative to traditional neural network training by
synthesizing high-quality model weights directly. In the context of Low-Rank
Adaptation (LoRA) for evolving ($\textit{i.e.}$, constantly updated) large
language models (LLMs), this approach promises efficient adaptation without
costly retraining. However, existing methods face critical limitations in
simultaneously achieving scalability and controllability. In this paper, we
introduce $\texttt{ORAL}$, a novel $\textbf{conditional recurrent diffusion}$
framework that addresses these challenges. $\texttt{ORAL}$ incorporates a novel
conditioning mechanism that integrates model architecture and textual task
specifications, enabling the generation of task-specific LoRA parameters that
can seamlessly transfer across evolving foundation models. Our approach
successfully scales to billions-of-parameter LLMs and maintains
controllability. Through extensive experiments across seven language tasks,
four vision tasks, and three multimodal tasks using five pre-trained LLMs, we
demonstrate that $\texttt{ORAL}$ generates high-quality LoRA parameters that
achieve comparable or superior performance to vanilla trained counterparts.
☆ BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models
In this research, we introduce BEATS, a novel framework for evaluating Bias,
Ethics, Fairness, and Factuality in Large Language Models (LLMs). Building upon
the BEATS framework, we present a bias benchmark for LLMs that measure
performance across 29 distinct metrics. These metrics span a broad range of
characteristics, including demographic, cognitive, and social biases, as well
as measures of ethical reasoning, group fairness, and factuality related
misinformation risk. These metrics enable a quantitative assessment of the
extent to which LLM generated responses may perpetuate societal prejudices that
reinforce or expand systemic inequities. To achieve a high score on this
benchmark a LLM must show very equitable behavior in their responses, making it
a rigorous standard for responsible AI evaluation. Empirical results based on
data from our experiment show that, 37.65\% of outputs generated by industry
leading models contained some form of bias, highlighting a substantial risk of
using these models in critical decision making systems. BEATS framework and
benchmark offer a scalable and statistically rigorous methodology to benchmark
LLMs, diagnose factors driving biases, and develop mitigation strategies. With
the BEATS framework, our goal is to help the development of more socially
responsible and ethically aligned AI models.
comment: 32 pages, 33 figures, preprint version
☆ A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG
This study presents a systematic comparison of three approaches for the
analysis of mental health text using large language models (LLMs): prompt
engineering, retrieval augmented generation (RAG), and fine-tuning. Using LLaMA
3, we evaluate these approaches on emotion classification and mental health
condition detection tasks across two datasets. Fine-tuning achieves the highest
accuracy (91% for emotion classification, 80% for mental health conditions) but
requires substantial computational resources and large training sets, while
prompt engineering and RAG offer more flexible deployment with moderate
performance (40-68% accuracy). Our findings provide practical insights for
implementing LLM-based solutions in mental health applications, highlighting
the trade-offs between accuracy, computational requirements, and deployment
flexibility.
☆ Is analogy enough to draw novel adjective-noun inferences? SC
Recent work (Ross et al., 2025, 2024) has argued that the ability of humans
and LLMs respectively to generalize to novel adjective-noun combinations shows
that they each have access to a compositional mechanism to determine the
phrase's meaning and derive inferences. We study whether these inferences can
instead be derived by analogy to known inferences, without need for
composition. We investigate this by (1) building a model of analogical
reasoning using similarity over lexical items, and (2) asking human
participants to reason by analogy. While we find that this strategy works well
for a large proportion of the dataset of Ross et al. (2025), there are novel
combinations for which both humans and LLMs derive convergent inferences but
which are not well handled by analogy. We thus conclude that the mechanism
humans and LLMs use to generalize in these cases cannot be fully reduced to
analogy, and likely involves composition.
comment: 8 pages (16 pages with appendix). Submitted to SCiL 2025
☆ Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
We introduce Open-Reasoner-Zero, the first open source implementation of
large-scale reasoning-oriented RL training focusing on scalability, simplicity
and accessibility. Through extensive experiments, we demonstrate that a
minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and
straightforward rule-based rewards, without any KL regularization, is
sufficient to scale up both response length and benchmark performance, similar
to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as
DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on
AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating
remarkable efficiency -- requiring only a tenth of the training steps, compared
to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our
source code, parameter settings, training data, and model weights across
various sizes.
☆ Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning
We propose Rec-R1, a general reinforcement learning framework that bridges
large language models (LLMs) with recommendation systems through closed-loop
optimization. Unlike prompting and supervised fine-tuning (SFT), Rec-R1
directly optimizes LLM generation using feedback from a fixed black-box
recommendation model, without relying on synthetic SFT data from proprietary
models such as GPT-4o. This avoids the substantial cost and effort required for
data distillation. To verify the effectiveness of Rec-R1, we evaluate it on two
representative tasks: product search and sequential recommendation.
Experimental results demonstrate that Rec-R1 not only consistently outperforms
prompting- and SFT-based methods, but also achieves significant gains over
strong discriminative baselines, even when used with simple retrievers such as
BM25. Moreover, Rec-R1 preserves the general-purpose capabilities of the LLM,
unlike SFT, which often impairs instruction-following and reasoning. These
findings suggest Rec-R1 as a promising foundation for continual task-specific
adaptation without catastrophic forgetting.
☆ MaintainCoder: Maintainable Code Generation Under Dynamic Requirements
Modern code generation has made significant strides in functional correctness
and execution efficiency. However, these systems often overlook a critical
dimension in real-world software development: maintainability. To handle
dynamic requirements with minimal rework, we propose MaintainCoder as a
pioneering solution. It integrates Waterfall model, design patterns, and
multi-agent collaboration to systematically enhance cohesion, reduce coupling,
and improve adaptability. We also introduce MaintainBench, a benchmark
comprising requirement changes and corresponding dynamic metrics on
maintainance effort. Experiments demonstrate that existing code generation
methods struggle to meet maintainability standards when requirements evolve. In
contrast, MaintainCoder improves maintainability metrics by 14-30% with even
higher correctness, i.e. pass@k. Our work not only provides the foundation of
maintainable code generation, but also highlights the need for more holistic
code quality research. Resources:
https://github.com/IAAR-Shanghai/MaintainCoder.
☆ Enhancing Large Language Models (LLMs) for Telecommunications using Knowledge Graphs and Retrieval-Augmented Generation
Large language models (LLMs) have made significant progress in
general-purpose natural language processing tasks. However, LLMs are still
facing challenges when applied to domain-specific areas like
telecommunications, which demands specialized expertise and adaptability to
evolving standards. This paper presents a novel framework that combines
knowledge graph (KG) and retrieval-augmented generation (RAG) techniques to
enhance LLM performance in the telecom domain. The framework leverages a KG to
capture structured, domain-specific information about network protocols,
standards, and other telecom-related entities, comprehensively representing
their relationships. By integrating KG with RAG, LLMs can dynamically access
and utilize the most relevant and up-to-date knowledge during response
generation. This hybrid approach bridges the gap between structured knowledge
representation and the generative capabilities of LLMs, significantly enhancing
accuracy, adaptability, and domain-specific comprehension. Our results
demonstrate the effectiveness of the KG-RAG framework in addressing complex
technical queries with precision. The proposed KG-RAG model attained an
accuracy of 88% for question answering tasks on a frequently used
telecom-specific dataset, compared to 82% for the RAG-only and 48% for the
LLM-only approaches.
comment: This work has been accepted to ICC 2025 IEEE International Conference
on Communications. copyright 2025 IEEE
☆ What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, Chen Ma
As enthusiasm for scaling computation (data and parameters) in the
pretraining era gradually diminished, test-time scaling (TTS), also referred to
as ``test-time computing'' has emerged as a prominent research focus. Recent
studies demonstrate that TTS can further elicit the problem-solving
capabilities of large language models (LLMs), enabling significant
breakthroughs not only in specialized reasoning tasks, such as mathematics and
coding, but also in general tasks like open-ended Q&A. However, despite the
explosion of recent efforts in this area, there remains an urgent need for a
comprehensive survey offering a systemic understanding. To fill this gap, we
propose a unified, multidimensional framework structured along four core
dimensions of TTS research: what to scale, how to scale, where to scale, and
how well to scale. Building upon this taxonomy, we conduct an extensive review
of methods, application scenarios, and assessment aspects, and present an
organized decomposition that highlights the unique functional roles of
individual techniques within the broader TTS landscape. From this analysis, we
distill the major developmental trajectories of TTS to date and offer hands-on
guidelines for practical deployment. Furthermore, we identify several open
challenges and offer insights into promising future directions, including
further scaling, clarifying the functional essence of techniques, generalizing
to more tasks, and more attributions.
☆ PAARS: Persona Aligned Agentic Retail Shoppers
In e-commerce, behavioral data is collected for decision making which can be
costly and slow. Simulation with LLM powered agents is emerging as a promising
alternative for representing human population behavior. However, LLMs are known
to exhibit certain biases, such as brand bias, review rating bias and limited
representation of certain groups in the population, hence they need to be
carefully benchmarked and aligned to user behavior. Ultimately, our goal is to
synthesise an agent population and verify that it collectively approximates a
real sample of humans. To this end, we propose a framework that: (i) creates
synthetic shopping agents by automatically mining personas from anonymised
historical shopping data, (ii) equips agents with retail-specific tools to
synthesise shopping sessions and (iii) introduces a novel alignment suite
measuring distributional differences between humans and shopping agents at the
group (i.e. population) level rather than the traditional "individual" level.
Experimental results demonstrate that using personas improves performance on
the alignment suite, though a gap remains to human behaviour. We showcase an
initial application of our framework for automated agentic A/B testing and
compare the findings to human results. Finally, we discuss applications,
limitations and challenges setting the stage for impactful future work.
☆ BAR-Analytics: A Web-based Platform for Analyzing Information Spreading Barriers in News: Comparative Analysis Across Multiple Barriers and Events
This paper presents BAR-Analytics, a web-based, open-source platform designed
to analyze news dissemination across geographical, economic, political, and
cultural boundaries. Using the Russian-Ukrainian and Israeli-Palestinian
conflicts as case studies, the platform integrates four analytical methods:
propagation analysis, trend analysis, sentiment analysis, and temporal topic
modeling. Over 350,000 articles were collected and analyzed, with a focus on
economic disparities and geographical influences using metadata enrichment. We
evaluate the case studies using coherence, sentiment polarity, topic frequency,
and trend shifts as key metrics. Our results show distinct patterns in news
coverage: the Israeli-Palestinian conflict tends to have more negative
sentiment with a focus on human rights, while the Russia-Ukraine conflict is
more positive, emphasizing election interference. These findings highlight the
influence of political, economic, and regional factors in shaping media
narratives across different conflicts.
comment: 46 pages
☆ MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing
We propose a unified framework that integrates object detection (OD) and
visual grounding (VG) for remote sensing (RS) imagery. To support conventional
OD and establish an intuitive prior for VG task, we fine-tune an open-set
object detector using referring expression data, framing it as a partially
supervised OD task. In the first stage, we construct a graph representation of
each image, comprising object queries, class embeddings, and proposal
locations. Then, our task-aware architecture processes this graph to perform
the VG task. The model consists of: (i) a multi-branch network that integrates
spatial, visual, and categorical features to generate task-aware proposals, and
(ii) an object reasoning network that assigns probabilities across proposals,
followed by a soft selection mechanism for final referring object localization.
Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG
datasets, achieving significant improvements over state-of-the-art methods
while retaining classical OD capabilities. The code will be available in our
repository: \url{https://github.com/rd20karim/MB-ORES}.
☆ Synthetic News Generation for Fake News Classification
This study explores the generation and evaluation of synthetic fake news
through fact based manipulations using large language models (LLMs). We
introduce a novel methodology that extracts key facts from real articles,
modifies them, and regenerates content to simulate fake news while maintaining
coherence. To assess the quality of the generated content, we propose a set of
evaluation metrics coherence, dissimilarity, and correctness. The research also
investigates the application of synthetic data in fake news classification,
comparing traditional machine learning models with transformer based models
such as BERT. Our experiments demonstrate that transformer models, especially
BERT, effectively leverage synthetic data for fake news detection, showing
improvements with smaller proportions of synthetic data. Additionally, we find
that fact verification features, which focus on identifying factual
inconsistencies, provide the most promising results in distinguishing synthetic
fake news. The study highlights the potential of synthetic data to enhance fake
news detection systems, offering valuable insights for future research and
suggesting that targeted improvements in synthetic data generation can further
strengthen detection models.
comment: 13 pages, 8 figures
☆ TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers' Guidance
Large Language Models (LLMs) have made significant strides in problem-solving
by incorporating reasoning processes. However, this enhanced reasoning
capability results in an increased number of output tokens during inference,
leading to higher computational costs. To address this challenge, we propose
TwT (Thinking without Tokens), a method that reduces inference-time costs
through habitual reasoning distillation with multi-teachers' guidance, while
maintaining high performance. Our approach introduces a Habitual Reasoning
Distillation method, which internalizes explicit reasoning into the model's
habitual behavior through a Teacher-Guided compression strategy inspired by
human cognition. Additionally, we propose Dual-Criteria Rejection Sampling
(DCRS), a technique that generates a high-quality and diverse distillation
dataset using multiple teacher models, making our method suitable for
unsupervised scenarios. Experimental results demonstrate that TwT effectively
reduces inference costs while preserving superior performance, achieving up to
a 13.6% improvement in accuracy with fewer output tokens compared to other
distillation methods, offering a highly practical solution for efficient LLM
deployment.
☆ Implicit In-Context Learning: Evidence from Artificial Language Experiments
Humans acquire language through implicit learning, absorbing complex patterns
without explicit awareness. While LLMs demonstrate impressive linguistic
capabilities, it remains unclear whether they exhibit human-like pattern
recognition during in-context learning at inferencing level. We adapted three
classic artificial language learning experiments spanning morphology,
morphosyntax, and syntax to systematically evaluate implicit learning at
inferencing level in two state-of-the-art OpenAI models: gpt-4o and o3-mini.
Our results reveal linguistic domain-specific alignment between models and
human behaviors, o3-mini aligns better in morphology while both models align in
syntax.
☆ Multi-Task Learning for Extracting Menstrual Characteristics from Clinical Notes
Menstrual health is a critical yet often overlooked aspect of women's
healthcare. Despite its clinical relevance, detailed data on menstrual
characteristics is rarely available in structured medical records. To address
this gap, we propose a novel Natural Language Processing pipeline to extract
key menstrual cycle attributes -- dysmenorrhea, regularity, flow volume, and
intermenstrual bleeding. Our approach utilizes the GatorTron model with
Multi-Task Prompt-based Learning, enhanced by a hybrid retrieval preprocessing
step to identify relevant text segments. It out- performs baseline methods,
achieving an average F1-score of 90% across all menstrual characteristics,
despite being trained on fewer than 100 annotated clinical notes. The retrieval
step consistently improves performance across all approaches, allowing the
model to focus on the most relevant segments of lengthy clinical notes. These
results show that combining multi-task learning with retrieval improves
generalization and performance across menstrual charac- teristics, advancing
automated extraction from clinical notes and supporting women's health
research.
☆ TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang
The detection of telecom fraud faces significant challenges due to the lack
of high-quality multimodal training data that integrates audio signals with
reasoning-oriented textual analysis. To address this gap, we present
TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset
specifically designed for automated telecom fraud analysis. Our dataset is
constructed through three strategies: (1) Privacy-preserved text-truth sample
generation using automatically speech recognition (ASR)-transcribed call
recordings (with anonymized original audio), ensuring real-world consistency
through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via
large language model (LLM)-based self-instruction sampling on authentic ASR
outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that
simulates emerging fraud tactics through predefined communication scenarios and
fraud typologies. The generated dataset contains 28,511 rigorously processed
speech-text pairs, complete with detailed annotations for fraud reasoning. The
dataset is divided into three tasks: scenario classification, fraud detection,
fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a
standardized evaluation benchmark comprising proportionally sampled instances
from the dataset, to facilitate systematic testing of model performance on
telecom fraud detection tasks. We also contribute a production-optimized
supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while
open-sourcing the data processing framework to enable community-driven dataset
expansion. This work establishes a foundational framework for multimodal
anti-fraud research while addressing critical challenges in data privacy and
scenario diversity. The project will be released at
https://github.com/JimmyMa99/TeleAntiFraud.
☆ Grounding Agent Reasoning in Image Schemas: A Neurosymbolic Approach to Embodied Cognition
Despite advances in embodied AI, agent reasoning systems still struggle to
capture the fundamental conceptual structures that humans naturally use to
understand and interact with their environment. To address this, we propose a
novel framework that bridges embodied cognition theory and agent systems by
leveraging a formal characterization of image schemas, which are defined as
recurring patterns of sensorimotor experience that structure human cognition.
By customizing LLMs to translate natural language descriptions into formal
representations based on these sensorimotor patterns, we will be able to create
a neurosymbolic system that grounds the agent's understanding in fundamental
conceptual structures. We argue that such an approach enhances both efficiency
and interpretability while enabling more intuitive human-agent interactions
through shared embodied understanding.
☆ Is LLM the Silver Bullet to Low-Resource Languages Machine Translation?
Yewei Song, Lujun Li, Cedric Lothritz, Saad Ezzini, Lama Sleem, Niccolo Gentile, Radu State, Tegawendé F. Bissyandé, Jacques Klein
Low-Resource Languages (LRLs) present significant challenges in natural
language processing due to their limited linguistic resources and
underrepresentation in standard datasets. While recent advancements in Large
Language Models (LLMs) and Neural Machine Translation (NMT) have substantially
improved translation capabilities for high-resource languages, performance
disparities persist for LRLs, particularly impacting privacy-sensitive and
resource-constrained scenarios. This paper systematically evaluates the
limitations of current LLMs across 200 languages using benchmarks such as
FLORES-200. We also explore alternative data sources, including news articles
and bilingual dictionaries, and demonstrate how knowledge distillation from
large pre-trained models can significantly improve smaller LRL translations.
Additionally, we investigate various fine-tuning strategies, revealing that
incremental enhancements markedly reduce performance gaps on smaller LLMs.
☆ Artificial Conversations, Real Results: Fostering Language Detection with Synthetic Data
Collecting high-quality training data is essential for fine-tuning Large
Language Models (LLMs). However, acquiring such data is often costly and
time-consuming, especially for non-English languages such as Italian. Recently,
researchers have begun to explore the use of LLMs to generate synthetic
datasets as a viable alternative. This study proposes a pipeline for generating
synthetic data and a comprehensive approach for investigating the factors that
influence the validity of synthetic data generated by LLMs by examining how
model performance is affected by metrics such as prompt strategy, text length
and target position in a specific task, i.e. inclusive language detection in
Italian job advertisements. Our results show that, in most cases and across
different metrics, the fine-tuned models trained on synthetic data consistently
outperformed other models on both real and synthetic test datasets. The study
discusses the practical implications and limitations of using synthetic data
for language detection tasks with LLMs.
☆ Crossing Boundaries: Leveraging Semantic Divergences to Explore Cultural Novelty in Cooking Recipes
Novelty modeling and detection is a core topic in Natural Language Processing
(NLP), central to numerous tasks such as recommender systems and automatic
summarization. It involves identifying pieces of text that deviate in some way
from previously known information. However, novelty is also a crucial
determinant of the unique perception of relevance and quality of an experience,
as it rests upon each individual's understanding of the world. Social factors,
particularly cultural background, profoundly influence perceptions of novelty
and innovation. Cultural novelty arises from differences in salience and
novelty as shaped by the distance between distinct communities. While cultural
diversity has garnered increasing attention in artificial intelligence (AI),
the lack of robust metrics for quantifying cultural novelty hinders a deeper
understanding of these divergences. This gap limits quantifying and
understanding cultural differences within computational frameworks. To address
this, we propose an interdisciplinary framework that integrates knowledge from
sociology and management. Central to our approach is GlobalFusion, a novel
dataset comprising 500 dishes and approximately 100,000 cooking recipes
capturing cultural adaptation from over 150 countries. By introducing a set of
Jensen-Shannon Divergence metrics for novelty, we leverage this dataset to
analyze textual divergences when recipes from one community are modified by
another with a different cultural background. The results reveal significant
correlations between our cultural novelty metrics and established cultural
measures based on linguistic, religious, and geographical distances. Our
findings highlight the potential of our framework to advance the understanding
and measurement of cultural diversity in AI.
☆ You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation
The goal of translation, be it by human or by machine, is, given some text in
a source language, to produce text in a target language that simultaneously 1)
preserves the meaning of the source text and 2) achieves natural expression in
the target language. However, researchers in the machine translation community
usually assess translations using a single score intended to capture semantic
accuracy and the naturalness of the output simultaneously. In this paper, we
build on recent advances in information theory to mathematically prove and
empirically demonstrate that such single-score summaries do not and cannot give
the complete picture of a system's true performance. Concretely, we prove that
a tradeoff exists between accuracy and naturalness and demonstrate it by
evaluating the submissions to the WMT24 shared task. Our findings help explain
well-known empirical phenomena, such as the observation that optimizing
translation systems for a specific accuracy metric (like BLEU) initially
improves the system's naturalness, while ``overfitting'' the system to the
metric can significantly degrade its naturalness. Thus, we advocate for a
change in how translations are evaluated: rather than comparing systems using a
single number, they should be compared on an accuracy-naturalness plane.
☆ Comparing representations of long clinical texts for the task of patient note-identification
In this paper, we address the challenge of patient-note identification, which
involves accurately matching an anonymized clinical note to its corresponding
patient, represented by a set of related notes. This task has broad
applications, including duplicate records detection and patient similarity
analysis, which require robust patient-level representations. We explore
various embedding methods, including Hierarchical Attention Networks (HAN),
three-level Hierarchical Transformer Networks (HTN), LongFormer, and advanced
BERT-based models, focusing on their ability to process mediumto-long clinical
texts effectively. Additionally, we evaluate different pooling strategies
(mean, max, and mean_max) for aggregating wordlevel embeddings into
patient-level representations and we examine the impact of sliding windows on
model performance. Our results indicate that BERT-based embeddings outperform
traditional and hierarchical models, particularly in processing lengthy
clinical notes and capturing nuanced patient representations. Among the pooling
strategies, mean_max pooling consistently yields the best results, highlighting
its ability to capture critical features from clinical notes. Furthermore, the
reproduction of our results on both MIMIC dataset and Necker hospital data
warehouse illustrates the generalizability of these approaches to real-world
applications, emphasizing the importance of both embedding methods and
aggregation strategies in optimizing patient-note identification and enhancing
patient-level modeling.
☆ BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation
Multimodal emotion recognition in conversation (MERC), the task of
identifying the emotion label for each utterance in a conversation, is vital
for developing empathetic machines. Current MLLM-based MERC studies focus
mainly on capturing the speaker's textual or vocal characteristics, but ignore
the significance of video-derived behavior information. Different from text and
audio inputs, learning videos with rich facial expression, body language and
posture, provides emotion trigger signals to the models for more accurate
emotion predictions. In this paper, we propose a novel behavior-aware
MLLM-based framework (BeMERC) to incorporate speaker's behaviors, including
subtle facial micro-expression, body language and posture, into a vanilla
MLLM-based MERC model, thereby facilitating the modeling of emotional dynamics
during a conversation. Furthermore, BeMERC adopts a two-stage instruction
tuning strategy to extend the model to the conversations scenario for
end-to-end training of a MERC predictor. Experiments demonstrate that BeMERC
achieves superior performance than the state-of-the-art methods on two
benchmark datasets, and also provides a detailed discussion on the significance
of video-derived behavior information in MERC.
☆ Model Hemorrhage and the Robustness Limits of Large Language Models
Large language models (LLMs) demonstrate strong performance across natural
language processing tasks, yet undergo significant performance degradation when
modified for deployment through quantization, pruning, or decoding strategy
adjustments. We define this phenomenon as model hemorrhage - performance
decline caused by parameter alterations and architectural changes. Through
systematic analysis of various LLM frameworks, we identify key vulnerability
patterns: layer expansion frequently disrupts attention mechanisms, compression
techniques induce information loss cascades, and decoding adjustments amplify
prediction divergences. Our investigation reveals transformer architectures
exhibit inherent robustness thresholds that determine hemorrhage severity
across modification types. We propose three mitigation strategies:
gradient-aware pruning preserves critical weight pathways, dynamic quantization
scaling maintains activation integrity, and decoding calibration aligns
generation trajectories with original model distributions. This work
establishes foundational metrics for evaluating model stability during
adaptation, providing practical guidelines for maintaining performance while
enabling efficient LLM deployment. Our findings advance understanding of neural
network resilience under architectural transformations, particularly for
large-scale language models.
comment: 33 pages, 18 figures
☆ Entropy-Based Adaptive Weighting for Self-Training
The mathematical problem-solving capabilities of large language models have
become a focal point of research, with growing interests in leveraging
self-generated reasoning paths as a promising way to refine and enhance these
models. These paths capture step-by-step logical processes while requiring only
the correct answer for supervision. The self-training method has been shown to
be effective in reasoning tasks while eliminating the need for external models
and manual annotations. However, optimizing the use of self-generated data for
model training remains an open challenge. In this work, we propose
Entropy-Based Adaptive Weighting for Self-Training (EAST), an adaptive
weighting strategy designed to prioritize uncertain data during self-training.
Specifically, EAST employs a mapping function with a tunable parameter that
controls the sharpness of the weighting, assigning higher weights to data where
the model exhibits greater uncertainty. This approach guides the model to focus
on more informative and challenging examples, thereby enhancing its reasoning
ability. We evaluate our approach on GSM8K and MATH benchmarks. Empirical
results show that, while the vanilla method yields virtually no improvement
(0%) on MATH, EAST achieves around a 1% gain over backbone model. On GSM8K,
EAST attains a further 1-2% performance boost compared to the vanilla method.
☆ Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset ACL 2025
Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, Yunmeng Li, Hongyi gu, Zheng Yuan, Keisuke Sakaguchi, Paula Buttery
The performance and usability of Large-Language Models (LLMs) are driving
their use in explanation generation tasks. However, despite their widespread
adoption, LLM explanations have been found to be unreliable, making it
difficult for users to distinguish good from bad explanations. To address this
issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of
26k explanations, written and later quality-annotated using the rubric by both
humans and six open- and closed-source LLMs. The CUBE dataset focuses on two
reasoning and two language tasks, providing the necessary diversity for us to
effectively test our proposed rubric. Using Rubrik, we find that explanations
are influenced by both task and perceived difficulty. Low quality stems
primarily from a lack of conciseness in LLM-generated explanations, rather than
cohesion and word choice. The full dataset, rubric, and code will be made
available upon acceptance.
comment: 9 main pages (21 appendix pages), 7 figures, submitted to ACL 2025
☆ Better wit than wealth: Dynamic Parametric Retrieval Augmented Generation for Test-time Knowledge Enhancement
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by
retrieving relevant documents from external sources and incorporating them into
the context. While it improves reliability by providing factual texts, it
significantly increases inference costs as context length grows and introduces
challenging issue of RAG hallucination, primarily caused by the lack of
corresponding parametric knowledge in LLMs. An efficient solution is to enhance
the knowledge of LLMs at test-time. Parametric RAG (PRAG) addresses this by
embedding document into LLMs parameters to perform test-time knowledge
enhancement, effectively reducing inference costs through offline training.
However, its high training and storage costs, along with limited generalization
ability, significantly restrict its practical adoption. To address these
challenges, we propose Dynamic Parametric RAG (DyPRAG), a novel framework that
leverages a lightweight parameter translator model to efficiently convert
documents into parametric knowledge. DyPRAG not only reduces inference,
training, and storage costs but also dynamically generates parametric
knowledge, seamlessly enhancing the knowledge of LLMs and resolving knowledge
conflicts in a plug-and-play manner at test-time. Extensive experiments on
multiple datasets demonstrate the effectiveness and generalization capabilities
of DyPRAG, offering a powerful and practical RAG paradigm which enables
superior knowledge fusion and mitigates RAG hallucination in real-world
applications. Our code is available at https://github.com/Trae1ounG/DyPRAG.
comment: preprint
☆ SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development
High-quality speech dialogue datasets are crucial for Speech-LLM development,
yet existing acquisition methods face significant limitations. Human recordings
incur high costs and privacy concerns, while synthetic approaches often lack
conversational authenticity. To address these challenges, we introduce
\textsc{SpeechDialogueFactory}, a production-ready framework for generating
natural speech dialogues efficiently. Our solution employs a comprehensive
pipeline including metadata generation, dialogue scripting,
paralinguistic-enriched utterance simulation, and natural speech synthesis with
voice cloning. Additionally, the system provides an interactive UI for detailed
sample inspection and a high-throughput batch synthesis mode. Evaluations show
that dialogues generated by our system achieve a quality comparable to human
recordings while significantly reducing production costs. We release our work
as an open-source toolkit, alongside example datasets available in English and
Chinese, empowering researchers and developers in Speech-LLM research and
development.
☆ Expanding RL with Verifiable Rewards Across Diverse Domains
Reinforcement learning (RL) with verifiable rewards (RLVR) has shown
promising results in mathematical reasoning and coding tasks where
well-structured reference answers are available. However, its applicability to
broader domains remains underexplored. In this work, we study the extension of
RLVR to more diverse domains such as medicine, chemistry, psychology, and
economics. We observe high agreement in binary judgments across different large
language models (LLMs) when objective reference answers exist, which challenges
the necessity of large-scale annotation for training domain-specific reward
models. To address the limitations of binary rewards when handling unstructured
reference answers, we further incorporate model-based soft scoring into RLVR to
improve its flexibility. Our experiments show that a distilled generative
reward model can serve as an effective cross-domain verifier, providing
reliable reward signals for RL without requiring domain-specific annotations.
By fine-tuning a base 7B model using various RL algorithms against our reward
model, we obtain policies that outperform state-of-the-art open-source aligned
LLMs such as Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B by a large
margin, across domains in free-form answer settings. This also strengthens
RLVR's robustness and scalability, highlighting its potential for real-world
applications with noisy or weak labels.
☆ Did ChatGPT or Copilot use alter the style of internet news headlines? A time series regression analysis
The release of advanced Large Language Models (LLMs) such as ChatGPT and
Copilot is changing the way text is created and may influence the content that
we find on the web. This study investigated whether the release of these two
popular LLMs coincided with a change in writing style in headlines and links on
worldwide news websites. 175 NLP features were obtained for each text in a
dataset of 451 million headlines/links. An interrupted time series analysis was
applied for each of the 175 NLP features to evaluate whether there were any
statistically significant sustained changes after the release dates of ChatGPT
and/or Copilot. There were a total of 44 features that did not appear to have
any significant sustained change after the release of ChatGPT/Copilot. A total
of 91 other features did show significant change with ChatGPT and/or Copilot
although significance with earlier control LLM release dates (GPT-1/2/3,
Gopher) removed them from consideration. This initial analysis suggests these
language models may have had a limited impact on the style of individual news
headlines/links, with respect to only some NLP measures.
☆ Get the Agents Drunk: Memory Perturbations in Autonomous Agent-based Recommender Systems
Large language model-based agents are increasingly used in recommender
systems (Agent4RSs) to achieve personalized behavior modeling. Specifically,
Agent4RSs introduces memory mechanisms that enable the agents to autonomously
learn and self-evolve from real-world interactions. However, to the best of our
knowledge, how robust Agent4RSs are remains unexplored. As such, in this paper,
we propose the first work to attack Agent4RSs by perturbing agents' memories,
not only to uncover their limitations but also to enhance their security and
robustness, ensuring the development of safer and more reliable AI agents.
Given the security and privacy concerns, it is more practical to launch
attacks under a black-box setting, where the accurate knowledge of the victim
models cannot be easily obtained. Moreover, the practical attacks are often
stealthy to maximize the impact. To this end, we propose a novel practical
attack framework named DrunkAgent. DrunkAgent consists of a generation module,
a strategy module, and a surrogate module. The generation module aims to
produce effective and coherent adversarial textual triggers, which can be used
to achieve attack objectives such as promoting the target items. The strategy
module is designed to `get the target agents drunk' so that their memories
cannot be effectively updated during the interaction process. As such, the
triggers can play the best role. Both of the modules are optimized on the
surrogate module to improve the transferability and imperceptibility of the
attacks. By identifying and analyzing the vulnerabilities, our work provides
critical insights that pave the way for building safer and more resilient
Agent4RSs. Extensive experiments across various real-world datasets demonstrate
the effectiveness of DrunkAgent.
☆ Adaptive Layer-skipping in Pre-trained LLMs
Various layer-skipping methods have been proposed to accelerate token
generation in large language models (LLMs). However, they have overlooked a
fundamental question: How do computational demands vary across the generation
of different tokens? In this work, we introduce FlexiDepth, a method that
dynamically adjusts the number of Transformer layers used in text generation.
By incorporating a plug-in router and adapter, FlexiDepth enables adaptive
layer-skipping in LLMs without modifying their original parameters. Introducing
FlexiDepth to Llama-3-8B model achieves layer skipping of 8 layers out of 32,
and meanwhile maintains the full 100\% benchmark performance. Experimental
results with FlexiDepth demonstrate that computational demands in LLMs
significantly vary based on token type. Specifically, generating repetitive
tokens or fixed phrases requires fewer layers, whereas producing tokens
involving computation or high uncertainty requires more layers. Interestingly,
this adaptive allocation pattern aligns with human intuition. To advance
research in this area, we open sourced FlexiDepth and a dataset documenting
FlexiDepth's layer allocation patterns for future exploration.
☆ WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization
In this study, we take a closer look at how Winograd schema challenges can be
used to evaluate common sense reasoning in LLMs. Specifically, we evaluate
generative models of different sizes on the popular WinoGrande benchmark. We
release WinoWhat, a new corpus, in which each instance of the WinoGrande
validation set is paraphrased. Additionally, we evaluate the performance on the
challenge across five common sense knowledge categories, giving more
fine-grained insights on what types of knowledge are more challenging for LLMs.
Surprisingly, all models perform significantly worse on WinoWhat, implying that
LLM reasoning capabilities are overestimated on WinoGrande. To verify whether
this is an effect of benchmark memorization, we match benchmark instances to
LLM trainingdata and create two test-suites. We observe that memorization has a
minimal effect on model performance on WinoGrande.
☆ CONGRAD:Conflicting Gradient Filtering for Multilingual Preference Alignment
Jiangnan Li, Thuy-Trang Vu, Christian Herold, Amirhossein Tebbifakhr, Shahram Khadivi, Gholamreza Haffari
Naive joint training of large language models (LLMs) for multilingual
preference alignment can suffer from negative interference. This is a known
issue in multilingual training, where conflicting objectives degrade overall
performance. However, the impact of this phenomenon in the context of
multilingual preference alignment remains largely underexplored. To address
this issue, we propose CONGRAD, a scalable and effective filtering method that
selects high-quality preference samples with minimal gradient conflicts across
languages. Our method leverages gradient surgery to retain samples aligned with
an aggregated multilingual update direction. Additionally, we incorporate a
sublinear gradient compression strategy that reduces memory overhead during
gradient accumulation. We integrate CONGRAD into self-rewarding framework and
evaluate on LLaMA3-8B and Gemma2-2B across 10 languages. Results show that
CONGRAD consistently outperforms strong baselines in both seen and unseen
languages, with minimal alignment tax.
☆ Texture or Semantics? Vision-Language Models Get Lost in Font Recognition
Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic
capabilities, achieving impressive performance in various tasks such as image
recognition and object localization. However, their effectiveness in
fine-grained tasks remains an open question. In everyday scenarios, individuals
encountering design materials, such as magazines, typography tutorials,
research papers, or branding content, may wish to identify aesthetically
pleasing fonts used in the text. Given their multimodal capabilities and free
accessibility, many VLMs are often considered potential tools for font
recognition. This raises a fundamental question: Do VLMs truly possess the
capability to recognize fonts? To investigate this, we introduce the Font
Recognition Benchmark (FRB), a compact and well-structured dataset comprising
15 commonly used fonts. FRB includes two versions: (i) an easy version, where
10 sentences are rendered in different fonts, and (ii) a hard version, where
each text sample consists of the names of the 15 fonts themselves, introducing
a stroop effect that challenges model perception. Through extensive evaluation
of various VLMs on font recognition tasks, we arrive at the following key
findings: (i) Current VLMs exhibit limited font recognition capabilities, with
many state-of-the-art models failing to achieve satisfactory performance. (ii)
Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits
in improving font recognition accuracy across different VLMs. (iii) Attention
analysis sheds light on the inherent limitations of VLMs in capturing semantic
features.
☆ Towards a cognitive architecture to enable natural language interaction in co-constructive task learning
This research addresses the question, which characteristics a cognitive
architecture must have to leverage the benefits of natural language in
Co-Constructive Task Learning (CCTL). To provide context, we first discuss
Interactive Task Learning (ITL), the mechanisms of the human memory system, and
the significance of natural language and multi-modality. Next, we examine the
current state of cognitive architectures, analyzing their capabilities to
inform a concept of CCTL grounded in multiple sources. We then integrate
insights from various research domains to develop a unified framework. Finally,
we conclude by identifying the remaining challenges and requirements necessary
to achieve CCTL in Human-Robot Interaction (HRI).
comment: 8 pages, 5 figures, submitted to: IEEE RO-MAN 2025
☆ Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model
Short-video platforms have gained immense popularity, captivating the
interest of millions, if not billions, of users globally. Recently, researchers
have highlighted the significance of analyzing the propagation of short-videos,
which typically involves discovering commercial values, public opinions, user
behaviors, etc. This paper proposes a new Short-video Propagation Influence
Rating (SPIR) task and aims to promote SPIR from both the dataset and method
perspectives. First, we propose a new Cross-platform Short-Video (XS-Video)
dataset, which aims to provide a large-scale and real-world short-video
propagation network across various platforms to facilitate the research on
short-video propagation. Our XS-Video dataset includes 117,720 videos, 381,926
samples, and 535 topics across 5 biggest Chinese platforms, annotated with the
propagation influence from level 0 to 9. To the best of our knowledge, this is
the first large-scale short-video dataset that contains cross-platform data or
provides all of the views, likes, shares, collects, fans, comments, and comment
content. Second, we propose a Large Graph Model (LGM) named NetGPT, based on a
novel three-stage training mechanism, to bridge heterogeneous graph-structured
data with the powerful reasoning ability and knowledge of Large Language Models
(LLMs). Our NetGPT can comprehend and analyze the short-video propagation
graph, enabling it to predict the long-term propagation influence of
short-videos. Comprehensive experimental results evaluated by both
classification and regression metrics on our XS-Video dataset indicate the
superiority of our method for SPIR.
☆ LANID: LLM-assisted New Intent Discovery LREC
Task-oriented Dialogue Systems (TODS) often face the challenge of
encountering new intents. New Intent Discovery (NID) is a crucial task that
aims to identify these novel intents while maintaining the capability to
recognize existing ones. Previous efforts to adapt TODS to new intents have
struggled with inadequate semantic representation or have depended on external
knowledge, which is often not scalable or flexible. Recently, Large Language
Models (LLMs) have demonstrated strong zero-shot capabilities; however, their
scale can be impractical for real-world applications that involve extensive
queries. To address the limitations of existing NID methods by leveraging LLMs,
we propose LANID, a framework that enhances the semantic representation of
lightweight NID encoders with the guidance of LLMs. Specifically, LANID employs
the $K$-nearest neighbors and Density-Based Spatial Clustering of Applications
with Noise (DBSCAN) algorithms to sample selective utterance pairs from the
training set. It then queries an LLM to ascertain the relationships between
these pairs. The data produced from this process is utilized to design a
contrastive fine-tuning task, which is then used to train a small encoder with
a contrastive triplet loss. Our experimental results demonstrate the efficacy
of the proposed method across three distinct NID datasets, surpassing strong
baselines in both unsupervised and semi-supervised settings. Our code is
available at https://github.com/floatSDSDS/LANID.
comment: Published in LREC-COLING 2024
☆ AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization CVPR 2025
Yiyang Du, Xiaochen Wang, Chi Chen, Jiabo Ye, Yiru Wang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Zhifang Sui, Maosong Sun, Yang Liu
Recently, model merging methods have demonstrated powerful strengths in
combining abilities on various tasks from multiple Large Language Models
(LLMs). While previous model merging methods mainly focus on merging
homogeneous models with identical architecture, they meet challenges when
dealing with Multimodal Large Language Models (MLLMs) with inherent
heterogeneous property, including differences in model architecture and the
asymmetry in the parameter space. In this work, we propose AdaMMS, a novel
model merging method tailored for heterogeneous MLLMs. Our method tackles the
challenges in three steps: mapping, merging and searching. Specifically, we
first design mapping function between models to apply model merging on MLLMs
with different architecture. Then we apply linear interpolation on model
weights to actively adapt the asymmetry in the heterogeneous MLLMs. Finally in
the hyper-parameter searching step, we propose an unsupervised hyper-parameter
selection method for model merging. As the first model merging method capable
of merging heterogeneous MLLMs without labeled data, extensive experiments on
various model combinations demonstrated that AdaMMS outperforms previous model
merging methods on various vision-language benchmarks.
comment: CVPR 2025
☆ KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language CVPR
The recent emergence of Large Vision-Language Models(VLMs) has resulted in a
variety of different benchmarks for evaluating such models. Despite this, we
observe that most existing evaluation methods suffer from the fact that they
either require the model to choose from pre-determined responses, sacrificing
open-endedness, or evaluate responses using a judge model, resulting in
subjective and unreliable evaluation. In addition, we observe a lack of
benchmarks for VLMs in the Korean language, which are necessary as a separate
metric from more common English language benchmarks, as the performance of
generative language models can differ significantly based on the language being
used. Therefore, we present KOFFVQA, a general-purpose free-form visual
question answering benchmark in the Korean language for the evaluation of VLMs.
Our benchmark consists of 275 carefully crafted questions each paired with an
image and grading criteria covering 10 different aspects of VLM performance.
The grading criteria eliminate the problem of unreliability by allowing the
judge model to grade each response based on a pre-determined set of rules. By
defining the evaluation criteria in an objective manner, even a small
open-source model can be used to evaluate models on our benchmark reliably. In
addition to evaluating a large number of existing VLMs on our benchmark, we
also experimentally verify that our method of using pre-existing grading
criteria for evaluation is much more reliable than existing methods. Our
evaluation code is available at https://github.com/maum-ai/KOFFVQA
comment: Accepted to CVPRW 2025, Workshop on Benchmarking and Expanding AI
Multimodal Approaches
☆ Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models
Youmi Ma, Sakae Mizuki, Kazuki Fujii, Taishi Nakamura, Masanari Ohi, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Koki Maeda, Kakeru Hattori, Takumi Okamoto, Shigeki Ishida, Rio Yokota, Hiroya Takamura, Naoaki Okazaki
Instruction tuning is crucial for enabling Large Language Models (LLMs) to
solve real-world tasks. Prior work has shown the effectiveness of
instruction-tuning data synthesized solely from LLMs, raising a fundamental
question: Do we still need human-originated signals for instruction tuning?
This work answers the question affirmatively: we build state-of-the-art
instruction-tuning datasets sourced from human-written instructions, by simply
pairing them with LLM-generated responses. LLMs fine-tuned on our datasets
consistently outperform those fine-tuned on existing ones. Our data
construction approach can be easily adapted to other languages; we build
datasets for Japanese and confirm that LLMs tuned with our data reach
state-of-the-art performance. Analyses suggest that instruction-tuning in a new
language allows LLMs to follow instructions, while the tuned models exhibit a
notable lack of culture-specific knowledge in that language. The datasets and
fine-tuned models will be publicly available. Our datasets, synthesized with
open-weight LLMs, are openly distributed under permissive licenses, allowing
for diverse use cases.
comment: 15 pages, 5 figures
☆ Mapping Geopolitical Bias in 11 Large Language Models: A Bilingual, Dual-Framing Analysis of U.S.-China Tensions
This study systematically analyzes geopolitical bias across 11 prominent
Large Language Models (LLMs) by examining their responses to seven critical
topics in U.S.-China relations. Utilizing a bilingual (English and Chinese) and
dual-framing (affirmative and reverse) methodology, we generated 19,712 prompts
designed to detect ideological leanings in model outputs. Responses were
quantitatively assessed on a normalized scale from -2 (strongly Pro-China) to
+2 (strongly Pro-U.S.) and categorized according to stance, neutrality, and
refusal rates. The findings demonstrate significant and consistent ideological
alignments correlated with the LLMs' geographic origins; U.S.-based models
predominantly favored Pro-U.S. stances, while Chinese-origin models exhibited
pronounced Pro-China biases. Notably, language and prompt framing substantially
influenced model responses, with several LLMs exhibiting stance reversals based
on prompt polarity or linguistic context. Additionally, we introduced
comprehensive metrics to evaluate response consistency across languages and
framing conditions, identifying variability and vulnerabilities in model
behaviors. These results offer practical insights that can guide organizations
and individuals in selecting LLMs best aligned with their operational
priorities and geopolitical considerations, underscoring the importance of
careful model evaluation in politically sensitive applications. Furthermore,
the research highlights specific prompt structures and linguistic variations
that can strategically trigger distinct responses from models, revealing
methods for effectively navigating and influencing LLM outputs.
comment: Preliminary version,20 pages, 10 figures, 1 table
☆ MKA: Leveraging Cross-Lingual Consensus for Model Abstention ICLR 2025
Reliability of LLMs is questionable even as they get better at more tasks. A
wider adoption of LLMs is contingent on whether they are usably factual. And if
they are not, on whether they can properly calibrate their confidence in their
responses. This work focuses on utilizing the multilingual knowledge of an LLM
to inform its decision to abstain or answer when prompted. We develop a
multilingual pipeline to calibrate the model's confidence and let it abstain
when uncertain. We run several multilingual models through the pipeline to
profile them across different languages. We find that the performance of the
pipeline varies by model and language, but that in general they benefit from
it. This is evidenced by the accuracy improvement of $71.2\%$ for Bengali over
a baseline performance without the pipeline. Even a high-resource language like
English sees a $15.5\%$ improvement. These results hint at possible further
improvements.
comment: To appear in Building Trust Workshop at ICLR 2025
☆ Large Language Models Pass the Turing Test
We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two
randomised, controlled, and pre-registered Turing tests on independent
populations. Participants had 5 minute conversations simultaneously with
another human participant and one of these systems before judging which
conversational partner they thought was human. When prompted to adopt a
humanlike persona, GPT-4.5 was judged to be the human 73% of the time:
significantly more often than interrogators selected the real human
participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of
the time -- not significantly more or less often than the humans they were
being compared to -- while baseline models (ELIZA and GPT-4o) achieved win
rates significantly below chance (23% and 21% respectively). The results
constitute the first empirical evidence that any artificial system passes a
standard three-party Turing test. The results have implications for debates
about what kind of intelligence is exhibited by Large Language Models (LLMs),
and the social and economic impacts these systems are likely to have.
☆ WHERE and WHICH: Iterative Debate for Biomedical Synthetic Data Augmentation
In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation
Extraction, Named Entity Recognition, and Text Classification, the scarcity of
high-quality data remains a significant challenge. This limitation poisons
large language models to correctly understand relationships between biological
entities, such as molecules and diseases, or drug interactions, and further
results in potential misinterpretation of biomedical documents. To address this
issue, current approaches generally adopt the Synthetic Data Augmentation
method which involves similarity computation followed by word replacement, but
counterfactual data are usually generated. As a result, these methods disrupt
meaningful word sets or produce sentences with meanings that deviate
substantially from the original context, rendering them ineffective in
improving model performance. To this end, this paper proposes a
biomedical-dedicated rationale-based synthetic data augmentation method. Beyond
the naive lexicon similarity, specific bio-relation similarity is measured to
hold the augmented instance having a strong correlation with bio-relation
instead of simply increasing the diversity of augmented data. Moreover, a
multi-agents-involved reflection mechanism helps the model iteratively
distinguish different usage of similar entities to escape falling into the
mis-replace trap. We evaluate our method on the BLURB and BigBIO benchmark,
which includes 9 common datasets spanning four major BioNLP tasks. Our
experimental results demonstrate consistent performance improvements across all
tasks, highlighting the effectiveness of our approach in addressing the
challenges associated with data scarcity and enhancing the overall performance
of biomedical NLP models.
☆ CrossFormer: Cross-Segment Semantic Fusion for Document Segmentation
Text semantic segmentation involves partitioning a document into multiple
paragraphs with continuous semantics based on the subject matter, contextual
information, and document structure. Traditional approaches have typically
relied on preprocessing documents into segments to address input length
constraints, resulting in the loss of critical semantic information across
segments. To address this, we present CrossFormer, a transformer-based model
featuring a novel cross-segment fusion module that dynamically models latent
semantic dependencies across document segments, substantially elevating
segmentation accuracy. Additionally, CrossFormer can replace rule-based chunk
methods within the Retrieval-Augmented Generation (RAG) system, producing more
semantically coherent chunks that enhance its efficacy. Comprehensive
evaluations confirm CrossFormer's state-of-the-art performance on public text
semantic segmentation datasets, alongside considerable gains on RAG benchmarks.
comment: 10 pages, 4 figures
♻ ☆ EQ-Negotiator: An Emotion-Reasoning LLM Agent in Credit Dialogues
While large language model (LLM)-based chatbots have been applied for
effective engagement in credit dialogues, their capacity for dynamic emotional
expression remains limited. Current agents primarily rely on passive empathy
rather than affective reasoning. For instance, when faced with persistent
client negativity, the agent should employ strategic emotional adaptation by
expressing measured anger to discourage counterproductive behavior and guide
the conversation toward resolution. This context-aware emotional modulation is
essential for imitating the nuanced decision-making of human negotiators. This
paper introduces an EQ-negotiator that combines emotion sensing from
pre-trained language models (PLMs) with emotional reasoning based on Game
Theory and Hidden Markov Models. It takes into account both the current and
historical emotions of the client to better manage and address negative
emotions during interactions. By fine-tuning pre-trained language models (PLMs)
on public emotion datasets and validating them on the credit dialogue datasets,
our approach enables LLM-based agents to effectively capture shifts in client
emotions and dynamically adjust their response tone based on our emotion
decision policies in real-world financial negotiations. This EQ-negotiator can
also help credit agencies foster positive client relationships, enhancing
satisfaction in credit services.
♻ ☆ ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
Jianguo Zhang, Thai Hoang, Ming Zhu, Zuxin Liu, Shiyu Wang, Tulika Awalgaonkar, Akshara Prabhakar, Haolin Chen, Weiran Yao, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong
Action models are essential for enabling autonomous agents to perform complex
tasks. However, training large action models remains challenging due to the
diversity of agent environments and the complexity of agentic data. Despite
growing interest, existing infrastructure provides limited support for
scalable, agent-specific fine-tuning. We present ActionStudio, a lightweight
and extensible data and training framework designed for large action models.
ActionStudio unifies heterogeneous agent trajectories through a standardized
format, supports diverse training paradigms including LoRA, full fine-tuning,
and distributed setups, and integrates robust preprocessing and verification
tools. We validate its effectiveness across both public and realistic industry
benchmarks, demonstrating strong performance and practical scalability. We
open-sourced code and data at https://github.com/SalesforceAIResearch/xLAM to
facilitate research in the community.
comment: 15 pages; large action models; xLAM
♻ ☆ How Does A Text Preprocessing Pipeline Affect Ontology Syntactic Matching?
The classic text preprocessing pipeline, comprising Tokenisation,
Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been
implemented in many systems for syntactic ontology matching (OM). However, the
lack of standardisation in text preprocessing creates diversity in mapping
results. In this paper we investigate the effect of the text preprocessing
pipeline on syntactic OM in 8 Ontology Alignment Evaluation Initiative (OAEI)
tracks with 49 distinct alignments. We find that Phase 1 text preprocessing
(Tokenisation and Normalisation) is more effective than Phase 2 text
preprocessing (Stop Words Removal and Stemming/Lemmatisation). To repair the
unwanted false mappings caused by Phase 2 text preprocessing, we propose a
novel context-based pipeline repair approach that employs a post hoc check to
find common words that cause false mappings. These words are stored in a
reserved word set and applied in text preprocessing. The experimental results
show that our approach improves the matching correctness and the overall
matching performance. We then consider the broader integration of the classic
text preprocessing pipeline with modern large language models (LLMs) for OM. We
recommend that (1) the text preprocessing pipeline be injected via function
calling into LLMs to avoid the tendency towards unstable true mappings produced
by LLM prompting; or (2) LLMs be used to repair non-existent and
counter-intuitive false mappings generated by the text preprocessing pipeline.
comment: 12 pages, 11 figures, 4 tables
♻ ☆ Evil twins are not that evil: Qualitative insights into machine-generated prompts
It has been widely observed that language models (LMs) respond in predictable
ways to algorithmically generated prompts that are seemingly unintelligible.
This is both a sign that we lack a full understanding of how LMs work, and a
practical challenge, because opaqueness can be exploited for harmful uses of
LMs, such as jailbreaking. We present the first thorough analysis of opaque
machine-generated prompts, or autoprompts, pertaining to 6 LMs of different
sizes and families. We find that machine-generated prompts are characterized by
a last token that is often intelligible and strongly affects the generation. A
small but consistent proportion of the previous tokens are prunable, probably
appearing in the prompt as a by-product of the fact that the optimization
process fixes the number of tokens. The remaining tokens fall into two
categories: filler tokens, which can be replaced with semantically unrelated
substitutes, and keywords, that tend to have at least a loose semantic relation
with the generation, although they do not engage in well-formed syntactic
relations with it. Additionally, human experts can reliably identify the most
influential tokens in an autoprompt a posteriori, suggesting these prompts are
not entirely opaque. Finally, some of the ablations we applied to autoprompts
yield similar effects in natural language inputs, suggesting that autoprompts
emerge naturally from the way LMs process linguistic inputs in general.
♻ ☆ Surgical Action Planning with Large Language Models
In robot-assisted minimally invasive surgery, we introduce the Surgical
Action Planning (SAP) task, which generates future action plans from visual
inputs to address the absence of intraoperative predictive planning in current
intelligent applications. SAP shows great potential for enhancing
intraoperative guidance and automating procedures. However, it faces challenges
such as understanding instrument-action relationships and tracking surgical
progress. Large Language Models (LLMs) show promise in understanding surgical
video content but remain underexplored for predictive decision-making in SAP,
as they focus mainly on retrospective analysis. Challenges like data privacy,
computational demands, and modality-specific constraints further highlight
significant research gaps. To tackle these challenges, we introduce LLM-SAP, a
Large Language Models-based Surgical Action Planning framework that predicts
future actions and generates text responses by interpreting natural language
prompts of surgical goals. The text responses potentially support surgical
education, intraoperative decision-making, procedure documentation, and skill
analysis. LLM-SAP integrates two novel modules: the Near-History Focus Memory
Module (NHF-MM) for modeling historical states and the prompts factory for
action planning. We evaluate LLM-SAP on our constructed CholecT50-SAP dataset
using models like Qwen2.5 and Qwen2-VL, demonstrating its effectiveness in
next-action prediction. Pre-trained LLMs are tested in a zero-shot setting, and
supervised fine-tuning (SFT) with LoRA is implemented. Our experiments show
that Qwen2.5-72B-SFT surpasses Qwen2.5-72B with a 19.3% higher accuracy.
comment: 10 pages,4 figures
♻ ☆ Cascade Reward Sampling for Efficient Decoding-Time Alignment
Aligning large language models (LLMs) with human preferences is essential for
their applications. Recently, decoding-time alignment has emerged as an
effective plug-and-play technique that avoids fine-tuning model parameters.
This approach retains the general utility of pretrained LLMs but often suffers
from significant inefficiencies during decoding, primarily due to wasted token
generation and excessive reward evaluations. To address these challenges, we
introduce Cascade Reward Sampling (CARDS) to resolve both efficiency
bottlenecks in decoding-time alignment. Specifically, we develop a
segment-level rejection sampling algorithm that minimizes redundant
computations of both LLMs and reward models (RMs). Central to CARDS is an
uncertainty-based segmentation mechanism, which ensures the accuracy of RMs
evaluations on incomplete segments. Furthermore, we provide a detailed analysis
of reward scores on segments to elucidate the improved alignment performance.
Experimental results demonstrate that CARDS significantly improves decoding
efficiency, alignment quality, and general utility compared to existing
decoding-time alignment methods, achieving approximately a 70% reduction in
decoding time and over 90% win-ties in utility and safety benchmarks.
♻ ☆ ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery ICLR 2025
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, Huan Sun
The advancements of large language models (LLMs) have piqued growing interest
in developing LLM-based language agents to automate scientific discovery
end-to-end, which has sparked both excitement and skepticism about their true
capabilities. In this work, we call for rigorous assessment of agents on
individual tasks in a scientific workflow before making bold claims on
end-to-end automation. To this end, we present ScienceAgentBench, a new
benchmark for evaluating language agents for data-driven scientific discovery.
To ensure the scientific authenticity and real-world relevance of our
benchmark, we extract 102 tasks from 44 peer-reviewed publications in four
disciplines and engage nine subject matter experts to validate them. We unify
the target output for every task to a self-contained Python program file and
employ an array of evaluation metrics to examine the generated programs,
execution results, and costs. Each task goes through multiple rounds of manual
validation by annotators and subject matter experts to ensure its annotation
quality and scientific plausibility. We also propose two effective strategies
to mitigate data contamination concerns. Using ScienceAgentBench, we evaluate
five open-weight and proprietary LLMs, each with three frameworks: direct
prompting, OpenHands CodeAct, and self-debug. Given three attempts for each
task, the best-performing agent can only solve 32.4% of the tasks independently
and 34.3% with expert-provided knowledge. In addition, we evaluate OpenAI
o1-preview with direct prompting and self-debug, which can boost the
performance to 42.2%, demonstrating the effectiveness of increasing
inference-time compute but with more than 10 times the cost of other LLMs.
Still, our results underscore the limitations of current language agents in
generating code for data-driven discovery, let alone end-to-end automation for
scientific research.
comment: ICLR 2025. 60 pages
♻ ☆ Concept Navigation and Classification via Open-Source Large Language Model Processing
This paper presents a novel methodological framework for detecting and
classifying latent constructs, including frames, narratives, and topics, from
textual data using Open-Source Large Language Models (LLMs). The proposed
hybrid approach combines automated summarization with human-in-the-loop
validation to enhance the accuracy and interpretability of construct
identification. By employing iterative sampling coupled with expert refinement,
the framework guarantees methodological robustness and ensures conceptual
precision. Applied to diverse data sets, including AI policy debates, newspaper
articles on encryption, and the 20 Newsgroups data set, this approach
demonstrates its versatility in systematically analyzing complex political
discourses, media framing, and topic classification tasks.
comment: 36 pages, 1 figure, 5 tabels
♻ ☆ Continuous Speech Tokenizer in Text To Speech NAACL 2025
The fusion of speech and language in the era of large language models has
garnered significant attention. Discrete speech token is often utilized in
text-to-speech tasks for speech compression and portability, which is
convenient for joint training with text and have good compression efficiency.
However, we found that the discrete speech tokenizer still suffers from
information loss. Therefore, we propose a simple yet effective continuous
speech tokenizer named Cont-SPT, and a text-to-speech model based on continuous
speech tokens. Our results show that the speech language model based on the
continuous speech tokenizer has better continuity and higher estimated Mean
Opinion Scores (MoS). This enhancement is attributed to better information
preservation rate of the continuous speech tokenizer across both low and high
frequencies in the frequency domain. The code and resources for Cont-SPT can be
found in https://github.com/Yixing-Li/Continuous-Speech-Tokenizer
comment: NAACL 2025 Findings Poster
♻ ☆ Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis
Alessandro Scirè, Andrei Stefan Bejgu, Simone Tedeschi, Karim Ghonim, Federico Martelli, Roberto Navigli
After the introduction of Large Language Models (LLMs), there have been
substantial improvements in the performance of Natural Language Generation
(NLG) tasks, including Text Summarization and Machine Translation. However,
LLMs still produce outputs containing hallucinations, that is, content not
grounded in factual information. Therefore, developing methods to assess the
factuality of LLMs has become urgent.
Indeed, resources for factuality evaluation have recently emerged. Although
challenging, these resources face one or more of the following limitations: (i)
they are tailored to a specific task or domain; (ii) they are limited in size,
thereby preventing the training of new factuality evaluators; (iii) they are
designed for simpler verification tasks, such as claim verification.
To address these issues, we introduce LLM-Oasis, to the best of our knowledge
the largest resource for training end-to-end factuality evaluators. LLM-Oasis
is constructed by extracting claims from Wikipedia, falsifying a subset of
these claims, and generating pairs of factual and unfactual texts. We then rely
on human annotators to both validate the quality of our dataset and to create a
gold standard test set for benchmarking factuality evaluation systems.
Our experiments demonstrate that LLM-Oasis presents a significant challenge
for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our
proposed end-to-end factuality evaluation task, highlighting its potential to
drive future research in the field.
comment: 15 pages. To be submitted to CL journal
♻ ☆ MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty NAACL 2025
Despite the massive advancements in large language models (LLMs), they still
suffer from producing plausible but incorrect responses. To improve the
reliability of LLMs, recent research has focused on uncertainty quantification
to predict whether a response is correct or not. However, most uncertainty
quantification methods have been evaluated on single-labeled questions, which
removes data uncertainty: the irreducible randomness often present in user
queries, which can arise from factors like multiple possible answers. This
limitation may cause uncertainty quantification results to be unreliable in
practical settings. In this paper, we investigate previous uncertainty
quantification methods under the presence of data uncertainty. Our
contributions are two-fold: 1) proposing a new Multi-Answer Question Answering
dataset, MAQA, consisting of world knowledge, mathematical reasoning, and
commonsense reasoning tasks to evaluate uncertainty quantification regarding
data uncertainty, and 2) assessing 5 uncertainty quantification methods of
diverse white- and black-box LLMs. Our findings show that previous methods
relatively struggle compared to single-answer settings, though this varies
depending on the task. Moreover, we observe that entropy- and consistency-based
methods effectively estimate model uncertainty, even in the presence of data
uncertainty. We believe these observations will guide future work on
uncertainty quantification in more realistic settings.
comment: Findings of NAACL 2025
♻ ☆ Banyan: Improved Representation Learning with Explicit Structure
We present Banyan, a model that efficiently learns semantic representations
by leveraging explicit hierarchical structure. While transformers excel at
scale, they struggle in low-resource settings. Conversely recent structured
models have shown promise as efficient learners, but lack performance. Banyan
bridges this gap with two key innovations: an entangled hierarchical tree
structure and diagonalized message passing, enabling it to outperform larger
transformer models with just 14 non-embedding parameters. It excels in
low-resource settings, offering a viable alternative for under-represented
languages and highlighting its potential for efficient, interpretable NLP in
resource-constrained environments.
comment: v2
♻ ☆ The Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions
A recent paper proposes Dynamic Tanh (DyT) as a drop-in replacement for layer
normalization (LN). Although the method is empirically well-motivated and
appealing from a practical point of view, it lacks a theoretical foundation. In
this work, we shed light on the mathematical relationship between layer
normalization and dynamic activation functions. In particular, we derive DyT
from LN and show that a well-defined approximation is needed to do so. By
dropping said approximation, an alternative activation function is obtained,
which we call Dynamic Inverse Square Root Unit (DyISRU). DyISRU is the exact
counterpart of layer normalization, and we demonstrate numerically that it
indeed resembles LN more accurately than DyT does.
comment: New title, renamed DyISRU, added missing parentheses in proof of
theorem 3, minor language corrections
♻ ☆ EmoVerse: Exploring Multimodal Large Language Models for Sentiment and Emotion Understanding
Sentiment and emotion understanding are essential to applications such as
human-computer interaction and depression detection. While Multimodal Large
Language Models (MLLMs) demonstrate robust general capabilities, they face
considerable challenges in the field of affective computing, particularly in
detecting subtle facial expressions and handling complex emotion-related tasks,
such as emotion reason inference and understanding emotions in long-context
scenarios. Furthermore, there is a lack of a unified MLLM that can effectively
handle both sentiment and emotion-related tasks. To address these challenges,
we explore multi-task training strategies for MLLMs in affective computing and
introduce Emotion Universe (EmoVerse), an MLLM designed to handle a broad
spectrum of sentiment and emotion-related tasks. In addition, EmoVerse is
capable of deeply analyzing the underlying causes of emotional states. We also
introduce the Affective Multitask (AMT) Dataset, which supports multimodal
sentiment analysis, multimodal emotion recognition, facial expression
recognition, emotion reason inference, and emotion cause-pair extraction tasks.
Extensive experiments demonstrate that EmoVerse outperforms existing methods,
achieving state-of-the-art results in sentiment and emotion-related tasks. The
code is available at https://github.com/liaolea/EmoVerse.
♻ ☆ TablePilot: Recommending Human-Preferred Tabular Data Analysis with Large Language Models
Tabular data analysis is crucial in many scenarios, yet efficiently
identifying the most relevant data analysis queries and results for a new table
remains a significant challenge. The complexity of tabular data, diverse
analytical operations, and the demand for high-quality analysis make the
process tedious. To address these challenges, we aim to recommend
query-code-result triplets tailored for new tables in tabular data analysis
workflows. In this paper, we present TablePilot, a pioneering tabular data
analysis framework leveraging large language models to autonomously generate
comprehensive and superior analytical results without relying on user profiles
or prior interactions. The framework incorporates key designs in analysis
preparation and analysis optimization to enhance accuracy. Additionally, we
propose Rec-Align, a novel method to further improve recommendation quality and
better align with human preferences. Experiments on DART, a dataset
specifically designed for comprehensive tabular data analysis recommendation,
demonstrate the effectiveness of our framework. Based on GPT-4o, the tuned
TablePilot achieves 77.0% top-5 recommendation recall. Human evaluations
further highlight its effectiveness in optimizing tabular data analysis
workflows.
♻ ☆ SwiftCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning
Dong Huang, Guangtao Zeng, Jianbo Dai, Meng Luo, Han Weng, Yuhao Qing, Heming Cui, Zhijiang Guo, Jie M. Zhang
As large language models (LLMs) play an increasingly important role in code
generation, enhancing both correctness and efficiency has become crucial.
Current methods primarily focus on correctness, often overlooking efficiency.
To address this gap, we introduce \dataset to improve both aspects by
fine-tuning LLMs on a high-quality dataset comprising correct and efficient
code samples. Our methodology involves leveraging multiple LLMs to generate
diverse candidate code solutions for various tasks across different programming
languages. We then evaluate these solutions by directly measuring their
execution time and memory usage through local execution. The code solution with
the lowest execution time and memory consumption is selected as the final
output for each task. Experimental results demonstrate significant improvements
when fine-tuning with \dataset. For instance, Qwen2.5-Coder-7B-Instruct's
pass@1 score increases from 44.8\% to 57.7\%, while the average execution time
for correct tasks decreases by 48.4\%. \dataset offers a scalable and effective
solution for advancing AI-driven code generation, benefiting both software
development and computational problem-solving. The source code of Effi-Code was
released in https://github.com/huangd1999/Effi-Code.
comment: Under Review
♻ ☆ Know "No'' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
While CLIP has significantly advanced multimodal understanding by bridging
vision and language, the inability to grasp negation - such as failing to
differentiate concepts like "parking" from "no parking" - poses substantial
challenges. By analyzing the data used in the public CLIP model's pre-training,
we posit this limitation stems from a lack of negation-inclusive data. To
address this, we introduce data generation pipelines that employ a large
language model (LLM) and a multimodal LLM to produce negation-inclusive
captions. Fine-tuning CLIP with data generated from our pipelines, we develop
NegationCLIP, which enhances negation awareness while preserving the
generality. Moreover, to enable a comprehensive evaluation of negation
understanding, we propose NegRefCOCOg-a benchmark tailored to test VLMs'
ability to interpret negation across diverse expressions and positions within a
sentence. Experiments on various CLIP architectures validate the effectiveness
of our data generation pipelines in enhancing CLIP's ability to perceive
negation accurately. Additionally, NegationCLIP's enhanced negation awareness
has practical applications across various multimodal tasks, demonstrated by
performance gains in text-to-image generation and referring image segmentation.
♻ ☆ DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
The rapid development of multilingual large language models (LLMs) highlights
the need for high-quality, diverse, and clean multilingual datasets. In this
paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a
large-scale multilingual corpus built using newly extracted Common Crawl data
and existing multilingual datasets. DCAD-2000 includes over 2,282 languages,
46.72TB of data, and 8.63 billion documents, spanning 155 high- and
medium-resource languages and 159 writing scripts. To overcome the limitations
of current data cleaning methods, which rely on manual heuristic thresholds, we
propose reframing data cleaning as an anomaly detection task. This dynamic
filtering approach significantly enhances data quality by identifying and
removing noisy or anomalous content. We evaluate the quality of DCAD-2000 on
the FineTask benchmark, demonstrating substantial improvements in multilingual
dataset quality and task performance.
♻ ☆ Training-Free Exponential Context Extension via Cascading KV Cache
The transformer's context window is vital for tasks such as few-shot learning
and conditional generation as it preserves previous tokens for active memory.
However, as the context lengths increase, the computational costs grow
quadratically, hindering the deployment of large language models (LLMs) in
real-world, long sequence scenarios. Although some recent key-value caching (KV
Cache) methods offer linear inference complexity, they naively manage the
stored context, prematurely evicting tokens and losing valuable information.
Moreover, they lack an optimized prefill/prompt stage strategy, resulting in
higher latency than even quadratic attention for realistic context sizes. In
response, we introduce a novel mechanism that leverages cascading sub-cache
buffers to selectively retain the most relevant tokens, enabling the model to
maintain longer context histories without increasing the cache size. Our
approach outperforms linear caching baselines across key benchmarks, including
streaming perplexity, question answering, book summarization, and passkey
retrieval, where it retains better retrieval accuracy at 1M tokens after four
doublings of the cache size of 65K. Additionally, our method reduces prefill
stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
These innovations not only enhance the computational efficiency of LLMs but
also pave the way for their effective deployment in resource-constrained
environments, enabling large-scale, real-time applications with significantly
reduced latency.
♻ ☆ MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models ICLR 2025
Peng Xia, Siwei Han, Shi Qiu, Yiyang Zhou, Zhaoyang Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui, Mingyu Ding, Linjie Li, Lijuan Wang, Huaxiu Yao
Interleaved multimodal comprehension and generation, enabling models to
produce and interpret both images and text in arbitrary sequences, have become
a pivotal area in multimodal learning. Despite significant advancements, the
evaluation of this capability remains insufficient. Existing benchmarks suffer
from limitations in data scale, scope, and evaluation depth, while current
evaluation metrics are often costly or biased, lacking in reliability for
practical applications. To address these challenges, we introduce MMIE, a
large-scale knowledge-intensive benchmark for evaluating interleaved multimodal
comprehension and generation in Large Vision-Language Models (LVLMs). MMIE
comprises 20K meticulously curated multimodal queries, spanning 3 categories,
12 fields, and 102 subfields, including mathematics, coding, physics,
literature, health, and arts. It supports both interleaved inputs and outputs,
offering a mix of multiple-choice and open-ended question formats to evaluate
diverse competencies. Moreover, we propose a reliable automated evaluation
metric, leveraging a scoring model fine-tuned with human-annotated data and
systematic evaluation criteria, aimed at reducing bias and improving evaluation
accuracy. Extensive experiments demonstrate the effectiveness of our benchmark
and metrics in providing a comprehensive evaluation of interleaved LVLMs.
Specifically, we evaluate eight LVLMs, revealing that even the best models show
significant room for improvement, with most achieving only moderate results. We
believe MMIE will drive further advancements in the development of interleaved
LVLMs. We publicly release our benchmark and code in
https://mmie-bench.github.io/.
comment: ICLR 2025 Oral
♻ ☆ Self-Vocabularizing Training for Neural Machine Translation NAACL
Past vocabulary learning techniques identify relevant vocabulary before
training, relying on statistical and entropy-based assumptions that largely
neglect the role of model training. Empirically, we observe that trained
translation models are induced to use a byte-pair encoding (BPE) vocabulary
subset distinct from the original BPE vocabulary, leading to performance
improvements when retrained with the induced vocabulary. In this paper, we
analyze this discrepancy in neural machine translation by examining vocabulary
and entropy shifts during self-training--where each iteration generates a
labeled dataset by pairing source sentences with the model's predictions to
define a new vocabulary. Building on these insights, we propose
self-vocabularizing training, an iterative method that self-selects a smaller,
more optimal vocabulary, yielding up to a 1.49 BLEU improvement. Moreover, we
find that deeper model architectures lead to both an increase in unique token
usage and a 6-8% reduction in vocabulary size.
comment: Accepted to NAACL SRW 2025