Computation and Language 150
☆ Inference-Time Hyper-Scaling with KV Cache Compression
Inference-time scaling trades efficiency for increased reasoning accuracy by
generating longer or more parallel sequences. However, in Transformer LLMs,
generation cost is bottlenecked by the size of the key-value (KV) cache, rather
than the number of generated tokens. Hence, we explore inference-time
hyper-scaling: by compressing the KV cache, we can generate more tokens within
the same compute budget and further improve the accuracy of scaled inference.
The success of this approach, however, hinges on the ability of compression
methods to preserve accuracy even at high compression ratios. To make
hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a
novel method for sparsifying KV caches that only requires 1K training steps to
achieve 8$\times$ compression, while maintaining better accuracy than
training-free sparse attention. Instead of prematurely discarding cached
tokens, DMS delays token eviction, implicitly merging representations and
preserving critical information. We demonstrate the effectiveness of
inference-time hyper-scaling with DMS on multiple families of LLMs, showing
that it boosts accuracy for comparable inference runtime and memory load. For
instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on
GPQA, and 9.6 on LiveCodeBench across compute budgets.
☆ Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets
Recent advancements in large language models (LLMs) have underscored their
vulnerability to safety alignment jailbreaks, particularly when subjected to
downstream fine-tuning. However, existing mitigation strategies primarily focus
on reactively addressing jailbreak incidents after safety guardrails have been
compromised, removing harmful gradients during fine-tuning, or continuously
reinforcing safety alignment throughout fine-tuning. As such, they tend to
overlook a critical upstream factor: the role of the original safety-alignment
data. This paper therefore investigates the degradation of safety guardrails
through the lens of representation similarity between upstream alignment
datasets and downstream fine-tuning tasks. Our experiments demonstrate that
high similarity between these datasets significantly weakens safety guardrails,
making models more susceptible to jailbreaks. Conversely, low similarity
between these two types of datasets yields substantially more robust models and
thus reduces harmfulness score by up to 10.33%. By highlighting the importance
of upstream dataset design in the building of durable safety guardrails and
reducing real-world vulnerability to jailbreak attacks, these findings offer
actionable insights for fine-tuning service providers.
comment: Project Page: https://hsiung.cc/llm-similarity-risk/
☆ Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
Language models serve as proxies for human preference judgements in alignment
and evaluation, yet they exhibit systematic miscalibration, prioritizing
superficial patterns over substantive qualities. This bias manifests as
overreliance on features like length, structure, and style, leading to issues
like reward hacking and unreliable evaluations. Evidence suggests these biases
originate in artifacts in human training data. In this work, we systematically
investigate the relationship between training data biases and preference model
miscalibration across five idiosyncratic features of language model
generations: length, structure, jargon, sycophancy and vagueness. Using
controlled counterfactual pairs, we first quantify the extent to which
preference models favor responses with magnified biases (skew), finding this
preference occurs in >60% of instances, and model preferences show high
miscalibration (~40%) compared to human preferences. Notably, bias features
only show mild negative correlations to human preference labels (mean r_human =
-0.12) but show moderately strong positive correlations with labels from a
strong reward model (mean r_model = +0.36), suggesting that models may overrely
on spurious cues. To mitigate these issues, we propose a simple post-training
method based on counterfactual data augmentation (CDA) using synthesized
contrastive examples. Finetuning models with CDA reduces average miscalibration
from 39.4% to 32.5% and average absolute skew difference from 20.5% to 10.0%,
while maintaining overall RewardBench performance, showing that targeted
debiasing is effective for building reliable preference models.
comment: Code and data available at
https://github.com/anirudhb123/preference-model-biases
☆ Search Arena: Analyzing Search-Augmented LLMs
Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez
Search-augmented language models combine web search with Large Language
Models (LLMs) to improve response groundedness and freshness. However,
analyzing these systems remains challenging: existing datasets are limited in
scale and narrow in scope, often constrained to static, single-turn,
fact-checking questions. In this work, we introduce Search Arena, a
crowd-sourced, large-scale, human-preference dataset of over 24,000 paired
multi-turn user interactions with search-augmented LLMs. The dataset spans
diverse intents and languages, and contains full system traces with around
12,000 human preference votes. Our analysis reveals that user preferences are
influenced by the number of citations, even when the cited content does not
directly support the attributed claims, uncovering a gap between perceived and
actual credibility. Furthermore, user preferences vary across cited sources,
revealing that community-driven platforms are generally preferred and static
encyclopedic sources are not always appropriate and reliable. To assess
performance across different settings, we conduct cross-arena analyses by
testing search-augmented LLMs in a general-purpose chat environment and
conventional LLMs in search-intensive settings. We find that web search does
not degrade and may even improve performance in non-search settings; however,
the quality in search settings is significantly affected if solely relying on
the model's parametric knowledge. We open-sourced the dataset to support future
research in this direction. Our dataset and code are available at:
https://github.com/lmarena/search-arena.
comment: Preprint. Code: https://github.com/lmarena/search-arena. Dataset:
https://huggingface.co/datasets/lmarena-ai/search-arena-24k
☆ Kinetics: Rethinking Test-Time Scaling Laws
We rethink test-time scaling laws from a practical efficiency perspective,
revealing that the effectiveness of smaller models is significantly
overestimated. Prior work, grounded in compute-optimality, overlooks critical
memory access bottlenecks introduced by inference-time strategies (e.g.,
Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to
32B parameters, reveals a new Kinetics Scaling Law that better guides resource
allocation by incorporating both computation and memory access costs. Kinetics
Scaling Law suggests that test-time compute is more effective when used on
models above a threshold than smaller ones. A key reason is that in TTS,
attention, rather than parameter count, emerges as the dominant cost factor.
Motivated by this, we propose a new scaling paradigm centered on sparse
attention, which lowers per-token cost and enables longer generations and more
parallel samples within the same resource budget. Empirically, we show that
sparse attention models consistently outperform dense counterparts, achieving
over 60 points gains in low-cost regimes and over 5 points gains in high-cost
regimes for problem-solving accuracy on AIME, encompassing evaluations on
state-of-the-art MoEs. These results suggest that sparse attention is essential
for realizing the full potential of test-time scaling because, unlike training,
where parameter scaling saturates, test-time accuracy continues to improve
through increased generation. The code is available at
https://github.com/Infini-AI-Lab/Kinetics.
☆ Unleashing Hour-Scale Video Training for Long Video-Language Understanding
Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum
Recent long-form video-language understanding benchmarks have driven progress
in video large multimodal models (Video-LMMs). However, the scarcity of
well-annotated long videos has left the training of hour-long Video-LLMs
underexplored. To close this gap, we present VideoMarathon, a large-scale
hour-long video instruction-following dataset. This dataset includes around
9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60
minutes per video. Specifically, it contains 3.3M high-quality QA pairs,
spanning six fundamental topics: temporality, spatiality, object, action,
scene, and event. Compared to existing video instruction datasets,
VideoMarathon significantly extends training video durations up to 1 hour, and
supports 22 diverse tasks requiring both short- and long-term video
comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and
efficient Video-LMM for hour-scale video-language modeling. It enables
hour-long video training and inference at 1-FPS sampling by leveraging a memory
augmentation module, which adaptively integrates user question-relevant and
spatiotemporal-informative semantics from a cached full video context. In our
experiments, Hour-LLaVA achieves the best performance on multiple long
video-language benchmarks, demonstrating the high quality of the VideoMarathon
dataset and the superiority of the Hour-LLaVA model.
comment: Project page: https://videomarathon.github.io/
☆ Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
Reinforcement learning (RL) has become an effective approach for fine-tuning
large language models (LLMs), particularly to enhance their reasoning
capabilities. However, RL fine-tuning remains highly resource-intensive, and
existing work has largely overlooked the problem of data efficiency. In this
paper, we propose two techniques to improve data efficiency in LLM RL
fine-tuning: difficulty-targeted online data selection and rollout replay. We
introduce the notion of adaptive difficulty to guide online data selection,
prioritizing questions of moderate difficulty that are more likely to yield
informative learning signals. To estimate adaptive difficulty efficiently, we
develop an attention-based framework that requires rollouts for only a small
reference set of questions. The adaptive difficulty of the remaining questions
is then estimated based on their similarity to this set. To further reduce
rollout cost, we introduce a rollout replay mechanism that reuses recent
rollouts, lowering per-step computation while maintaining stable updates.
Extensive experiments across 6 LLM-dataset combinations show that our method
reduces RL fine-tuning time by 25% to 65% to reach the same level of
performance as the original GRPO algorithm.
☆ Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models
Large Language Models (LLMs) deployed in real-world settings increasingly
face the need to unlearn sensitive, outdated, or proprietary information.
Existing unlearning methods typically formulate forgetting and retention as a
regularized trade-off, combining both objectives into a single scalarized loss.
This often leads to unstable optimization and degraded performance on retained
data, especially under aggressive forgetting. We propose a new formulation of
LLM unlearning as a constrained optimization problem: forgetting is enforced
via a novel logit-margin flattening loss that explicitly drives the output
distribution toward uniformity on a designated forget set, while retention is
preserved through a hard constraint on a separate retain set. Compared to
entropy-based objectives, our loss is softmax-free, numerically stable, and
maintains non-vanishing gradients, enabling more efficient and robust
optimization. We solve the constrained problem using a scalable primal-dual
algorithm that exposes the trade-off between forgetting and retention through
the dynamics of the dual variable. Evaluations on the TOFU and MUSE benchmarks
across diverse LLM architectures demonstrate that our approach consistently
matches or exceeds state-of-the-art baselines, effectively removing targeted
information while preserving downstream utility.
☆ Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games
LLMs are used predominantly in synchronous communication, where a human user
and a model communicate in alternating turns. In contrast, many real-world
settings are inherently asynchronous. For example, in group chats, online team
meetings, or social games, there is no inherent notion of turns; therefore, the
decision of when to speak forms a crucial part of the participant's decision
making. In this work, we develop an adaptive asynchronous LLM-agent which, in
addition to determining what to say, also decides when to say it. To evaluate
our agent, we collect a unique dataset of online Mafia games, including both
human participants, as well as our asynchronous agent. Overall, our agent
performs on par with human players, both in game performance, as well as in its
ability to blend in with the other human players. Our analysis shows that the
agent's behavior in deciding when to speak closely mirrors human patterns,
although differences emerge in message content. We release all our data and
code to support and encourage further research for more realistic asynchronous
communication between LLM agents. This work paves the way for integration of
LLMs into realistic human group settings, from assistance in team discussions
to educational and professional environments where complex social dynamics must
be navigated.
☆ ProRefine: Inference-time Prompt Refinement with Textual Feedback
Agentic workflows, where multiple AI agents collaborate to accomplish complex
tasks like reasoning or planning, are becoming increasingly prevalent. However,
these workflows often suffer from error propagation and sub-optimal
performance, largely due to poorly designed prompts that fail to effectively
guide individual agents. This is a critical problem because it limits the
reliability and scalability of these powerful systems. We introduce ProRefine,
an innovative inference-time prompt optimization method that leverages textual
feedback from large language models (LLMs) to address this challenge. ProRefine
dynamically refines prompts for multi-step reasoning tasks without additional
training or ground truth labels. Evaluated on five benchmark mathematical
reasoning datasets, ProRefine significantly surpasses zero-shot
Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only
boosts accuracy but also allows smaller models to match the performance of
larger ones, highlighting its potential for efficient and scalable AI
deployment, and democratizing access to high-performing AI.
☆ Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning ACL 2025
Retrieval-Augmented Generation (RAG) systems commonly suffer from Knowledge
Conflicts, where retrieved external knowledge contradicts the inherent,
parametric knowledge of large language models (LLMs). It adversely affects
performance on downstream tasks such as question answering (QA). Existing
approaches often attempt to mitigate conflicts by directly comparing two
knowledge sources in a side-by-side manner, but this can overwhelm LLMs with
extraneous or lengthy contexts, ultimately hindering their ability to identify
and mitigate inconsistencies. To address this issue, we propose Micro-Act a
framework with a hierarchical action space that automatically perceives context
complexity and adaptively decomposes each knowledge source into a sequence of
fine-grained comparisons. These comparisons are represented as actionable
steps, enabling reasoning beyond the superficial context. Through extensive
experiments on five benchmark datasets, Micro-Act consistently achieves
significant increase in QA accuracy over state-of-the-art baselines across all
5 datasets and 3 conflict types, especially in temporal and semantic types
where all baselines fail significantly. More importantly, Micro-Act exhibits
robust performance on non-conflict questions simultaneously, highlighting its
practical value in real-world RAG applications.
comment: Accepted by ACL 2025 Main
☆ CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection
A common approach to hallucination detection casts it as a natural language
inference (NLI) task, often using LLMs to classify whether the generated text
is entailed by corresponding reference texts. Since entailment classification
is a complex reasoning task, one would expect that LLMs could benefit from
generating an explicit reasoning process, as in CoT reasoning or the explicit
``thinking'' of recent reasoning models. In this work, we propose that guiding
such models to perform a systematic and comprehensive reasoning process -- one
that both decomposes the text into smaller facts and also finds evidence in the
source for each fact -- allows models to execute much finer-grained and
accurate entailment decisions, leading to increased performance. To that end,
we define a 3-step reasoning process, consisting of (i) claim decomposition,
(ii) sub-claim attribution and entailment classification, and (iii) aggregated
classification, showing that such guided reasoning indeed yields improved
hallucination detection. Following this reasoning framework, we introduce an
analysis scheme, consisting of several metrics that measure the quality of the
intermediate reasoning steps, which provided additional empirical evidence for
the improved quality of our guided reasoning scheme.
☆ Towards a Unified System of Representation for Continuity and Discontinuity in Natural Language
Syntactic discontinuity is a grammatical phenomenon in which a constituent is
split into more than one part because of the insertion of an element which is
not part of the constituent. This is observed in many languages across the
world such as Turkish, Russian, Japanese, Warlpiri, Navajo, Hopi, Dyirbal,
Yidiny etc. Different formalisms/frameworks in current linguistic theory
approach the problem of discontinuous structures in different ways. Each
framework/formalism has widely been viewed as an independent and non-converging
system of analysis. In this paper, we propose a unified system of
representation for both continuity and discontinuity in structures of natural
languages by taking into account three formalisms, in particular, Phrase
Structure Grammar (PSG) for its widely used notion of constituency, Dependency
Grammar (DG) for its head-dependent relations, and Categorial Grammar (CG) for
its focus on functor-argument relations. We attempt to show that discontinuous
expressions as well as continuous structures can be analysed through a unified
mathematical derivation incorporating the representations of linguistic
structure in these three grammar formalisms.
☆ MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, João Sacramento
Sequence modeling is currently dominated by causal transformer architectures
that use softmax self-attention. Although widely adopted, transformers require
scaling memory and compute linearly during inference. A recent stream of work
linearized the softmax operation, resulting in powerful recurrent neural
network (RNN) models with constant memory and compute costs such as DeltaNet,
Mamba or xLSTM. These models can be unified by noting that their recurrent
layer dynamics can all be derived from an in-context regression objective,
approximately optimized through an online learning rule. Here, we join this
line of work and introduce a numerically stable, chunkwise parallelizable
version of the recently proposed Mesa layer (von Oswald et al., 2024), and
study it in language modeling at the billion-parameter scale. This layer again
stems from an in-context loss, but which is now minimized to optimality at
every time point using a fast conjugate gradient solver. Through an extensive
suite of experiments, we show that optimal test-time training enables reaching
lower language modeling perplexity and higher downstream benchmark performance
than previous RNNs, especially on tasks requiring long context understanding.
This performance gain comes at the cost of additional flops spent during
inference time. Our results are therefore intriguingly related to recent trends
of increasing test-time compute to improve performance -- here by spending
compute to solve sequential optimization problems within the neural network
itself.
☆ Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts
Transformer models struggle with long-context inference due to their
quadratic time and linear memory complexity. Recurrent Memory Transformers
(RMTs) offer a solution by reducing the asymptotic cost to linear time and
constant memory usage. However, their memory update mechanism leads to
sequential execution, causing a performance bottleneck.
We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism
across segments in RMTs while preserving exact recurrence. This approach
eliminates the sequential constraint, enabling efficient GPU inference even for
single long-context inputs without complex batching and pipelining techniques.
Because the technique is purely a run-time computation reordering, existing RMT
models adopt it with no retraining.
Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup
over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential
RMT implementation on 131,072-token sequences. By removing sequential
bottleneck, Diagonal Batching reduces inference cost and latency, thereby
strengthening RMTs as a practical solution for real-world, long-context
applications.
☆ Improving Low-Resource Morphological Inflection via Self-Supervised Objectives ACL 2025
Self-supervised objectives have driven major advances in NLP by leveraging
large-scale unlabeled data, but such resources are scarce for many of the
world's languages. Surprisingly, they have not been explored much for
character-level tasks, where smaller amounts of data have the potential to be
beneficial. We investigate the effectiveness of self-supervised auxiliary tasks
for morphological inflection -- a character-level task highly relevant for
language documentation -- in extremely low-resource settings, training
encoder-decoder transformers for 19 languages and 13 auxiliary objectives.
Autoencoding yields the best performance when unlabeled data is very limited,
while character masked language modeling (CMLM) becomes more effective as data
availability increases. Though objectives with stronger inductive biases
influence model predictions intuitively, they rarely outperform standard CMLM.
However, sampling masks based on known morpheme boundaries consistently
improves performance, highlighting a promising direction for low-resource
morphological modeling.
comment: ACL 2025 main
☆ Mitigating Degree Bias Adaptively with Hard-to-Learn Nodes in Graph Contrastive Learning
Graph Neural Networks (GNNs) often suffer from degree bias in node
classification tasks, where prediction performance varies across nodes with
different degrees. Several approaches, which adopt Graph Contrastive Learning
(GCL), have been proposed to mitigate this bias. However, the limited number of
positive pairs and the equal weighting of all positives and negatives in GCL
still lead to low-degree nodes acquiring insufficient and noisy information.
This paper proposes the Hardness Adaptive Reweighted (HAR) contrastive loss to
mitigate degree bias. It adds more positive pairs by leveraging node labels and
adaptively weights positive and negative pairs based on their learning
hardness. In addition, we develop an experimental framework named SHARP to
extend HAR to a broader range of scenarios. Both our theoretical analysis and
experiments validate the effectiveness of SHARP. The experimental results
across four datasets show that SHARP achieves better performance against
baselines at both global and degree levels.
☆ LLM-First Search: Self-Guided Exploration of the Solution Space
Large Language Models (LLMs) have demonstrated remarkable improvements in
reasoning and planning through increased test-time compute, often by framing
problem-solving as a search process. While methods like Monte Carlo Tree Search
(MCTS) have proven effective in some domains, their reliance on fixed
exploration hyperparameters limits their adaptability across tasks of varying
difficulty, rendering them impractical or expensive in certain settings. In
this paper, we propose \textbf{LLM-First Search (LFS)}, a novel \textit{LLM
Self-Guided Search} method that removes the need for pre-defined search
strategies by empowering the LLM to autonomously control the search process via
self-guided exploration. Rather than relying on external heuristics or
hardcoded policies, the LLM evaluates whether to pursue the current search path
or explore alternative branches based on its internal scoring mechanisms. This
enables more flexible and context-sensitive reasoning without requiring manual
tuning or task-specific adaptation. We evaluate LFS on Countdown and Sudoku
against three classic widely-used search algorithms, Tree-of-Thoughts' Breadth
First Search (ToT-BFS), Best First Search (BestFS), and MCTS, each of which
have been used to achieve SotA results on a range of challenging reasoning
tasks. We found that LFS (1) performs better on more challenging tasks without
additional tuning, (2) is more computationally efficient compared to the other
methods, especially when powered by a stronger model, (3) scales better with
stronger models, due to its LLM-First design, and (4) scales better with
increased compute budget. Our code is publicly available at
\href{https://github.com/NathanHerr/LLM-First-Search}{LLM-First-Search}.
comment: 9 main pages, 2 figures, 2 tables, 36 appendix pages
☆ The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray
Large language models (LLMs) are typically trained on enormous quantities of
unlicensed text, a practice that has led to scrutiny due to possible
intellectual property infringement and ethical concerns. Training LLMs on
openly licensed text presents a first step towards addressing these issues, but
prior data collection efforts have yielded datasets too small or low-quality to
produce performant LLMs. To address this gap, we collect, curate, and release
the Common Pile v0.1, an eight terabyte collection of openly licensed text
designed for LLM pretraining. The Common Pile comprises content from 30 sources
that span diverse domains including research papers, code, books,
encyclopedias, educational materials, audio transcripts, and more. Crucially,
we validate our efforts by training two 7 billion parameter LLMs on text from
the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion
tokens respectively. Both models attain competitive performance to LLMs trained
on unlicensed text with similar computational budgets, such as Llama 1 and 2
7B. In addition to releasing the Common Pile v0.1 itself, we also release the
code used in its creation as well as the training mixture and checkpoints for
the Comma v0.1 models.
☆ RELIC: Evaluating Compositional Instruction Following via Language Recognition
Large language models (LLMs) are increasingly expected to perform tasks based
only on a specification of the task provided in context, without examples of
inputs and outputs; this ability is referred to as instruction following. We
introduce the Recognition of Languages In-Context (RELIC) framework to evaluate
instruction following using language recognition: the task of determining if a
string is generated by formal grammar. Unlike many standard evaluations of
LLMs' ability to use their context, this task requires composing together a
large number of instructions (grammar productions) retrieved from the context.
Because the languages are synthetic, the task can be increased in complexity as
LLMs' skills improve, and new instances can be automatically generated,
mitigating data contamination. We evaluate state-of-the-art LLMs on RELIC and
find that their accuracy can be reliably predicted from the complexity of the
grammar and the individual example strings, and that even the most advanced
LLMs currently available show near-chance performance on more complex grammars
and samples, in line with theoretical expectations. We also use RELIC to
diagnose how LLMs attempt to solve increasingly difficult reasoning tasks,
finding that as the complexity of the language recognition task increases,
models switch to relying on shallow heuristics instead of following complex
instructions.
☆ Counterfactual reasoning: an analysis of in-context emergence
Large-scale neural language models (LMs) exhibit remarkable performance in
in-context learning: the ability to learn and reason the input context on the
fly without parameter update. This work studies in-context counterfactual
reasoning in language models, that is, to predict the consequences of changes
under hypothetical scenarios. We focus on studying a well-defined synthetic
setup: a linear regression task that requires noise abduction, where accurate
prediction is based on inferring and copying the contextual noise from factual
observations. We show that language models are capable of counterfactual
reasoning in this controlled setup and provide insights that counterfactual
reasoning for a broad class of functions can be reduced to a transformation on
in-context observations; we find self-attention, model depth, and data
diversity in pre-training drive performance in Transformers. More
interestingly, our findings extend beyond regression tasks and show that
Transformers can perform noise abduction on sequential data, providing
preliminary evidence on the potential for counterfactual story generation. Our
code is available under
https://github.com/moXmiller/counterfactual-reasoning.git .
☆ Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou
In this work, we introduce the Qwen3 Embedding series, a significant
advancement over its predecessor, the GTE-Qwen series, in text embedding and
reranking capabilities, built upon the Qwen3 foundation models. Leveraging the
Qwen3 LLMs' robust capabilities in multilingual text understanding and
generation, our innovative multi-stage training pipeline combines large-scale
unsupervised pre-training with supervised fine-tuning on high-quality datasets.
Effective model merging strategies further ensure the robustness and
adaptability of the Qwen3 Embedding series. During the training process, the
Qwen3 LLMs serve not only as backbone models but also play a crucial role in
synthesizing high-quality, rich, and diverse training data across multiple
domains and languages, thus enhancing the training pipeline. The Qwen3
Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both
embedding and reranking tasks, addressing diverse deployment scenarios where
users can optimize for either efficiency or effectiveness. Empirical
evaluations demonstrate that the Qwen3 Embedding series achieves
state-of-the-art results across diverse benchmarks. Notably, it excels on the
multilingual evaluation benchmark MTEB for text embedding, as well as in
various retrieval tasks, including code retrieval, cross-lingual retrieval and
multilingual retrieval. To facilitate reproducibility and promote
community-driven research and development, the Qwen3 Embedding models are
publicly available under the Apache 2.0 license.
☆ ECoRAG: Evidentiality-guided Compression for Long Context RAG
Large Language Models (LLMs) have shown remarkable performance in Open-Domain
Question Answering (ODQA) by leveraging external documents through
Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer
context, context compression is necessary. However, prior compression methods
do not focus on filtering out non-evidential information, which limit the
performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or
\textbf{ECoRAG} framework. ECoRAG improves LLM performance by compressing
retrieved documents based on evidentiality, ensuring whether answer generation
is supported by the correct evidence. As an additional step, ECoRAG reflects
whether the compressed content provides sufficient evidence, and if not,
retrieves more until sufficient. Experiments show that ECoRAG improves LLM
performance on ODQA tasks, outperforming existing compression methods.
Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency
but also minimizes token usage by retaining only the necessary information to
generate the correct answer. Code is available at
https://github.com/ldilab/ECoRAG.
☆ Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
Large Language Models (LLMs) are known to exhibit social, demographic, and
gender biases, often as a consequence of the data on which they are trained. In
this work, we adopt a mechanistic interpretability approach to analyze how such
biases are structurally represented within models such as GPT-2 and Llama2.
Focusing on demographic and gender biases, we explore different metrics to
identify the internal edges responsible for biased behavior. We then assess the
stability, localization, and generalizability of these components across
dataset and linguistic variations. Through systematic ablations, we demonstrate
that bias-related computations are highly localized, often concentrated in a
small subset of layers. Moreover, the identified components change across
fine-tuning settings, including those unrelated to bias. Finally, we show that
removing these components not only reduces biased outputs but also affects
other NLP tasks, such as named entity recognition and linguistic acceptability
judgment because of the sharing of important components with these tasks.
☆ Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) is a mainstream method for improving
performance on knowledge-intensive tasks. However,current RAG systems often
place too much emphasis on retrieved contexts. This can lead to reliance on
inaccurate sources and overlook the model's inherent knowledge, especially when
dealing with misleading or excessive information. To resolve this imbalance, we
propose Knowledgeable-r1 that using joint sampling and define multi policy
distributions in knowledge capability exploration to stimulate large language
models'self-integrated utilization of parametric and contextual knowledge.
Experiments show that Knowledgeable-r1 significantly enhances robustness and
reasoning accuracy in both parameters and contextual conflict tasks and general
RAG tasks, especially outperforming baselines by 17.07% in counterfactual
scenarios and demonstrating consistent gains across RAG tasks. Our code are
available at https://github.com/lcy80366872/ knowledgeable-r1.
☆ CIVET: Systematic Evaluation of Understanding in VLMs
Massimo Rizzoli, Simone Alghisi, Olha Khomyn, Gabriel Roccabruna, Seyed Mahed Mousavi, Giuseppe Riccardi
While Vision-Language Models (VLMs) have achieved competitive performance in
various tasks, their comprehension of the underlying structure and semantics of
a scene remains understudied. To investigate the understanding of VLMs, we
study their capability regarding object properties and relations in a
controlled and interpretable manner. To this scope, we introduce CIVET, a novel
and extensible framework for systematiC evaluatIon Via controllEd sTimuli.
CIVET addresses the lack of standardized systematic evaluation for assessing
VLMs' understanding, enabling researchers to test hypotheses with statistical
rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of
stimuli, free from annotation noise, dataset-specific biases, and uncontrolled
scene complexity. Our findings reveal that 1) current VLMs can accurately
recognize only a limited set of basic object properties; 2) their performance
heavily depends on the position of the object in the scene; 3) they struggle to
understand basic relations among objects. Furthermore, a comparative evaluation
with human annotators reveals that VLMs still fall short of achieving
human-level accuracy.
☆ Do Large Language Models Judge Error Severity Like Humans?
Large Language Models (LLMs) are increasingly used as automated evaluators in
natural language generation, yet it remains unclear whether they can accurately
replicate human judgments of error severity. In this study, we systematically
compare human and LLM assessments of image descriptions containing controlled
semantic errors. We extend the experimental framework of van Miltenburg et al.
(2020) to both unimodal (text-only) and multimodal (text + image) settings,
evaluating four error types: age, gender, clothing type, and clothing colour.
Our findings reveal that humans assign varying levels of severity to different
error types, with visual context significantly amplifying perceived severity
for colour and type errors. Notably, most LLMs assign low scores to gender
errors but disproportionately high scores to colour errors, unlike humans, who
judge both as highly severe but for different reasons. This suggests that these
models may have internalised social norms influencing gender judgments but lack
the perceptual grounding to emulate human sensitivity to colour, which is
shaped by distinct neural mechanisms. Only one of the evaluated LLMs, Doubao,
replicates the human-like ranking of error severity, but it fails to
distinguish between error types as clearly as humans. Surprisingly,
DeepSeek-V3, a unimodal LLM, achieves the highest alignment with human
judgments across both unimodal and multimodal conditions, outperforming even
state-of-the-art multimodal models.
☆ AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models
Understanding the internal mechanisms of large audio-language models (LALMs)
is crucial for interpreting their behavior and improving performance. This work
presents the first in-depth analysis of how LALMs internally perceive and
recognize auditory attributes. By applying vocabulary projection on three
state-of-the-art LALMs, we track how attribute information evolves across
layers and token positions. We find that attribute information generally
decreases with layer depth when recognition fails, and that resolving
attributes at earlier layers correlates with better accuracy. Moreover, LALMs
heavily rely on querying auditory inputs for predicting attributes instead of
aggregating necessary information in hidden states at attribute-mentioning
positions. Based on our findings, we demonstrate a method to enhance LALMs. Our
results offer insights into auditory attribute processing, paving the way for
future improvements.
comment: 8 pages, 5 figures, 3 tables
☆ Information Locality as an Inductive Bias for Neural Language Models
Inductive biases are inherent in every machine learning system, shaping how
models generalize from finite data. In the case of neural language models
(LMs), debates persist as to whether these biases align with or diverge from
human processing constraints. To address this issue, we propose a quantitative
framework that allows for controlled investigations into the nature of these
biases. Within our framework, we introduce $m$-local entropy$\unicode{x2013}$an
information-theoretic measure derived from average lossy-context
surprisal$\unicode{x2013}$that captures the local uncertainty of a language by
quantifying how effectively the $m-1$ preceding symbols disambiguate the next
symbol. In experiments on both perturbed natural language corpora and languages
defined by probabilistic finite-state automata (PFSAs), we show that languages
with higher $m$-local entropy are more difficult for Transformer and LSTM LMs
to learn. These results suggest that neural LMs, much like humans, are highly
sensitive to the local statistical structure of a language.
☆ DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning ACL
Zero-shot Event Detection (ED), the task of identifying event mentions in
natural language text without any training data, is critical for document
understanding in specialized domains. Understanding the complex event ontology,
extracting domain-specific triggers from the passage, and structuring them
appropriately overloads and limits the utility of Large Language Models (LLMs)
for zero-shot ED. To this end, we propose DiCoRe, a divergent-convergent
reasoning framework that decouples the task of ED using Dreamer and Grounder.
Dreamer encourages divergent reasoning through open-ended event discovery,
which helps to boost event coverage. Conversely, Grounder introduces convergent
reasoning to align the free-form predictions with the task-specific
instructions using finite-state machine guided constrained decoding.
Additionally, an LLM-Judge verifies the final outputs to ensure high precision.
Through extensive experiments on six datasets across five domains and nine
LLMs, we demonstrate how DiCoRe consistently outperforms prior zero-shot,
transfer-learning, and reasoning baselines, achieving 4-7% average F1 gains
over the best baseline -- establishing DiCoRe as a strong zero-shot ED
framework.
comment: Submitted at ACL ARR May 2025
☆ The NTNU System at the S&I Challenge 2025 SLA Open Track ISCA
A recent line of research on spoken language assessment (SLA) employs neural
models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency
across linguistic and acoustic modalities. Although both models effectively
capture features relevant to oral competence, each exhibits modality-specific
limitations. BERT-based methods rely on ASR transcripts, which often fail to
capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods
excel at modeling acoustic features but lack semantic interpretability. To
overcome these limitations, we propose a system that integrates W2V with Phi-4
multimodal large language model (MLLM) through a score fusion strategy. The
proposed system achieves a root mean square error (RMSE) of 0.375 on the
official test set of the Speak & Improve Challenge 2025, securing second place
in the competition. For comparison, the RMSEs of the top-ranked, third-ranked,
and official baseline systems are 0.364, 0.384, and 0.444, respectively.
comment: submitted to the ISCA SLaTE-2025 Workshop
☆ CL-ISR: A Contrastive Learning and Implicit Stance Reasoning Framework for Misleading Text Detection on Social Media
Misleading text detection on social media platforms is a critical research
area, as these texts can lead to public misunderstanding, social panic and even
economic losses. This paper proposes a novel framework - CL-ISR (Contrastive
Learning and Implicit Stance Reasoning), which combines contrastive learning
and implicit stance reasoning, to improve the detection accuracy of misleading
texts on social media. First, we use the contrastive learning algorithm to
improve the model's learning ability of semantic differences between truthful
and misleading texts. Contrastive learning could help the model to better
capture the distinguishing features between different categories by
constructing positive and negative sample pairs. This approach enables the
model to capture distinguishing features more effectively, particularly in
linguistically complicated situations. Second, we introduce the implicit stance
reasoning module, to explore the potential stance tendencies in the text and
their relationships with related topics. This method is effective for
identifying content that misleads through stance shifting or emotional
manipulation, because it can capture the implicit information behind the text.
Finally, we integrate these two algorithms together to form a new framework,
CL-ISR, which leverages the discriminative power of contrastive learning and
the interpretive depth of stance reasoning to significantly improve detection
effect.
comment: 6 pages, 2 figures
☆ Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics
While objective street metrics derived from imagery or GIS have become
standard in urban analytics, they remain insufficient to capture subjective
perceptions essential to inclusive urban design. This study introduces a novel
Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer
(VisualGLM-6B) with a large language model (GPT-4), enabling interpretable
dual-output assessment of streetscapes. Leveraging over 15,000 annotated
street-view images from Harbin, China, we fine-tune the framework using LoRA
and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1
score of 0.84 on objective features and 89.3 percent agreement with aggregated
resident perceptions, validated across stratified socioeconomic geographies.
Beyond classification accuracy, MSEF captures context-dependent contradictions:
for instance, informal commerce boosts perceived vibrancy while simultaneously
reducing pedestrian comfort. It also identifies nonlinear and semantically
contingent patterns -- such as the divergent perceptual effects of
architectural transparency across residential and commercial zones -- revealing
the limits of universal spatial heuristics. By generating natural-language
rationales grounded in attention mechanisms, the framework bridges sensory data
with socio-affective inference, enabling transparent diagnostics aligned with
SDG 11. This work offers both methodological innovation in urban perception
modeling and practical utility for planning systems seeking to reconcile
infrastructural precision with lived experience.
comment: 24 pages, 10 figures
☆ Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin
The commercial vitality of community-scale streets in Chinese cities is
shaped by complex interactions between vehicular accessibility, environmental
quality, and pedestrian perception. This study proposes an interpretable,
image-based framework to examine how street-level features -- including parked
vehicle density, greenery, cleanliness, and street width -- impact retail
performance and user satisfaction in Harbin, China. Leveraging street view
imagery and a multimodal large language model (VisualGLM-6B), we construct a
Community Commercial Vitality Index (CCVI) from Meituan and Dianping data and
analyze its relationship with spatial attributes extracted via GPT-4-based
perception modeling. Our findings reveal that while moderate vehicle presence
may enhance commercial access, excessive on-street parking -- especially in
narrow streets -- erodes walkability and reduces both satisfaction and
shop-level pricing. In contrast, streets with higher perceived greenery and
cleanliness show significantly greater satisfaction scores but only weak
associations with pricing. Street width moderates the effects of vehicle
presence, underscoring the importance of spatial configuration. These results
demonstrate the value of integrating AI-assisted perception with urban
morphological analysis to capture non-linear and context-sensitive drivers of
commercial success. This study advances both theoretical and methodological
frontiers by highlighting the conditional role of vehicle activity in
neighborhood commerce and demonstrating the feasibility of multimodal AI for
perceptual urban diagnostics. The implications extend to urban design, parking
management, and scalable planning tools for community revitalization.
comment: 22 pages,5 figures
☆ Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation ACL 2025
Self-harm detection on social media is critical for early intervention and
mental health support, yet remains challenging due to the subtle,
context-dependent nature of such expressions. Identifying self-harm intent aids
suicide prevention by enabling timely responses, but current large language
models (LLMs) struggle to interpret implicit cues in casual language and
emojis. This work enhances LLMs' comprehension of self-harm by distinguishing
intent through nuanced language-emoji interplay. We present the Centennial
Emoji Sensitivity Matrix (CESM-100), a curated set of 100 emojis with
contextual self-harm interpretations and the Self-Harm Identification aNd
intent Extraction with Supportive emoji sensitivity (SHINES) dataset, offering
detailed annotations for self-harm labels, casual mentions (CMs), and serious
intents (SIs). Our unified framework: a) enriches inputs using CESM-100; b)
fine-tunes LLMs for multi-task learning: self-harm detection (primary) and
CM/SI span detection (auxiliary); c) generates explainable rationales for
self-harm predictions. We evaluate the framework on three state-of-the-art
LLMs-Llama 3, Mental-Alpaca, and MentalLlama, across zero-shot, few-shot, and
fine-tuned scenarios. By coupling intent differentiation with contextual cues,
our approach commendably enhances LLM performance in both detection and
explanation tasks, effectively addressing the inherent ambiguity in self-harm
signals. The SHINES dataset, CESM-100 and codebase are publicly available at:
https://www.iitp.ac.in/~ai-nlp-ml/resources.html#SHINES .
comment: To be published in the Proceedings of the 63rd Annual Meeting of the
Association for Computational Linguistics (ACL 2025 Main)
☆ RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation
Tianjiao Li, Mengran Yu, Chenyu Shi, Yanjun Zhao, Xiaojing Liu, Qiang Zhang, Qi Zhang, Xuanjing Huang, Jiayin Wang
Large language models (LLMs) possess strong multilingual capabilities, and
combining Reinforcement Learning from Human Feedback (RLHF) with translation
tasks has shown great potential. However, we observe that this paradigm
performs unexpectedly poorly when applied to colloquial subtitle translation
tasks. In this work, we investigate this issue and find that the offline reward
model (RM) gradually diverges from the online LLM due to distributional shift,
ultimately leading to undesirable training outcomes. To address this, we
propose RIVAL, an adversarial training framework that formulates the process as
a min-max game between the RM and the LLM. RIVAL iteratively updates the both
models, with the RM trained to distinguish strong from weak translations
(qualitative preference reward), and the LLM trained to enhance its translation
for closing this gap. To stabilize training and improve generalizability, we
also incorporate quantitative preference reward (e.g., BLEU) into the RM,
enabling reference-free quality modeling aligned with human evaluation. Through
extensive experiments, we demonstrate that the proposed adversarial training
framework significantly improves upon translation baselines.
☆ Does It Make Sense to Speak of Introspection in Large Language Models?
Large language models (LLMs) exhibit compelling linguistic behaviour, and
sometimes offer self-reports, that is to say statements about their own nature,
inner workings, or behaviour. In humans, such reports are often attributed to a
faculty of introspection and are typically linked to consciousness. This raises
the question of how to interpret self-reports produced by LLMs, given their
increasing linguistic fluency and cognitive capabilities. To what extent (if
any) can the concept of introspection be meaningfully applied to LLMs? Here, we
present and critique two examples of apparent introspective self-report from
LLMs. In the first example, an LLM attempts to describe the process behind its
own ``creative'' writing, and we argue this is not a valid example of
introspection. In the second example, an LLM correctly infers the value of its
own temperature parameter, and we argue that this can be legitimately
considered a minimal example of introspection, albeit one that is (presumably)
not accompanied by conscious experience.
☆ Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
We introduce Debate Speech Evaluation as a novel and challenging benchmark
for assessing LLM judges. Evaluating debate speeches requires a deep
understanding of the speech at multiple levels, including argument strength and
relevance, the coherence and organization of the speech, the appropriateness of
its style and tone, and so on. This task involves a unique set of cognitive
abilities that have previously received limited attention in systematic LLM
benchmarking. To explore such skills, we leverage a dataset of over 600
meticulously annotated debate speeches and present the first in-depth analysis
of how state-of-the-art LLMs compare to human judges on this task. Our findings
reveal a nuanced picture: while larger models can approximate individual human
judgments in some respects, they differ substantially in their overall judgment
behavior. We also investigate the ability of frontier LLMs to generate
persuasive, opinionated speeches, showing that models may perform at a human
level on this task.
comment: Code: https://github.com/noy-sternlicht/Debatable-Intelligence
☆ TALL -- A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages
Large Language Models (LLMs) excel in high-resource languages but struggle
with low-resource languages due to limited training data. This paper presents
TALL (Trainable Architecture for Enhancing LLM Performance in Low-Resource
Languages), which integrates an LLM with two bilingual translation models. TALL
transforms low-resource inputs into high-resource representations, leveraging
the LLM's capabilities while preserving linguistic features through dimension
alignment layers and custom transformers. Our experiments on Hebrew demonstrate
significant improvements over several baselines, including direct use, naive
translation, and fine-tuning approaches. The architecture employs a
parameter-efficient strategy, freezing pre-trained components while training
only lightweight adapter modules, balancing computational efficiency with
performance gains.
☆ Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers
Large language models (LLMs) have achieved distinguished performance on
various reasoning-intensive tasks. However, LLMs might still face the
challenges of robustness issues and fail unexpectedly in some simple reasoning
tasks. Previous works evaluate the LLM robustness with hand-crafted templates
or a limited set of perturbation rules, indicating potential data contamination
in pre-training or fine-tuning datasets. In this work, inspired by stress
testing in software engineering, we propose a novel framework, Automatic
Robustness Checker (AR-Checker), to generate mathematical problem variants that
maintain the semantic meanings of the original one but might fail the LLMs. The
AR-Checker framework generates mathematical problem variants through
multi-round parallel streams of LLM-based rewriting and verification. Our
framework can generate benchmark variants dynamically for each LLM, thus
minimizing the risk of data contamination. Experiments on GSM8K and MATH-500
demonstrate the strong performance of AR-Checker on mathematical tasks. We also
evaluate AR-Checker on benchmarks beyond mathematics, including MMLU, MMLU-Pro,
and CommonsenseQA, where it also achieves strong performance, further proving
the effectiveness of AR-Checker.
☆ Controlling Summarization Length Through EOS Token Weighting
Controlling the length of generated text can be crucial in various
text-generation tasks, including summarization. Existing methods often require
complex model alterations, limiting compatibility with pre-trained models. We
address these limitations by developing a simple approach for controlling the
length of automatic text summaries by increasing the importance of correctly
predicting the EOS token in the cross-entropy loss computation. The proposed
methodology is agnostic to architecture and decoding algorithms and orthogonal
to other inference-time techniques to control the generation length. We tested
it with encoder-decoder and modern GPT-style LLMs, and show that this method
can control generation length, often without affecting the quality of the
summary.
☆ ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development ACL 2025
Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
We introduce ComfyUI-Copilot, a large language model-powered plugin designed
to enhance the usability and efficiency of ComfyUI, an open-source platform for
AI-driven art creation. Despite its flexibility and user-friendly interface,
ComfyUI can present challenges to newcomers, including limited documentation,
model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot
addresses these challenges by offering intelligent node and model
recommendations, along with automated one-click workflow construction. At its
core, the system employs a hierarchical multi-agent framework comprising a
central assistant agent for task delegation and specialized worker agents for
different usages, supported by our curated ComfyUI knowledge bases to
streamline debugging and deployment. We validate the effectiveness of
ComfyUI-Copilot through both offline quantitative evaluations and online user
feedback, showing that it accurately recommends nodes and accelerates workflow
development. Additionally, use cases illustrate that ComfyUI-Copilot lowers
entry barriers for beginners and enhances workflow efficiency for experienced
users. The ComfyUI-Copilot installation package and a demo video are available
at https://github.com/AIDC-AI/ComfyUI-Copilot.
comment: ACL 2025 Demo. Github: https://github.com/AIDC-AI/ComfyUI-Copilot
☆ SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View
Despite the great potential of large language models(LLMs) in machine
comprehension, it is still disturbing to fully count on them in real-world
scenarios. This is probably because there is no rational explanation for
whether the comprehension process of LLMs is aligned with that of experts. In
this paper, we propose SCOP to carefully examine how LLMs perform during the
comprehension process from a cognitive view. Specifically, it is equipped with
a systematical definition of five requisite skills during the comprehension
process, a strict framework to construct testing data for these skills, and a
detailed analysis of advanced open-sourced and closed-sourced LLMs using the
testing data. With SCOP, we find that it is still challenging for LLMs to
perform an expert-level comprehension process. Even so, we notice that LLMs
share some similarities with experts, e.g., performing better at comprehending
local information than global information. Further analysis reveals that LLMs
can be somewhat unreliable -- they might reach correct answers through flawed
comprehension processes. Based on SCOP, we suggest that one direction for
improving LLMs is to focus more on the comprehension process, ensuring all
comprehension skills are thoroughly developed during training.
comment: arXiv admin note: text overlap with arXiv:2004.14535 by other authors
☆ Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings ACL 2025
Yubo Ma, Jinsong Li, Yuhang Zang, Xiaobao Wu, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Jiaqi Wang, Yixin Cao, Aixin Sun
Despite the strong performance of ColPali/ColQwen2 in Visualized Document
Retrieval (VDR), it encodes each page into multiple patch-level embeddings and
leads to excessive memory usage. This empirical study investigates methods to
reduce patch embeddings per page at minimum performance degradation. We
evaluate two token-reduction strategies: token pruning and token merging.
Regarding token pruning, we surprisingly observe that a simple random strategy
outperforms other sophisticated pruning methods, though still far from
satisfactory. Further analysis reveals that pruning is inherently unsuitable
for VDR as it requires removing certain page embeddings without query-specific
information. Turning to token merging (more suitable for VDR), we search for
the optimal combinations of merging strategy across three dimensions and
develop Light-ColPali/ColQwen2. It maintains 98.2% of retrieval performance
with only 11.8% of original memory usage, and preserves 94.6% effectiveness at
2.8% memory footprint. We expect our empirical findings and resulting
Light-ColPali/ColQwen2 offer valuable insights and establish a competitive
baseline for future research towards efficient VDR.
comment: Accepted by ACL 2025 findings
☆ Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering
Andres Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esau Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke
Fine-tuning pretrained ASR models for specific domains is challenging when
labeled data is scarce. But unlabeled audio and labeled data from related
domains are often available. We propose an incremental semi-supervised learning
pipeline that first integrates a small in-domain labeled set and an auxiliary
dataset from a closely related domain, achieving a relative improvement of 4%
over no auxiliary data. Filtering based on multi-model consensus or named
entity recognition (NER) is then applied to select and iteratively refine
pseudo-labels, showing slower performance saturation compared to random
selection. Evaluated on the multi-domain Wow call center and Fisher English
corpora, it outperforms single-step fine-tuning. Consensus-based filtering
outperforms other methods, providing up to 22.3% relative improvement on Wow
and 24.8% on Fisher over single-step fine-tuning with random selection. NER is
the second-best filter, providing competitive performance at a lower
computational cost.
comment: Accepted at Interspeech 2025, Netherlands
☆ From Struggle (06-2024) to Mastery (02-2025) LLMs Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation
This paper presents a comprehensive evaluation of the performance of
state-of-the-art Large Language Models (LLMs) on challenging university-level
algorithms exams. By testing multiple models on both a Romanian exam and its
high-quality English translation, we analyze LLMs' problem-solving
capabilities, consistency, and multilingual performance. Our empirical study
reveals that the most recent models not only achieve scores comparable to
top-performing students but also demonstrate robust reasoning skills on
complex, multi-step algorithmic challenges, even though difficulties remain
with graph-based tasks. Building on these findings, we explore the potential of
LLMs to support educational environments through the generation of high-quality
editorial content, offering instructors a powerful tool to enhance student
feedback. The insights and best practices discussed herein pave the way for
further integration of generative AI in advanced algorithm education.
comment: 15 pages Pre-print Paper accepted to ITS 2025
☆ ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT ACL 2025
Neural Machine Translation (NMT) has improved translation by using
Transformer-based models, but it still struggles with word ambiguity and
context. This problem is especially important in domain-specific applications,
which often have problems with unclear sentences or poor data quality. Our
research explores how adding information to models can improve translations in
the context of e-commerce data. To this end we create ConECT -- a new
Czech-to-Polish e-commerce product translation dataset coupled with images and
product metadata consisting of 11,400 sentence pairs. We then investigate and
compare different methods that are applicable to context-aware translation. We
test a vision-language model (VLM), finding that visual context aids
translation quality. Additionally, we explore the incorporation of contextual
information into text-to-text models, such as the product's category path or
image descriptions. The results of our study demonstrate that the incorporation
of contextual information leads to an improvement in the quality of machine
translation. We make the new dataset publicly available.
comment: Accepted at ACL 2025 (The 63rd Annual Meeting of the Association for
Computational Linguistics)
☆ Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback
Large language models (LLMs) have demonstrated the ability to generate
formative feedback and instructional hints in English, making them increasingly
relevant for AI-assisted education. However, their ability to provide effective
instructional support across different languages, especially for mathematically
grounded reasoning tasks, remains largely unexamined. In this work, we present
the first large-scale simulation of multilingual tutor-student interactions
using LLMs. A stronger model plays the role of the tutor, generating feedback
in the form of hints, while a weaker model simulates the student. We explore
352 experimental settings across 11 typologically diverse languages, four
state-of-the-art LLMs, and multiple prompting strategies to assess whether
language-specific feedback leads to measurable learning gains. Our study
examines how student input language, teacher feedback language, model choice,
and language resource level jointly influence performance. Results show that
multilingual hints can significantly improve learning outcomes, particularly in
low-resource languages when feedback is aligned with the student's native
language. These findings offer practical insights for developing multilingual,
LLM-based educational tools that are both effective and inclusive.
comment: Preprint, in submission
☆ A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic
An effective approach to the development of ASR systems for low-resource
languages is to fine-tune an existing multilingual end-to-end model. When the
original model has been trained on large quantities of data from many
languages, fine-tuning can be effective with limited training data, even when
the language in question was not present in the original training data. The
fine-tuning approach has been encouraged by the availability of public-domain
E2E models and is widely believed to lead to state-of-the-art results. This
paper, however, challenges that belief. We show that an approach combining
hybrid HMMs with self-supervised models can yield substantially better
performance with limited training data. This combination allows better
utilisation of all available speech and text data through continued
self-supervised pre-training and semi-supervised training. We benchmark our
approach on Scottish Gaelic, achieving WER reductions of 32% relative over our
best fine-tuned Whisper model.
comment: Accepted to Interspeech 2025
☆ Dissecting Long Reasoning Models: An Empirical Study
Despite recent progress in training long-context reasoning models via
reinforcement learning (RL), several open questions and counterintuitive
behaviors remain. This work focuses on three key aspects: (1) We systematically
analyze the roles of positive and negative samples in RL, revealing that
positive samples mainly facilitate data fitting, whereas negative samples
significantly enhance generalization and robustness. Interestingly, training
solely on negative samples can rival standard RL training performance. (2) We
identify substantial data inefficiency in group relative policy optimization,
where over half of the samples yield zero advantage. To address this, we
explore two straightforward strategies, including relative length rewards and
offline sample injection, to better leverage these data and enhance reasoning
efficiency and capability. (3) We investigate unstable performance across
various reasoning models and benchmarks, attributing instability to uncertain
problems with ambiguous outcomes, and demonstrate that multiple evaluation runs
mitigate this issue.
comment: Working in process
☆ When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
The honesty of large language models (LLMs) is a critical alignment
challenge, especially as advanced systems with chain-of-thought (CoT) reasoning
may strategically deceive humans. Unlike traditional honesty issues on LLMs,
which could be possibly explained as some kind of hallucination, those models'
explicit thought paths enable us to study strategic deception--goal-driven,
intentional misinformation where reasoning contradicts outputs. Using
representation engineering, we systematically induce, detect, and control such
deception in CoT-enabled LLMs, extracting "deception vectors" via Linear
Artificial Tomography (LAT) for 89% detection accuracy. Through activation
steering, we achieve a 40% success rate in eliciting context-appropriate
deception without explicit prompts, unveiling the specific honesty-related
issue of reasoning models and providing tools for trustworthy AI alignment.
☆ Verbose ListOps (VLO): Beyond Long Context -- Unmasking LLM's Reasoning Blind Spots
Large Language Models (LLMs), whilst great at extracting facts from text,
struggle with nested narrative reasoning. Existing long context and multi-hop
QA benchmarks inadequately test this, lacking realistic distractors or failing
to decouple context length from reasoning complexity, masking a fundamental LLM
limitation. We introduce Verbose ListOps, a novel benchmark that
programmatically transposes ListOps computations into lengthy, coherent
stories. This uniquely forces internal computation and state management of
nested reasoning problems by withholding intermediate results, and offers
fine-grained controls for both narrative size \emph{and} reasoning difficulty.
Whilst benchmarks like LongReason (2025) advance approaches for synthetically
expanding the context size of multi-hop QA problems, Verbose ListOps pinpoints
a specific LLM vulnerability: difficulty in state management for nested
sub-reasoning amongst semantically-relevant, distracting narrative. Our
experiments show that leading LLMs (e.g., OpenAI o4, Gemini 2.5 Pro) collapse
in performance on Verbose ListOps at modest (~10k token) narrative lengths,
despite effortlessly solving raw ListOps equations. Addressing this failure is
paramount for real-world text interpretation which requires identifying key
reasoning points, tracking conceptual intermediate results, and filtering
irrelevant information. Verbose ListOps, and its extensible generation
framework thus enables targeted reasoning enhancements beyond mere
context-window expansion; a critical step to automating the world's knowledge
work.
☆ ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests
With the significant progress of large reasoning models in complex coding and
reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are
insufficient to evaluate the coding capabilities of large language models
(LLMs) in real competition environments. Moreover, current evaluation metrics
such as Pass@K fail to capture the reflective abilities of reasoning models. To
address these challenges, we propose \textbf{ICPC-Eval}, a top-level
competitive coding benchmark designed to probing the frontiers of LLM
reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent
ICPC contests held in various regions of the world, offering three key
contributions: 1) A challenging realistic ICPC competition scenario, featuring
a problem type and difficulty distribution consistent with actual contests. 2)
A robust test case generation method and a corresponding local evaluation
toolkit, enabling efficient and accurate local evaluation. 3) An effective
test-time scaling evaluation metric, Refine@K, which allows iterative repair of
solutions based on execution feedback. The results underscore the significant
challenge in evaluating complex reasoning abilities: top-tier reasoning models
like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their
in-context reasoning potential when compared to non-reasoning counterparts.
Furthermore, despite recent advancements in code generation, these models still
lag behind top-performing human teams. We release the benchmark at:
https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
☆ Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies
Universal Dependencies (UD), while widely regarded as the most successful
linguistic framework for cross-lingual syntactic representation, remains
underexplored in terms of its effectiveness. This paper addresses this gap by
integrating UD into pretrained language models and assesses if UD can improve
their performance on a cross-lingual adversarial paraphrase identification
task. Experimental results show that incorporation of UD yields significant
improvements in accuracy and $F_1$ scores, with average gains of 3.85\% and
6.08\% respectively. These enhancements reduce the performance gap between
pretrained models and large language models in some language pairs, and even
outperform the latter in some others. Furthermore, the UD-based similarity
score between a given language and English is positively correlated to the
performance of models in that language. Both findings highlight the validity
and potential of UD in out-of-domain tasks.
☆ Prompting LLMs: Length Control for Isometric Machine Translation
In this study, we explore the effectiveness of isometric machine translation
across multiple language pairs (En$\to$De, En$\to$Fr, and En$\to$Es) under the
conditions of the IWSLT Isometric Shared Task 2022. Using eight open-source
large language models (LLMs) of varying sizes, we investigate how different
prompting strategies, varying numbers of few-shot examples, and demonstration
selection influence translation quality and length control. We discover that
the phrasing of instructions, when aligned with the properties of the provided
demonstrations, plays a crucial role in controlling the output length. Our
experiments show that LLMs tend to produce shorter translations only when
presented with extreme examples, while isometric demonstrations often lead to
the models disregarding length constraints. While few-shot prompting generally
enhances translation quality, further improvements are marginal across 5, 10,
and 20-shot settings. Finally, considering multiple outputs allows to notably
improve overall tradeoff between the length and quality, yielding
state-of-the-art performance for some language pairs.
comment: Accepted to IWSLT 2025
☆ Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights
Integrating Artificial Intelligence (AI) in educational settings has brought
new learning approaches, transforming the practices of both students and
educators. Among the various technologies driving this transformation, Large
Language Models (LLMs) have emerged as powerful tools for creating educational
materials and question answering, but there are still space for new
applications. Educators commonly use Multiple-Choice Questions (MCQs) to assess
student knowledge, but manually generating these questions is
resource-intensive and requires significant time and cognitive effort. In our
opinion, LLMs offer a promising solution to these challenges. This paper
presents a novel comparative analysis of three widely known LLMs - Llama 2,
Mistral, and GPT-3.5 - to explore their potential for creating informative and
challenging MCQs. In our approach, we do not rely on the knowledge of the LLM,
but we inject the knowledge into the prompt to contrast the hallucinations,
giving the educators control over the test's source text, too. Our experiment
involving 21 educators shows that GPT-3.5 generates the most effective MCQs
across several known metrics. Additionally, it shows that there is still some
reluctance to adopt AI in the educational field. This study sheds light on the
potential of LLMs to generate MCQs and improve the educational experience,
providing valuable insights for the future.
comment: Copyright ACM 2024. This is the author's version of the work. It is
posted here for your personal use. Not for redistribution. The definitive
Version of Record was published in Adjunct Proceedings of the 32nd ACM
Conference on User Modeling, Adaptation and Personalization (UMAP Adjunct
'24), http://dx.doi.org/10.1145/3631700.3665233
☆ MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines ACL 2025
In simultaneous interpreting, an interpreter renders a source speech into
another language with a very short lag, much sooner than sentences are
finished. In order to understand and later reproduce this dynamic and complex
task automatically, we need dedicated datasets and tools for analysis,
monitoring, and evaluation, such as parallel speech corpora, and tools for
their automatic annotation. Existing parallel corpora of translated texts and
associated alignment algorithms hardly fill this gap, as they fail to model
long-range interactions between speech segments or specific types of
divergences (e.g., shortening, simplification, functional generalization)
between the original and interpreted speeches. In this work, we introduce
MockConf, a student interpreting dataset that was collected from Mock
Conferences run as part of the students' curriculum. This dataset contains 7
hours of recordings in 5 European languages, transcribed and aligned at the
level of spans and words. We further implement and release InterAlign, a modern
web-based annotation tool for parallel word and span annotations on long
inputs, suitable for aligning simultaneous interpreting. We propose metrics for
the evaluation and a baseline for automatic alignment. Dataset and tools are
released to the community.
comment: Accepted to ACL 2025 Main Conference
☆ Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models
Large Reasoning Models (LRMs) extend large language models with explicit,
multi-step reasoning traces to enhance transparency and performance on complex
tasks. However, these reasoning traces can be redundant or logically
inconsistent, making them a new source of hallucination that is difficult to
detect. Existing hallucination detection methods focus primarily on
answer-level uncertainty and often fail to detect hallucinations or logical
inconsistencies arising from the model's reasoning trace. This oversight is
particularly problematic for LRMs, where the explicit thinking trace is not
only an important support to the model's decision-making process but also a key
source of potential hallucination. To this end, we propose RACE (Reasoning and
Answer Consistency Evaluation), a novel framework specifically tailored for
hallucination detection in LRMs. RACE operates by extracting essential
reasoning steps and computing four diagnostic signals: inter-sample consistency
of reasoning traces, entropy-based answer uncertainty, semantic alignment
between reasoning and answers, and internal coherence of reasoning. This joint
analysis enables fine-grained hallucination detection even when the final
answer appears correct. Experiments across datasets and different LLMs
demonstrate that RACE outperforms existing hallucination detection baselines,
offering a robust and generalizable solution for evaluating LRMs. Our code is
available at: https://github.com/bebr2/RACE.
☆ From EHRs to Patient Pathways: Scalable Modeling of Longitudinal Health Trajectories with LLMs
Healthcare systems face significant challenges in managing and interpreting
vast, heterogeneous patient data for personalized care. Existing approaches
often focus on narrow use cases with a limited feature space, overlooking the
complex, longitudinal interactions needed for a holistic understanding of
patient health. In this work, we propose a novel approach to patient pathway
modeling by transforming diverse electronic health record (EHR) data into a
structured representation and designing a holistic pathway prediction model,
EHR2Path, optimized to predict future health trajectories. Further, we
introduce a novel summary mechanism that embeds long-term temporal context into
topic-specific summary tokens, improving performance over text-only models,
while being much more token-efficient. EHR2Path demonstrates strong performance
in both next time-step prediction and longitudinal simulation, outperforming
competitive baselines. It enables detailed simulations of patient trajectories,
inherently targeting diverse evaluation tasks, such as forecasting vital signs,
lab test results, or length-of-stay, opening a path towards predictive and
personalized healthcare.
☆ A Reasoning-Based Approach to Cryptic Crossword Clue Solving ICML 2025
Cryptic crossword clues are challenging language tasks for which new test
sets are released daily by major newspapers on a global basis. Each cryptic
clue contains both the definition of the answer to be placed in the crossword
grid (in common with regular crosswords), and 'wordplay' that proves that the
answer is correct (i.e. a human solver can be confident that an answer is
correct without needing crossing words as confirmation). This work describes an
LLM-based reasoning system built from open-licensed components that solves
cryptic clues by (i) hypothesising answers; (ii) proposing wordplay
explanations; and (iii) using a verifier system that operates on codified
reasoning steps. Overall, this system establishes a new state-of-the-art
performance on the challenging Cryptonite dataset of clues from The Times and
The Telegraph newspapers in the UK. Because each proved solution is expressed
in Python, interpretable wordplay reasoning for proven answers is available for
inspection.
comment: 9 page paper plus Appendices. Accepted to ICML 2025
☆ Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms
Although vision-language and large language models (VLM and LLM) offer
promising opportunities for AI-driven educational assessment, their
effectiveness in real-world classroom settings, particularly in
underrepresented educational contexts, remains underexplored. In this study, we
evaluated the performance of a state-of-the-art VLM and several LLMs on 646
handwritten exam responses from grade 4 students in six Indonesian schools,
covering two subjects: Mathematics and English. These sheets contain more than
14K student answers that span multiple choice, short answer, and essay
questions. Assessment tasks include grading these responses and generating
personalized feedback. Our findings show that the VLM often struggles to
accurately recognize student handwriting, leading to error propagation in
downstream LLM grading. Nevertheless, LLM-generated feedback retains some
utility, even when derived from imperfect input, although limitations in
personalization and contextual relevance persist.
☆ Design of intelligent proofreading system for English translation based on CNN and BERT
Since automatic translations can contain errors that require substantial
human post-editing, machine translation proofreading is essential for improving
quality. This paper proposes a novel hybrid approach for robust proofreading
that combines convolutional neural networks (CNN) with Bidirectional Encoder
Representations from Transformers (BERT). In order to extract semantic
information from phrases and expressions, CNN uses a variety of convolution
kernel filters to capture local n-gram patterns. In the meanwhile, BERT creates
context-rich representations of whole sequences by utilizing stacked
bidirectional transformer encoders. Using BERT's attention processes, the
integrated error detection component relates tokens to spot translation
irregularities including word order problems and omissions. The correction
module then uses parallel English-German alignment and GRU decoder models in
conjunction with translation memory to propose logical modifications that
maintain original meaning. A unified end-to-end training process optimized for
post-editing performance is applied to the whole pipeline. The multi-domain
collection of WMT and the conversational dialogues of Open-Subtitles are two of
the English-German parallel corpora used to train the model. Multiple loss
functions supervise detection and correction capabilities. Experiments attain a
90% accuracy, 89.37% F1, and 16.24% MSE, exceeding recent proofreading
techniques by over 10% overall. Comparative benchmarking demonstrates
state-of-the-art performance in identifying and coherently rectifying
mistranslations and omissions.
☆ Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study
Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan Guo, Taicheng Guo, Xiangqi Wang, Xiangliang Zhang
Logical reasoning is a core capability for many applications of large
language models (LLMs), yet existing benchmarks often rely solely on
final-answer accuracy, failing to capture the quality and structure of the
reasoning process. We propose FineLogic, a fine-grained evaluation framework
that assesses logical reasoning across three dimensions: overall benchmark
accuracy, stepwise soundness, and representation-level alignment. In addition,
to better understand how reasoning capabilities emerge, we conduct a
comprehensive study on the effects of supervision format during fine-tuning. We
construct four supervision styles (one natural language and three symbolic
variants) and train LLMs under each. Our findings reveal that natural language
supervision yields strong generalization even on out-of-distribution and
long-context tasks, while symbolic reasoning styles promote more structurally
sound and atomic inference chains. Further, our representation-level probing
shows that fine-tuning primarily improves reasoning behaviors through
step-by-step generation, rather than enhancing shortcut prediction or
internalized correctness. Together, our framework and analysis provide a more
rigorous and interpretable lens for evaluating and improving logical reasoning
in LLMs.
☆ Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques
The rapid progress of Multimodal Large Language Models(MLLMs) has transformed
the AI landscape. These models combine pre-trained LLMs with various modality
encoders. This integration requires a systematic understanding of how different
modalities connect to the language backbone. Our survey presents an LLM-centric
analysis of current approaches. We examine methods for transforming and
aligning diverse modal inputs into the language embedding space. This addresses
a significant gap in existing literature. We propose a classification framework
for MLLMs based on three key dimensions. First, we examine architectural
strategies for modality integration. This includes both the specific
integration mechanisms and the fusion level. Second, we categorize
representation learning techniques as either joint or coordinate
representations. Third, we analyze training paradigms, including training
strategies and objective functions. By examining 125 MLLMs developed between
2021 and 2025, we identify emerging patterns in the field. Our taxonomy
provides researchers with a structured overview of current integration
techniques. These insights aim to guide the development of more robust
multimodal integration strategies for future models built on pre-trained
foundations.
comment: 18 pages, 3 figures, 3 tables
☆ MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
Speech inherently contains rich acoustic information that extends far beyond
the textual language. In real-world spoken language understanding, effective
interpretation often requires integrating semantic meaning (e.g., content),
paralinguistic features (e.g., emotions, speed, pitch) and phonological
characteristics (e.g., prosody, intonation, rhythm), which are embedded in
speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have
demonstrated remarkable capabilities in processing audio information, their
ability to perform fine-grained perception and complex reasoning in natural
speech remains largely unexplored. To address this gap, we introduce MMSU, a
comprehensive benchmark designed specifically for understanding and reasoning
in spoken language. MMSU comprises 5,000 meticulously curated
audio-question-answer triplets across 47 distinct tasks. To ground our
benchmark in linguistic theory, we systematically incorporate a wide range of
linguistic phenomena, including phonetics, prosody, rhetoric, syntactics,
semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced
SpeechLLMs, we identify substantial room for improvement in existing models,
highlighting meaningful directions for future optimization. MMSU establishes a
new standard for comprehensive assessment of spoken language understanding,
providing valuable insights for developing more sophisticated human-AI speech
interaction systems. MMSU benchmark is available at
https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available
at https://github.com/dingdongwang/MMSU_Bench.
comment: MMSU benchmark is available at
https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available
at https://github.com/dingdongwang/MMSU_Bench
☆ Fine-Grained Interpretation of Political Opinions in Large Language Models
Studies of LLMs' political opinions mainly rely on evaluations of their
open-ended responses. Recent work indicates that there is a misalignment
between LLMs' responses and their internal intentions. This motivates us to
probe LLMs' internal mechanisms and help uncover their internal political
states. Additionally, we found that the analysis of LLMs' political opinions
often relies on single-axis concepts, which can lead to concept confounds. In
this work, we extend the single-axis to multi-dimensions and apply
interpretable representation engineering techniques for more transparent LLM
political concept learning. Specifically, we designed a four-dimensional
political learning framework and constructed a corresponding dataset for
fine-grained political concept vector learning. These vectors can be used to
detect and intervene in LLM internals. Experiments are conducted on eight
open-source LLMs with three representation engineering techniques. Results show
these vectors can disentangle political concept confounds. Detection tasks
validate the semantic meaning of the vectors and show good generalization and
robustness in OOD settings. Intervention Experiments show these vectors can
intervene in LLMs to generate responses with different political leanings.
☆ Identifying Reliable Evaluation Metrics for Scientific Text Revision ACL 2025
Evaluating text revision in scientific writing remains a challenge, as
traditional metrics such as ROUGE and BERTScore primarily focus on similarity
rather than capturing meaningful improvements. In this work, we analyse and
identify the limitations of these metrics and explore alternative evaluation
methods that better align with human judgments. We first conduct a manual
annotation study to assess the quality of different revisions. Then, we
investigate reference-free evaluation metrics from related NLP domains.
Additionally, we examine LLM-as-a-judge approaches, analysing their ability to
assess revisions with and without a gold reference. Our results show that LLMs
effectively assess instruction-following but struggle with correctness, while
domain-specific metrics provide complementary insights. We find that a hybrid
approach combining LLM-as-a-judge evaluation and task-specific metrics offers
the most reliable assessment of revision quality.
comment: Accepted to ACL 2025 main
☆ GOLFer: Smaller LM-Generated Documents Hallucination Filter & Combiner for Query Expansion in Information Retrieval
Large language models (LLMs)-based query expansion for information retrieval
augments queries with generated hypothetical documents with LLMs. However, its
performance relies heavily on the scale of the language models (LMs),
necessitating larger, more advanced LLMs. This approach is costly,
computationally intensive, and often has limited accessibility. To address
these limitations, we introduce GOLFer - Smaller LMs-Generated Documents
Hallucination Filter & Combiner - a novel method leveraging smaller open-source
LMs for query expansion. GOLFer comprises two modules: a hallucination filter
and a documents combiner. The former detects and removes non-factual and
inconsistent sentences in generated documents, a common issue with smaller LMs,
while the latter combines the filtered content with the query using a weight
vector to balance their influence. We evaluate GOLFer alongside dominant
LLM-based query expansion methods on three web search and ten low-resource
datasets. Experimental results demonstrate that GOLFer consistently outperforms
other methods using smaller LMs, and maintains competitive performance against
methods using large-size LLMs, demonstrating its effectiveness.
☆ Exp4Fuse: A Rank Fusion Framework for Enhanced Sparse Retrieval using Large Language Model-based Query Expansion
Large Language Models (LLMs) have shown potential in generating hypothetical
documents for query expansion, thereby enhancing information retrieval
performance. However, the efficacy of this method is highly dependent on the
quality of the generated documents, which often requires complex prompt
strategies and the integration of advanced dense retrieval techniques. This can
be both costly and computationally intensive. To mitigate these limitations, we
explore the use of zero-shot LLM-based query expansion to improve sparse
retrieval, particularly for learned sparse retrievers. We introduce a novel
fusion ranking framework, Exp4Fuse, which enhances the performance of sparse
retrievers through an indirect application of zero-shot LLM-based query
expansion. Exp4Fuse operates by simultaneously considering two retrieval
routes-one based on the original query and the other on the LLM-augmented
query. It then generates two ranked lists using a sparse retriever and fuses
them using a modified reciprocal rank fusion method. We conduct extensive
evaluations of Exp4Fuse against leading LLM-based query expansion methods and
advanced retrieval techniques on three MS MARCO-related datasets and seven
low-resource datasets. Experimental results reveal that Exp4Fuse not only
surpasses existing LLM-based query expansion methods in enhancing sparse
retrievers but also, when combined with advanced sparse retrievers, achieves
SOTA results on several benchmarks. This highlights the superior performance
and effectiveness of Exp4Fuse in improving query expansion for sparse
retrieval.
☆ Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection
The widespread dissemination of fake news on social media has significantly
impacted society, resulting in serious consequences. Conventional deep learning
methodologies employing small language models (SLMs) suffer from extensive
supervised training requirements and difficulties adapting to evolving news
environments due to data scarcity and distribution shifts. Large language
models (LLMs), despite robust zero-shot capabilities, fall short in accurately
detecting fake news owing to outdated knowledge and the absence of suitable
demonstrations. In this paper, we propose a novel Continuous Collaborative
Emergent Fake News Detection (C$^2$EFND) framework to address these challenges.
The C$^2$EFND framework strategically leverages both LLMs' generalization power
and SLMs' classification expertise via a multi-round collaborative learning
framework. We further introduce a lifelong knowledge editing module based on a
Mixture-of-Experts architecture to incrementally update LLMs and a replay-based
continue learning method to ensure SLMs retain prior knowledge without
retraining entirely. Extensive experiments on Pheme and Twitter16 datasets
demonstrate that C$^2$EFND significantly outperforms existed methods,
effectively improving detection accuracy and adaptability in continuous
emergent fake news scenarios.
☆ Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design
Lin Sun, Weihong Lin, Jinzhu Wu, Yongfu Zhu, Xiaoqi Jian, Guangxiang Zhao, Change Jia, Linglin Zhang, Sai-er Hu, Yuhan Wu, Xiangzheng Zhang
Reasoning models represented by the Deepseek-R1-Distill series have been
widely adopted by the open-source community due to their strong performance in
mathematics, science, programming, and other domains. However, our study
reveals that their benchmark evaluation results are subject to significant
fluctuations caused by various factors. Subtle differences in evaluation
conditions can lead to substantial variations in results. Similar phenomena are
observed in other open-source inference models fine-tuned based on the
Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their
claimed performance improvements difficult to reproduce reliably. Therefore, we
advocate for the establishment of a more rigorous paradigm for model
performance evaluation and present our empirical assessments of the
Deepseek-R1-Distill series models.
☆ SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat
We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs
through competition and combat. To complement a single model's lack of
diversity in generation and biases in evaluation, multiple LLMs form a "sparta
tribe" to compete against each other in fulfilling instructions while serving
as judges for the competition of others. For each iteration, one instruction
and two models are selected for a duel, the other models evaluate the two
responses, and their evaluation scores are aggregated through a adapted
elo-ranking based reputation system, where winners/losers of combat gain/lose
weight in evaluating others. The peer-evaluated combat results then become
preference pairs where the winning response is preferred over the losing one,
and all models learn from these preferences at the end of each iteration.
SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative
and collective competition process. Extensive experiments demonstrate that
SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines
across 10 out of 12 tasks and datasets with 7.0% average improvement. Further
analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen
tasks and leverages the expertise diversity of participating models to produce
more logical, direct and informative outputs.
☆ IIITH-BUT system for IWSLT 2025 low-resource Bhojpuri to Hindi speech translation
This paper presents the submission of IIITH-BUT to the IWSLT 2025 shared task
on speech translation for the low-resource Bhojpuri-Hindi language pair. We
explored the impact of hyperparameter optimisation and data augmentation
techniques on the performance of the SeamlessM4T model fine-tuned for this
specific task. We systematically investigated a range of hyperparameters
including learning rate schedules, number of update steps, warm-up steps, label
smoothing, and batch sizes; and report their effect on translation quality. To
address data scarcity, we applied speed perturbation and SpecAugment and
studied their effect on translation quality. We also examined the use of
cross-lingual signal through joint training with Marathi and Bhojpuri speech
data. Our experiments reveal that careful selection of hyperparameters and the
application of simple yet effective augmentation techniques significantly
improve performance in low-resource settings. We also analysed the translation
hypotheses to understand various kinds of errors that impacted the translation
quality in terms of BLEU.
comment: Paper is accepted to IWSLT2025
☆ LLM-based phoneme-to-grapheme for phoneme-based speech recognition
In automatic speech recognition (ASR), phoneme-based multilingual
pre-training and crosslingual fine-tuning is attractive for its high data
efficiency and competitive results compared to subword-based models. However,
Weighted Finite State Transducer (WFST) based decoding is limited by its
complex pipeline and inability to leverage large language models (LLMs).
Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for
phoneme-based ASR, consisting of speech-to-phoneme (S2P) and
phoneme-to-grapheme (P2G). A challenge is that there seems to have information
loss in cascading S2P and P2G. To address this challenge, we propose two
training strategies: data augmentation with noisy phonemes (DANP), and
randomized top-$K$ marginalized (TKM) training and decoding. Our experimental
results show that LLM-P2G outperforms WFST-based systems in crosslingual ASR
for Polish and German, by relative WER reductions of 3.6% and 6.9%
respectively.
comment: Interspeech 2025
☆ Accelerated Test-Time Scaling with Model-Free Speculative Sampling
Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati
Language models have demonstrated remarkable capabilities in reasoning tasks
through test-time scaling techniques like best-of-N sampling and tree search.
However, these approaches often demand substantial computational resources,
creating a critical trade-off between performance and efficiency. We introduce
STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative
decoding approach that leverages the inherent redundancy in reasoning
trajectories to achieve significant acceleration without compromising accuracy.
Our analysis reveals that reasoning paths frequently reuse similar reasoning
patterns, enabling efficient model-free token prediction without requiring
separate draft models. By introducing stochastic drafting and preserving
probabilistic information through a memory-efficient logit-based N-gram module,
combined with optimized Gumbel-Top-K sampling and data-driven tree
construction, STAND significantly improves token acceptance rates. Extensive
evaluations across multiple models and reasoning tasks (AIME-2024,
GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference
latency by 60-65% compared to standard autoregressive decoding while
maintaining accuracy. Furthermore, STAND outperforms state-of-the-art
speculative decoding methods by 14-28% in throughput and shows strong
performance even in single-trajectory scenarios, reducing inference latency by
48-58%. As a model-free approach, STAND can be applied to any existing language
model without additional training, being a powerful plug-and-play solution for
accelerating language model reasoning.
☆ Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification
The internet has become a hotspot for hate speech (HS), threatening societal
harmony and individual well-being. While automatic detection methods perform
well in identifying explicit hate speech (ex-HS), they struggle with more
subtle forms, such as implicit hate speech (im-HS). We tackle this problem by
introducing a new taxonomy for im-HS detection, defining six encoding
strategies named codetypes. We present two methods for integrating codetypes
into im-HS detection: 1) prompting large language models (LLMs) directly to
classify sentences based on generated responses, and 2) using LLMs as encoders
with codetypes embedded during the encoding process. Experiments show that the
use of codetypes improves im-HS detection in both Chinese and English datasets,
validating the effectiveness of our approach across different languages.
comment: Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025),
112-126
☆ Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Scaling laws predict that the performance of large language models improves
with increasing model size and data size. In practice, pre-training has been
relying on massive web crawls, using almost all data sources publicly available
on the internet so far. However, this pool of natural data does not grow at the
same rate as the compute supply. Furthermore, the availability of high-quality
texts is even more limited: data filtering pipelines often remove up to 99% of
the initial web scrapes to achieve state-of-the-art. To address the "data wall"
of pre-training scaling, our work explores ways to transform and recycle data
discarded in existing filtering processes. We propose REWIRE, REcycling the Web
with guIded REwrite, a method to enrich low-quality documents so that they
could become useful for training. This in turn allows us to increase the
representation of synthetic data in the final pre-training set. Experiments at
1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw
texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points
improvement respectively across 22 diverse tasks, compared to training on only
filtered web data. Training on the raw-synthetic data mix is also more
effective than having access to 2x web data. Through further analysis, we
demonstrate that about 82% of the mixed in texts come from transforming
lower-quality documents that would otherwise be discarded. REWIRE also
outperforms related approaches of generating synthetic data, including
Wikipedia-style paraphrasing, question-answer synthesizing and knowledge
extraction. These results suggest that recycling web texts holds the potential
for being a simple and effective approach for scaling pre-training data.
☆ MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models ACL
This paper introduces MMRefine, a MultiModal Refinement benchmark designed to
evaluate the error refinement capabilities of Multimodal Large Language Models
(MLLMs). As the emphasis shifts toward enhancing reasoning during inference,
MMRefine provides a framework that evaluates MLLMs' abilities to detect and
correct errors across six distinct scenarios beyond just comparing final
accuracy before and after refinement. Furthermore, the benchmark analyzes the
refinement performance by categorizing errors into six error types. Experiments
with various open and closed MLLMs reveal bottlenecks and factors impeding
refinement performance, highlighting areas for improvement in effective
reasoning enhancement. Our code and dataset are publicly available at
https://github.com/naver-ai/MMRefine.
comment: ACL Findings 2025
☆ Urania: Differentially Private Insights into AI Use
Daogao Liu, Edith Cohen, Badih Ghazi, Peter Kairouz, Pritish Kamath, Alexander Knop, Ravi Kumar, Pasin Manurangsi, Adam Sealfon, Da Yu, Chiyuan Zhang
We introduce $Urania$, a novel framework for generating insights about LLM
chatbot interactions with rigorous differential privacy (DP) guarantees. The
framework employs a private clustering mechanism and innovative keyword
extraction methods, including frequency-based, TF-IDF-based, and LLM-guided
approaches. By leveraging DP tools such as clustering, partition selection, and
histogram-based summarization, $Urania$ provides end-to-end privacy protection.
Our evaluation assesses lexical and semantic content preservation, pair
similarity, and LLM-based metrics, benchmarking against a non-private
Clio-inspired pipeline (Tamkin et al., 2024). Moreover, we develop a simple
empirical privacy evaluation that demonstrates the enhanced robustness of our
DP pipeline. The results show the framework's ability to extract meaningful
conversational insights while maintaining stringent user privacy, effectively
balancing data utility with privacy preservation.
☆ Normative Conflicts and Shallow AI Alignment
The progress of AI systems such as large language models (LLMs) raises
increasingly pressing concerns about their safe deployment. This paper examines
the value alignment problem for LLMs, arguing that current alignment strategies
are fundamentally inadequate to prevent misuse. Despite ongoing efforts to
instill norms such as helpfulness, honesty, and harmlessness in LLMs through
fine-tuning based on human preferences, they remain vulnerable to adversarial
attacks that exploit conflicts between these norms. I argue that this
vulnerability reflects a fundamental limitation of existing alignment methods:
they reinforce shallow behavioral dispositions rather than endowing LLMs with a
genuine capacity for normative deliberation. Drawing from on research in moral
psychology, I show how humans' ability to engage in deliberative reasoning
enhances their resilience against similar adversarial tactics. LLMs, by
contrast, lack a robust capacity to detect and rationally resolve normative
conflicts, leaving them susceptible to manipulation; even recent advances in
reasoning-focused LLMs have not addressed this vulnerability. This ``shallow
alignment'' problem carries significant implications for AI safety and
regulation, suggesting that current approaches are insufficient for mitigating
potential harms posed by increasingly capable AI systems.
comment: Published in Philosophical Studies
☆ EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition
Speech emotion recognition (SER) systems often exhibit gender bias. However,
the effectiveness and robustness of existing debiasing methods in such
multi-label scenarios remain underexplored. To address this gap, we present
EMO-Debias, a large-scale comparison of 13 debiasing methods applied to
multi-label SER. Our study encompasses techniques from pre-processing,
regularization, adversarial learning, biased learners, and distributionally
robust optimization. Experiments conducted on acted and naturalistic emotion
datasets, using WavLM and XLSR representations, evaluate each method under
conditions of gender imbalance. Our analysis quantifies the trade-offs between
fairness and accuracy, identifying which approaches consistently reduce gender
performance gaps without compromising overall model performance. The findings
provide actionable insights for selecting effective debiasing strategies and
highlight the impact of dataset distributions.
comment: 8 pages
☆ Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents
Real-world planning problems require constant adaptation to changing
requirements and balancing of competing constraints. However, current
benchmarks for evaluating LLMs' planning capabilities primarily focus on
static, single-turn scenarios. We introduce Flex-TravelPlanner, a benchmark
that evaluates language models' ability to reason flexibly in dynamic planning
scenarios. Building on the TravelPlanner dataset~\citep{xie2024travelplanner},
we introduce two novel evaluation settings: (1) sequential constraint
introduction across multiple turns, and (2) scenarios with explicitly
prioritized competing constraints. Our analysis of GPT-4o and Llama 3.1 70B
reveals several key findings: models' performance on single-turn tasks poorly
predicts their ability to adapt plans across multiple turns; constraint
introduction order significantly affects performance; and models struggle with
constraint prioritization, often incorrectly favoring newly introduced lower
priority preferences over existing higher-priority constraints. These findings
highlight the importance of evaluating LLMs in more realistic, dynamic planning
scenarios and suggest specific directions for improving model performance on
complex planning tasks. The code and dataset for our framework are publicly
available at https://github.com/juhyunohh/FlexTravelBench.
☆ TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering ACL-2025
The key-value (KV) cache in transformer models is a critical component for
efficient decoding or inference, yet its memory demands scale poorly with
sequence length, posing a major challenge for scalable deployment of large
language models. Among several approaches to KV cache compression, quantization
of key and value activations has been widely explored. Most KV cache
quantization methods still need to manage sparse and noncontiguous outliers
separately. To address this, we introduce TaDA, a training-free recipe for KV
cache compression with quantization precision that adapts to error sensitivity
across layers and a mean centering to eliminate separate outlier handling. Our
approach yields substantial accuracy improvements for multiple models
supporting various context lengths. Moreover, our approach does not need to
separately manage outlier elements -- a persistent hurdle in most traditional
quantization methods. Experiments on standard benchmarks demonstrate that our
technique reduces KV cache memory footprint to 27% of the original 16-bit
baseline while achieving comparable accuracy. Our method paves the way for
scalable and high-performance reasoning in language models by potentially
enabling inference for longer context length models, reasoning models, and
longer chain of thoughts.
comment: ACL-2025 industry-track accepted
☆ ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition
Audio-Visual Speech Recognition (AVSR) has gained significant attention
recently due to its robustness against noise, which often challenges
conventional speech recognition systems that rely solely on audio features.
Despite this advantage, AVSR models remain limited by the scarcity of extensive
datasets, especially for most languages beyond English. Automated data
collection offers a promising solution. This work presents a practical approach
to generate AVSR datasets from raw video, refining existing techniques for
improved efficiency and accessibility. We demonstrate its broad applicability
by developing a baseline AVSR model for Vietnamese. Experiments show the
automatically collected dataset enables a strong baseline, achieving
competitive performance with robust ASR in clean conditions and significantly
outperforming them in noisy environments like cocktail parties. This efficient
method provides a pathway to expand AVSR to more languages, particularly
under-resourced ones.
comment: Accepted at Interspeech 2025
♻ ☆ AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Despite recent progress in large-scale reinforcement learning (RL) for
reasoning, the training recipe for building high-performing reasoning models
remains elusive. Key implementation details of frontier models, such as
DeepSeek-R1, including data curation strategies and RL training recipe, are
often omitted. Moreover, recent research indicates distillation remains more
effective than RL for smaller models. In this work, we demonstrate that
large-scale RL can significantly enhance the reasoning capabilities of strong,
small- and mid-sized models, achieving results that surpass those of
state-of-the-art distillation-based models. We systematically study the RL
training process through extensive ablations and propose a simple yet effective
approach: first training on math-only prompts, then on code-only prompts.
Notably, we find that math-only RL not only significantly enhances the
performance of strong distilled models on math benchmarks (e.g., +14.6% /
+17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks
(e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition,
extended code-only RL iterations further improve performance on code benchmarks
with minimal or no degradation in math results. We develop a robust data
curation pipeline to collect challenging prompts with high-quality, verifiable
answers and test cases to enable verification-based RL across both domains.
Finally, we identify key experimental insights, including curriculum learning
with progressively increasing response lengths and the stabilizing effect of
on-policy parameter updates. We find that RL not only elicits the foundational
reasoning capabilities acquired during pretraining and supervised fine-tuning
(e.g., distillation), but also pushes the limits of the model's reasoning
ability, enabling it to solve problems that were previously unsolvable.
comment: Add pass@1024 evaluation results for LiveCodeBench v6. We release the
models at:
https://huggingface.co/collections/nvidia/acereason-682f4e1261dc22f697fd1485
♻ ☆ The broader spectrum of in-context learning
The ability of language models to learn a task from a few examples in context
has generated substantial interest. Here, we provide a perspective that
situates this type of supervised few-shot learning within a much broader
spectrum of meta-learned in-context learning. Indeed, we suggest that any
distribution of sequences in which context non-trivially decreases loss on
subsequent predictions can be interpreted as eliciting a kind of in-context
learning. We suggest that this perspective helps to unify the broad set of
in-context abilities that language models exhibit -- such as adapting to tasks
from instructions or role play, or extrapolating time series. This perspective
also sheds light on potential roots of in-context learning in lower-level
processing of linguistic dependencies (e.g. coreference or parallel
structures). Finally, taking this perspective highlights the importance of
generalization, which we suggest can be studied along several dimensions: not
only the ability to learn something novel, but also flexibility in learning
from different presentations, and in applying what is learned. We discuss
broader connections to past literature in meta-learning and goal-conditioned
agents, and other perspectives on learning and adaptation. We close by
suggesting that research on in-context learning should consider this broader
spectrum of in-context capabilities and types of generalization.
♻ ☆ Is LLM the Silver Bullet to Low-Resource Languages Machine Translation?
Yewei Song, Lujun Li, Cedric Lothritz, Saad Ezzini, Lama Sleem, Niccolo Gentile, Radu State, Tegawendé F. Bissyandé, Jacques Klein
Low-Resource Languages (LRLs) present significant challenges in natural
language processing due to their limited linguistic resources and
underrepresentation in standard datasets. While recent advances in Large
Language Models (LLMs) and Neural Machine Translation have substantially
improved translation capabilities for high-resource languages, performance
disparities persist for LRLs, particularly impacting privacy-sensitive and
resource-constrained scenarios. This paper systematically evaluates current
LLMs in 200 languages using the FLORES-200 benchmark and demonstrates their
limitations in LRL translation capability. We also explore alternative data
sources, including news articles and bilingual dictionaries, and demonstrate
how knowledge distillation from large pre-trained teacher models can
significantly improve the performance of small LLMs on LRL translation tasks.
For example, this approach increases EN->LB with the LLM-as-a-Judge score on
the validation set from 0.36 to 0.89 for Llama-3.2-3B. Furthermore, we examine
different fine-tuning configurations, providing practical insights on optimal
data scale, training efficiency, and the preservation of generalization
capabilities of models under study.
♻ ☆ ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL
Yu Zhang, Yunqi Li, Yifan Yang, Rui Wang, Yuqing Yang, Dai Qi, Jianmin Bao, Dongdong Chen, Chong Luo, Lili Qiu
Although chain-of-thought reasoning and reinforcement learning (RL) have
driven breakthroughs in NLP, their integration into generative vision models
remains underexplored. We introduce ReasonGen-R1, a two-stage framework that
first imbues an autoregressive image generator with explicit text-based
"thinking" skills via supervised fine-tuning on a newly generated reasoning
dataset of written rationales, and then refines its outputs using Group
Relative Policy Optimization. To enable the model to reason through text before
generating images, We automatically generate and release a corpus of model
crafted rationales paired with visual prompts, enabling controlled planning of
object layouts, styles, and scene compositions. Our GRPO algorithm uses reward
signals from a pretrained vision language model to assess overall visual
quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and
the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong
baselines and prior state-of-the-art models. More: aka.ms/reasongen.
♻ ☆ GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering ACL 2025
Pre-trained Language Models (PLMs) have shown remarkable performances in
recent years, setting a new paradigm for NLP research and industry. The legal
domain has received some attention from the NLP community partly due to its
textual nature. Some tasks from this domain are represented by
question-answering (QA) tasks. This work explores the legal domain
Multiple-Choice QA (MCQA) for a low-resource language. The contribution of this
work is multi-fold. We first introduce JuRO, the first openly available
Romanian legal MCQA dataset, comprising three different examinations and a
number of 10,836 total questions. Along with this dataset, we introduce CROL,
an organized corpus of laws that has a total of 93 distinct documents with
their modifications from 763 time spans, that we leveraged in this work for
Information Retrieval (IR) techniques. Moreover, we are the first to propose
Law-RoG, a Knowledge Graph (KG) for the Romanian language, and this KG is
derived from the aforementioned corpus. Lastly, we propose a novel approach for
MCQA, Graph Retrieval Augmented by Facts (GRAF), which achieves competitive
results with generally accepted SOTA methods and even exceeds them in most
settings.
comment: Accepted to ACL 2025 Findings
♻ ☆ From Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
Current studies have exposed the risk of Large Language Models (LLMs)
generating harmful content by jailbreak attacks. However, they overlook that
the direct generation of harmful content from scratch is more difficult than
inducing LLM to calibrate benign content into harmful forms. In our study, we
introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR)
to induce the LLM to calibrate malicious metaphors for jailbreaking.
Specifically, to answer harmful queries, AVATAR adaptively identifies a set of
benign but logically related metaphors as the initial seed. Then, driven by
these metaphors, the target LLM is induced to reason and calibrate about the
metaphorical content, thus jailbroken by either directly outputting harmful
responses or calibrating residuals between metaphorical and professional
harmful content. Experimental results demonstrate that AVATAR can effectively
and transferable jailbreak LLMs and achieve a state-of-the-art attack success
rate across multiple advanced LLMs.
comment: arXiv admin note: substantial text overlap with arXiv:2412.12145
♻ ☆ DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts
The proliferation of disinformation demands reliable and scalable
fact-checking solutions. We present Dynamic Evidence-based FAct-checking with
Multimodal Experts (DEFAME), a modular, zero-shot MLLM pipeline for
open-domain, text-image claim verification. DEFAME operates in a six-stage
process, dynamically selecting the tools and search depth to extract and
evaluate textual and visual evidence. Unlike prior approaches that are
text-only, lack explainability, or rely solely on parametric knowledge, DEFAME
performs end-to-end verification, accounting for images in claims and evidence
while generating structured, multimodal reports. Evaluation on the popular
benchmarks VERITE, AVerITeC, and MOCHEG shows that DEFAME surpasses all
previous methods, establishing itself as the new state-of-the-art fact-checking
system for uni- and multimodal fact-checking. Moreover, we introduce a new
multimodal benchmark, ClaimReview2024+, featuring claims after the knowledge
cutoff of GPT-4o, avoiding data leakage. Here, DEFAME drastically outperforms
the GPT-4o baselines, showing temporal generalizability and the potential for
real-time fact-checking.
♻ ☆ UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan
Although existing unified models achieve strong performance in
vision-language understanding and text-to-image generation, they remain limited
in addressing image perception and manipulation -- capabilities increasingly
demanded in practical applications. Recently, OpenAI introduced the powerful
GPT-4o-Image model, which showcases advanced capabilities in comprehensive
image perception and manipulation, sparking widespread interest. Through
carefully designed experiments, we observe that GPT-4o-Image likely relies on
semantic encoders rather than VAEs for feature extraction, despite VAEs being
commonly regarded as crucial for image manipulation tasks. Inspired by this
insight, we propose UniWorld-V1, a unified generative framework built upon
semantic features extracted from powerful multimodal large language models and
contrastive semantic encoders. Using only 2.7M training data, UniWorld-V1
achieves impressive performance across diverse tasks, including image
understanding, generation, manipulation, and perception. We fully open-source
the UniWorld-V1 framework, including model weights, training and evaluation
scripts, and datasets to promote reproducibility and further research.
♻ ☆ The Lessons of Developing Process Reward Models in Mathematical Reasoning
Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin
Process Reward Models (PRMs) emerge as a promising approach for process
supervision in mathematical reasoning of Large Language Models (LLMs), which
aim to identify and mitigate intermediate errors in the reasoning processes.
However, the development of effective PRMs faces significant challenges,
particularly in data annotation and evaluation methodologies. In this paper,
through extensive experiments, we demonstrate that commonly used Monte Carlo
(MC) estimation-based data synthesis for PRMs typically yields inferior
performance and generalization compared to LLM-as-a-judge and human annotation
methods. MC estimation relies on completion models to evaluate current-step
correctness, leading to inaccurate step verification. Furthermore, we identify
potential biases in conventional Best-of-N (BoN) evaluation strategies for
PRMs: (1) The unreliable policy models generate responses with correct answers
but flawed processes, leading to a misalignment between the evaluation criteria
of BoN and the PRM objectives of process verification. (2) The tolerance of
PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a
significant proportion of minimum scores concentrated on the final answer
steps, revealing the shift from process to outcome-based assessment in BoN
Optimized PRMs. To address these challenges, we develop a consensus filtering
mechanism that effectively integrates MC estimation with LLM-as-a-judge and
advocates a more comprehensive evaluation framework that combines
response-level and step-level metrics. Based on the mechanisms, we
significantly improve both model performance and data efficiency in the BoN
evaluation and the step-wise error identification task. Finally, we release a
new state-of-the-art PRM that outperforms existing open-source alternatives and
provides practical guidelines for future research in building process
supervision models.
♻ ☆ CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks
Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, Torsten Hoefler
Large Language Models (LLMs) are transforming a wide range of domains, yet
verifying their outputs remains a significant challenge, especially for complex
open-ended tasks such as consolidation, summarization, and knowledge
extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable,
and accurate verification method. CE reduces each LLM answer to a single
embedding vector using powerful modern embedding LLM models like
SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied
on weaker encoders like BERT, forcing them to operate at token or sentence
granularity. In contrast, CE performs fast, semantically rich comparisons
directly at the whole-answer level, overcoming key limitations in both accuracy
and scalability. We conduct a comprehensive design and time complexity analysis
across 13 verification baselines, including classical text scorers (e.g.,
BLEU), stability-based methods (e.g., SelfCheckGPT), and generative evaluators
(e.g., LLM-as-a-Judge), which highlights the effectiveness, efficiency,
versatility, and simplicity of CE. Empirical results show that CE reliably
detects hallucinations in both closed and open-ended tasks. We further present
evidence that CE generalizes beyond text to other modalities such as vision,
establishing it as a practical and versatile verification framework.
♻ ☆ MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration ACL 2025
In recent years, multimodal large language models (MLLMs) have made
significant progress but continue to face inherent challenges in multimodal
reasoning, which requires multi-level (e.g., perception, reasoning) and
multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior
work on estimating model confidence tends to focus on the overall response for
training and calibration, but fails to assess confidence in each reasoning
step, leading to undesirable hallucination snowballing. In this work, we
present MMBoundary, a novel framework that advances the knowledge boundary
awareness of MLLMs through reasoning step confidence calibration. To achieve
this, we propose to incorporate complementary textual and cross-modal
self-rewarding signals to estimate confidence at each step of the MLLM
reasoning process. In addition to supervised fine-tuning MLLM on this set of
self-rewarded confidence estimation signal for initial confidence expression
warm-up, we introduce a reinforcement learning stage with multiple reward
functions for further aligning model knowledge and calibrating confidence at
each reasoning step, enhancing reasoning chain self-correction. Empirical
results show that MMBoundary significantly outperforms existing methods across
diverse domain datasets and metrics, achieving an average of 7.5% reduction in
multimodal confidence calibration errors and up to 8.3% improvement in task
performance.
comment: 18 pages, ACL 2025
♻ ☆ DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models NAACL 2025
Jianyu Liu, Hangyu Guo, Ranjie Duan, Xingyuan Bu, Yancheng He, Shilong Li, Hui Huang, Jiaheng Liu, Yucheng Wang, Chenchen Jing, Xingwei Qu, Xiao Zhang, Yingshui Tan, Yanan Wu, Jihao Gu, Yangguang Li, Jianke Zhu
Multimodal Large Language Models (MLLMs) pose unique safety challenges due to
their integration of visual and textual data, thereby introducing new
dimensions of potential attacks and complex risk combinations. In this paper,
we begin with a detailed analysis aimed at disentangling risks through
step-by-step reasoning within multimodal inputs. We find that systematic
multimodal risk disentanglement substantially enhances the risk awareness of
MLLMs. Via leveraging the strong discriminative abilities of multimodal risk
disentanglement, we further introduce \textbf{DREAM}
(\textit{\textbf{D}isentangling \textbf{R}isks to \textbf{E}nhance Safety
\textbf{A}lignment in \textbf{M}LLMs}), a novel approach that enhances safety
alignment in MLLMs through supervised fine-tuning and iterative Reinforcement
Learning from AI Feedback (RLAIF). Experimental results show that DREAM
significantly boosts safety during both inference and training phases without
compromising performance on normal tasks (namely oversafety), achieving a
16.17\% improvement in the SIUO safe\&effective score compared to GPT-4V. The
data and code are available at https://github.com/Kizna1ver/DREAM.
comment: [NAACL 2025] The first four authors contribute equally, 23 pages,
repo at https://github.com/Kizna1ver/DREAM
♻ ☆ Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic, Yi Zhu, Patrick Iff, Michal Podstawski, Lucas Weitzendorf, Mingyuan Chi, Joanna Gajda, Piotr Nyczyk, Jürgen Müller, Hubert Niewiadomski, Torsten Hoefler
Retrieval Augmented Generation (RAG) enhances the abilities of Large Language
Models (LLMs) by enabling the retrieval of documents into the LLM context to
provide more accurate and relevant responses. Existing RAG solutions do not
focus on queries that may require fetching multiple documents with
substantially different contents. Such queries occur frequently, but are
challenging because the embeddings of these documents may be distant in the
embedding space, making it hard to retrieve them all. This paper introduces
Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a
simple yet powerful idea: leveraging activations of Transformer's multi-head
attention layer, instead of the decoder layer, as keys for fetching
multi-aspect documents. The driving observation is that different attention
heads learn to capture different data aspects. Harnessing the corresponding
activations results in embeddings that represent various facets of data items
and queries, improving the retrieval accuracy for complex queries. We provide
an evaluation methodology and metrics, multi-aspect datasets, and real-world
use cases to demonstrate MRAG's effectiveness. We show MRAG's design advantages
over 18 RAG baselines, empirical improvements of up to 20% in retrieval success
ratios, and benefits for downstream LLM generation. MRAG can be seamlessly
integrated with existing RAG frameworks and benchmarks.
♻ ☆ Can Large Language Models Understand Intermediate Representations in Compilers?
Intermediate Representations (IRs) play a critical role in compiler design
and program analysis, yet their comprehension by Large Language Models (LLMs)
remains underexplored. In this paper, we present an explorative empirical study
evaluating the capabilities of six state-of-the-art LLMs: GPT-4, GPT-3,
DeepSeek, Gemma 2, Llama 3, and Code Llama, in understanding IRs. Specifically,
we assess model performance across four core tasks: control flow graph
reconstruction, decompilation, code summarization, and execution reasoning.
While LLMs exhibit competence in parsing IR syntax and identifying high-level
structures, they consistently struggle with instruction-level reasoning,
especially in control flow reasoning, loop handling, and dynamic execution.
Common failure modes include misinterpreting branching instructions, omitting
critical operations, and relying on heuristic reasoning rather than precise
instruction-level logic. Our findings highlight the need for IR-specific
enhancements in LLM design. We recommend fine-tuning on structured IR datasets
and integrating control-flow-sensitive architectures to improve model
effectiveness. All experimental data and source code are publicly available at
♻ ☆ SNaRe: Domain-aware Data Generation for Low-Resource Event Detection ACL
Event Detection (ED) -- the task of identifying event mentions from natural
language text -- is critical for enabling reasoning in highly specialized
domains such as biomedicine, law, and epidemiology. Data generation has proven
to be effective in broadening its utility to wider applications without
requiring expensive expert annotations. However, when existing generation
approaches are applied to specialized domains, they struggle with label noise,
where annotations are incorrect, and domain drift, characterized by a
distributional mismatch between generated sentences and the target domain. To
address these issues, we introduce SNaRe, a domain-aware synthetic data
generation framework composed of three components: Scout, Narrator, and
Refiner. Scout extracts triggers from unlabeled target domain data and curates
a high-quality domain-specific trigger list using corpus-level statistics to
mitigate domain drift. Narrator, conditioned on these triggers, generates
high-quality domain-aligned sentences, and Refiner identifies additional event
mentions, ensuring high annotation quality. Experimentation on three diverse
domain ED datasets reveals how SNaRe outperforms the best baseline, achieving
average F1 gains of 3-7% in the zero-shot/few-shot settings and 4-20% F1
improvement for multilingual generation. Analyzing the generated trigger hit
rate and human evaluation substantiates SNaRe's stronger annotation quality and
reduced domain drift.
comment: Under review at ACL ARR May 2025
♻ ☆ ValueSim: Generating Backstories to Model Individual Value Systems
As Large Language Models (LLMs) continue to exhibit increasingly human-like
capabilities, aligning them with human values has become critically important.
Contemporary advanced techniques, such as prompt learning and reinforcement
learning, are being deployed to better align LLMs with human values. However,
while these approaches address broad ethical considerations and helpfulness,
they rarely focus on simulating individualized human value systems. To address
this gap, we present ValueSim, a framework that simulates individual values
through the generation of personal backstories reflecting past experiences and
demographic information. ValueSim converts structured individual data into
narrative backstories and employs a multi-module architecture inspired by the
Cognitive-Affective Personality System to simulate individual values based on
these narratives. Testing ValueSim on a self-constructed benchmark derived from
the World Values Survey demonstrates an improvement in top-1 accuracy by over
10% compared to retrieval-augmented generation methods. Further analysis
reveals that performance enhances as additional user interaction history
becomes available, indicating the model's ability to refine its persona
simulation capabilities over time.
comment: 8 pages main paper + 13 pages appendix, 3 figures, 2 tables
♻ ☆ Explainability in Practice: A Survey of Explainable NLP Across Various Domains
Natural Language Processing (NLP) has become a cornerstone in many critical
sectors, including healthcare, finance, and customer relationship management.
This is especially true with the development and use of advanced models such as
GPT-based architectures and BERT, which are widely used in decision-making
processes. However, the black-box nature of these advanced NLP models has
created an urgent need for transparency and explainability. This review
explores explainable NLP (XNLP) with a focus on its practical deployment and
real-world applications, examining its implementation and the challenges faced
in domain-specific contexts. The paper underscores the importance of
explainability in NLP and provides a comprehensive perspective on how XNLP can
be designed to meet the unique demands of various sectors, from healthcare's
need for clear insights to finance's emphasis on fraud detection and risk
assessment. Additionally, this review aims to bridge the knowledge gap in XNLP
literature by offering a domain-specific exploration and discussing
underrepresented areas such as real-world applicability, metric evaluation, and
the role of human interaction in model assessment. The paper concludes by
suggesting future research directions that could enhance the understanding and
broader application of XNLP.
♻ ☆ CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature
A hallmark of human innovation is the process of recombination -- creating
original ideas by integrating elements of existing mechanisms and concepts. In
this work, we automatically mine the scientific literature and build CHIMERA: a
large-scale knowledge base (KB) of recombination examples. CHIMERA can be used
to empirically explore at scale how scientists recombine concepts and take
inspiration from different areas, or to train supervised machine learning
models that learn to predict new creative cross-domain directions. To build
this KB, we present a novel information extraction task of extracting
recombination from scientific paper abstracts, collect a high-quality corpus of
hundreds of manually annotated abstracts, and use it to train an LLM-based
extraction model. The model is applied to a large corpus of papers in the AI
domain, yielding a KB of over 28K recombination examples. We analyze CHIMERA to
explore the properties of recombination in different subareas of AI. Finally,
we train a scientific hypothesis generation model using the KB, which predicts
new recombination directions that real-world researchers find inspiring. Our
data and code are available at https://github.com/noy-sternlicht/CHIMERA-KB
comment: Project page: https://noy-sternlicht.github.io/CHIMERA-Web
♻ ☆ LLM Social Simulations Are a Promising Research Method ICML 2025
Jacy Reese Anthis, Ryan Liu, Sean M. Richardson, Austin C. Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, Michael Bernstein
Accurate and verifiable large language model (LLM) simulations of human
research subjects promise an accessible data source for understanding human
behavior and training new AI systems. However, results to date have been
limited, and few social scientists have adopted this method. In this position
paper, we argue that the promise of LLM social simulations can be achieved by
addressing five tractable challenges. We ground our argument in a review of
empirical comparisons between LLMs and human research subjects, commentaries on
the topic, and related work. We identify promising directions, including
context-rich prompting and fine-tuning with social science datasets. We believe
that LLM social simulations can already be used for pilot and exploratory
studies, and more widespread use may soon be possible with rapidly advancing
LLM capabilities. Researchers should prioritize developing conceptual models
and iterative evaluations to make the best use of new AI systems.
comment: Published at ICML 2025
♻ ☆ OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction
Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, Jingkuan Song, Fei Huang, Yongbin Li
Role-Playing Agents (RPAs), benefiting from large language models, is an
emerging interactive AI system that simulates roles or characters with diverse
personalities. However, existing methods primarily focus on mimicking dialogues
among roles in textual form, neglecting the role's voice traits (e.g., voice
style and emotions) as playing a crucial effect in interaction, which tends to
be more immersive experiences in realistic scenarios. Towards this goal, we
propose OmniCharacter, a first seamless speech-language personality interaction
model to achieve immersive RPAs with low latency. Specifically, OmniCharacter
enables agents to consistently exhibit role-specific personality traits and
vocal traits throughout the interaction, enabling a mixture of speech and
language responses. To align the model with speech-language scenarios, we
construct a dataset named OmniCharacter-10K, which involves more distinctive
characters (20), richly contextualized multi-round dialogue (10K), and dynamic
speech response (135K). Experimental results showcase that our method yields
better responses in terms of both content and style compared to existing RPAs
and mainstream speech-language models, with a response latency as low as 289ms.
Code and dataset are available at
https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.
comment: 14 pages, 6 figures
♻ ☆ The Impossibility of Fair LLMs ACL 2025
The rise of general-purpose artificial intelligence (AI) systems,
particularly large language models (LLMs), has raised pressing moral questions
about how to reduce bias and ensure fairness at scale. Researchers have
documented a sort of "bias" in the significant correlations between
demographics (e.g., race, gender) in LLM prompts and responses, but it remains
unclear how LLM fairness could be evaluated with more rigorous definitions,
such as group fairness or fair representations. We analyze a variety of
technical fairness frameworks and find inherent challenges in each that make
the development of a fair LLM intractable. We show that each framework either
does not logically extend to the general-purpose AI context or is infeasible in
practice, primarily due to the large amounts of unstructured training data and
the many potential combinations of human populations, use cases, and sensitive
attributes. These inherent challenges would persist for general-purpose AI,
including LLMs, even if empirical challenges, such as limited participatory
input and limited measurement methods, were overcome. Nonetheless, fairness
will remain an important type of model evaluation, and there are still
promising research directions, particularly the development of standards for
the responsibility of LLM developers, context-specific evaluations, and methods
of iterative, participatory, and AI-assisted evaluation that could scale
fairness across the diverse contexts of modern human-AI interaction.
comment: Published in ACL 2025
♻ ☆ Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation ACL 2025
Standard benchmarks of bias and fairness in large language models (LLMs)
measure the association between the user attributes stated or implied by a
prompt and the LLM's short text response, but human-AI interaction increasingly
requires long-form and context-specific system output to solve real-world
tasks. In the commonly studied domain of gender-occupation bias, we test
whether these benchmarks are robust to lengthening the LLM responses as a
measure of Realistic Use and Tangible Effects (i.e., RUTEd evaluations). From
the current literature, we adapt three standard bias metrics (neutrality, skew,
and stereotype) and develop analogous RUTEd evaluations from three contexts of
real-world use: children's bedtime stories, user personas, and English language
learning exercises. We find that standard bias metrics have no significant
correlation with the more realistic bias metrics. For example, selecting the
least biased model based on the standard "trick tests" coincides with selecting
the least biased model as measured in more realistic use no more than random
chance. We suggest that there is not yet evidence to justify standard
benchmarks as reliable proxies of real-world AI biases, and we encourage
further development of evaluations grounded in particular contexts.
comment: Published in ACL 2025
♻ ☆ Leveraging LLMs for Bangla Grammar Error Correction:Error Categorization, Synthetic Data, and Model Evaluation ACL
Large Language Models (LLMs) perform exceedingly well in Natural Language
Understanding (NLU) tasks for many languages including English. However,
despite being the fifth most-spoken language globally, Grammatical Error
Correction (GEC) in Bangla remains underdeveloped. In this work, we investigate
how LLMs can be leveraged for improving Bangla GEC. For that, we first do an
extensive categorization of 12 error classes in Bangla, and take a survey of
native Bangla speakers to collect real-world errors. We next devise a
rule-based noise injection method to create grammatically incorrect sentences
corresponding to correct ones. The Vaiyakarana dataset, thus created, consists
of 5,67,422 sentences of which 2,27,119 are erroneous. This dataset is then
used to instruction-tune LLMs for the task of GEC in Bangla. Evaluations show
that instruction-tuning with \name improves GEC performance of LLMs by 3-7
percentage points as compared to the zero-shot setting, and makes them achieve
human-like performance in grammatical error identification. Humans, though,
remain superior in error correction.
comment: Accepted at ACL Findings, 2025
♻ ☆ GoRA: Gradient-driven Adaptive Low Rank Adaptation
Low-Rank Adaptation (LoRA) is a crucial method for efficiently fine-tuning
large language models (LLMs), with its effectiveness influenced by two key
factors: rank selection and weight initialization. While numerous LoRA variants
have been proposed to improve performance by addressing one of these aspects,
they often compromise usability or computational efficiency. In this paper, we
analyze and identify the core limitations of existing approaches and propose a
novel framework -- GoRA (Gradient-driven Adaptive Low Rank Adaptation) -- that
simultaneously adapts both the rank and initialization strategy within a
unified framework. GoRA leverages gradient information during training to
dynamically assign optimal ranks and initialize low-rank adapter weights in an
adaptive manner. To our knowledge, GoRA is the first method that not only
addresses the limitations of prior approaches -- which often focus on either
rank selection or initialization in isolation -- but also unifies both aspects
within a single framework, enabling more effective and efficient adaptation.
Extensive experiments across various architectures and modalities show that
GoRA consistently outperforms existing LoRA-based methods while preserving the
efficiency of vanilla LoRA. For example, when fine-tuning Llama3.1-8B-Base for
mathematical reasoning, GoRA achieves a 5.13-point improvement over standard
LoRA and even outperforms full fine-tuning by 2.05 points under high-rank
settings.
♻ ☆ Argument Summarization and its Evaluation in the Era of Large Language Models
Moritz Altemeyer, Steffen Eger, Johannes Daxenberger, Yanran Chen, Tim Altendorf, Philipp Cimiano, Benjamin Schiller
Large Language Models (LLMs) have revolutionized various Natural Language
Generation (NLG) tasks, including Argument Summarization (ArgSum), a key
subfield of Argument Mining (AM). This paper investigates the integration of
state-of-the-art LLMs into ArgSum, including for its evaluation. In particular,
we propose a novel prompt-based evaluation scheme, and validate it through a
novel human benchmark dataset. Our work makes three main contributions: (i) the
integration of LLMs into existing ArgSum frameworks, (ii) the development of a
new LLM-based ArgSum system, benchmarked against prior methods, and (iii) the
introduction of an advanced LLM-based evaluation scheme. We demonstrate that
the use of LLMs substantially improves both the generation and evaluation of
argument summaries, achieving state-of-the-art results and advancing the field
of ArgSum. We also show that among the four LLMs integrated in (i) and (ii),
Qwen-3-32B, despite having the fewest parameters, performs best, even
surpassing GPT-4o, while LLaMA-3.3-70B consistently underperforms.
♻ ☆ Optimizing Anytime Reasoning via Budget Relative Policy Optimization
Scaling test-time compute is crucial for enhancing the reasoning capabilities
of large language models (LLMs). Existing approaches typically employ
reinforcement learning (RL) to maximize a verifiable reward obtained at the end
of reasoning traces. However, such methods optimize only the final performance
under a large and fixed token budget, which hinders efficiency in both training
and deployment. In this work, we present a novel framework, AnytimeReasoner, to
optimize anytime reasoning performance, which aims to improve token efficiency
and the flexibility of reasoning under varying token budget constraints. To
achieve this, we truncate the complete thinking process to fit within sampled
token budgets from a prior distribution, compelling the model to summarize the
optimal answer for each truncated thinking for verification. This introduces
verifiable dense rewards into the reasoning process, facilitating more
effective credit assignment in RL optimization. We then optimize the thinking
and summary policies in a decoupled manner to maximize the cumulative reward.
Additionally, we introduce a novel variance reduction technique, Budget
Relative Policy Optimization (BRPO), to enhance the robustness and efficiency
of the learning process when reinforcing the thinking policy. Empirical results
in mathematical reasoning tasks demonstrate that our method consistently
outperforms GRPO across all thinking budgets under various prior distributions,
enhancing both training and token efficiency.
♻ ☆ GRU: Mitigating the Trade-off between Unlearning and Retention for LLMs
Large language model (LLM) unlearning has demonstrated its essential role in
removing privacy and copyright-related responses, crucial for their legal and
safe applications. However, the pursuit of complete unlearning often comes with
substantial costs due to its compromises in their general functionality,
leading to a notorious trade-off between unlearning and retention. It motivates
this paper to explore enhanced unlearning schemes that can mitigate this
trade-off. Specifically, we propose Gradient Rectified Unlearning (GRU), an
improved framework that regulates the directions of gradient updates during the
unlearning procedure such that their side impacts on other, unrelated responses
can be minimized. GRU is easy and general to implement, demonstrating practical
effectiveness across a variety of well-established unlearning benchmarks.
♻ ☆ Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks ACL 2025
While Vision-Language Models (VLMs) have shown remarkable abilities in visual
and language reasoning tasks, they invariably generate flawed responses.
Self-correction that instructs models to refine their outputs presents a
promising solution to this issue. Previous studies have mainly concentrated on
Large Language Models (LLMs), while the self-correction abilities of VLMs,
particularly concerning both visual and linguistic information, remain largely
unexamined. This study investigates the self-correction capabilities of VLMs
during both inference and fine-tuning stages. We introduce a Self-Correction
Learning (SCL) approach that enables VLMs to learn from their self-generated
self-correction data through Direct Preference Optimization (DPO) without
relying on external feedback, facilitating self-improvement. Specifically, we
collect preferred and disfavored samples based on the correctness of initial
and refined responses, which are obtained by two-turn self-correction with VLMs
during the inference stage. Experimental results demonstrate that although VLMs
struggle to self-correct effectively during iterative inference without
additional fine-tuning and external feedback, they can enhance their
performance and avoid previous mistakes through preference fine-tuning when
their self-generated self-correction data are categorized into preferred and
disfavored samples. This study emphasizes that self-correction is not merely a
refinement process; rather, it should enhance the reasoning abilities of models
through additional training, enabling them to generate high-quality responses
directly without further refinement.
comment: Accepted by ACL 2025 Findings
♻ ☆ Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity ACL 2025
A range of recent works addresses the problem of compression of sequence of
tokens into a shorter sequence of real-valued vectors to be used as inputs
instead of token embeddings or key-value cache. These approaches are focused on
reduction of the amount of compute in existing language models rather than
minimization of number of bits needed to store text. Despite relying on
powerful models as encoders, the maximum attainable lossless compression ratio
is typically not higher than x10. This fact is highly intriguing because, in
theory, the maximum information capacity of large real-valued vectors is far
beyond the presented rates even for 16-bit precision and a modest vector size.
In this work, we explore the limits of compression by replacing the encoder
with a per-sample optimization procedure. We show that vectors with compression
ratios up to x1500 exist, which highlights two orders of magnitude gap between
existing and practically attainable solutions. Furthermore, we empirically show
that the compression limits are determined not by the length of the input but
by the amount of uncertainty to be reduced, namely, the cross-entropy loss on
this sequence without any conditioning. The obtained limits highlight the
substantial gap between the theoretical capacity of input embeddings and their
practical utilization, suggesting significant room for optimization in model
design.
comment: ACL 2025 (main conference)
♻ ☆ Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs)
System prompts in Large Language Models (LLMs) are predefined directives that
guide model behaviour, taking precedence over user inputs in text processing
and generation. LLM deployers increasingly use them to ensure consistent
responses across contexts. While model providers set a foundation of system
prompts, deployers and third-party developers can append additional prompts
without visibility into others' additions, while this layered implementation
remains entirely hidden from end-users. As system prompts become more complex,
they can directly or indirectly introduce unaccounted for side effects. This
lack of transparency raises fundamental questions about how the position of
information in different directives shapes model outputs. As such, this work
examines how the placement of information affects model behaviour. To this end,
we compare how models process demographic information in system versus user
prompts across six commercially available LLMs and 50 demographic groups. Our
analysis reveals significant biases, manifesting in differences in user
representation and decision-making scenarios. Since these variations stem from
inaccessible and opaque system-level configurations, they risk
representational, allocative and potential other biases and downstream harms
beyond the user's ability to detect or correct. Our findings draw attention to
these critical issues, which have the potential to perpetuate harms if left
unexamined. Further, we argue that system prompt analysis must be incorporated
into AI auditing processes, particularly as customisable system prompts become
increasingly prevalent in commercial AI deployments.
comment: Forthcoming in Proceedings of ACM FAccT 2025
♻ ☆ Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons ACL 2025
Large Language Models (LLMs) have shown to be effective evaluators across
various domains such as machine translations or the scientific domain. Current
LLM-as-a-Judge approaches rely mostly on individual assessments or a single
round of pairwise assessments, preventing the judge LLM from developing a
global ranking perspective. To address this, we present Knockout Assessment, an
LLM-asa Judge method using a knockout tournament system with iterative pairwise
comparisons. Experiments across three LLMs on two datasets show that knockout
assessment improves scoring accuracy, increasing Pearson correlation with
expert evaluations by 0.07 on average for university-level exam scoring and
machine translation evaluations, aligning LLM assessments more closely with
human scoring.
comment: Accepted to GEM @ ACL 2025
♻ ☆ EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving
Shihan Dou, Ming Zhang, Chenhao Huang, Jiayi Chen, Feng Chen, Shichun Liu, Yan Liu, Chenxiao Liu, Cheng Zhong, Zongzhang Zhang, Tao Gui, Chao Xin, Wei Chengzhi, Lin Yan, Qi Zhang, Yonghui Wu, Xuanjing Huang
We introduce EvaLearn, a pioneering benchmark designed to evaluate large
language models (LLMs) on their learning capability and efficiency in
challenging tasks, a critical, yet underexplored aspect of model potential.
EvaLearn contains 648 challenging problems across six task types, grouped into
182 sequences, each sequence dedicated to one task type. Diverging from most
existing benchmarks that evaluate models in parallel, EvaLearn requires models
to solve problems sequentially, allowing them to leverage the experience gained
from previous solutions. EvaLearn provides five comprehensive automated metrics
to evaluate models and quantify their learning capability and efficiency. We
extensively benchmark nine frontier models and observe varied performance
profiles: some models, such as Claude-3.7-sonnet, start with moderate initial
performance but exhibit strong learning ability, while some models struggle to
benefit from experience and may even show negative transfer. Moreover, we
investigate model performance under two learning settings and find that
instance-level rubrics and teacher-model feedback further facilitate model
learning. Importantly, we observe that current LLMs with stronger static
abilities do not show a clear advantage in learning capability across all
tasks, highlighting that EvaLearn evaluates a new dimension of model
performance. We hope EvaLearn provides a novel evaluation perspective for
assessing LLM potential and understanding the gap between models and human
capabilities, promoting the development of deeper and more dynamic evaluation
approaches. All datasets, the automatic evaluation framework, and the results
studied in this paper are available at the GitHub repository.
comment: 47 pages, 24 figures
♻ ☆ Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents
Chaoran Chen, Zhiping Zhang, Ibrahim Khalilov, Bingcan Guo, Simret A Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, Toby Jia-Jun Li
The rise of Large Language Models (LLMs) has revolutionized Graphical User
Interface (GUI) automation through LLM-powered GUI agents, yet their ability to
process sensitive data with limited human oversight raises significant privacy
and security risks. This position paper identifies three key risks of GUI
agents and examines how they differ from traditional GUI automation and general
autonomous agents. Despite these risks, existing evaluations focus primarily on
performance, leaving privacy and security assessments largely unexplored. We
review current evaluation metrics for both GUI and general LLM agents and
outline five key challenges in integrating human evaluators for GUI agent
assessments. To address these gaps, we advocate for a human-centered evaluation
framework that incorporates risk assessments, enhances user awareness through
in-context consent, and embeds privacy and security considerations into GUI
agent design and evaluation.
♻ ☆ Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models
Adversarial attacks on Large Language Models (LLMs) via jailbreaking
techniques-methods that circumvent their built-in safety and ethical
constraints-have emerged as a critical challenge in AI security. These attacks
compromise the reliability of LLMs by exploiting inherent weaknesses in their
comprehension capabilities. This paper investigates the efficacy of
jailbreaking strategies that are specifically adapted to the diverse levels of
understanding exhibited by different LLMs. We propose the Adaptive Jailbreaking
Strategies Based on the Semantic Understanding Capabilities of Large Language
Models, a novel framework that classifies LLMs into Type I and Type II
categories according to their semantic comprehension abilities. For each
category, we design tailored jailbreaking strategies aimed at leveraging their
vulnerabilities to facilitate successful attacks. Extensive experiments
conducted on multiple LLMs demonstrate that our adaptive strategy markedly
improves the success rate of jailbreaking. Notably, our approach achieves an
exceptional 98.9% success rate in jailbreaking GPT-4o(29 May 2025 release)
♻ ☆ Towards Robust ESG Analysis Against Greenwashing Risks: Aspect-Action Analysis with Cross-Category Generalization ACL 2025
Sustainability reports are key for evaluating companies' environmental,
social and governance, ESG performance, but their content is increasingly
obscured by greenwashing - sustainability claims that are misleading,
exaggerated, and fabricated. Yet, existing NLP approaches for ESG analysis lack
robustness against greenwashing risks, often extracting insights that reflect
misleading or exaggerated sustainability claims rather than objective ESG
performance. To bridge this gap, we introduce A3CG - Aspect-Action Analysis
with Cross-Category Generalization, as a novel dataset to improve the
robustness of ESG analysis amid the prevalence of greenwashing. By explicitly
linking sustainability aspects with their associated actions, A3CG facilitates
a more fine-grained and transparent evaluation of sustainability claims,
ensuring that insights are grounded in verifiable actions rather than vague or
misleading rhetoric. Additionally, A3CG emphasizes cross-category
generalization. This ensures robust model performance in aspect-action analysis
even when companies change their reports to selectively favor certain
sustainability areas. Through experiments on A3CG, we analyze state-of-the-art
supervised models and LLMs, uncovering their limitations and outlining key
directions for future research.
comment: Proceedings of the Association for Computational Linguistics Main
Conference (ACL 2025)
♻ ☆ Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation
Counterfactual reasoning typically involves considering alternatives to
actual events. While often applied to understand past events, a distinct
form-forward counterfactual reasoning-focuses on anticipating plausible future
developments. This type of reasoning is invaluable in dynamic financial
markets, where anticipating market developments can powerfully unveil potential
risks and opportunities for stakeholders, guiding their decision-making.
However, performing this at scale is challenging due to the cognitive demands
involved, underscoring the need for automated solutions. Large Language Models
(LLMs) offer promise, but remain unexplored for this application. To address
this gap, we introduce a novel benchmark, Fin-Force-FINancial FORward
Counterfactual Evaluation. By curating financial news headlines and providing
structured evaluation, Fin-Force supports LLM based forward counterfactual
generation. This paves the way for scalable and automated solutions for
exploring and anticipating future market developments, thereby providing
structured insights for decision-making. Through experiments on Fin-Force, we
evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing
their limitations and proposing insights for future research.
♻ ☆ In-context Language Learning for Endangered Languages in Speech Recognition
With approximately 7,000 languages spoken worldwide, current large language
models (LLMs) support only a small subset. Prior research indicates LLMs can
learn new languages for certain tasks without supervised data. We extend this
investigation to speech recognition, investigating whether LLMs can learn
unseen, low-resource languages through in-context learning (ICL). With
experiments on four diverse endangered languages that LLMs have not been
trained on, we find that providing more relevant text samples enhances
performance in both language modelling and Automatic Speech Recognition (ASR)
tasks. Furthermore, we show that the probability-based approach outperforms the
traditional instruction-based approach in language learning. Lastly, we show
ICL enables LLMs to achieve ASR performance that is comparable to or even
surpasses dedicated language models trained specifically for these languages,
while preserving the original capabilities of the LLMs.
comment: Interspeech2025
♻ ☆ MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
LLM-Core Xiaomi, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, Kai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue
We present MiMo-7B, a large language model born for reasoning tasks, with
optimization across both pre-training and post-training stages. During
pre-training, we enhance the data preprocessing pipeline and employ a
three-stage data mixing strategy to strengthen the base model's reasoning
potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional
Multi-Token Prediction objective for enhanced performance and accelerated
inference speed. During post-training, we curate a dataset of 130K verifiable
mathematics and programming problems for reinforcement learning, integrating a
test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and
employing strategic data resampling to stabilize training. Extensive
evaluations show that MiMo-7B-Base possesses exceptional reasoning potential,
outperforming even much larger 32B models. The final RL-tuned model,
MiMo-7B-RL, achieves superior performance on mathematics, code and general
reasoning tasks, surpassing the performance of OpenAI o1-mini. The model
checkpoints are available at https://github.com/xiaomimimo/MiMo.
♻ ☆ Full-Parameter Continual Pretraining of Gemma2: Insights into Fluency and Domain Knowledge
In this technical report, we empirically investigate the relationship between
linguistic fluency and domain knowledge in the context of continual learning
with large language models (LLMs). Specifically, we enhance the linguistic
fluency of the Gemma2 LLM for the Lithuanian language by autoregressively
pretraining its full parameter set on the first 10\% of the Lithuanian language
component of the CulturaX dataset. To prevent catastrophic forgetting of the
model's existing domain knowledge, we apply Elastic Weight Consolidation (EWC),
leveraging Fisher information estimated using data from the Massive Multitask
Language Understanding (MMLU) benchmark. In the post-training evaluations, we
assess linguistic fluency through perplexity and evaluate domain knowledge
using accuracy on a suite of language understanding benchmarks, including
ARC-Easy, Belebele, GSM8K, HellaSwag, MMLU, TruthfulQA, and Winogrande, in both
English and Lithuanian. The empirical results demonstrate that EWC not only
mitigates catastrophic forgetting by preserving the model's performance in
terms of both linguistic fluency and domain knowledge but also improves or
maintains these capabilities for the newly added Lithuanian language. These
findings highlight the potential for more efficient adaptation of
general-purpose LLMs to under-represented languages without requiring access to
the original training data. The accompanying codebase is openly accessible at
https://github.com/Neurotechnology/LLM_EWC.
comment: 9 pages, 3 figures, 1 table
♻ ☆ NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction ICML 2025
Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, Peilin Zhao
Inspired by the impressive capabilities of GPT-4o, there is growing interest
in enabling speech language models (SLMs) to engage in natural, fluid spoken
interactions with humans. Recent advancements have led to the development of
several SLMs that demonstrate promising results in this area. However, current
approaches have yet to fully exploit dual-channel speech data, which inherently
captures the structure and dynamics of human conversation. In this work, we
systematically explore the use of dual-channel speech data in the context of
modern large language models, and introduce a novel generative modeling
paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent
dual-channel spoken dialogue learning using decoder-only architectures for the
first time. We evaluate our approach on standard benchmarks, and empirical
results show that our proposed method, NTPP, significantly improves the
conversational abilities of SLMs in terms of turn-taking prediction, response
coherence, and naturalness. Moreover, compared to existing methods, NTPP
achieves substantially lower inference latency, highlighting its practical
efficiency for real-time applications.
comment: Accepted by ICML 2025
♻ ☆ NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark ACL 2025
Vladislav Mikhailov, Tita Enstad, David Samuel, Hans Christian Farsethås, Andrey Kutuzov, Erik Velldal, Lilja Øvrelid
This paper introduces NorEval, a new and comprehensive evaluation suite for
large-scale standardized benchmarking of Norwegian generative language models
(LMs). NorEval consists of 24 high-quality human-created datasets -- of which
five are created from scratch. In contrast to existing benchmarks for
Norwegian, NorEval covers a broad spectrum of task categories targeting
Norwegian language understanding and generation, establishes human baselines,
and focuses on both of the official written standards of the Norwegian
language: Bokm{\aa}l and Nynorsk. All our datasets and a collection of over 100
human-written prompts are integrated into LM Evaluation Harness, ensuring
flexible and reproducible evaluation. We describe the NorEval design and
present the results of benchmarking 19 open-source pre-trained and
instruction-tuned LMs for Norwegian in various scenarios. Our benchmark,
evaluation framework, and annotation materials are publicly available.
comment: Accepted for Findings of the Association for Computational
Linguistics: ACL 2025
♻ ☆ Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition
Recent work suggests that large language models (LLMs) can improve
performance of speech tasks compared to existing systems. To support their
claims, results on LibriSpeech and Common Voice are often quoted. However, this
work finds that a substantial amount of the LibriSpeech and Common Voice
evaluation sets appear in public LLM pretraining corpora. This calls into
question the reliability of findings drawn from these two datasets. To measure
contamination impact, LLMs trained with/without contamination are compared. A
contaminated LLM is more likely to generate test sentences it has seen during
training. Then, speech recognisers based on LLMs are compared. They show only
subtle error rate differences if the LLM is contaminated, but assign
significantly higher probabilities to transcriptions seen during LLM training.
Results show that LLM outputs can be biased by tiny amounts of data
contamination, highlighting the importance of evaluating LLM-based speech
systems with held-out data.
♻ ☆ MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
In this work, we explore a cost-effective framework for multilingual image
generation. We find that, unlike models tuned on high-quality images with
multilingual annotations, leveraging text encoders pre-trained on widely
available, noisy Internet image-text pairs significantly enhances data
efficiency in text-to-image (T2I) generation across multiple languages.Based on
this insight, we introduce MuLan, Multi-Language adapter, a lightweight
language adapter with fewer than 20M parameters, trained alongside a frozen
text encoder and image diffusion model. Compared to previous multilingual T2I
models, this framework offers: (1) Cost efficiency. Using readily accessible
English data and off-the-shelf multilingual text encoders minimizes the
training cost; (2) High performance. Achieving comparable generation
capabilities in over 110 languages with CLIP similarity scores nearly matching
those in English (39.57 for English vs. 39.61 for other languages); and (3)
Broad applicability. Seamlessly integrating with compatible community tools
like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.
♻ ☆ When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits ACL 2025
Online misinformation remains a critical challenge, and fact-checkers
increasingly rely on claim matching systems that use sentence embedding models
to retrieve relevant fact-checks. However, as users interact with claims
online, they often introduce edits, and it remains unclear whether current
embedding models used in retrieval are robust to such edits. To investigate
this, we introduce a perturbation framework that generates valid and natural
claim variations, enabling us to assess the robustness of a wide-range of
sentence embedding models in a multi-stage retrieval pipeline and evaluate the
effectiveness of various mitigation approaches. Our evaluation reveals that
standard embedding models exhibit notable performance drops on edited claims,
while LLM-distilled embedding models offer improved robustness at a higher
computational cost. Although a strong reranker helps to reduce the performance
drop, it cannot fully compensate for first-stage retrieval gaps. To address
these retrieval gaps, we evaluate train- and inference-time mitigation
approaches, demonstrating that they can improve in-domain robustness by up to
17 percentage points and boost out-of-domain generalization by 10 percentage
points. Overall, our findings provide practical improvements to claim-matching
systems, enabling more reliable fact-checking of evolving misinformation. Code
and data are available at
https://github.com/JabezNzomo99/claim-matching-robustness.
comment: Accepted to ACL 2025 Findings
♻ ☆ Evaluating Morphological Compositional Generalization in Large Language Models NAACL 2025
Mete Ismayilzada, Defne Circi, Jonne Sälevä, Hale Sirin, Abdullatif Köksal, Bhuwan Dhingra, Antoine Bosselut, Duygu Ataman, Lonneke van der Plas
Large language models (LLMs) have demonstrated significant progress in
various natural language generation and understanding tasks. However, their
linguistic generalization capabilities remain questionable, raising doubts
about whether these models learn language similarly to humans. While humans
exhibit compositional generalization and linguistic creativity in language use,
the extent to which LLMs replicate these abilities, particularly in morphology,
is under-explored. In this work, we systematically investigate the
morphological generalization abilities of LLMs through the lens of
compositionality. We define morphemes as compositional primitives and design a
novel suite of generative and discriminative tasks to assess morphological
productivity and systematicity. Focusing on agglutinative languages such as
Turkish and Finnish, we evaluate several state-of-the-art instruction-finetuned
multilingual models, including GPT-4 and Gemini. Our analysis shows that LLMs
struggle with morphological compositional generalization particularly when
applied to novel word roots, with performance declining sharply as
morphological complexity increases. While models can identify individual
morphological combinations better than chance, their performance lacks
systematicity, leading to significant accuracy gaps compared to humans.
comment: Accepted to NAACL 2025
♻ ☆ Rethinking Text-based Protein Understanding: Retrieval or LLM?
In recent years, protein-text models have gained significant attention for
their potential in protein generation and understanding. Current approaches
focus on integrating protein-related knowledge into large language models
through continued pretraining and multi-modal alignment, enabling simultaneous
comprehension of textual descriptions and protein sequences. Through a thorough
analysis of existing model architectures and text-based protein understanding
benchmarks, we identify significant data leakage issues present in current
benchmarks. Moreover, conventional metrics derived from natural language
processing fail to accurately assess the model's performance in this domain. To
address these limitations, we reorganize existing datasets and introduce a
novel evaluation framework based on biological entities. Motivated by our
observation, we propose a retrieval-enhanced method, which significantly
outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy
and efficiency in training-free scenarios. Our code and data can be seen at
https://github.com/IDEA-XL/RAPM.
♻ ☆ FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models
Large vision-language models (LVLMs) excel at multimodal understanding but
suffer from high computational costs due to redundant vision tokens. Existing
pruning methods typically rely on single-layer attention scores to rank and
prune redundant visual tokens to solve this inefficiency. However, as the
interaction between tokens and layers is complicated, this raises a basic
question: Is such a simple single-layer criterion sufficient to identify
redundancy? To answer this question, we rethink the emergence of redundant
visual tokens from a fundamental perspective: information flow, which models
the interaction between tokens and layers by capturing how information moves
between tokens across layers. We find (1) the CLS token acts as an information
relay, which can simplify the complicated flow analysis; (2) the redundancy
emerges progressively and dynamically via layer-wise attention concentration;
and (3) relying solely on attention scores from single layers can lead to
contradictory redundancy identification. Based on this, we propose FlowCut, an
information-flow-aware pruning framework, mitigating the insufficiency of the
current criterion for identifying redundant tokens and better aligning with the
model's inherent behaviors. Extensive experiments show that FlowCut achieves
superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token
reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x
speed-up in the prefilling stage. Our code is available at
https://github.com/TungChintao/FlowCut
comment: 19 pages, 11 figures
♻ ☆ Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction
Automatic Speech Recognition (ASR) error correction aims to correct
recognition errors while preserving accurate text. Although traditional
approaches demonstrate moderate effectiveness, LLMs offer a paradigm that
eliminates the need for training and labeled data. However, directly using LLMs
will encounter hallucinations problem, which may lead to the modification of
the correct text. To address this problem, we propose the Reliable LLM
Correction Framework (RLLM-CF), which consists of three stages: (1) error
pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3)
reasoning process verification. The advantage of our method is that it does not
require additional information or fine-tuning of the model, and ensures the
correctness of the LLM correction under multi-pass programming. Experiments on
AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by
our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.
♻ ☆ Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning
Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng
Distilling reasoning paths from teacher to student models via supervised
fine-tuning (SFT) provides a shortcut for improving the reasoning ability of
smaller Large Language Models (LLMs). However, the reasoning paths generated by
teacher models often reflect only surface-level traces of their underlying
authentic reasoning. Insights from cognitive neuroscience suggest that
authentic reasoning involves a complex interweaving between meta-reasoning
(which selects appropriate sub-problems from multiple candidates) and solving
(which addresses the sub-problem). This implies authentic reasoning has an
implicit multi-branch structure. Supervised fine-tuning collapses this rich
structure into a flat sequence of token prediction in the teacher's reasoning
path, preventing effective distillation of this structure to students. To
address this limitation, we propose RLKD, a reinforcement learning (RL)-based
distillation framework guided by a novel Generative Structure Reward Model
(GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving
steps and computes rewards to measure structural alignment between student and
teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to
internalize the teacher's implicit multi-branch reasoning structure rather than
merely mimicking fixed output paths. Experiments show RLKD surpasses standard
SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime,
unlocking greater student reasoning potential than SFT-based distillation.
comment: 15 pages
♻ ☆ Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Large Language Models (LLMs) have shown remarkable capabilities across
various tasks, but their deployment in high-stake domains requires consistent
and coherent behavior across multiple rounds of user interaction. This paper
introduces a comprehensive framework for evaluating and improving LLM response
consistency, making three key contributions. Code and data are available at:
https://github.com/yubol-bobo/MT-Consistency. First, we introduce
Position-Weighted Consistency (PWC), a metric designed to capture both the
importance of early-stage stability and recovery patterns in multi-turn
interactions. Second, we present MT-Consistency, a carefully curated benchmark
dataset spanning diverse domains and difficulty levels, specifically designed
to evaluate LLM consistency under various challenging follow-up scenarios.
Third, we introduce Confidence-Aware Response Generation (CARG), a framework
that significantly improves response stability by explicitly integrating
internal model confidence scores during the generation process. Experimental
results demonstrate that CARG significantly improves response stability without
sacrificing accuracy, offering a practical path toward more dependable LLM
behavior in critical, real-world deployments.
comment: 8 pages, 5 figures
♻ ☆ Focus On This, Not That! Steering LLMs with Adaptive Feature Specification
Despite the success of Instruction Tuning (IT) in training large language
models (LLMs), such models often leverage spurious or biased features learnt
from their training data and can become misaligned, leading to undesired
behaviours. While existing techniques can steer model behaviour at
inference-time, they are often post-hoc and do not embed steering as an
intrinsic model feature. In this work, we introduce Focus Instruction Tuning
(FIT), which trains LLMs to condition their responses by focusing on specific
features whilst ignoring others, leading to different behaviours based on what
features are specified. Across diverse benchmarks, we demonstrate that FIT: (i)
successfully steers behaviour at inference time; (ii) increases robustness by
amplifying core task signals and down-weighting spurious cues; (iii) mitigates
social bias by suppressing demographic attributes; and (iv) generalises under
distribution shifts and to previously unseen focus features. FIT therefore
offers a lightweight, intrinsic mechanism for building more robust, fair, and
easily controllable LLMs.
comment: 36pages, 19 figures
♻ ☆ The Role of Diversity in In-Context Learning for Large Language Models
In-context learning (ICL) is a crucial capability of current large language
models (LLMs), where the selection of examples plays a key role in performance.
While most existing approaches focus on selecting the most similar examples to
the query, the impact of diversity in example selection remains underexplored.
We systematically investigate the role of diversity in in-context example
selection through experiments across a range of tasks, from sentiment
classification to more challenging math and code problems. Experiments on
Llama-3.1, Gemma-2, and Mistral-v0.3 families of models show that
diversity-aware selection methods improve performance, particularly on complex
tasks like math and code, and enhance robustness to out-of-distribution
queries. To support these findings, we introduce a theoretical framework that
explains the benefits of incorporating diversity in in-context example
selection.
comment: 30 pages
♻ ☆ Scaling Trends in Language Model Robustness ICML
Nikolaus Howe, Ian McKenzie, Oskar Hollinsworth, Michał Zajac, Tom Tseng, Aaron Tucker, Pierre-Luc Bacon, Adam Gleave
Increasing model size has unlocked a dazzling array of capabilities in modern
language models. At the same time, even frontier models remain vulnerable to
jailbreaks and prompt injections, despite concerted efforts to make them
robust. As both attack and defense gain access to more compute, and as models
become larger, what happens to robustness? We argue that to answer this
question requires a \emph{scaling} approach, which we employ in an extensive
study of language model robustness across several classification tasks, model
families, and adversarial attacks. We find that in the absence of explicit
safety training, larger models are not consistently more robust; however, scale
improves sample efficiency in adversarial training, though it worsens compute
efficiency. Further, we find that increasing attack compute smoothly improves
attack success rate against both undefended and adversarially trained models.
Finally, after exploring robustness transfer across attacks and threat models,
we combine attack and defense scaling rates to study the offense-defense
balance. We find that while attack scaling outpaces adversarial training across
all models studied, larger adversarially trained models might give defense the
advantage in the long run. These results underscore the utility of the scaling
lens, and provide a paradigm for evaluating future attacks and defenses on
frontier models.
comment: 59 pages; updated to ICML version
♻ ☆ FastDraft: How to Train Your Draft ACL 2025
Speculative Decoding has gained popularity as an effective technique for
accelerating the auto-regressive inference process of Large Language Models.
However, Speculative Decoding entirely relies on the availability of efficient
draft models, which are often lacking for many existing language models due to
a stringent constraint of vocabulary compatibility. In this work we introduce
FastDraft, a novel and efficient approach for pre-training and aligning a draft
model to any large language model by incorporating efficient pre-training,
followed by fine-tuning over synthetic datasets generated by the target model.
We demonstrate FastDraft by training two highly parameter efficient drafts for
the popular Phi-3-mini and Llama-3.1-8B models. Using FastDraft, we were able
to produce a draft model with approximately 10 billion tokens on a single
server with 8 Intel$^\circledR$ Gaudi$^\circledR$ 2 accelerators in under 24
hours. Our results show that the draft model achieves impressive results in key
metrics of acceptance rate, block efficiency and up to 3x memory bound speed up
when evaluated on code completion and up to 2x in summarization, text
completion and instruction tasks. We validate our theoretical findings through
benchmarking on the latest Intel$^\circledR$ Core$^{\tiny \text{TM}}$ Ultra,
achieving a wall-clock time speedup of up to 2x, indicating a significant
reduction in runtime. Due to its high quality, FastDraft unlocks large language
models inference on AI-PC and other edge-devices.
comment: Accepted at ACL 2025
♻ ☆ The Vector Grounding Problem
The remarkable performance of large language models (LLMs) on complex
linguistic tasks has sparked debate about their capabilities. Unlike humans,
these models learn language solely from textual data without directly
interacting with the world. Yet they generate seemingly meaningful text on
diverse topics. This achievement has renewed interest in the classical `Symbol
Grounding Problem' -- the question of whether the internal representations and
outputs of symbolic AI systems can possess intrinsic meaning that is not
parasitic on external interpretation. Although modern LLMs compute over vectors
rather than symbols, an analogous problem arises for these systems, which we
call the Vector Grounding Problem. This paper has two main goals. First, we
distinguish five main notions of grounding that are often conflated in the
literature, and argue that only one of them, which we call referential
grounding, is relevant to the Vector Grounding Problem. Second, drawing on
philosophical theories of representational content, we provide two arguments
for the claim that LLMs and related systems can achieve referential grounding:
(1) through preference fine-tuning methods that explicitly establish
world-involving functions, and (2) through pre-training alone, which in limited
domains may select for internal states with world-involving content, as
mechanistic interpretability research suggests. Through these pathways, LLMs
can establish connections to the world sufficient for intrinsic meaning. One
potentially surprising implication of our discussion is that that multimodality
and embodiment are neither necessary nor sufficient to overcome the Grounding
Problem.
♻ ☆ EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents ICML 2025
Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
Leveraging Multi-modal Large Language Models (MLLMs) to create embodied
agents offers a promising avenue for tackling real-world tasks. While
language-centric embodied agents have garnered substantial attention,
MLLM-based embodied agents remain underexplored due to the lack of
comprehensive evaluation frameworks. To bridge this gap, we introduce
EmbodiedBench, an extensive benchmark designed to evaluate vision-driven
embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing
tasks across four environments, ranging from high-level semantic tasks (e.g.,
household) to low-level tasks involving atomic actions (e.g., navigation and
manipulation); and (2) six meticulously curated subsets evaluating essential
agent capabilities like commonsense reasoning, complex instruction
understanding, spatial awareness, visual perception, and long-term planning.
Through extensive experiments, we evaluated 24 leading proprietary and
open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel
at high-level tasks but struggle with low-level manipulation, with the best
model, GPT-4o, scoring only 28.9\% on average. EmbodiedBench provides a
multifaceted standardized evaluation platform that not only highlights existing
challenges but also offers valuable insights to advance MLLM-based embodied
agents. Our code and dataset are available at https://embodiedbench.github.io.
comment: Accepted to ICML 2025
♻ ☆ Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models
Large Language Models (LLMs) show remarkable proficiency in natural language
tasks, yet their frequent overconfidence-misalignment between predicted
confidence and true correctness-poses significant risks in critical
decision-making applications. We present a comprehensive analysis on
calibration in LLMs across nine LLMs and three factual Question-Answering (QA)
datasets, systematically comparing standard free-generation settings against
structured distractor-augmented prompts. Our evaluation reveals that explicitly
incorporating distractors can substantially mitigate miscalibration, achieving
relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite
general trends, we uncover nuanced findings: large RLHF-tuned models display
inherent calibration strengths but can paradoxically suffer increased
miscalibration on easier queries, whereas smaller models benefit
disproportionately from distractor prompts but remain significantly
miscalibrated. Through detailed analyses across question types, we identify
persistent calibration failures, particularly in person-based queries. We
conclude with concrete recommendations-targeted fine-tuning, structured
prompting, and strategic model choice-to ensure reliable, trustworthy LLM
deployments.
♻ ☆ Magic Mushroom: A Customizable Benchmark for Fine-grained Analysis of Retrieval Noise Erosion in RAG Systems
Retrieval-Augmented Generation (RAG) systems enhance Large Language Models
(LLMs) by incorporating external retrieved information, mitigating issues such
as hallucination and outdated knowledge. However, RAG systems are highly
sensitive to retrieval noise prevalent in real-world scenarios. Existing
benchmarks fail to emulate the complex and heterogeneous noise distributions
encountered in real-world retrieval environments, undermining reliable
robustness assessment. In this paper, we define four categories of retrieval
noise based on linguistic properties and noise characteristics, aiming to
reflect the heterogeneity of noise in real-world scenarios. Building on this,
we introduce Magic Mushroom, a benchmark for replicating "magic mushroom"
noise: contexts that appear relevant on the surface but covertly mislead RAG
systems. Magic Mushroom comprises 7,468 single-hop and 3,925 multi-hop
question-answer pairs. More importantly, Magic Mushroom enables researchers to
flexibly configure combinations of retrieval noise according to specific
research objectives or application scenarios, allowing for highly controlled
evaluation setups. We evaluate LLM generators of varying parameter scales and
classic RAG denoising strategies under diverse noise distributions to
investigate their performance dynamics during progressive noise encroachment.
Our analysis reveals that both generators and denoising strategies have
significant room for improvement and exhibit extreme sensitivity to noise
distributions. Magic Mushroom emerges as a promising tool for evaluating and
advancing noise-robust RAG systems, accelerating their widespread deployment in
real-world applications. The Magic Mushroom benchmark is available at
https://drive.google.com/file/d/1aP5kyPuk4L-L_uoI6T9UhxuTyt8oMqjT/view?usp=sharing.
♻ ☆ Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL ACL
Wichayaporn Wongkamjan, Yanze Wang, Feng Gu, Denis Peskoff, Jonathan K. Kummerfeld, Jonathan May, Jordan Lee Boyd-Graber
An increasingly common socio-technical problem is people being taken in by
offers that sound ``too good to be true'', where persuasion and trust shape
decision-making. This paper investigates how \abr{ai} can help detect these
deceptive scenarios. We analyze how humans strategically deceive each other in
\textit{Diplomacy}, a board game that requires both natural language
communication and strategic reasoning. This requires extracting logical forms
of proposed agreements in player communications and computing the relative
rewards of the proposal using agents' value functions. Combined with text-based
features, this can improve our deception detection. Our method detects human
deception with a high precision when compared to a Large Language Model
approach that flags many true messages as deceptive. Future human-\abr{ai}
interaction tools can build on our methods for deception detection by
triggering \textit{friction} to give users a chance of interrogating suspicious
proposals.
comment: ACL Findings 2025
♻ ☆ Propaganda and Information Dissemination in the Russo-Ukrainian War: Natural Language Processing of Russian and Western Twitter Narratives
The conflict in Ukraine has been not only characterised by military
engagement but also by a significant information war, with social media
platforms like X, formerly known as Twitter playing an important role in
shaping public perception. This article provides an analysis of tweets from
propaganda accounts and trusted accounts collected from the onset of the war,
February 2022 until the middle of May 2022 with n=40,000 total tweets. We
utilise natural language processing and machine learning algorithms to assess
the sentiment and identify key themes, topics and narratives across the dataset
with human-in-the-loop (HITL) analysis throughout. Our findings indicate
distinct strategies in how information is created, spread, and targeted at
different audiences by both sides. Propaganda accounts frequently employ
emotionally charged language and disinformation to evoke fear and distrust,
whereas other accounts, primarily Western tend to focus on factual reporting
and humanitarian aspects of the conflict. Clustering analysis reveals groups of
accounts with similar behaviours, which we suspect indicates the presence of
coordinated efforts. This research attempts to contribute to our understanding
of the dynamics of information warfare and offers techniques for future studies
on social media influence in military conflicts.
comment: 7 pages; 6 figures
♻ ☆ An Efficient Task-Oriented Dialogue Policy: Evolutionary Reinforcement Learning Injected by Elite Individuals ACL 2025
Deep Reinforcement Learning (DRL) is widely used in task-oriented dialogue
systems to optimize dialogue policy, but it struggles to balance exploration
and exploitation due to the high dimensionality of state and action spaces.
This challenge often results in local optima or poor convergence. Evolutionary
Algorithms (EAs) have been proven to effectively explore the solution space of
neural networks by maintaining population diversity. Inspired by this, we
innovatively combine the global search capabilities of EA with the local
optimization of DRL to achieve a balance between exploration and exploitation.
Nevertheless, the inherent flexibility of natural language in dialogue tasks
complicates this direct integration, leading to prolonged evolutionary times.
Thus, we further propose an elite individual injection mechanism to enhance
EA's search efficiency by adaptively introducing best-performing individuals
into the population. Experiments across four datasets show that our approach
significantly improves the balance between exploration and exploitation,
boosting performance. Moreover, the effectiveness of the EII mechanism in
reducing exploration time has been demonstrated, achieving an efficient
integration of EA and DRL on task-oriented dialogue policy tasks.
comment: Accepted to ACL 2025 (Main Track)
♻ ☆ Inducing lexicons of in-group language with socio-temporal context ACL 2025
In-group language is an important signifier of group dynamics. This paper
proposes a novel method for inducing lexicons of in-group language, which
incorporates its socio-temporal context. Existing methods for lexicon induction
do not capture the evolving nature of in-group language, nor the social
structure of the community. Using dynamic word and user embeddings trained on
conversations from online anti-women communities, our approach outperforms
prior methods for lexicon induction. We develop a test set for the task of
lexicon induction and a new lexicon of manosphere language, validated by human
experts, which quantifies the relevance of each term to a specific
sub-community at a given point in time. Finally, we present novel insights on
in-group language which illustrate the utility of this approach.
comment: Accepted to ACL 2025
♻ ☆ From Intention To Implementation: Automating Biomedical Research via LLMs SC
Conventional biomedical research is increasingly labor-intensive due to the
exponential growth of scientific literature and datasets. Artificial
intelligence (AI), particularly Large Language Models (LLMs), has the potential
to revolutionize this process by automating various steps. Still, significant
challenges remain, including the need for multidisciplinary expertise,
logicality of experimental design, and performance measurements. This paper
introduces BioResearcher, the first end-to-end automated system designed to
streamline the entire biomedical research process involving dry lab
experiments. BioResearcher employs a modular multi-agent architecture,
integrating specialized agents for search, literature processing, experimental
design, and programming. By decomposing complex tasks into logically related
sub-tasks and utilizing a hierarchical learning approach, BioResearcher
effectively addresses the challenges of multidisciplinary requirements and
logical complexity. Furthermore, BioResearcher incorporates an LLM-based
reviewer for in-process quality control and introduces novel evaluation metrics
to assess the quality and automation of experimental protocols. BioResearcher
successfully achieves an average execution success rate of 63.07% across eight
previously unmet research objectives. The generated protocols, on average,
outperform typical agent systems by 22.0% on five quality metrics. The system
demonstrates significant potential to reduce researchers' workloads and
accelerate biomedical discoveries, paving the way for future innovations in
automated research systems.
comment: To appear in SCIENCE CHINA Information Sciences. If you find our work
useful, please cite us as: @article{ BioResearcher, author = "Yi Luo and
Linghang Shi and Yihao Li and Aobo Zhuang and Yeyun Gong and Ling Liu and
Chen Lin", title = "From Intention To Implementation: Automating Biomedical
Research via LLMs", journal = "SCIENCE CHINA Information Sciences", year =
"2025" }
♻ ☆ Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing
Large Language Models (LLMs) have recently emerged as promising tools for
knowledge tracing (KT) due to their strong reasoning and generalization
abilities. While recent LLM-based KT methods have proposed new prompt formats,
they struggle to represent the full interaction histories of example learners
within a single prompt during in-context learning (ICL), resulting in limited
scalability and high computational cost under token constraints. In this work,
we present \textit{LLM-based Option-weighted Knowledge Tracing (LOKT)}, a
simple yet effective framework that encodes the interaction histories of
example learners in context as \textit{textual categorical option weights
(TCOW)}. TCOW are semantic labels (e.g., ``inadequate'') assigned to the
options selected by learners when answering questions, enhancing the
interpretability of LLMs. Experiments on multiple-choice datasets show that
LOKT outperforms existing non-LLM and LLM-based KT models in both cold-start
and warm-start settings. Moreover, LOKT enables scalable and cost-efficient
inference, achieving strong performance even under strict token constraints.
Our code is available at
\href{https://anonymous.4open.science/r/LOKT_model-3233}{https://anonymous.4open.science/r/LOKT\_model-3233}.
comment: 11 pages
♻ ☆ Rectified Sparse Attention
Yutao Sun, Tianzhu Ye, Li Dong, Yuqing Xia, Jian Chen, Yizhao Gao, Shijie Cao, Jianyong Wang, Furu Wei
Efficient long-sequence generation is a critical challenge for Large Language
Models. While recent sparse decoding methods improve efficiency, they suffer
from KV cache misalignment, where approximation errors accumulate and degrade
generation quality. In this work, we propose Rectified Sparse Attention (ReSA),
a simple yet effective method that combines block-sparse attention with
periodic dense rectification. By refreshing the KV cache at fixed intervals
using a dense forward pass, ReSA bounds error accumulation and preserves
alignment with the pretraining distribution. Experiments across math reasoning,
language modeling, and retrieval tasks demonstrate that ReSA achieves
near-lossless generation quality with significantly improved efficiency.
Notably, ReSA delivers up to 2.42$\times$ end-to-end speedup under decoding at
256K sequence length, making it a practical solution for scalable long-context
inference. Code is available at https://aka.ms/ReSA-LM.
♻ ☆ MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning
Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, Jiacheng Zhu
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as
a powerful paradigm for post-training large language models (LLMs), achieving
state-of-the-art performance on tasks with structured, verifiable answers.
Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but
is complicated by the broader, heterogeneous nature of vision-language tasks
that demand nuanced visual, logical, and spatial capabilities. As such,
training MLLMs using RLVR on multiple datasets could be beneficial but creates
challenges with conflicting objectives from interaction among diverse datasets,
highlighting the need for optimal dataset mixture strategies to improve
generalization and reasoning. We introduce a systematic post-training framework
for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation
and benchmark implementation. Specifically, (1) We developed a multimodal RLVR
framework for multi-dataset post-training by curating a dataset that contains
different verifiable vision-language problems and enabling multi-domain online
RL learning with different verifiable rewards; (2) We proposed a data mixture
strategy that learns to predict the RL fine-tuning outcome from the data
mixture distribution, and consequently optimizes the best mixture.
Comprehensive experiments showcase that multi-domain RLVR training, when
combined with mixture prediction strategies, can significantly boost MLLM
general reasoning capacities. Our best mixture improves the post-trained
model's accuracy on out-of-distribution benchmarks by an average of 5.24%
compared to the same model post-trained with uniform data mixture, and by a
total of 20.74% compared to the pre-finetuning baseline.
comment: Project Webpage: https://modomodo-rl.github.io/