Computation and Language 104
☆ Do Large Language Model Benchmarks Test Reliability?
When deploying large language models (LLMs), it is important to ensure that
these models are not only capable, but also reliable. Many benchmarks have been
created to track LLMs' growing capabilities, however there has been no similar
focus on measuring their reliability. To understand the potential ramifications
of this gap, we investigate how well current benchmarks quantify model
reliability. We find that pervasive label errors can compromise these
evaluations, obscuring lingering model failures and hiding unreliable behavior.
Motivated by this gap in the evaluation of reliability, we then propose the
concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to
minimize label errors and ambiguity. As a first attempt at constructing such
benchmarks, we revise examples from fifteen existing popular benchmarks. We
evaluate a wide range of models on these platinum benchmarks and find that,
indeed, frontier LLMs still exhibit failures on simple tasks such as
elementary-level math word problems. Analyzing these failures further reveals
previously unidentified patterns of problems on which frontier models
consistently struggle. We provide code at
https://github.com/MadryLab/platinum-benchmarks
☆ Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training
Small language models (SLMs) have attracted considerable attention from both
academia and industry due to their broad range of applications in edge devices.
To obtain SLMs with strong performance, conventional approaches either
pre-train the models from scratch, which incurs substantial computational
costs, or compress/prune existing large language models (LLMs), which results
in performance drops and falls short in comparison to pre-training. In this
paper, we investigate the family of acceleration methods that involve both
structured pruning and model training. We found 1) layer-wise adaptive pruning
(Adapt-Pruner) is extremely effective in LLMs and yields significant
improvements over existing pruning techniques, 2) adaptive pruning equipped
with further training leads to models comparable to those pre-training from
scratch, 3) incremental pruning brings non-trivial performance gain by
interleaving pruning with training and only removing a small portion of neurons
($\sim$5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that
Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner,
FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense
benchmarks. Additionally, Adapt-Pruner restores the performance of
MobileLLM-125M to 600M on the MMLU benchmark with 200$\times$ fewer tokens via
pruning from its larger counterparts, and discovers a new 1B model that
surpasses LLaMA-3.2-1B in multiple benchmarks.
☆ On Fairness of Unified Multimodal Large Language Model for Image Generation
Unified multimodal large language models (U-MLLMs) have demonstrated
impressive performance in visual understanding and generation in an end-to-end
pipeline. Compared with generation-only models (e.g., Stable Diffusion),
U-MLLMs may raise new questions about bias in their outputs, which can be
affected by their unified capabilities. This gap is particularly concerning
given the under-explored risk of propagating harmful stereotypes. In this
paper, we benchmark the latest U-MLLMs and find that most exhibit significant
demographic biases, such as gender and race bias. To better understand and
mitigate this issue, we propose a locate-then-fix strategy, where we audit and
show how the individual model component is affected by bias. Our analysis shows
that bias originates primarily from the language model. More interestingly, we
observe a "partial alignment" phenomenon in U-MLLMs, where understanding bias
appears minimal, but generation bias remains substantial. Thus, we propose a
novel balanced preference model to balance the demographic distribution with
synthetic data. Experiments demonstrate that our approach reduces demographic
bias while preserving semantic fidelity. We hope our findings underscore the
need for more holistic interpretation and debiasing strategies of U-MLLMs in
the future.
☆ Think or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts
Zero-shot prompting techniques have significantly improved the performance of
Large Language Models (LLMs). However, we lack a clear understanding of why
zero-shot prompts are so effective. For example, in the prompt "Let's think
step-by-step," is "think" or "step-by-step" more crucial to its success?
Existing interpretability methods, such as gradient-based and attention-based
approaches, are computationally intensive and restricted to open-source models.
We introduce the ZIP score (Zero-shot Importance of Perturbation score), a
versatile metric applicable to both open and closed-source models, based on
systematic input word perturbations. Our experiments across four recent LLMs,
seven widely-used prompts, and several tasks, reveal interesting patterns in
word importance. For instance, while both 'step-by-step' and 'think' show high
ZIP scores, which one is more influential depends on the model and task. We
validate our method using controlled experiments and compare our results with
human judgments, finding that proprietary models align more closely with human
intuition regarding word significance. These findings enhance our understanding
of LLM behavior and contribute to developing more effective zero-shot prompts
and improved model analysis.
comment: 8 pages (excluding references)
☆ SPRI: Aligning Large Language Models with Context-Situated Principles
Aligning Large Language Models to integrate and reflect human values,
especially for tasks that demand intricate human oversight, is arduous since it
is resource-intensive and time-consuming to depend on human expertise for
context-specific guidance. Prior work has utilized predefined sets of rules or
principles to steer the behavior of models (Bai et al., 2022; Sun et al.,
2023). However, these principles tend to be generic, making it challenging to
adapt them to each individual input query or context. In this work, we present
Situated-PRInciples (SPRI), a framework requiring minimal or no human effort
that is designed to automatically generate guiding principles in real-time for
each input query and utilize them to align each response. We evaluate SPRI on
three tasks, and show that 1) SPRI can derive principles in a complex
domain-specific task that leads to on-par performance as expert-crafted ones;
2) SPRI-generated principles lead to instance-specific rubrics that outperform
prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data
leads to substantial improvement on truthfulness. We release our code and model
generations at https://github.com/honglizhan/SPRI-public.
☆ LIMO: Less is More for Reasoning
We present a fundamental discovery that challenges our understanding of how
complex reasoning emerges in large language models. While conventional wisdom
suggests that sophisticated reasoning tasks demand extensive training data
(>100,000 examples), we demonstrate that complex mathematical reasoning
abilities can be effectively elicited with surprisingly few examples. Through
comprehensive experiments, our proposed model LIMO demonstrates unprecedented
performance in mathematical reasoning. With merely 817 curated training
samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from
previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of
the training data required by previous approaches. LIMO demonstrates
exceptional out-of-distribution generalization, achieving 40.5% absolute
improvement across 10 diverse benchmarks, outperforming models trained on 100x
more data, challenging the notion that SFT leads to memorization rather than
generalization. Based on these results, we propose the Less-Is-More Reasoning
Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has
been comprehensively encoded during pre-training, sophisticated reasoning
capabilities can emerge through minimal but precisely orchestrated
demonstrations of cognitive processes. This hypothesis posits that the
elicitation threshold for complex reasoning is determined by two key factors:
(1) the completeness of the model's encoded knowledge foundation during
pre-training, and (2) the effectiveness of post-training examples as "cognitive
templates" that show the model how to utilize its knowledge base to solve
complex reasoning tasks. To facilitate reproducibility and future research in
data-efficient reasoning, we release LIMO as a comprehensive open-source suite
at https://github.com/GAIR-NLP/LIMO.
comment: 17 pages
☆ High-Fidelity Simultaneous Speech-To-Speech Translation
We introduce Hibiki, a decoder-only model for simultaneous speech
translation. Hibiki leverages a multistream language model to synchronously
process source and target speech, and jointly produces text and audio tokens to
perform speech-to-text and speech-to-speech translation. We furthermore address
the fundamental challenge of simultaneous interpretation, which unlike its
consecutive counterpart, where one waits for the end of the source utterance to
start translating, adapts its flow to accumulate just enough context to produce
a correct translation in real-time, chunk by chunk. To do so, we introduce a
weakly-supervised method that leverages the perplexity of an off-the-shelf text
translation system to identify optimal delays on a per-word basis and create
aligned synthetic data. After supervised training, Hibiki performs adaptive,
simultaneous speech translation with vanilla temperature sampling. On a
French-English simultaneous speech translation task, Hibiki demonstrates
state-of-the-art performance in translation quality, speaker fidelity and
naturalness. Moreover, the simplicity of its inference process makes it
compatible with batched translation and even real-time on-device deployment. We
provide examples as well as models and inference code.
☆ Integrating automatic speech recognition into remote healthcare interpreting: A pilot study of its impact on interpreting quality
This paper reports on the results from a pilot study investigating the impact
of automatic speech recognition (ASR) technology on interpreting quality in
remote healthcare interpreting settings. Employing a within-subjects experiment
design with four randomised conditions, this study utilises scripted medical
consultations to simulate dialogue interpreting tasks. It involves four trainee
interpreters with a language combination of Chinese and English. It also
gathers participants' experience and perceptions of ASR support through cued
retrospective reports and semi-structured interviews. Preliminary data suggest
that the availability of ASR, specifically the access to full ASR transcripts
and to ChatGPT-generated summaries based on ASR, effectively improved
interpreting quality. Varying types of ASR output had different impacts on the
distribution of interpreting error types. Participants reported similar
interactive experiences with the technology, expressing their preference for
full ASR transcripts. This pilot study shows encouraging results of applying
ASR to dialogue-based healthcare interpreting and offers insights into the
optimal ways to present ASR output to enhance interpreter experience and
performance. However, it should be emphasised that the main purpose of this
study was to validate the methodology and that further research with a larger
sample size is necessary to confirm these findings.
comment: to appear in the Proceedings of Translation and the Computer (TC46)
☆ Demystifying Long Chain-of-Thought Reasoning in LLMs
Scaling inference compute enhances reasoning in large language models (LLMs),
with long chains-of-thought (CoTs) enabling strategies like backtracking and
error correction. Reinforcement learning (RL) has emerged as a crucial method
for developing these capabilities, yet the conditions under which long CoTs
emerge remain unclear, and RL training requires careful design choices. In this
study, we systematically investigate the mechanics of long CoT reasoning,
identifying the key factors that enable models to generate long CoT
trajectories. Through extensive supervised fine-tuning (SFT) and RL
experiments, we present four main findings: (1) While SFT is not strictly
necessary, it simplifies training and improves efficiency; (2) Reasoning
capabilities tend to emerge with increased training compute, but their
development is not guaranteed, making reward shaping crucial for stabilizing
CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We
find that leveraging noisy, web-extracted solutions with filtering mechanisms
shows strong potential, particularly for out-of-distribution (OOD) tasks such
as STEM reasoning; and (4) Core abilities like error correction are inherently
present in base models, but incentivizing these skills effectively for complex
tasks via RL demands significant compute, and measuring their emergence
requires a nuanced approach. These insights provide practical guidance for
optimizing training strategies to enhance long CoT reasoning in LLMs. Our code
is available at: https://github.com/eddycmu/demystify-long-cot.
comment: Preprint, under review
☆ Minerva: A Programmable Memory Test Benchmark for Language Models
How effectively can LLM-based AI assistants utilize their memory (context) to
perform various tasks? Traditional data benchmarks, which are often manually
crafted, suffer from several limitations: they are static, susceptible to
overfitting, difficult to interpret, and lack actionable insights--failing to
pinpoint the specific capabilities a model lacks when it does not pass a test.
In this paper, we present a framework for automatically generating a
comprehensive set of tests to evaluate models' abilities to use their memory
effectively. Our framework extends the range of capability tests beyond the
commonly explored (passkey, key-value, needle in the haystack) search, a
dominant focus in the literature. Specifically, we evaluate models on atomic
tasks such as searching, recalling, editing, matching, comparing information in
context memory, and performing basic operations when inputs are structured into
distinct blocks, simulating real-world data. Additionally, we design composite
tests to investigate the models' ability to maintain state while operating on
memory. Our benchmark enables an interpretable, detailed assessment of memory
capabilities of LLMs.
☆ ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiaqi Wang, Mengkang Hu, Zhi Chen, Wanxiang Che, Ting Liu
Recent advancements in large language models (LLMs) have led to significant
successes across various applications, where the most noticeable is to a series
of emerging capabilities, particularly in the areas of In-Context Learning
(ICL) and Chain-of-Thought (CoT). To better understand and control model
performance, many studies have begun investigating the underlying causes of
these phenomena and their impact on task outcomes. However, existing
explanatory frameworks predominantly focus on isolating and explaining ICL and
CoT independently, leading to an incomplete understanding of their combined
influence on model performance. To address this gap, we propose the Electronic
Circuit Model (ECM), which provides a foundation for developing scalable,
learnable policies and improving the management of AI-generated content.
Specifically, ECM conceptualizes model behavior as an electronic circuit: ICL
is represented as semantic magnetic field to providing an additional voltage
following Faraday's Law, while CoT is modeled as series resistors to constrain
the model output performance following Ohm's Law. Experimental results
demonstrate that the ECM effectively predicts and explains LLM performance
across a variety of prompting strategies. Furthermore, we apply ECM to advanced
reasoning strategy optimization on a series of tasks, such as the International
Olympiad in Informatics (IOI) and the International Mathematical Olympiad
(IMO), achieving competitive performance that surpasses nearly 80% of top human
competitors.
comment: Manuscript
☆ Out-of-Distribution Detection using Synthetic Data Generation
Distinguishing in- and out-of-distribution (OOD) inputs is crucial for
reliable deployment of classification systems. However, OOD data is typically
unavailable or difficult to collect, posing a significant challenge for
accurate OOD detection. In this work, we present a method that harnesses the
generative capabilities of Large Language Models (LLMs) to create high-quality
synthetic OOD proxies, eliminating the dependency on any external OOD data
source. We study the efficacy of our method on classical text classification
tasks such as toxicity detection and sentiment classification as well as
classification tasks arising in LLM development and deployment, such as
training a reward model for RLHF and detecting misaligned generations.
Extensive experiments on nine InD-OOD dataset pairs and various model sizes
show that our approach dramatically lowers false positive rates (achieving a
perfect zero in some cases) while maintaining high accuracy on in-distribution
tasks, outperforming baseline methods by a significant margin.
☆ Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning
Large language models (LLMs) excel across various tasks, but standard
first-order (FO) fine-tuning demands considerable memory, significantly
limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood
out as a promising memory-efficient training paradigm, avoiding backward passes
and relying solely on forward passes for gradient estimation, making it
attractive for resource-constrained scenarios. However, ZO method lags far
behind FO method in both convergence speed and accuracy. To bridge the gap, we
introduce a novel layer-wise divergence analysis that uncovers the distinct
update pattern of FO and ZO optimization. Aiming to resemble the learning
capacity of FO method from the findings, we propose \textbf{Di}vergence-driven
\textbf{Z}eroth-\textbf{O}rder (\textbf{DiZO}) optimization. DiZO conducts
divergence-driven layer adaptation by incorporating projections to ZO updates,
generating diverse-magnitude updates precisely scaled to layer-wise individual
optimization needs. Our results demonstrate that DiZO significantly reduces the
needed iterations for convergence without sacrificing throughput, cutting
training GPU hours by up to 48\% on various datasets. Moreover, DiZO
consistently outperforms the representative ZO baselines in fine-tuning
RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some
cases, even surpasses memory-intensive FO fine-tuning.
☆ MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters
Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich
While increasing patients' access to medical documents improves medical care,
this benefit is limited by varying health literacy levels and complex medical
terminology. Large language models (LLMs) offer solutions by simplifying
medical information. However, evaluating LLMs for safe and patient-friendly
text generation is difficult due to the lack of standardized evaluation
resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset
created from MIMIC-IV discharge summaries through an automated pipeline
combining LLM-based question-answer generation with manual quality checks. We
use this dataset to evaluate various LLMs on patient-oriented
question-answering. Our findings reveal that general-purpose LLMs frequently
surpass biomedical-adapted models, while automated metrics correlate with human
judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the
development of LLMs to enhance patient understanding and ultimately improve
care outcomes.
☆ ALPET: Active Few-shot Learning for Citation Worthiness Detection in Low-Resource Wikipedia Languages
Citation Worthiness Detection (CWD) consists in determining which sentences,
within an article or collection, should be backed up with a citation to
validate the information it provides. This study, introduces ALPET, a framework
combining Active Learning (AL) and Pattern-Exploiting Training (PET), to
enhance CWD for languages with limited data resources. Applied to Catalan,
Basque, and Albanian Wikipedia datasets, ALPET outperforms the existing CCW
baseline while reducing the amount of labeled data in some cases above 80\%.
ALPET's performance plateaus after 300 labeled samples, showing it suitability
for low-resource scenarios where large, labeled datasets are not common. While
specific active learning query strategies, like those employing K-Means
clustering, can offer advantages, their effectiveness is not universal and
often yields marginal gains over random sampling, particularly with smaller
datasets. This suggests that random sampling, despite its simplicity, remains a
strong baseline for CWD in constraint resource environments. Overall, ALPET's
ability to achieve high performance with fewer labeled samples makes it a
promising tool for enhancing the verifiability of online content in
low-resource language settings.
comment: 24 pages, 8 figures, 4 tables
☆ SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs
Recent advancements have highlighted that Large Language Models (LLMs) are
prone to hallucinations when solving complex reasoning problems, leading to
erroneous results. To tackle this issue, researchers incorporate Knowledge
Graphs (KGs) to improve the reasoning ability of LLMs. However, existing
methods face two limitations: 1) they typically assume that all answers to the
questions are contained in KGs, neglecting the incompleteness issue of KGs, and
2) they treat the KG as a static repository and overlook the implicit logical
reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an
innovative neural-symbolic agent framework that achieves collaborative
augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments
and transform complex reasoning tasks into a multi-step interactive process,
enabling KGs to participate deeply in the reasoning process. SymAgent consists
of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages
LLM's inductive reasoning capability to extract symbolic rules from KGs,
guiding efficient question decomposition. The Agent-Executor autonomously
invokes predefined action tools to integrate information from KGs and external
documents, addressing the issues of KG incompleteness. Furthermore, we design a
self-learning framework comprising online exploration and offline iterative
policy updating phases, enabling the agent to automatically synthesize
reasoning trajectories and improve performance. Experimental results
demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields
better or comparable performance compared to various strong baselines. Further
analysis reveals that our agent can identify missing triples, facilitating
automatic KG updates.
☆ Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning
Large Language Models (LLMs) excel at reasoning and planning when trained on
chainof-thought (CoT) data, where the step-by-step thought process is
explicitly outlined by text tokens. However, this results in lengthy inputs
where many words support textual coherence rather than core reasoning
information, and processing these inputs consumes substantial computation
resources. In this work, we propose a hybrid representation of the reasoning
process, where we partially abstract away the initial reasoning steps using
latent discrete tokens generated by VQ-VAE, significantly reducing the length
of reasoning traces. We explore the use of latent trace abstractions in two
scenarios: 1) training the model from scratch for the Keys-Finding Maze
problem, 2) fine-tuning LLMs on this hybrid data with an extended vocabulary
including unseen latent tokens, for both logical and mathematical reasoning
problems. To facilitate effective learning, we introduce a simple training
procedure that randomly mixes latent and text tokens, which enables fast
adaptation to new latent tokens. Our approach consistently outperforms the
baselines methods in various benchmarks.
☆ Efficient extraction of medication information from clinical notes: an evaluation in two languages
Thibaut Fabacher, Erik-André Sauleau, Emmanuelle Arcay, Bineta Faye, Maxime Alter, Archia Chahard, Nathan Miraillet, Adrien Coulet, Aurélie Névéol
Objective: To evaluate the accuracy, computational cost and portability of a
new Natural Language Processing (NLP) method for extracting medication
information from clinical narratives. Materials and Methods: We propose an
original transformer-based architecture for the extraction of entities and
their relations pertaining to patients' medication regimen. First, we used this
approach to train and evaluate a model on French clinical notes, using a newly
annotated corpus from H\^opitaux Universitaires de Strasbourg. Second, the
portability of the approach was assessed by conducting an evaluation on
clinical documents in English from the 2018 n2c2 shared task. Information
extraction accuracy and computational cost were assessed by comparison with an
available method using transformers. Results: The proposed architecture
achieves on the task of relation extraction itself performance that are
competitive with the state-of-the-art on both French and English (F-measures
0.82 and 0.96 vs 0.81 and 0.95), but reduce the computational cost by 10.
End-to-end (Named Entity recognition and Relation Extraction) F1 performance is
0.69 and 0.82 for French and English corpus. Discussion: While an existing
system developed for English notes was deployed in a French hospital setting
with reasonable effort, we found that an alternative architecture offered
end-to-end drug information extraction with comparable extraction performance
and lower computational impact for both French and English clinical text
processing, respectively. Conclusion: The proposed architecture can be used to
extract medication information from clinical text with high performance and low
computational cost and consequently suits with usually limited hospital IT
resources
comment: Submitted to JAMIA, 17 pages, 3 figures, 2 tables and 5 supplementary
tables
☆ How do Humans and Language Models Reason About Creativity? A Comparative Analysis
Antonio Laverghetta Jr., Tuhin Chakrabarty, Tom Hope, Jimmy Pronchick, Krupa Bhawsar, Roger E. Beaty
Creativity assessment in science and engineering is increasingly based on
both human and AI judgment, but the cognitive processes and biases behind these
evaluations remain poorly understood. We conducted two experiments examining
how including example solutions with ratings impact creativity evaluation,
using a finegrained annotation protocol where raters were tasked with
explaining their originality scores and rating for the facets of remoteness
(whether the response is "far" from everyday ideas), uncommonness (whether the
response is rare), and cleverness. In Study 1, we analyzed creativity ratings
from 72 experts with formal science or engineering training, comparing those
who received example solutions with ratings (example) to those who did not (no
example). Computational text analysis revealed that, compared to experts with
examples, no-example experts used more comparative language (e.g.,
"better/worse") and emphasized solution uncommonness, suggesting they may have
relied more on memory retrieval for comparisons. In Study 2, parallel analyses
with state-of-the-art LLMs revealed that models prioritized uncommonness and
remoteness of ideas when rating originality, suggesting an evaluative process
rooted around the semantic similarity of ideas. In the example condition, while
LLM accuracy in predicting the true originality scores improved, the
correlations of remoteness, uncommonness, and cleverness with originality also
increased substantially - to upwards of 0.99 - suggesting a homogenization in
the LLMs evaluation of the individual facets. These findings highlight
important implications for how humans and AI reason about creativity and
suggest diverging preferences for what different populations prioritize when
rating.
comment: CogSci 2025
☆ A scale of conceptual orality and literacy: Automatic text categorization in the tradition of "Nähe und Distanz"
Koch and Oesterreicher's model of "N\"ahe und Distanz" (N\"ahe = immediacy,
conceptual orality; Distanz = distance, conceptual literacy) is constantly used
in German linguistics. However, there is no statistical foundation for use in
corpus linguistic analyzes, while it is increasingly moving into empirical
corpus linguistics. Theoretically, it is stipulated, among other things, that
written texts can be rated on a scale of conceptual orality and literacy by
linguistic features. This article establishes such a scale based on PCA and
combines it with automatic analysis. Two corpora of New High German serve as
examples. When evaluating established features, a central finding is that
features of conceptual orality and literacy must be distinguished in order to
rank texts in a differentiated manner. The scale is also discussed with a view
to its use in corpus compilation and as a guide for analyzes in larger corpora.
With a theory-driven starting point and as a "tailored" dimension, the approach
compared to Biber's Dimension 1 is particularly suitable for these supporting,
controlling tasks.
☆ Mitigating Language Bias in Cross-Lingual Job Retrieval: A Recruitment Platform Perspective AAAI 2025
Understanding the textual components of resumes and job postings is critical
for improving job-matching accuracy and optimizing job search systems in online
recruitment platforms. However, existing works primarily focus on analyzing
individual components within this information, requiring multiple specialized
tools to analyze each aspect. Such disjointed methods could potentially hinder
overall generalizability in recruitment-related text processing. Therefore, we
propose a unified sentence encoder that utilized multi-task dual-encoder
framework for jointly learning multiple component into the unified sentence
encoder. The results show that our method outperforms other state-of-the-art
models, despite its smaller model size. Moreover, we propose a novel metric,
Language Bias Kullback-Leibler Divergence (LBKL), to evaluate language bias in
the encoder, demonstrating significant bias reduction and superior
cross-lingual performance.
comment: To be published in CompJobs Workshop at AAAI 2025
☆ iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs
Vision-Language Models (VLMs) are known to struggle with spatial reasoning
and visual alignment. To help overcome these limitations, we introduce iVISPAR,
an interactive multi-modal benchmark designed to evaluate the spatial reasoning
capabilities of VLMs acting as agents. iVISPAR is based on a variant of the
sliding tile puzzle-a classic problem that demands logical planning, spatial
awareness, and multi-step reasoning. The benchmark supports visual 2D, 3D, and
text-based input modalities, enabling comprehensive assessments of VLMs'
planning and reasoning skills. We evaluate a broad suite of state-of-the-art
open-source and closed-source VLMs, comparing their performance while also
providing optimal path solutions and a human baseline to assess the task's
complexity and feasibility for humans. Results indicate that while some VLMs
perform well on simple spatial tasks, they encounter difficulties with more
complex configurations and problem properties. Notably, while VLMs generally
perform better in 2D vision compared to 3D or text-based representations, they
consistently fall short of human performance, illustrating the persistent
challenge of visual alignment. This highlights critical gaps in current VLM
capabilities, highlighting their limitations in achieving human-level
cognition.
☆ Improve Decoding Factuality by Token-wise Cross Layer Entropy of Large Language Models NAACL 2025
Despite their impressive capacities, Large language models (LLMs) often
struggle with the hallucination issue of generating inaccurate or fabricated
content even when they possess correct knowledge. In this paper, we extend the
exploration of the correlation between hidden-state prediction changes and
output factuality into a deeper, token-wise level. Based on the insights , we
propose cross-layer Entropy eNhanced Decoding (END), a decoding method that
mitigates hallucinations without requiring extra training. END leverages inner
probability changes across layers to individually quantify the factual
knowledge required for each candidate token, and adjusts the final predicting
distribution to prioritize tokens with higher factuality. Experiments on both
hallucination and QA benchmarks demonstrate that END significantly enhances the
truthfulness and informativeness of generated content while maintaining robust
QA accuracy. Moreover, our work provides a deeper perspective on understanding
the correlations between inherent knowledge and output factuality.
comment: NAACL 2025 Findings
☆ EuskañolDS: A Naturally Sourced Corpus for Basque-Spanish Code-Switching
Code-switching (CS) remains a significant challenge in Natural Language
Processing (NLP), mainly due a lack of relevant data. In the context of the
contact between the Basque and Spanish languages in the north of the Iberian
Peninsula, CS frequently occurs in both formal and informal spontaneous
interactions. However, resources to analyse this phenomenon and support the
development and evaluation of models capable of understanding and generating
code-switched language for this language pair are almost non-existent. We
introduce a first approach to develop a naturally sourced corpus for
Basque-Spanish code-switching. Our methodology consists of identifying CS texts
from previously available corpora using language identification models, which
are then manually validated to obtain a reliable subset of CS instances. We
present the properties of our corpus and make it available under the name
Euska\~nolDS.
☆ Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models
Recent studies have shown that large language models (LLMs), when customized
with post-training on tabular data, can acquire general tabular in-context
learning (TabICL) capabilities. These models are able to transfer effectively
across diverse data schemas and different task domains. However, existing
LLM-based TabICL approaches are constrained to few-shot scenarios due to the
sequence length limitations of LLMs, as tabular instances represented in plain
text consume substantial tokens. To address this limitation and enable scalable
TabICL for any data size, we propose retrieval-augmented LLMs tailored to
tabular data. Our approach incorporates a customized retrieval module, combined
with retrieval-guided instruction-tuning for LLMs. This enables LLMs to
effectively leverage larger datasets, achieving significantly improved
performance across 69 widely recognized datasets and demonstrating promising
scaling behavior. Extensive comparisons with state-of-the-art tabular models
reveal that, while LLM-based TabICL still lags behind well-tuned numeric models
in overall performance, it uncovers powerful algorithms under limited contexts,
enhances ensemble diversity, and excels on specific datasets. These unique
properties underscore the potential of language as a universal and accessible
interface for scalable tabular data learning.
comment: Preprint
☆ Teaching Large Language Models Number-Focused Headline Generation With Key Element Rationales NAACL 2025
Number-focused headline generation is a summarization task requiring both
high textual quality and precise numerical accuracy, which poses a unique
challenge for Large Language Models (LLMs). Existing studies in the literature
focus only on either textual quality or numerical reasoning and thus are
inadequate to address this challenge. In this paper, we propose a novel
chain-of-thought framework for using rationales comprising key elements of the
Topic, Entities, and Numerical reasoning (TEN) in news articles to enhance the
capability for LLMs to generate topic-aligned high-quality texts with precise
numerical accuracy. Specifically, a teacher LLM is employed to generate TEN
rationales as supervision data, which are then used to teach and fine-tune a
student LLM. Our approach teaches the student LLM automatic generation of
rationales with enhanced capability for numerical reasoning and topic-aligned
numerical headline generation. Experiments show that our approach achieves
superior performance in both textual quality and numerical accuracy.
comment: Pre-print for a paper accepted to findings of NAACL 2025
☆ Policies and Evaluation for Online Meeting Summarization
With more and more meetings moving to a digital domain, meeting summarization
has recently gained interest in both academic and commercial research. However,
prior academic research focuses on meeting summarization as an offline task,
performed after the meeting concludes. In this paper, we perform the first
systematic study of online meeting summarization. For this purpose, we propose
several policies for conducting online summarization. We discuss the unique
challenges of this task compared to the offline setting and define novel
metrics to evaluate latency and partial summary quality. The experiments on the
AutoMin dataset show that 1) online models can produce strong summaries, 2) our
metrics allow a detailed analysis of different systems' quality-latency
trade-off, also taking into account intermediate outputs and 3) adaptive
policies perform better than fixed scheduled ones. These findings provide a
starting point for the wider research community to explore this important task.
comment: 8 pages, 1 figure
☆ Structured Token Retention and Computational Memory Paths in Large Language Models
Memory retention mechanisms play a central role in determining the efficiency
of computational architectures designed for processing extended sequences.
Conventional methods for token management often impose fixed retention
thresholds or rely on uniform attention weight distributions, leading to
inefficient memory utilization and premature information loss in extended
sequence modeling. Structured Token Retention (STR) introduces a probabilistic
selection framework that dynamically adjusts token persistence based on
contextual significance, ensuring that computational resources are allocated to
semantically relevant elements. Computational Memory Paths (CMP) extend this
framework through hierarchical memory allocation, refining retention efficiency
through structured reallocation of token embeddings. Comparative assessments
against baseline models demonstrate that STR and CMP improve token survival
rates across long input sequences while reducing cumulative error propagation
across processing layers. Experimental results further indicate reductions in
computational overhead, improving inference speed without degrading contextual
coherence. Token distribution analyses reveal that structured memory allocation
prevents excessive redundancy in attention weight calculations, optimizing
information retrieval efficiency in large-scale generative architectures. The
integration of STR and CMP into an open-source model illustrates the
adaptability of structured memory retention methodologies, highlighting their
applicability in generative text processing, long-context comprehension, and
scalable sequence modeling.
☆ IAO Prompting: Making Knowledge Flow Explicit in LLMs through Structured Reasoning Templates AAAI 2025
While Large Language Models (LLMs) demonstrate impressive reasoning
capabilities, understanding and validating their knowledge utilization remains
challenging. Chain-of-thought (CoT) prompting partially addresses this by
revealing intermediate reasoning steps, but the knowledge flow and application
remain implicit. We introduce IAO (Input-Action-Output) prompting, a structured
template-based method that explicitly models how LLMs access and apply their
knowledge during complex reasoning tasks. IAO decomposes problems into
sequential steps, each clearly identifying the input knowledge being used, the
action being performed, and the resulting output. This structured decomposition
enables us to trace knowledge flow, verify factual consistency, and identify
potential knowledge gaps or misapplications. Through experiments across diverse
reasoning tasks, we demonstrate that IAO not only improves zero-shot
performance but also provides transparency in how LLMs leverage their stored
knowledge. Human evaluation confirms that this structured approach enhances our
ability to verify knowledge utilization and detect potential hallucinations or
reasoning errors. Our findings provide insights into both knowledge
representation within LLMs and methods for more reliable knowledge application.
comment: Accepted as Oral at KnowFM @ AAAI 2025
☆ DOLFIN -- Document-Level Financial test set for Machine Translation NAACL 2025
Despite the strong research interest in document-level Machine Translation
(MT), the test sets dedicated to this task are still scarce. The existing test
sets mainly cover topics from the general domain and fall short on specialised
domains, such as legal and financial. Also, in spite of their document-level
aspect, they still follow a sentence-level logic that does not allow for
including certain linguistic phenomena such as information reorganisation. In
this work, we aim to fill this gap by proposing a novel test set: DOLFIN. The
dataset is built from specialised financial documents, and it makes a step
towards true document-level MT by abandoning the paradigm of perfectly aligned
sentences, presenting data in units of sections rather than sentences. The test
set consists of an average of 1950 aligned sections for five language pairs. We
present a detailed data collection pipeline that can serve as inspiration for
aligning new document-level datasets. We demonstrate the usefulness and quality
of this test set by evaluating a number of models. Our results show that the
test set is able to discriminate between context-sensitive and context-agnostic
models and shows the weaknesses when models fail to accurately translate
financial texts. The test set is made public for the community.
comment: To be published in NAACL 2025 Findings
☆ Knowledge Distillation from Large Language Models for Household Energy Modeling
Machine learning (ML) is increasingly vital for smart-grid research, yet
restricted access to realistic, diverse data - often due to privacy concerns -
slows progress and fuels doubts within the energy sector about adopting
ML-based strategies. We propose integrating Large Language Models (LLMs) in
energy modeling to generate realistic, culturally sensitive, and
behavior-specific data for household energy usage across diverse geographies.
In this study, we employ and compare five different LLMs to systematically
produce family structures, weather patterns, and daily consumption profiles for
households in six distinct countries. A four-stage methodology synthesizes
contextual daily data, including culturally nuanced activities, realistic
weather ranges, HVAC operations, and distinct `energy signatures' that capture
unique consumption footprints. Additionally, we explore an alternative strategy
where external weather datasets can be directly integrated, bypassing
intermediate weather modeling stages while ensuring physically consistent data
inputs. The resulting dataset provides insights into how cultural, climatic,
and behavioral factors converge to shape carbon emissions, offering a
cost-effective avenue for scenario-based energy optimization. This approach
underscores how prompt engineering, combined with knowledge distillation, can
advance sustainable energy research and climate mitigation efforts. Source code
is available at
https://github.com/Singularity-AI-Lab/LLM-Energy-Knowledge-Distillation .
comment: Source code is available at
https://github.com/Singularity-AI-Lab/LLM-Energy-Knowledge-Distillation
☆ Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
We introduce a new approach to systematically map features discovered by
sparse autoencoder across consecutive layers of large language models,
extending earlier work that examined inter-layer feature links. By using a
data-free cosine similarity technique, we trace how specific features persist,
transform, or first appear at each stage. This method yields granular flow
graphs of feature evolution, enabling fine-grained interpretability and
mechanistic insights into model computations. Crucially, we demonstrate how
these cross-layer feature maps facilitate direct steering of model behavior by
amplifying or suppressing chosen features, achieving targeted thematic control
in text generation. Together, our findings highlight the utility of a causal,
cross-layer interpretability framework that not only clarifies how features
develop through forward passes but also provides new means for transparent
manipulation of large language models.
☆ Scaling Laws for Upcycling Mixture-of-Experts Language Models
Pretraining large language models (LLMs) is resource-intensive, often
requiring months of training time even with high-end GPU clusters. There are
two approaches of mitigating such computational demands: reusing smaller models
to train larger ones (upcycling), and training computationally efficient models
like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to
MoE models, of which the scaling behavior remains underexplored. Through
extensive experiments, we identify empirical scaling laws that describe how
performance depends on dataset size and model configuration. Particularly, we
show that, while scaling these factors improves performance, there is a novel
interaction term between the dense and upcycled training dataset that limits
the efficiency of upcycling at large computational budgets. Based on these
findings, we provide guidance to scale upcycling, and establish conditions
under which upcycling outperforms from-scratch trainings within budget
constraints.
comment: 15 figures, 8 tables
☆ MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation
Large Language Models (LLMs) have demonstrated impressive capabilities across
natural language processing tasks. However, their application to specialized
domains such as medicine and biology requires further optimization to ensure
factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a
domain-adapted biomedical question-answering model designed to enhance both
short-form and long-form queries. By integrating fine-tuning and
retrieval-augmented generation (RAG), MedBioLM dynamically incorporates
domain-specific knowledge, improving reasoning abilities and factual accuracy.
To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA
datasets, covering structured multiple-choice assessments and complex clinical
reasoning tasks. Fine-tuning significantly improves accuracy on benchmark
datasets, while RAG enhances factual consistency. These results highlight the
potential of domain-optimized LLMs in advancing biomedical research, medical
education, and clinical decision support.
☆ Training an LLM-as-a-Judge Model: Pipeline, Insights, and Practical Lessons WWW'25
The rapid advancement of large language models (LLMs) has opened new
possibilities for their adoption as evaluative judges. This paper introduces
Themis, a fine-tuned LLM judge that delivers sophisticated context-aware
evaluations. We provide a comprehensive overview of the development pipeline
for Themis, highlighting its scenario-dependent evaluation prompts and two
novel methods for controlled instruction generation. These designs enable
Themis to effectively distill evaluative skills from teacher models, while
retaining flexibility for continuous development. We introduce two
human-labeled benchmarks for meta-evaluation, demonstrating that Themis can
achieve high alignment with human preferences in an economical manner.
Additionally, we explore insights into the LLM-as-a-judge paradigm, revealing
nuances in performance and the varied effects of reference answers. Notably, we
observe that pure knowledge distillation from strong LLMs, though common, does
not guarantee performance improvement through scaling. We propose a mitigation
strategy based on instruction-following difficulty. Furthermore, we provide
practical guidelines covering data balancing, prompt customization,
multi-objective training, and metric aggregation. We aim for our method and
findings, along with the fine-tuning data, benchmarks, and model checkpoints,
to support future research and development in this area.
comment: accepted at WWW'25 (Industrial Track), extended version
☆ Position: Editing Large Language Models Poses Serious Safety Risks
Large Language Models (LLMs) contain large amounts of facts about the world.
These facts can become outdated over time, which has led to the development of
knowledge editing methods (KEs) that can change specific facts in LLMs with
limited side effects. This position paper argues that editing LLMs poses
serious safety risks that have been largely overlooked. First, we note the fact
that KEs are widely available, computationally inexpensive, highly performant,
and stealthy makes them an attractive tool for malicious actors. Second, we
discuss malicious use cases of KEs, showing how KEs can be easily adapted for a
variety of malicious purposes. Third, we highlight vulnerabilities in the AI
ecosystem that allow unrestricted uploading and downloading of updated models
without verification. Fourth, we argue that a lack of social and institutional
awareness exacerbates this risk, and discuss the implications for different
stakeholders. We call on the community to (i) research tamper-resistant models
and countermeasures against malicious model editing, and (ii) actively engage
in securing the AI ecosystem.
☆ ReachAgent: Enhancing Mobile Agent via Page Reaching and Operation
Recently, mobile AI agents have gained increasing attention. Given a task,
mobile AI agents can interact with mobile devices in multiple steps and finally
form a GUI flow that solves the task. However, existing agents tend to focus on
most task-relevant elements at each step, leading to local optimal solutions
and ignoring the overall GUI flow. To address this issue, we constructed a
training dataset called MobileReach, which breaks the task into page reaching
and operation subtasks. Furthermore, we propose ReachAgent, a two-stage
framework that focuses on improving its task-completion abilities. It utilizes
the page reaching and page operation subtasks, along with reward-based
preference GUI flows, to further enhance the agent. Experimental results show
that ReachAgent significantly improves the IoU Acc and Text Acc by 7.12% and
7.69% on the step-level and 4.72% and 4.63% on the task-level compared to the
SOTA agent. Our data and code will be released upon acceptance.
☆ LLM-KT: Aligning Large Language Models with Knowledge Tracing using a Plug-and-Play Instruction
The knowledge tracing (KT) problem is an extremely important topic in
personalized education, which aims to predict whether students can correctly
answer the next question based on their past question-answer records. Prior
work on this task mainly focused on learning the sequence of behaviors based on
the IDs or textual information. However, these studies usually fail to capture
students' sufficient behavioral patterns without reasoning with rich world
knowledge about questions. In this paper, we propose a large language models
(LLMs)-based framework for KT, named \texttt{\textbf{LLM-KT}}, to integrate the
strengths of LLMs and traditional sequence interaction models. For task-level
alignment, we design Plug-and-Play instruction to align LLMs with KT,
leveraging LLMs' rich knowledge and powerful reasoning capacity. For
modality-level alignment, we design the plug-in context and sequence to
integrate multiple modalities learned by traditional methods. To capture the
long context of history records, we present a plug-in context to flexibly
insert the compressed context embedding into LLMs using question-specific and
concept-specific tokens. Furthermore, we introduce a plug-in sequence to
enhance LLMs with sequence interaction behavior representation learned by
traditional sequence models using a sequence adapter. Extensive experiments
show that \texttt{\textbf{LLM-KT}} obtains state-of-the-art performance on four
typical datasets by comparing it with approximately 20 strong baselines.
☆ LLaVAC: Fine-tuning LLaVA as a Multimodal Sentiment Classifier
We present LLaVAC, a method for constructing a classifier for multimodal
sentiment analysis. This method leverages fine-tuning of the Large Language and
Vision Assistant (LLaVA) to predict sentiment labels across both image and text
modalities. Our approach involves designing a structured prompt that
incorporates both unimodal and multimodal labels to fine-tune LLaVA, enabling
it to perform sentiment classification effectively. Experiments on the
MVSA-Single dataset demonstrate that LLaVAC outperforms existing methods in
multimodal sentiment analysis across three data processing procedures. The
implementation of LLaVAC is publicly available at
https://github.com/tchayintr/llavac.
☆ SPARC: Subspace-Aware Prompt Adaptation for Robust Continual Learning in LLMs
We propose SPARC, a lightweight continual learning framework for large
language models (LLMs) that enables efficient task adaptation through prompt
tuning in a lower-dimensional space. By leveraging principal component analysis
(PCA), we identify a compact subspace of the training data. Optimizing prompts
in this lower-dimensional space enhances training efficiency, as it focuses
updates on the most relevant features while reducing computational overhead.
Furthermore, since the model's internal structure remains unaltered, the
extensive knowledge gained from pretraining is fully preserved, ensuring that
previously learned information is not compromised during adaptation. Our method
achieves high knowledge retention in both task-incremental and
domain-incremental continual learning setups while fine-tuning only 0.04% of
the model's parameters. Additionally, by integrating LoRA, we enhance
adaptability to computational constraints, allowing for a tradeoff between
accuracy and training cost. Experiments on the SuperGLUE benchmark demonstrate
that our PCA-based prompt tuning combined with LoRA maintains full knowledge
retention while improving accuracy, utilizing only 1% of the model's
parameters. These results establish our approach as a scalable and
resource-efficient solution for continual learning in LLMs.
☆ ScholaWrite: A Dataset of End-to-End Scholarly Writing Process
Writing is a cognitively demanding task involving continuous decision-making,
heavy use of working memory, and frequent switching between multiple
activities. Scholarly writing is particularly complex as it requires authors to
coordinate many pieces of multiform knowledge. To fully understand writers'
cognitive thought process, one should fully decode the end-to-end writing data
(from individual ideas to final manuscript) and understand their complex
cognitive mechanisms in scholarly writing. We introduce ScholaWrite dataset,
the first-of-its-kind keystroke logs of an end-to-end scholarly writing process
for complete manuscripts, with thorough annotations of cognitive writing
intentions behind each keystroke. Our dataset includes LaTeX-based keystroke
data from five preprints with nearly 62K total text changes and annotations
across 4 months of paper writing. ScholaWrite shows promising usability and
applications (e.g., iterative self-writing) for the future development of AI
writing assistants for academic research, which necessitate complex methods
beyond LLM prompting. Our experiments clearly demonstrated the importance of
collection of end-to-end writing data, rather than the final manuscript, for
the development of future writing assistants to support the cognitive thinking
process of scientists. Our de-identified dataset, demo, and code repository are
available on our project page.
comment: Equal contribution: Linghe Wang, Minhwa Lee | project page:
https://minnesotanlp.github.io/scholawrite/
☆ What is in a name? Mitigating Name Bias in Text Embeddings via Anonymization
Text-embedding models often exhibit biases arising from the data on which
they are trained. In this paper, we examine a hitherto unexplored bias in
text-embeddings: bias arising from the presence of $\textit{names}$ such as
persons, locations, organizations etc. in the text. Our study shows how the
presence of $\textit{name-bias}$ in text-embedding models can potentially lead
to erroneous conclusions in assessment of thematic similarity.Text-embeddings
can mistakenly indicate similarity between texts based on names in the text,
even when their actual semantic content has no similarity or indicate
dissimilarity simply because of the names in the text even when the texts match
semantically. We first demonstrate the presence of name bias in different
text-embedding models and then propose $\textit{text-anonymization}$ during
inference which involves removing references to names, while preserving the
core theme of the text. The efficacy of the anonymization approach is
demonstrated on two downstream NLP tasks, achieving significant performance
gains. Our simple and training-optimization-free approach offers a practical
and easily implementable solution to mitigate name bias.
☆ A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs ISWC 2024
Evaluating large language models (LLMs) for tasks like fact extraction in
support of knowledge graph construction frequently involves computing accuracy
metrics using a ground truth benchmark based on a knowledge graph (KG). These
evaluations assume that errors represent factual disagreements. However, human
discourse frequently features metalinguistic disagreement, where agents differ
not on facts but on the meaning of the language used to express them. Given the
complexity of natural language processing and generation using LLMs, we ask: do
metalinguistic disagreements occur between LLMs and KGs? Based on an
investigation using the T-REx knowledge alignment dataset, we hypothesize that
metalinguistic disagreement does in fact occur between LLMs and KGs, with
potential relevance for the practice of knowledge graph engineering. We propose
a benchmark for evaluating the detection of factual and metalinguistic
disagreements between LLMs and KGs. An initial proof of concept of such a
benchmark is available on Github.
comment: 6 pages, 2 tables, to appear in Reham Alharbi, Jacopo de Berardinis,
Paul Groth, Albert Mero\~no-Pe\~nuela, Elena Simperl, Valentina Tamma (eds.),
ISWC 2024 Special Session on Harmonising Generative AI and Semantic Web
Technologies. CEUR-WS.org (forthcoming), for associated code and data see
https://github.com/bradleypallen/trex-metalinguistic-disagreement
☆ Lowering the Barrier of Machine Learning: Achieving Zero Manual Labeling in Review Classification Using LLMs
With the internet's evolution, consumers increasingly rely on online reviews
for service or product choices, necessitating that businesses analyze extensive
customer feedback to enhance their offerings. While machine learning-based
sentiment classification shows promise in this realm, its technical complexity
often bars small businesses and individuals from leveraging such advancements,
which may end up making the competitive gap between small and large businesses
even bigger in terms of improving customer satisfaction. This paper introduces
an approach that integrates large language models (LLMs), specifically
Generative Pre-trained Transformer (GPT) and Bidirectional Encoder
Representations from Transformers (BERT)-based models, making it accessible to
a wider audience. Our experiments across various datasets confirm that our
approach retains high classification accuracy without the need for manual
labeling, expert knowledge in tuning and data annotation, or substantial
computational power. By significantly lowering the barriers to applying
sentiment classification techniques, our methodology enhances competitiveness
and paves the way for making machine learning technology accessible to a
broader audience.
comment: Accepted to 2025 11th International Conference on Computing and
Artificial Intelligence (ICCAI 2025)
☆ Achieving Operational Universality through a Turing Complete Chemputer
The most fundamental abstraction underlying all modern computers is the
Turing Machine, that is if any modern computer can simulate a Turing Machine,
an equivalence which is called Turing completeness, it is theoretically
possible to achieve any task that can be algorithmically described by executing
a series of discrete unit operations. In chemistry, the ability to program
chemical processes is demanding because it is hard to ensure that the process
can be understood at a high level of abstraction, and then reduced to practice.
Herein we exploit the concept of Turing completeness applied to robotic
platforms for chemistry that can be used to synthesise complex molecules
through unit operations that execute chemical processes using a
chemically-aware programming language, XDL. We leverage the concept of
computability by computers to synthesizability of chemical compounds by
automated synthesis machines. The results of an interactive demonstration of
Turing completeness using the colour gamut and conditional logic are presented
and examples of chemical use-cases are discussed. Over 16.7 million
combinations of Red, Green, Blue (RGB) colour space were binned into 5 discrete
values and measured over 10 regions of interest (ROIs), affording 78 million
possible states per step and served as a proxy for conceptual, chemical space
exploration. This formal description establishes a formal framework in future
chemical programming languages to ensure complex logic operations are expressed
and executed correctly, with the possibility of error correction, in the
automated and autonomous pursuit of increasingly complex molecules.
comment: 18 pages, 7 figures, 28 references
☆ Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Yibo Yan, Shen Wang, Jiahao Huo, Jingheng Ye, Zhendong Chu, Xuming Hu, Philip S. Yu, Carla Gomes, Bart Selman, Qingsong Wen
Scientific reasoning, the process through which humans apply logic, evidence,
and critical thinking to explore and interpret scientific phenomena, is
essential in advancing knowledge reasoning across diverse fields. However,
despite significant progress, current scientific reasoning models still
struggle with generalization across domains and often fall short of multimodal
perception. Multimodal Large Language Models (MLLMs), which integrate text,
images, and other modalities, present an exciting opportunity to overcome these
limitations and enhance scientific reasoning. Therefore, this position paper
argues that MLLMs can significantly advance scientific reasoning across
disciplines such as mathematics, physics, chemistry, and biology. First, we
propose a four-stage research roadmap of scientific reasoning capabilities, and
highlight the current state of MLLM applications in scientific reasoning,
noting their ability to integrate and reason over diverse data types. Second,
we summarize the key challenges that remain obstacles to achieving MLLM's full
potential. To address these challenges, we propose actionable insights and
suggestions for the future. Overall, our work offers a novel perspective on
MLLM integration with scientific reasoning, providing the LLM community with a
valuable vision for achieving Artificial General Intelligence (AGI).
☆ CAMI: A Counselor Agent Supporting Motivational Interviewing through State Inference and Topic Exploration
Yizhe Yang, Palakorn Achananuparp, Heyan Huang, Jing Jiang, Kit Phey Leng, Nicholas Gabriel Lim, Cameron Tan Shi Ern, Ee-peng Lim
Conversational counselor agents have become essential tools for addressing
the rising demand for scalable and accessible mental health support. This paper
introduces CAMI, a novel automated counselor agent grounded in Motivational
Interviewing (MI) -- a client-centered counseling approach designed to address
ambivalence and facilitate behavior change. CAMI employs a novel STAR
framework, consisting of client's state inference, motivation topic
exploration, and response generation modules, leveraging large language models
(LLMs). These components work together to evoke change talk, aligning with MI
principles and improving counseling outcomes for clients from diverse
backgrounds. We evaluate CAMI's performance through both automated and manual
evaluations, utilizing simulated clients to assess MI skill competency,
client's state inference accuracy, topic exploration proficiency, and overall
counseling success. Results show that CAMI not only outperforms several
state-of-the-art methods but also shows more realistic counselor-like behavior.
Additionally, our ablation study underscores the critical roles of state
inference and topic exploration in achieving this performance.
☆ Consistent Client Simulation for Motivational Interviewing-based Counseling
Yizhe Yang, Palakorn Achananuparp, Heyan Huang, Jing Jiang, John Pinto, Jenny Giam, Kit Phey Leng, Nicholas Gabriel Lim, Cameron Tan Shi Ern, Ee-peng Lim
Simulating human clients in mental health counseling is crucial for training
and evaluating counselors (both human or simulated) in a scalable manner.
Nevertheless, past research on client simulation did not focus on complex
conversation tasks such as mental health counseling. In these tasks, the
challenge is to ensure that the client's actions (i.e., interactions with the
counselor) are consistent with with its stipulated profiles and negative
behavior settings. In this paper, we propose a novel framework that supports
consistent client simulation for mental health counseling. Our framework tracks
the mental state of a simulated client, controls its state transitions, and
generates for each state behaviors consistent with the client's motivation,
beliefs, preferred plan to change, and receptivity. By varying the client
profile and receptivity, we demonstrate that consistent simulated clients for
different counseling scenarios can be effectively created. Both our automatic
and expert evaluations on the generated counseling sessions also show that our
client simulation method achieves higher consistency than previous methods.
☆ Leveraging the true depth of LLMs
Large Language Models demonstrate remarkable capabilities at the cost of high
compute requirements. While recent research has shown that intermediate layers
can be removed or have their order shuffled without impacting performance
significantly, these findings have not been employed to reduce the
computational cost of inference. We investigate several potential ways to
reduce the depth of pre-trained LLMs without significantly affecting
performance. Leveraging our insights, we present a novel approach that exploits
this decoupling between layers by grouping some of them into pairs that can be
evaluated in parallel.
This modification of the computational graph -- through better parallelism --
results in an average improvement of around 1.20x on the number of tokens
generated per second, without re-training nor fine-tuning, while retaining
95%-99% of the original accuracy. Empirical evaluation demonstrates that this
approach significantly improves serving efficiency while maintaining model
performance, offering a practical improvement for large-scale LLM deployment.
☆ Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation
Improving time-to-first-token (TTFT) is an essentially important objective in
modern large language model (LLM) inference engines. Because optimizing TTFT
directly results in higher maximal QPS and meets the requirements of many
critical applications. However, boosting TTFT is notoriously challenging since
it is purely compute-bounded and the performance bottleneck shifts from the
self-attention to the MLP part. We present SpecPrefill, a training free
framework that accelerates the inference TTFT for both long and medium context
queries based on the following insight: LLMs are generalized enough to still
preserve the quality given only a carefully chosen subset of prompt tokens. At
its core, SpecPrefill leverages a lightweight model to speculate locally
important tokens based on the context. These tokens, along with the necessary
positional information, are then sent to the main model for processing. We
evaluate SpecPrefill with a diverse set of tasks, followed by a comprehensive
benchmarking of performance improvement both in a real end-to-end setting and
ablation studies. SpecPrefill manages to serve Llama-3.1-405B-Instruct-FP8 with
up to $7\times$ maximal end-to-end QPS on real downstream tasks and
$7.66\times$ TTFT improvement during benchmarking.
☆ SimMark: A Robust Sentence-Level Similarity-Based Watermarking Algorithm for Large Language Models
The rapid proliferation of large language models (LLMs) has created an urgent
need for reliable methods to detect whether a text is generated by such models.
In this paper, we propose SimMark, a posthoc watermarking algorithm that makes
LLMs' outputs traceable without requiring access to the model's internal
logits, enabling compatibility with a wide range of LLMs, including API-only
models. By leveraging the similarity of semantic sentence embeddings and
rejection sampling to impose detectable statistical patterns imperceptible to
humans, and employing a soft counting mechanism, SimMark achieves robustness
against paraphrasing attacks. Experimental results demonstrate that SimMark
sets a new benchmark for robust watermarking of LLM-generated content,
surpassing prior sentence-level watermarking techniques in robustness, sampling
efficiency, and applicability across diverse domains, all while preserving the
text quality.
comment: 15 pages, 5 tables, 6 figures
♻ ☆ NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild
We introduce NNetNav, a method for unsupervised interaction with websites
that generates synthetic demonstrations for training browser agents. Given any
website, NNetNav produces these demonstrations by retroactively labeling action
sequences from an exploration policy. Most work on training browser agents has
relied on expensive human supervision, and the limited prior work on such
interaction-based techniques has failed to provide effective search through the
exponentially large space of exploration. In contrast, NNetNav exploits the
hierarchical structure of language instructions to make this search more
tractable: Complex instructions are typically decomposable into simpler
sub-tasks, allowing NNetNav to automatically prune interaction episodes when an
intermediate trajectory cannot be annotated with a meaningful sub-task.
\texttt{LLama-3.1-8b} finetuned on 10k NNetNav self-generated demonstrations
obtains over 16\% success rate on WebArena, and 35\% on WebVoyager, an
improvement of 15pts and 31pts respectively over zero-shot
\texttt{LLama-3.1-8b}, outperforming zero-shot GPT-4 and reaching the
state-of-the-art among unsupervised methods, for both benchmarks.
comment: Code, Data and Models available at https://www.nnetnav.dev
♻ ☆ S2-Attention: Hardware-Aware Context Sharding Among Attention Heads
Sparse attention, which selectively attends to a subset of tokens in the
context was supposed to be efficient. However, its theoretical reduction in
FLOPs has rarely translated into wall-clock speed-up over its dense attention
counterparts due to the lack of hardware-aware optimizations like
FlashAttention. Meanwhile, it remains unclear whether sparse attention can
maintain the model's quality at a scale of today's large language models (LLMs)
and how. This paper presents Sparsely-Sharded(S2) Attention, a Triton library
that provides kernel optimization for sparse attention customizable at both
per-head and per-context-range levels. S2-Attention enables the exploration of
novel and high-performance sparse attention techniques, which we demonstrate
through extensive ablations across a wide range of sparse attention designs at
various model scales. From these insights, we present several basic guidelines
to design sparse attention that can achieve not only practical efficiency
improvements, but also strong downstream performance. To achieve high
parallelization and optimized memory IO, sparse attention should shard the
context heterogeneously across attention heads, where each head attends to a
different subset of tokens while collectively covering the full context.
Meanwhile, we find hybrid architectures combining sparse and dense attention
particularly beneficial in practice. S2-Attention achieves wall-clock speedup
of 8.79X, 15.87X, 25.3X compared to the strong FlashAttention-2 baseline with
strong downstream performance on-par with full attention and perfect retrieval
performance at a 128k context length. At inference, for 7B models, our model,
with the help of our S2-Attention kernel, achieves 4.5x speed-up compared to
dense counterparts. S2-Attention is released with easy-to-customize APIs for
direct usage in Megatron and vLLM.
comment: 10 pages
♻ ☆ Unanswerability Evaluation for Retrieval Augmented Generation
Existing evaluation frameworks for retrieval-augmented generation (RAG)
systems focus on answerable queries, but they overlook the importance of
appropriately rejecting unanswerable requests. In this paper, we introduce
UAEval4RAG, a framework designed to evaluate whether RAG systems can handle
unanswerable queries effectively. We define a taxonomy with six unanswerable
categories, and UAEval4RAG automatically synthesizes diverse and challenging
queries for any given knowledge base with unanswered ratio and acceptable ratio
metrics. We conduct experiments with various RAG components, including
retrieval models, rewriting methods, rerankers, language models, and prompting
strategies, and reveal hidden trade-offs in performance of RAG systems. Our
findings highlight the critical role of component selection and prompt design
in optimizing RAG systems to balance the accuracy of answerable queries with
high rejection rates of unanswerable ones. UAEval4RAG provides valuable
insights and tools for developing more robust and reliable RAG systems.
♻ ☆ Simple Is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation ICLR 2025
Large Language Models (LLMs) demonstrate strong reasoning abilities but face
limitations such as hallucinations and outdated knowledge. Knowledge Graph
(KG)-based Retrieval-Augmented Generation (RAG) addresses these issues by
grounding LLM outputs in structured external knowledge from KGs. However,
current KG-based RAG frameworks still struggle to optimize the trade-off
between retrieval effectiveness and efficiency in identifying a suitable amount
of relevant graph information for the LLM to digest. We introduce SubgraphRAG,
extending the KG-based RAG framework that retrieves subgraphs and leverages
LLMs for reasoning and answer prediction. Our approach innovatively integrates
a lightweight multilayer perceptron with a parallel triple-scoring mechanism
for efficient and flexible subgraph retrieval while encoding directional
structural distances to enhance retrieval effectiveness. The size of retrieved
subgraphs can be flexibly adjusted to match the query's need and the downstream
LLM's capabilities. This design strikes a balance between model complexity and
reasoning power, enabling scalable and generalizable retrieval processes.
Notably, based on our retrieved subgraphs, smaller LLMs like
Llama3.1-8B-Instruct deliver competitive results with explainable reasoning,
while larger models like GPT-4o achieve state-of-the-art accuracy compared with
previous baselines -- all without fine-tuning. Extensive evaluations on the
WebQSP and CWQ benchmarks highlight SubgraphRAG's strengths in efficiency,
accuracy, and reliability by reducing hallucinations and improving response
grounding.
comment: Accepted by ICLR 2025; Code available at
https://github.com/Graph-COM/SubgraphRAG
♻ ☆ ExploreSelf: Fostering User-driven Exploration and Reflection on Personal Challenges with Adaptive Guidance by Large Language Models
Expressing stressful experiences in words is proven to improve mental and
physical health, but individuals often disengage with writing interventions as
they struggle to organize their thoughts and emotions. Reflective prompts have
been used to provide direction, and large language models (LLMs) have
demonstrated the potential to provide tailored guidance. However, current
systems often limit users' flexibility to direct their reflections. We thus
present ExploreSelf, an LLM-driven application designed to empower users to
control their reflective journey, providing adaptive support through
dynamically generated questions. Through an exploratory study with 19
participants, we examine how participants explore and reflect on personal
challenges using ExploreSelf. Our findings demonstrate that participants valued
the flexible navigation of adaptive guidance to control their reflective
journey, leading to deeper engagement and insight. Building on our findings, we
discuss the implications of designing LLM-driven tools that facilitate
user-driven and effective reflection of personal challenges.
comment: 17 pages excluding reference and appendix. Accepted at ACM CHI 2025.
https://naver-ai.github.io/exploreself
♻ ☆ Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation
Retrieval, re-ranking, and retrieval-augmented generation (RAG) are critical
components of modern natural language processing (NLP) applications in
information retrieval, question answering, and knowledge-based text generation.
However, existing solutions are often fragmented, lacking a unified framework
that easily integrates these essential processes. The absence of a standardized
implementation, coupled with the complexity of retrieval and re-ranking
workflows, makes it challenging for researchers to compare and evaluate
different approaches in a consistent environment. While existing toolkits such
as Rerankers and RankLLM provide general-purpose reranking pipelines, they
often lack the flexibility required for fine-grained experimentation and
benchmarking. In response to these challenges, we introduce \textbf{Rankify}, a
powerful and modular open-source toolkit designed to unify retrieval,
re-ranking, and RAG within a cohesive framework. Rankify supports a wide range
of retrieval techniques, including dense and sparse retrievers, while
incorporating state-of-the-art re-ranking models to enhance retrieval quality.
Additionally, Rankify includes a collection of pre-retrieved datasets to
facilitate benchmarking, available at Huggingface
(https://huggingface.co/datasets/abdoelsayed/reranking-datasets). To encourage
adoption and ease of integration, we provide comprehensive documentation
(http://rankify.readthedocs.io/), an open-source implementation on
GitHub(https://github.com/DataScienceUIBK/rankify), and a PyPI package for
effortless installation(https://pypi.org/project/rankify/). By providing a
unified and lightweight framework, Rankify allows researchers and practitioners
to advance retrieval and re-ranking methodologies while ensuring consistency,
scalability, and ease of use.
comment: Work in Progress
♻ ☆ CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao
Large language models have achieved remarkable success in various tasks but
suffer from high computational costs during inference, limiting their
deployment in resource-constrained applications. To address this issue, we
propose a novel CITER (\textbf{C}ollaborative \textbf{I}nference with
\textbf{T}oken-l\textbf{E}vel \textbf{R}outing) framework that enables
efficient collaboration between small and large language models (SLMs & LLMs)
through a token-level routing strategy. Specifically, CITER routes non-critical
tokens to an SLM for efficiency and routes critical tokens to an LLM for
generalization quality. We formulate router training as a policy optimization,
where the router receives rewards based on both the quality of predictions and
the inference costs of generation. This allows the router to learn to predict
token-level routing scores and make routing decisions based on both the current
token and the future impact of its decisions. To further accelerate the reward
evaluation process, we introduce a shortcut which significantly reduces the
costs of the reward estimation and improving the practicality of our approach.
Extensive experiments on five benchmark datasets demonstrate that CITER reduces
the inference costs while preserving high-quality generation, offering a
promising solution for real-time and resource-constrained applications.
♻ ☆ Agent-OM: Leveraging LLM Agents for Ontology Matching
Ontology matching (OM) enables semantic interoperability between different
ontologies and resolves their conceptual heterogeneity by aligning related
entities. OM systems currently have two prevailing design paradigms:
conventional knowledge-based expert systems and newer machine learning-based
predictive systems. While large language models (LLMs) and LLM agents have
revolutionised data engineering and have been applied creatively in many
domains, their potential for OM remains underexplored. This study introduces a
novel agent-powered LLM-based design paradigm for OM systems. With
consideration of several specific challenges in leveraging LLM agents for OM,
we propose a generic framework, namely Agent-OM (Agent for Ontology Matching),
consisting of two Siamese agents for retrieval and matching, with a set of OM
tools. Our framework is implemented in a proof-of-concept system. Evaluations
of three Ontology Alignment Evaluation Initiative (OAEI) tracks over
state-of-the-art OM systems show that our system can achieve results very close
to the long-standing best performance on simple OM tasks and can significantly
improve the performance on complex and few-shot OM tasks.
comment: 19 pages, 12 figures, 3 tables
♻ ☆ Distilling Implicit Multimodal Knowledge into Large Language Models for Zero-Resource Dialogue Generation
Integrating multimodal knowledge into large language models (LLMs) represents
a significant advancement in dialogue generation capabilities. However, the
effective incorporation of such knowledge in zero-resource scenarios remains a
substantial challenge due to the scarcity of diverse, high-quality dialogue
datasets. To address this, we propose the Visual Implicit Knowledge
Distillation Framework (VIKDF), an innovative approach aimed at enhancing LLMs
for enriched dialogue generation in zero-resource contexts by leveraging
implicit multimodal knowledge. VIKDF comprises two main stages: knowledge
distillation, using an Implicit Query Transformer to extract and encode visual
implicit knowledge from image-text pairs into knowledge vectors; and knowledge
integration, employing a novel Bidirectional Variational Information Fusion
technique to seamlessly integrate these distilled vectors into LLMs. This
enables the LLMs to generate dialogues that are not only coherent and engaging
but also exhibit a deep understanding of the context through implicit
multimodal cues, effectively overcoming the limitations of zero-resource
scenarios. Our extensive experimentation across two dialogue datasets shows
that VIKDF outperforms existing state-of-the-art models in generating
high-quality dialogues. The code is available at
https://github.com/zhangbo-nlp/VIKDF.
comment: Accepted by Information Fusion. The code is available at
https://github.com/zhangbo-nlp/VIKDF
♻ ☆ Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison NAACL 2025
Following the remarkable success of Large Language Models (LLMs) in NLP
tasks, there is increasing interest in extending their capabilities to speech
-- the most common form of communication. The most widespread approach to
integrating speech into LLMs is dense feature prepending (DFP), which prepends
the projected speech representations to the textual representations, allowing
end-to-end training with a speech encoder. This raises questions about the need
for a sophisticated speech encoder for DFP and how its performance compares
with a standard encoder-decoder (i.e., cross-attention) architecture. We
compare DFP and cross-attention under a variety of configurations, such as CTC
compression, sequence-level knowledge distillation, on monolingual, bilingual,
and multilingual models. To perform a controlled architectural comparison, we
train all models from scratch rather than using large pretrained models and use
comparable data and parameter settings, testing speech-to-text recognition
(ASR) and translation (ST) on MuST-C v1.0 and CoVoST2 datasets. Despite the
wide adoption of DFP, our results do not indicate a clear advantage of DFP over
cross-attention.
comment: Accepted at NAACL 2025
♻ ☆ Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems
This study introduces Conversation Routines (CR), a structured prompt
engineering framework for developing task-oriented dialog systems using Large
Language Models (LLMs). While LLMs demonstrate remarkable natural language
understanding capabilities, engineering them to reliably execute complex
business workflows remains challenging. The proposed CR framework enables the
development of Conversation Agentic Systems (CAS) through natural language
specifications, embedding task-oriented logic within LLM prompts. This approach
provides a systematic methodology for designing and implementing complex
conversational workflows while maintaining behavioral consistency. We
demonstrate the framework's effectiveness through two proof-of-concept
implementations: a Train Ticket Booking System and an Interactive
Troubleshooting Copilot. These case studies validate CR's capability to encode
sophisticated behavioral patterns and decision logic while preserving natural
conversational flexibility. Results show that CR enables domain experts to
design conversational workflows in natural language while leveraging custom
functions (tools) developed by software engineers, creating an efficient
division of responsibilities where developers focus on core API implementation
and domain experts handle conversation design. While the framework shows
promise in accessibility and adaptability, we identify key challenges including
computational overhead, non-deterministic behavior, and domain-specific logic
optimization. Future research directions include CR evaluation methods based on
prompt engineering frameworks driven by goal-oriented grading criteria,
improving scalability for complex multi-agent interactions, and enhancing
system robustness to address the identified limitations across diverse business
applications.
comment: In version 3, we added Subsection 1.2, "Single-Agent vs. Multi-Agent
Architectures," and Figure 1 to clarify CAS prompt composition. We also
refined code block and appendix log formatting for improved readability, with
minor formatting corrections throughout
♻ ☆ The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
The "LLM-as-a-judge" paradigm employs Large Language Models (LLMs) as
annotators and evaluators in tasks traditionally performed by humans. LLM
annotations are widely used, not only in NLP research but also in fields like
medicine, psychology, and social science. Despite their role in shaping study
results and insights, there is no standard or rigorous procedure to determine
whether LLMs can replace human annotators. In this paper, we propose a novel
statistical procedure -- the Alternative Annotator Test (alt-test) -- that
requires only a modest subset of annotated examples to justify using LLM
annotations. Additionally, we introduce a versatile and interpretable measure
for comparing LLM judges. To demonstrate our procedure, we curated a diverse
collection of ten datasets, consisting of language and vision-language tasks,
and conducted experiments with six LLMs and four prompting techniques. Our
results show that LLMs can sometimes replace humans with closed-source LLMs
(such as GPT-4o), outperforming open-source LLMs, and that prompting techniques
yield judges of varying quality. We hope this study encourages more rigorous
and reliable practices.
♻ ☆ Leveraging Encoder-only Large Language Models for Mobile App Review Feature Extraction
Mobile app review analysis presents unique challenges due to the low quality,
subjective bias, and noisy content of user-generated documents. Extracting
features from these reviews is essential for tasks such as feature
prioritization and sentiment analysis, but it remains a challenging task.
Meanwhile, encoder-only models based on the Transformer architecture have shown
promising results for classification and information extraction tasks for
multiple software engineering processes. This study explores the hypothesis
that encoder-only large language models can enhance feature extraction from
mobile app reviews. By leveraging crowdsourced annotations from an industrial
context, we redefine feature extraction as a supervised token classification
task. Our approach includes extending the pre-training of these models with a
large corpus of user reviews to improve contextual understanding and employing
instance selection techniques to optimize model fine-tuning. Empirical
evaluations demonstrate that this method improves the precision and recall of
extracted features and enhances performance efficiency. Key contributions
include a novel approach to feature extraction, annotated datasets, extended
pre-trained models, and an instance selection mechanism for cost-effective
fine-tuning. This research provides practical methods and empirical evidence in
applying large language models to natural language processing tasks within
mobile app reviews, offering improved performance in feature extraction.
comment: 46 pages, 7 tables, 11 figures
♻ ☆ Comply: Learning Sentences with Complex Weights inspired by Fruit Fly Olfaction
Alexei Figueroa, Justus Westerhoff, Golzar Atefi, Dennis Fast, Benjamin Winter, Felix Alexader Gers, Alexander Löser, Wolfang Nejdl
Biologically inspired neural networks offer alternative avenues to model data
distributions. FlyVec is a recent example that draws inspiration from the fruit
fly's olfactory circuit to tackle the task of learning word embeddings.
Surprisingly, this model performs competitively even against deep learning
approaches specifically designed to encode text, and it does so with the
highest degree of computational efficiency. We pose the question of whether
this performance can be improved further. For this, we introduce Comply. By
incorporating positional information through complex weights, we enable a
single-layer neural network to learn sequence representations. Our experiments
show that Comply not only supersedes FlyVec but also performs on par with
significantly larger state-of-the-art models. We achieve this without
additional parameters. Comply yields sparse contextual representations of
sentences that can be interpreted explicitly from the neuron weights.
comment: Accepted at NICE2025
♻ ☆ SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights ICLR 2025
Large language models (LLMs) like GPT-4, PaLM, and LLaMA have shown
significant improvements in various reasoning tasks. However, smaller models
such as Llama-3-8B and DeepSeekMath-Base still struggle with complex
mathematical reasoning because they fail to effectively identify and correct
reasoning errors. Recent reflection-based methods aim to address these issues
by enabling self-reflection and self-correction, but they still face challenges
in independently detecting errors in their reasoning steps. To overcome these
limitations, we propose SuperCorrect, a novel two-stage framework that uses a
large teacher model to supervise and correct both the reasoning and reflection
processes of a smaller student model. In the first stage, we extract
hierarchical high-level and detailed thought templates from the teacher model
to guide the student model in eliciting more fine-grained reasoning thoughts.
In the second stage, we introduce cross-model collaborative direct preference
optimization (DPO) to enhance the self-correction abilities of the student
model by following the teacher's correction traces during training. This
cross-model DPO approach teaches the student model to effectively locate and
resolve erroneous thoughts with error-driven insights from the teacher model,
breaking the bottleneck of its thoughts and acquiring new skills and knowledge
to tackle challenging problems. Extensive experiments consistently demonstrate
our superiority over previous methods. Notably, our SuperCorrect-7B model
significantly surpasses powerful DeepSeekMath-7B by 7.8%/5.3% and
Qwen2.5-Math-7B by 15.1%/6.3% on MATH/GSM8K benchmarks, achieving new SOTA
performance among all 7B models. Code:
https://github.com/YangLing0818/SuperCorrect-llm
comment: ICLR 2025. Project: https://github.com/YangLing0818/SuperCorrect-llm
♻ ☆ Token-based Decision Criteria Are Suboptimal in In-context Learning NAACL 2025
In-Context Learning (ICL) typically utilizes classification criteria from
output probabilities of manually selected label tokens. However, we argue that
such token-based classification criteria lead to suboptimal decision
boundaries, despite delicate calibrations through translation and constrained
rotation applied. To address this problem, we propose Hidden Calibration, which
renounces token probabilities and uses the nearest centroid classifier on the
LM's last hidden states. In detail, we assign the label of the nearest centroid
previously estimated from a calibration set to the test sample as the predicted
label. Our experiments on 6 models and 10 classification datasets indicate that
Hidden Calibration consistently outperforms current token-based baselines by
about 20%~50%, achieving a strong state-of-the-art in ICL. Our further analysis
demonstrates that Hidden Calibration finds better classification criteria with
less inter-class overlap, and LMs provide linearly separable intra-class
clusters with the help of demonstrations, which supports Hidden Calibration and
gives new insights into the principle of ICL. Our official code implementation
can be found at https://github.com/hc495/Hidden_Calibration.
comment: 24 pages, 15 figures, 13 tables. NAACL 2025 Main Conference Accepted.
Camera-ready version
♻ ☆ Compressing Large Language Models with Automated Sub-Network Search
Large Language Models (LLMs) demonstrate exceptional reasoning abilities,
enabling strong generalization across diverse tasks such as commonsense
reasoning and instruction following. However, as LLMs scale, inference costs
become increasingly prohibitive, accumulating significantly over their life
cycle. In this paper we consider model compression for LLMs to reduce model
size while improving downstream task performance. We phrase this as a neural
architecture search problem that automatically prunes structural components,
such as attention heads, neurons, and layers by searching for the
Pareto-optimal set of sub-networks balancing between performance and on-device
latency. Compared to state-of-the-art structural pruning approaches and
fine-tuned smaller sub-networks extracted from the pre-trained model, our
method achieves upto 9.85% improvement on average on 11 diverse downstream
tasks, while achieving up to 22% improvement of on-device latency.
♻ ☆ SimulPL: Aligning Human Preferences in Simultaneous Machine Translation ICLR 2025
Simultaneous Machine Translation (SiMT) generates translations while
receiving streaming source inputs. This requires the SiMT model to learn a
read/write policy, deciding when to translate and when to wait for more source
input. Numerous linguistic studies indicate that audiences in SiMT scenarios
have distinct preferences, such as accurate translations, simpler syntax, and
no unnecessary latency. Aligning SiMT models with these human preferences is
crucial to improve their performances. However, this issue still remains
unexplored. Additionally, preference optimization for SiMT task is also
challenging. Existing methods focus solely on optimizing the generated
responses, ignoring human preferences related to latency and the optimization
of read/write policy during the preference optimization phase. To address these
challenges, we propose Simultaneous Preference Learning (SimulPL), a preference
learning framework tailored for the SiMT task. In the SimulPL framework, we
categorize SiMT human preferences into five aspects: \textbf{translation
quality preference}, \textbf{monotonicity preference}, \textbf{key point
preference}, \textbf{simplicity preference}, and \textbf{latency preference}.
By leveraging the first four preferences, we construct human preference prompts
to efficiently guide GPT-4/4o in generating preference data for the SiMT task.
In the preference optimization phase, SimulPL integrates \textbf{latency
preference} into the optimization objective and enables SiMT models to improve
the read/write policy, thereby aligning with human preferences more
effectively. Experimental results indicate that SimulPL exhibits better
alignment with human preferences across all latency levels in
Zh$\rightarrow$En, De$\rightarrow$En and En$\rightarrow$Zh SiMT tasks. Our data
and code will be available at https://github.com/EurekaForNLP/SimulPL.
comment: Accepted to ICLR 2025. 23 pages,13 figures,11 tables
♻ ☆ Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency NAACL 2025
Code Language Models have been trained to generate accurate solutions,
typically with no regard for runtime. On the other hand, previous works that
explored execution optimisation have observed corresponding drops in functional
correctness. To that end, we introduce Code-Optimise, a framework that
incorporates both correctness (passed, failed) and runtime (quick, slow) as
learning signals via self-generated preference data. Our framework is both
lightweight and robust as it dynamically selects solutions to reduce
overfitting while avoiding a reliance on larger models for learning signals.
Code-Optimise achieves significant improvements in pass@k while decreasing the
competitive baseline runtimes by an additional 6% for in-domain data and up to
3% for out-of-domain data. As a by-product, the average length of the generated
solutions is reduced by up to 48% on MBPP and 23% on HumanEval, resulting in
faster and cheaper inference. The generated data and codebase is open-sourced
at https://github.com/huawei-noah/HEBO/tree/Code_Optimise.
comment: NAACL 2025 (Findings)
♻ ☆ Can Large Language Models Predict the Outcome of Judicial Decisions?
Large Language Models (LLMs) have shown exceptional capabilities in Natural
Language Processing (NLP) across diverse domains. However, their application in
specialized tasks such as Legal Judgment Prediction (LJP) for low-resource
languages like Arabic remains underexplored. In this work, we address this gap
by developing an Arabic LJP dataset, collected and preprocessed from Saudi
commercial court judgments. We benchmark state-of-the-art open-source LLMs,
including LLaMA-3.2-3B and LLaMA-3.1-8B, under varying configurations such as
zero-shot, one-shot, and fine-tuning using QLoRA. Additionally, we used a
comprehensive evaluation framework combining quantitative metrics (BLEU and
ROUGE) and qualitative assessments (Coherence, legal language, clarity). Our
results demonstrate that fine-tuned smaller models achieve comparable
performance to larger models in task-specific contexts while offering
significant resource efficiency. Furthermore, we investigate the effects of
prompt engineering and fine-tuning on model outputs, providing insights into
performance variability and instruction sensitivity. By making the dataset,
implementation code, and models publicly available, we establish a robust
foundation for future research in Arabic legal NLP.
♻ ☆ Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
Supervised Fine-Tuning (SFT) is commonly used to train language models to
imitate annotated responses for given instructions. In this paper, we challenge
this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models
learn to critique noisy responses rather than simply imitate correct ones.
Inspired by human learning processes that emphasize critical thinking, CFT
encourages deeper analysis and nuanced understanding-traits often overlooked by
standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample
dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in
the form of ([query; noisy response], critique). CFT on this dataset yields a
consistent 4-10% improvement over SFT on six math benchmarks with different
base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to
MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably,
our model Qwen2.5-Math-CFT only requires 1 hour training on 8xH100 over the 50K
examples. It can match or outperform strong competitors like
Qwen2.5-Math-Instruct on most benchmarks, which use over 2M samples. Moreover,
it can match the performance of SimpleRL, which is a deepseek-r1 replication
trained with 140x more compute. Ablation studies show that CFT is robust to the
source of noisy response and teacher critique model. Through these findings, we
argue that CFT offers a more effective alternative to advance the reasoning of
language models.
♻ ☆ UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
Xin Xu, Qiyun Xu, Tong Xiao, Tianhao Chen, Yuchen Yan, Jiaxin Zhang, Shizhe Diao, Can Yang, Yang Wang
Large language models (LLMs) have demonstrated remarkable capabilities in
solving complex reasoning tasks, particularly in mathematics. However, the
domain of physics reasoning presents unique challenges that have received
significantly less attention. Existing benchmarks often fall short in
evaluating LLMs' abilities on the breadth and depth of undergraduate-level
physics, underscoring the need for a comprehensive evaluation. To fill this
gap, we introduce UGPhysics, a large-scale and comprehensive benchmark
specifically designed to evaluate UnderGraduate-level Physics (UGPhysics)
reasoning with LLMs. UGPhysics includes 5,520 undergraduate-level physics
problems in both English and Chinese, covering 13 subjects with seven different
answer types and four distinct physics reasoning skills, all rigorously
screened for data leakage. Additionally, we develop a Model-Assistant
Rule-based Judgment (MARJ) pipeline specifically tailored for assessing answer
correctness of physics problems, ensuring accurate evaluation. Our evaluation
of 31 leading LLMs shows that the highest overall accuracy, 49.8% (achieved by
OpenAI-o1-mini), emphasizes the necessity for models with stronger physics
reasoning skills, beyond math abilities. We hope UGPhysics, along with MARJ,
will drive future advancements in AI for physics reasoning. Codes and data are
available at https://github.com/YangLabHKUST/UGPhysics .
comment: 9 pages
♻ ☆ Almost Surely Safe Alignment of Large Language Models at Inference-Time
Even highly capable large language models (LLMs) can produce biased or unsafe
responses, and alignment techniques, such as RLHF, aimed at mitigating this
issue, are expensive and prone to overfitting as they retrain the LLM. This
paper introduces a novel inference-time alignment approach that ensures LLMs
generate safe responses almost surely, i.e., with a probability approaching
one. We achieve this by framing the safe generation of inference-time responses
as a constrained Markov decision process within the LLM's latent space.
Crucially, we augment a safety state that tracks the evolution of safety
constraints and enables us to demonstrate formal safety guarantees upon solving
the MDP in the latent space. Building on this foundation, we propose
InferenceGuard, a practical implementation that safely aligns LLMs without
modifying the model weights. Empirically, we demonstrate InferenceGuard
effectively balances safety and task performance, outperforming existing
inference-time alignment methods in generating safe and aligned responses.
♻ ☆ RuleRAG: Rule-Guided Retrieval-Augmented Generation with Language Models for Question Answering
Retrieval-augmented generation (RAG) has shown promising potential in
knowledge intensive question answering (QA). However, existing approaches only
consider the query itself, neither specifying the retrieval preferences for the
retrievers nor informing the generators of how to refer to the retrieved
documents for the answers, which poses a significant challenge to the QA
performance. To address these issues, we propose Rule-guided
Retrieval-Augmented Generation with LMs, which explicitly introduces rules for
in-context learning (RuleRAG-ICL) to guide retrievers to recall related
documents in the directions of rules and uniformly guide generators to reason
attributed by the same rules. Moreover, most existing RAG datasets were
constructed without considering rules and Knowledge Graphs (KGs) are recognized
as providing high-quality rules. Therefore, we construct five rule-aware RAG
benchmarks for QA, RuleQA, based on KGs to stress the significance of retrieval
and reasoning with rules. Experiments on RuleQA demonstrate RuleRAG-ICL
improves the retrieval quality of +89.2% in Recall@10 and answer accuracy of
+103.1% in Exact Match, and RuleRAG-FT yields more enhancement. In addition,
experiments on four existing RAG datasets show RuleRAG is also effective by
offering rules in RuleQA to them, further proving the generalization of rule
guidance in RuleRAG.
♻ ☆ Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference
Although applications involving long-context inputs are crucial for the
effective utilization of large language models (LLMs), they also result in
increased computational costs and reduced performance. To address this
challenge, we propose an efficient, training-free prompt compression method
that retains key information within compressed prompts. We identify specific
attention heads in transformer-based LLMs, which we designate as evaluator
heads, that are capable of selecting tokens in long inputs that are most
significant for inference. Building on this discovery, we develop EHPC, an
Evaluator Head-based Prompt Compression method, which enables LLMs to rapidly
"skim through" input prompts by leveraging only the first few layers with
evaluator heads during the pre-filling stage, subsequently passing only the
important tokens to the model for inference. EHPC achieves state-of-the-art
results across two mainstream benchmarks: prompt compression and long-context
inference acceleration. Consequently, it effectively reduces the complexity and
costs associated with commercial API calls. We further demonstrate that EHPC
attains competitive results compared to key-value cache-based acceleration
methods, thereby highlighting its potential to enhance the efficiency of LLMs
for long-context tasks.
♻ ☆ Beyond English: Evaluating Automated Measurement of Moral Foundations in Non-English Discourse with a Chinese Case Study
This study explores computational approaches for measuring moral foundations
(MFs) in non-English corpora. Since most resources are developed primarily for
English, cross-linguistic applications of moral foundation theory remain
limited. Using Chinese as a case study, this paper evaluates the effectiveness
of applying English resources to machine translated text, local language
lexicons, multilingual language models, and large language models (LLMs) in
measuring MFs in non-English texts. The results indicate that machine
translation and local lexicon approaches are insufficient for complex moral
assessments, frequently resulting in a substantial loss of cultural
information. In contrast, multilingual models and LLMs demonstrate reliable
cross-language performance with transfer learning, with LLMs excelling in terms
of data efficiency. Importantly, this study also underscores the need for
human-in-the-loop validation of automated MF assessment, as the most advanced
models may overlook cultural nuances in cross-language measurements. The
findings highlight the potential of LLMs for cross-language MF measurements and
other complex multilingual deductive coding tasks.
comment: 12 pages, 2 figures, 6 tables
♻ ☆ Transition Network Analysis: A Novel Framework for Modeling, Visualizing, and Identifying the Temporal Patterns of Learners and Learning Processes
This paper presents a novel learning analytics method: Transition Network
Analysis (TNA), a method that integrates Stochastic Process Mining and
probabilistic graph representation to model, visualize, and identify transition
patterns in the learning process data. Combining the relational and temporal
aspects into a single lens offers capabilities beyond either framework,
including centralities to capture important learning events, community
detection to identify behavior patterns, and clustering to reveal temporal
patterns. Furthermore, TNA introduces several significance tests that go beyond
either method and add rigor to the analysis. Here, we introduce the theoretical
and mathematical foundations of TNA and we demonstrate the functionalities of
TNA with a case study where students (n=191) engaged in small-group
collaboration to map patterns of group dynamics using the theories of
co-regulation and socially-shared regulated learning. The analysis revealed
that TNA can map the regulatory processes as well as identify important events,
patterns, and clusters. Bootstrap validation established the significant
transitions and eliminated spurious transitions. As such, TNA can capture
learning dynamics and provide a robust framework for investigating the temporal
evolution of learning processes. Future directions include -- inter alia --
expanding estimation methods, reliability assessment, and building longitudinal
TNA.
comment: Accepted at Learning Analytics & Knowledge (LAK '25)
♻ ☆ What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning COLING2025
Yifan Du, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Jinpeng Wang, Chuyuan Wang, Mingchen Cai, Ruihua Song, Ji-Rong Wen
Visual instruction tuning is crucial for enhancing the zero-shot
generalization capability of Multi-modal Large Language Models (MLLMs). In this
paper, we aim to investigate a fundamental question: ''what makes for good
visual instructions''. Through a comprehensive empirical study, we find that
instructions focusing on complex visual reasoning tasks are particularly
effective in improving the performance of MLLMs, with results correlating to
instruction complexity. Based on this insight, we develop a systematic approach
to automatically create high-quality complex visual reasoning instructions. Our
approach employs a synthesize-complicate-reformulate paradigm, leveraging
multiple stages to gradually increase the complexity of the instructions while
guaranteeing quality. Based on this approach, we create the ComVint dataset
with 32K examples, and fine-tune four MLLMs on it. Experimental results
consistently demonstrate the enhanced performance of all compared MLLMs, such
as a 27.86% and 27.60% improvement for LLaVA on MME-Perception and
MME-Cognition, respectively. Our code and data are publicly available at the
link: https://github.com/RUCAIBox/ComVint.
comment: Accepted by COLING2025
♻ ☆ ColPali: Efficient Document Retrieval with Vision Language Models ICLR 2025
Documents are visually rich structures that convey information through text,
but also figures, page layouts, tables, or even fonts. Since modern retrieval
systems mainly rely on the textual information they extract from document pages
to index documents -often through lengthy and brittle processes-, they struggle
to exploit key visual cues efficiently. This limits their capabilities in many
practical document retrieval applications such as Retrieval Augmented
Generation (RAG). To benchmark current systems on visually rich document
retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe,
composed of various page-level retrieval tasks spanning multiple domains,
languages, and practical settings. The inherent complexity and performance
shortcomings of modern systems motivate a new concept; doing document retrieval
by directly embedding the images of the document pages. We release ColPali, a
Vision Language Model trained to produce high-quality multi-vector embeddings
from images of document pages. Combined with a late interaction matching
mechanism, ColPali largely outperforms modern document retrieval pipelines
while being drastically simpler, faster and end-to-end trainable. We release
models, data, code and benchmarks under open licenses at
https://huggingface.co/vidore.
comment: Published as a conference paper at ICLR 2025
♻ ☆ Explanation Regularisation through the Lens of Attributions COLING 2025
Explanation regularisation (ER) has been introduced as a way to guide text
classifiers to form their predictions relying on input tokens that humans
consider plausible. This is achieved by introducing an auxiliary explanation
loss that measures how well the output of an input attribution technique for
the model agrees with human-annotated rationales. The guidance appears to
benefit performance in out-of-domain (OOD) settings, presumably due to an
increased reliance on "plausible" tokens. However, previous work has
under-explored the impact of guidance on that reliance, particularly when
reliance is measured using attribution techniques different from those used to
guide the model. In this work, we seek to close this gap, and also explore the
relationship between reliance on plausible features and OOD performance. We
find that the connection between ER and the ability of a classifier to rely on
plausible features has been overstated and that a stronger reliance on
plausible tokens does not seem to be the cause for OOD improvements.
comment: COLING 2025 Camera-ready
♻ ☆ A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference NAACL2025
Recently, sharing key-value (KV) cache across layers has been found effective
in efficient inference of large language models (LLMs). To systematically
investigate different techniques of cross-layer KV sharing, we propose a
unified framework that covers several recent methods and their novel variants.
We conduct comprehensive experiments on all the configurations of the
framework, evaluating their generation throughput and performance in language
modeling and downstream tasks. We find that when reducing the size of the KV
cache by 2$\times$, most configurations can achieve higher throughput than
standard transformers while maintaining competitive performance. When further
reducing the size of the KV cache, however, pairing queries of all layers with
KVs of upper layers performs better, at the expense of additional training cost
and prefilling latency. We hope that this work will help users make more
informed choices of cross-layer KV sharing approaches and facilitate future
research on efficient LLM inference.
comment: Accepted to NAACL2025 main conference
♻ ☆ ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models
Masked Language Models (MLMs) have achieved remarkable success in many
self-supervised representation learning tasks. MLMs are trained by randomly
masking portions of the input sequences with [MASK] tokens and learning to
reconstruct the original content based on the remaining context. This paper
explores the impact of [MASK] tokens on MLMs. Analytical studies show that
masking tokens can introduce the corrupted semantics problem, wherein the
corrupted context may convey multiple, ambiguous meanings. This problem is also
a key factor affecting the performance of MLMs on downstream tasks. Based on
these findings, we propose a novel enhanced-context MLM, ExLM. Our approach
expands [MASK] tokens in the input context and models the dependencies between
these expanded states. This enhancement increases context capacity and enables
the model to capture richer semantic information, effectively mitigating the
corrupted semantics problem during pre-training. Experimental results
demonstrate that ExLM achieves significant performance improvements in both
text modeling and SMILES modeling tasks. Further analysis confirms that ExLM
enriches semantic representations through context enhancement, and effectively
reduces the semantic multimodality commonly observed in MLMs.
comment: 30 pages, 12 figures
♻ ☆ Lost in Overlap: Exploring Logit-based Watermark Collision in LLMs NAACL 2025
The proliferation of large language models (LLMs) in generating content
raises concerns about text copyright. Watermarking methods, particularly
logit-based approaches, embed imperceptible identifiers into text to address
these challenges. However, the widespread usage of watermarking across diverse
LLMs has led to an inevitable issue known as watermark collision during common
tasks, such as paraphrasing or translation. In this paper, we introduce
watermark collision as a novel and general philosophy for watermark attacks,
aimed at enhancing attack performance on top of any other attacking methods. We
also provide a comprehensive demonstration that watermark collision poses a
threat to all logit-based watermark algorithms, impacting not only specific
attack scenarios but also downstream applications.
comment: Long Paper, 9 pages, accepted at NAACL 2025 Findings
♻ ☆ Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection NAACL 2025
In human conversations, short backchannel utterances such as "yeah" and "oh"
play a crucial role in facilitating smooth and engaging dialogue. These
backchannels signal attentiveness and understanding without interrupting the
speaker, making their accurate prediction essential for creating more natural
conversational agents. This paper proposes a novel method for real-time,
continuous backchannel prediction using a fine-tuned Voice Activity Projection
(VAP) model. While existing approaches have relied on turn-based or
artificially balanced datasets, our approach predicts both the timing and type
of backchannels in a continuous and frame-wise manner on unbalanced, real-world
datasets. We first pre-train the VAP model on a general dialogue corpus to
capture conversational dynamics and then fine-tune it on a specialized dataset
focused on backchannel behavior. Experimental results demonstrate that our
model outperforms baseline methods in both timing and type prediction tasks,
achieving robust performance in real-time environments. This research offers a
promising step toward more responsive and human-like dialogue systems, with
implications for interactive spoken dialogue applications such as virtual
assistants and robots.
comment: This paper has been accepted for presentation at the main conference
of 2025 Annual Conference of the Nations of the Americas Chapter of the
Association for Computational Linguistics (NAACL 2025) and represents the
author's version of the work
♻ ☆ Mojito: Motion Trajectory and Intensity Control for Video Generation
Xuehai He, Shuohang Wang, Jianwei Yang, Xiaoxia Wu, Yiping Wang, Kuan Wang, Zheng Zhan, Olatunji Ruwase, Yelong Shen, Xin Eric Wang
Recent advancements in diffusion models have shown great promise in producing
high-quality video content. However, efficiently training video diffusion
models capable of integrating directional guidance and controllable motion
intensity remains a challenging and under-explored area. To tackle these
challenges, this paper introduces Mojito, a diffusion model that incorporates
both motion trajectory and intensity control for text-to-video generation.
Specifically, Mojito features a Directional Motion Control (DMC) module that
leverages cross-attention to efficiently direct the generated object's motion
without training, alongside a Motion Intensity Modulator (MIM) that uses
optical flow maps generated from videos to guide varying levels of motion
intensity. Extensive experiments demonstrate Mojito's effectiveness in
achieving precise trajectory and intensity control with high computational
efficiency, generating motion patterns that closely match specified directions
and intensities, providing realistic dynamics that align well with natural
motion in real-world scenarios.
♻ ☆ Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization ICLR 2025
Yuxin Jiang, Bo Huang, Yufei Wang, Xingshan Zeng, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Wei Wang
Direct preference optimization (DPO), a widely adopted offline preference
optimization algorithm, aims to align large language models (LLMs) with
human-desired behaviors using pairwise preference data. However, the generation
of the winning response and the losing response within pairwise data are
typically isolated, leading to weak correlations between them as well as
suboptimal alignment performance. To address this issue, we propose an
effective framework for Bridging and Modeling Correlations in pairwise data,
named BMC. Firstly, we increase the consistency and informativeness of the
pairwise preference signals through targeted modifications, synthesizing a
pseudo-winning response by improving the losing response with the winning
response as a reference. Secondly, we identify that DPO alone is insufficient
to model these correlations and capture nuanced variations. Therefore, we
propose learning token-level correlations by dynamically leveraging the policy
model's confidence during training. Comprehensive experiments on QA, math, and
instruction-following tasks demonstrate the effectiveness of our approach,
significantly surpassing competitive baselines, including DPO. Additionally,
our in-depth quantitative analysis reveals the reasons behind our method's
superior performance over DPO and showcases its versatility to other DPO
variants. We release our repository at https://github.com/YJiangcm/BMC.
comment: 20 pages, 9 figures, 12 tables. Accepted at ICLR 2025
♻ ☆ Rationale Behind Essay Scores: Enhancing S-LLM's Multi-Trait Essay Scoring with Rationale Generated by LLMs
Existing automated essay scoring (AES) has solely relied on essay text
without using explanatory rationales for the scores, thereby forgoing an
opportunity to capture the specific aspects evaluated by rubric indicators in a
fine-grained manner. This paper introduces Rationale-based Multiple Trait
Scoring (RMTS), a novel approach for multi-trait essay scoring that integrates
prompt-engineering-based large language models (LLMs) with a fine-tuning-based
essay scoring model using a smaller large language model (S-LLM). RMTS uses an
LLM-based trait-wise rationale generation system where a separate LLM agent
generates trait-specific rationales based on rubric guidelines, which the
scoring model uses to accurately predict multi-trait scores. Extensive
experiments on benchmark datasets, including ASAP, ASAP++, and Feedback Prize,
show that RMTS significantly outperforms state-of-the-art models and vanilla
S-LLMs in trait-specific scoring. By assisting quantitative assessment with
fine-grained qualitative rationales, RMTS enhances the trait-wise reliability,
providing partial explanations about essays. The code is available at
https://github.com/BBeeChu/RMTS.git.
♻ ☆ Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study NAACL2025
Large language models (LLMs) have shown continuously improving multilingual
capabilities, and even small-scale open-source models have demonstrated rapid
performance enhancement. In this paper, we systematically explore the abilities
of open LLMs with less than ten billion parameters to handle multilingual
machine translation (MT) tasks. We conduct comprehensive evaluations on six
popular LLMs and find that models like Gemma2-9B exhibit impressive
multilingual translation capabilities. We then introduce the Parallel-First
Monolingual-Second (PFMS) data mixing strategy in the continual pretraining
stage to further enhance the MT performance and present GemmaX2-28, a 9B model
achieving top-tier multilingual translation performance across 28 languages.
Specifically, GemmaX2-28 consistently outperforms the state-of-the-art (SOTA)
models such as TowerInstruct and XALMA and achieves competitive performance
with Google Translate and GPT-4-turbo.
comment: Accept to NAACL2025 Main Conference
♻ ☆ Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models ICLR2025
As large language models (LLMs) become integral to various applications,
ensuring both their safety and utility is paramount. Jailbreak attacks, which
manipulate LLMs into generating harmful content, pose significant challenges to
this balance. Existing defenses, such as prompt engineering and safety
fine-tuning, often introduce computational overhead, increase inference
latency, and lack runtime flexibility. Moreover, overly restrictive safety
measures can degrade model utility by causing refusals of benign queries. In
this paper, we introduce Jailbreak Antidote, a method that enables real-time
adjustment of LLM safety preferences by manipulating a sparse subset of the
model's internal states during inference. By shifting the model's hidden
representations along a safety direction with varying strengths, we achieve
flexible control over the safety-utility balance without additional token
overhead or inference delays. Our analysis reveals that safety-related
information in LLMs is sparsely distributed; adjusting approximately 5% of the
internal state is as effective as modifying the entire state. Extensive
experiments on nine LLMs (ranging from 2 billion to 72 billion parameters),
evaluated against ten jailbreak attack methods and compared with six defense
strategies, validate the effectiveness and efficiency of our approach. By
directly manipulating internal states during reasoning, Jailbreak Antidote
offers a lightweight, scalable solution that enhances LLM safety while
preserving utility, opening new possibilities for real-time safety mechanisms
in widely-deployed AI systems.
comment: Accepted by ICLR2025. url: https://openreview.net/forum?id=s20W12XTF8
♻ ☆ MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders
Mental health disorders are one of the most serious diseases in the world.
Most people with such a disease lack access to adequate care, which highlights
the importance of training models for the diagnosis and treatment of mental
health disorders. However, in the mental health domain, privacy concerns limit
the accessibility of personalized treatment data, making it challenging to
build powerful models. In this paper, we introduce MentalArena, a self-play
framework to train language models by generating domain-specific personalized
data, where we obtain a better model capable of making a personalized diagnosis
and treatment (as a therapist) and providing information (as a patient). To
accurately model human-like mental health patients, we devise Symptom Encoder,
which simulates a real patient from both cognition and behavior perspectives.
To address intent bias during patient-therapist interactions, we propose
Symptom Decoder to compare diagnosed symptoms with encoded symptoms, and
dynamically manage the dialogue between patient and therapist according to the
identified deviations. We evaluated MentalArena against 6 benchmarks, including
biomedicalQA and mental health tasks, compared to 6 advanced models. Our
models, fine-tuned on both GPT-3.5 and Llama-3-8b, significantly outperform
their counterparts, including GPT-4o. We hope that our work can inspire future
research on personalized care. Code is available in
https://github.com/Scarelette/MentalArena/tree/main
comment: Technical Report; 26 pages
♻ ☆ From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning ICML 2024
Wei Chen, Zhen Huang, Liang Xie, Binbin Lin, Houqiang Li, Le Lu, Xinmei Tian, Deng Cai, Yonggang Zhang, Wenxiao Wang, Xu Shen, Jieping Ye
Large Language Models (LLMs) tend to prioritize adherence to user prompts
over providing veracious responses, leading to the sycophancy issue. When
challenged by users, LLMs tend to admit mistakes and provide inaccurate
responses even if they initially provided the correct answer. Recent works
propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy
issue, while it typically leads to the degeneration of LLMs' general
capability. To address the challenge, we propose a novel supervised pinpoint
tuning (SPT), where the region-of-interest modules are tuned for a given
objective. Specifically, SPT first reveals and verifies a small percentage
(<5%) of the basic modules, which significantly affect a particular behavior of
LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified
modules while freezing the rest. To verify the effectiveness of the proposed
SPT, we conduct comprehensive experiments, demonstrating that SPT significantly
mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT
introduces limited or even no side effects on the general capability of LLMs.
Our results shed light on how to precisely, effectively, and efficiently
explain and improve the targeted ability of LLMs. Code and data are available
at https://github.com/yellowtownhz/sycophancy-interpretability.
comment: accepted by ICML 2024, code and data are available at
https://github.com/yellowtownhz/sycophancy-interpretability
♻ ☆ Mixture-of-Instructions: Aligning Large Language Models via Mixture Prompting
With the proliferation of large language models (LLMs), the comprehensive
alignment of such models across multiple tasks has emerged as a critical area
of research. Existing alignment methodologies primarily address single task,
such as multi-turn dialogue, coding, mathematical problem-solving, and tool
usage. Although there is a large amount of high-quality data available for
those tasks, most of them provide only questions and answers without including
the system prompt. Though a detailed analysis of the Qwen language model, we
found that the system prompt has a significant impact on both training and
inference processes of LLM. We attributes this phenomenon to overfitting to the
system prompt. In address this issue, we introduce a novel technique termed
Mixture-of-Instructions (MoI), which employs a strategy of instruction packing
combined with diverse system prompts to boost the alignment efficiency of
language models. We have also compiled a diverse set of seven benchmark
datasets to rigorously evaluate the alignment efficacy of the MoI-enhanced
language model. Our methodology was applied to the open-source Qwen-7B-chat
model, culminating in the development of Qwen-SFT-MoI. This enhanced model
demonstrates significant advancements in generative capabilities across coding,
mathematics, and tool use tasks.
♻ ☆ TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision WWW 2025
Hierarchical text classification aims to categorize each document into a set
of classes in a label taxonomy, which is a fundamental web text mining task
with broad applications such as web content analysis and semantic indexing.
Most earlier works focus on fully or semi-supervised methods that require a
large amount of human annotated data which is costly and time-consuming to
acquire. To alleviate human efforts, in this paper, we work on hierarchical
text classification with a minimal amount of supervision: using the sole class
name of each node as the only supervision. Recently, large language models
(LLM) have shown competitive performance on various tasks through zero-shot
prompting, but this method performs poorly in the hierarchical setting because
it is ineffective to include the large and structured label space in a prompt.
On the other hand, previous weakly-supervised hierarchical text classification
methods only utilize the raw taxonomy skeleton and ignore the rich information
hidden in the text corpus that can serve as additional class-indicative
features. To tackle the above challenges, we propose TELEClass, Taxonomy
Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification,
which combines the general knowledge of LLMs and task-specific features mined
from an unlabeled corpus. TELEClass automatically enriches the raw taxonomy
with class-indicative features for better label space understanding and
utilizes novel LLM-based data annotation and generation methods specifically
tailored for the hierarchical setting. Experiments show that TELEClass can
significantly outperform previous baselines while achieving comparable
performance to zero-shot prompting of LLMs with drastically less inference
cost.
comment: Accepted to WWW 2025 Research Track
♻ ☆ Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study COLING 2025
Utilizing Large Language Models (LLMs) as evaluators to assess the
performance of LLMs has garnered attention. However, this kind of evaluation
approach is affected by potential biases within LLMs, raising concerns about
the accuracy and reliability of the evaluation results of LLMs. To address this
problem, we propose and study two many-shot In-Context Learning (ICL) prompt
templates to help LLM evaluators mitigate potential biases: Many-Shot with
Reference (MSwR) and Many-Shot without Reference (MSoR). Specifically, the
former utilizes in-context examples with model-generated evaluation rationales
as references, while the latter does not include these references. Using these
prompt designs, we investigate the impact of increasing the number of
in-context examples on the consistency and quality of the evaluation results.
Experimental results show that advanced LLMs, such as GPT-4o, perform better in
the many-shot regime than in the zero-shot and few-shot regimes. Furthermore,
when using GPT-4o as an evaluator in the many-shot regime, adopting MSwR as the
prompt template performs better than MSoR.
comment: Accepted by COLING 2025
♻ ☆ Train-Attention: Meta-Learning Where to Focus in Continual Knowledge Learning
Previous studies on continual knowledge learning (CKL) in large language
models (LLMs) have predominantly focused on approaches such as regularization,
architectural modifications, and rehearsal techniques to mitigate catastrophic
forgetting. However, these methods naively inherit the inefficiencies of
standard training procedures, indiscriminately applying uniform weight across
all tokens, which can lead to unnecessary parameter updates and increased
forgetting. To address these shortcomings, we propose a novel CKL approach
termed Train-Attention-Augmented Language Model (TAALM), which enhances
learning efficiency by dynamically predicting and applying weights to tokens
based on their usefulness. This method employs a meta-learning framework that
optimizes token importance predictions, facilitating targeted knowledge updates
and minimizing forgetting. Also, we observe that existing benchmarks do not
clearly exhibit the trade-off between learning and retaining, therefore we
propose a new benchmark, \textsc{LAMA-ckl}, to address this issue. Through
experiments conducted on both newly introduced and established CKL benchmarks,
TAALM proves the state-of-the-art performance upon the baselines, and also
shows synergistic compatibility when integrated with previous CKL approaches.
♻ ☆ CoS: Enhancing Personalization and Mitigating Bias with Context Steering
When querying a large language model (LLM), the context, i.e. personal,
demographic, and cultural information specific to an end-user, can
significantly shape the response of the LLM. For example, asking the model to
explain Newton's second law with the context "I am a toddler" yields a
different answer compared to the context "I am a physics professor." Proper
usage of the context enables the LLM to generate personalized responses,
whereas inappropriate contextual influence can lead to stereotypical and
potentially harmful generations (e.g. associating "female" with "housekeeper").
In practice, striking the right balance when leveraging context is a nuanced
and challenging problem that is often situation-dependent. One common approach
to address this challenge is to fine-tune LLMs on contextually appropriate
responses. However, this approach is expensive, time-consuming, and not
controllable for end-users in different situations. In this work, we propose
Context Steering (CoS) - a simple training-free method that can be easily
applied to autoregressive LLMs at inference time. By measuring the contextual
influence in terms of token prediction likelihood and modulating it, our method
enables practitioners to determine the appropriate level of contextual
influence based on their specific use case and end-user base. We showcase a
variety of applications of CoS including amplifying the contextual influence to
achieve better personalization and mitigating unwanted influence for reducing
model bias. In addition, we show that we can combine CoS with Bayesian
Inference to quantify the extent of hate speech on the internet. We demonstrate
the effectiveness of CoS on state-of-the-art LLMs and benchmarks.
♻ ☆ Spoken Language Intelligence of Large Language Models for Language Learning
People have long hoped for a conversational system that can assist in
real-life situations, and recent progress on large language models (LLMs) is
bringing this idea closer to reality. While LLMs are often impressive in
performance, their efficacy in real-world scenarios that demand expert
knowledge remains unclear. LLMs are believed to hold the most potential and
value in education, especially in the development of Artificial intelligence
(AI) based virtual teachers capable of facilitating language learning. Our
focus is centered on evaluating the efficacy of LLMs in the realm of education,
specifically in the areas of spoken language learning which encompass
phonetics, phonology, and second language acquisition. We introduce a new
multiple-choice question dataset to evaluate the effectiveness of LLMs in the
aforementioned scenarios, including understanding and application of spoken
language knowledge. In addition, we investigate the influence of various
prompting techniques such as zero- and few-shot method (prepending the question
with question-answer exemplars), chain-of-thought (CoT, think step-by-step),
in-domain exampler and external tools (Google, Wikipedia). We conducted
large-scale evaluation on popular LLMs (20 distinct models) using these
methods. We achieved significant performance improvements compared to the
zero-shot baseline in the practical questions reasoning (GPT-3.5, 49.1% ->
63.1%; LLaMA2-70B-Chat, 42.2% -> 48.6%). We found that models of different
sizes have good understanding of concepts in phonetics, phonology, and second
language acquisition, but show limitations in reasoning for real-world
problems. Additionally, we also explore preliminary findings on conversational
communication.
comment: 28 pages, 7 figures, Preprint Feb 04, 2025 update: Add Deepseek R1
performance
♻ ☆ BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation
Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Boris Hanin, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia
As the demand for high-quality data in model training grows, researchers and
developers are increasingly generating synthetic data to tune and train LLMs. A
common assumption about synthetic data is that sampling from instruct-tuned
models is sufficient; however, these models struggle to produce diverse
outputs-a key requirement for generalization. Despite various prompting
methods, in this work we show that achieving meaningful diversity from
instruct-tuned models remains challenging. In contrast, we find base models
without post-training exhibit greater diversity, but are less capable at
instruction following and hence of lower quality. Leveraging this insight, we
propose Base-Refine (BARE), a synthetic data generation method that combines
the diversity of base models with the quality of instruct-tuned models through
a two-stage process. With minimal few-shot examples and curation, BARE
generates diverse and high-quality datasets, improving downstream task
performance. We show that fine-tuning with as few as 1,000 BARE-generated
samples can reach performance comparable to the best similarly sized models on
LiveCodeBench tasks. Furthermore, fine-tuning with BARE-generated data achieves
a 101% improvement over instruct-only data on GSM8K and a 18.4% improvement
over SOTA methods on RAFT.
♻ ☆ SymBa: Symbolic Backward Chaining for Structured Natural Language Reasoning
To improve the performance and explainability of LLM-based natural language
reasoning, structured reasoning can be applied to generate explicitly
structured proofs. Among different methods for structured reasoning, we
specifically focus on backward chaining, where the proof goal is recursively
decomposed to subgoals by searching and applying rules. We argue that current
LLM-based backward chaining systems (e.g. Least-to-most prompting and LAMBADA)
are incomplete, as they omit crucial algorithmic components identified from the
classic backward chaining algorithm in computational logic (SLD Resolution). To
this end, we propose a novel backward chaining system, SymBa (Symbolic Backward
Chaining), which integrates a symbolic solver and an LLM. In SymBa, the solver
controls the proof process, and the LLM is only called when the solver requires
new information to complete the proof. Empowered by completeness, SymBa
achieves a significant improvement in seven deductive, relational, and
arithmetic reasoning benchmarks compared to the baselines.
comment: 17 pages (8 pages for main text),11 figures
♻ ☆ Decoding Speculative Decoding NAACL 2025
Speculative Decoding is a widely used technique to speed up inference for
Large Language Models (LLMs) without sacrificing quality. When performing
inference, speculative decoding uses a smaller draft model to generate
speculative tokens and then uses the target LLM to verify those draft tokens.
The speedup provided by speculative decoding heavily depends on the choice of
the draft model. In this work, we perform a detailed study comprising over 350
experiments with LLaMA-65B and OPT-66B using speculative decoding and delineate
the factors that affect the performance gain provided by speculative decoding.
Our experiments indicate that the performance of speculative decoding depends
heavily on the latency of the draft model, and the draft model's capability in
language modeling does not correlate strongly with its performance in
speculative decoding. Based on these insights we explore a new design space for
draft models and design hardware-efficient draft models for speculative
decoding. Our newly designed draft model can provide 111% higher throughput
than existing draft models and our approach generalizes further to all LLaMA
models (1/2/3.1) and supervised fine-tuned models.
comment: Proceedings of the 2025 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies
(NAACL 2025)
♻ ☆ Can LLMs Assist Annotators in Identifying Morality Frames? -- Case Study on Vaccination Debate on Social Media
Nowadays, social media is pivotal in shaping public discourse, especially on
polarizing issues like vaccination, where diverse moral perspectives influence
individual opinions. In NLP, data scarcity and complexity of psycholinguistic
tasks, such as identifying morality frames, make relying solely on human
annotators costly, time-consuming, and prone to inconsistency due to cognitive
load. To address these issues, we leverage large language models (LLMs), which
are adept at adapting new tasks through few-shot learning, utilizing a handful
of in-context examples coupled with explanations that connect examples to task
principles. Our research explores LLMs' potential to assist human annotators in
identifying morality frames within vaccination debates on social media. We
employ a two-step process: generating concepts and explanations with LLMs,
followed by human evaluation using a "think-aloud" tool. Our study shows that
integrating LLMs into the annotation process enhances accuracy, reduces task
difficulty, lowers cognitive load, suggesting a promising avenue for human-AI
collaboration in complex psycholinguistic tasks.
comment: Accepted at 17th ACM Web Science Conference 2025 (WebSci'25)
♻ ☆ LMFusion: Adapting Pretrained Language Models for Multimodal Generation
We present LMFusion, a framework for empowering pretrained text-only large
language models (LLMs) with multimodal generative capabilities, enabling them
to understand and generate both text and images in arbitrary sequences.
LMFusion leverages existing Llama-3's weights for processing texts
autoregressively while introducing additional and parallel transformer modules
for processing images with diffusion. During training, the data from each
modality is routed to its dedicated modules: modality-specific feedforward
layers, query-key-value projections, and normalization layers process each
modality independently, while the shared self-attention layers allow
interactions across text and image features. By freezing the text-specific
modules and only training the image-specific modules, LMFusion preserves the
language capabilities of text-only LLMs while developing strong visual
understanding and generation abilities. Compared to methods that pretrain
multimodal generative models from scratch, our experiments demonstrate that,
LMFusion improves image understanding by 20% and image generation by 3.6% using
only 50% of the FLOPs while maintaining Llama-3's language capabilities. We
also demonstrate that this framework can adapt existing vision-language models
with multimodal generation ability. Overall, this framework not only leverages
existing computational investments in text-only LLMs but also enables the
parallel development of language and vision capabilities, presenting a
promising direction for efficient multimodal model development.
comment: Name change: LlamaFusion to LMFusion
♻ ☆ TableMaster: A Recipe to Advance Table Understanding with Language Models
Tables serve as a fundamental format for representing structured relational
data. While current language models (LMs) excel at many text-based tasks, they
still face challenges in table understanding due to the complex characteristics
of tabular data, such as their structured nature. In this paper, we aim to
enhance LMs for improved table understanding. We identify four key challenges:
1) difficulty in locating target data, 2) deficiency in table semantics, 3)
numerical inaccuracies in textual reasoning, and 4) semantic inflexibility in
symbolic reasoning. To address these issues, we propose TableMaster, a recipe
and comprehensive framework that integrates multiple solutions to overcome
these obstacles. TableMaster first extracts relevant table content and
verbalizes it with enriched semantic context. Additionally, we introduce
adaptive reasoning, a flexible approach that dynamically adjusts between
textual and symbolic reasoning, tailoring the reasoning process to each query.
Extensive analyses and experiments demonstrate our findings and the
effectiveness of TableMaster. On the WikiTQ dataset, TableMaster achieves an
accuracy of 78.13% using GPT-4o-mini, surpassing existing baselines.