Computation and Language 57
☆ The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
Sparse attention offers a promising strategy to extend long-context
capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy
trade-offs, and systematic scaling studies remain unexplored. To address this
gap, we perform a careful comparison of training-free sparse attention methods
at varying model scales, sequence lengths, and sparsity levels on a diverse
collection of long-sequence tasks-including novel ones that rely on natural
language while remaining controllable and easy to evaluate. Based on our
experiments, we report a series of key findings: 1) an isoFLOPS analysis
reveals that for very long sequences, larger and highly sparse models are
preferable to smaller and dense ones. 2) The level of sparsity attainable while
statistically guaranteeing accuracy preservation is higher during decoding than
prefilling, and correlates with model size in the former. 3) There is no clear
strategy that performs best across tasks and phases, with different units of
sparsification or budget adaptivity needed for different scenarios. Even
moderate sparsity levels often result in significant performance degradation on
at least one task, highlighting that sparse attention is not a universal
solution. 4) We introduce and validate novel scaling laws specifically tailored
for sparse attention, providing evidence that our findings are likely to hold
true beyond our range of experiments. Through these insights, we demonstrate
that sparse attention is a key tool to enhance the capabilities of Transformer
LLMs for processing longer sequences, but requires careful evaluation of
trade-offs for performance-sensitive applications.
☆ Conversational Assistants to support Heart Failure Patients: comparing a Neurosymbolic Architecture with ChatGPT
Anuja Tayal, Devika Salunke, Barbara Di Eugenio, Paula Allen-Meares, Eulalia Puig Abril, Olga Garcia, Carolyn Dickens, Andrew Boyd
Conversational assistants are becoming more and more popular, including in
healthcare, partly because of the availability and capabilities of Large
Language Models. There is a need for controlled, probing evaluations with real
stakeholders which can highlight advantages and disadvantages of more
traditional architectures and those based on generative AI. We present a
within-group user study to compare two versions of a conversational assistant
that allows heart failure patients to ask about salt content in food. One
version of the system was developed in-house with a neurosymbolic architecture,
and one is based on ChatGPT. The evaluation shows that the in-house system is
more accurate, completes more tasks and is less verbose than the one based on
ChatGPT; on the other hand, the one based on ChatGPT makes fewer speech errors
and requires fewer clarifications to complete the task. Patients show no
preference for one over the other.
☆ Multilingual Performance Biases of Large Language Models in Education
Large language models (LLMs) are increasingly being adopted in educational
settings. These applications expand beyond English, though current LLMs remain
primarily English-centric. In this work, we ascertain if their use in education
settings in non-English languages is warranted. We evaluated the performance of
popular LLMs on four educational tasks: identifying student misconceptions,
providing targeted feedback, interactive tutoring, and grading translations in
six languages (Hindi, Arabic, Farsi, Telugu, Ukrainian, Czech) in addition to
English. We find that the performance on these tasks somewhat corresponds to
the amount of language represented in training data, with lower-resource
languages having poorer task performance. Although the models perform
reasonably well in most languages, the frequent performance drop from English
is significant. Thus, we recommend that practitioners first verify that the LLM
works well in the target language for their educational task before deployment.
☆ Safety in Large Reasoning Models: A Survey
Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks
like mathematics and coding, leveraging their advanced reasoning capabilities.
Nevertheless, as these capabilities progress, significant concerns regarding
their vulnerabilities and safety have arisen, which can pose challenges to
their deployment and application in real-world settings. This paper presents a
comprehensive survey of LRMs, meticulously exploring and summarizing the newly
emerged safety risks, attacks, and defense strategies. By organizing these
elements into a detailed taxonomy, this work aims to offer a clear and
structured understanding of the current safety landscape of LRMs, facilitating
future research and development to enhance the security and reliability of
these powerful models.
☆ Ensemble Bayesian Inference: Leveraging Small Language Models to Achieve LLM-level Accuracy in Profile Matching Tasks
This study explores the potential of small language model(SLM) ensembles to
achieve accuracy comparable to proprietary large language models (LLMs). We
propose Ensemble Bayesian Inference (EBI), a novel approach that applies
Bayesian estimation to combine judgments from multiple SLMs, allowing them to
exceed the performance limitations of individual models. Our experiments on
diverse tasks(aptitude assessments and consumer profile analysis in both
Japanese and English) demonstrate EBI's effectiveness. Notably, we analyze
cases where incorporating models with negative Lift values into ensembles
improves overall performance, and we examine the method's efficacy across
different languages. These findings suggest new possibilities for constructing
high-performance AI systems with limited computational resources and for
effectively utilizing models with individually lower performance. Building on
existing research on LLM performance evaluation, ensemble methods, and
open-source LLM utilization, we discuss the novelty and significance of our
approach.
comment: 13 pages, 2 figures
☆ Energy Considerations of Large Language Model Inference and Efficiency Optimizations
As large language models (LLMs) scale in size and adoption, their
computational and environmental costs continue to rise. Prior benchmarking
efforts have primarily focused on latency reduction in idealized settings,
often overlooking the diverse real-world inference workloads that shape energy
use. In this work, we systematically analyze the energy implications of common
inference efficiency optimizations across diverse Natural Language Processing
(NLP) and generative Artificial Intelligence (AI) workloads, including
conversational AI and code generation. We introduce a modeling approach that
approximates real-world LLM workflows through a binning strategy for
input-output token distributions and batch size variations. Our empirical
analysis spans software frameworks, decoding strategies, GPU architectures,
online and offline serving settings, and model parallelism configurations. We
show that the effectiveness of inference optimizations is highly sensitive to
workload geometry, software stack, and hardware accelerators, demonstrating
that naive energy estimates based on FLOPs or theoretical GPU utilization
significantly underestimate real-world energy consumption. Our findings reveal
that the proper application of relevant inference efficiency optimizations can
reduce total energy use by up to 73% from unoptimized baselines. These insights
provide a foundation for sustainable LLM deployment and inform energy-efficient
design strategies for future AI infrastructure.
comment: 16 pages
☆ Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction
This study addresses the critical challenge of hallucination mitigation in
Large Vision-Language Models (LVLMs) for Visual Question Answering (VQA) tasks
through a Split Conformal Prediction (SCP) framework. While LVLMs excel in
multi-modal reasoning, their outputs often exhibit hallucinated content with
high confidence, posing risks in safety-critical applications. We propose a
model-agnostic uncertainty quantification method that integrates dynamic
threshold calibration and cross-modal consistency verification. By partitioning
data into calibration and test sets, the framework computes nonconformity
scores to construct prediction sets with statistical guarantees under
user-defined risk levels ($\alpha$). Key innovations include: (1) rigorous
control of \textbf{marginal coverage} to ensure empirical error rates remain
strictly below $\alpha$; (2) dynamic adjustment of prediction set sizes
inversely with $\alpha$, filtering low-confidence outputs; (3) elimination of
prior distribution assumptions and retraining requirements. Evaluations on
benchmarks (ScienceQA, MMMU) with eight LVLMs demonstrate that SCP enforces
theoretical guarantees across all $\alpha$ values. The framework achieves
stable performance across varying calibration-to-test split ratios,
underscoring its robustness for real-world deployment in healthcare, autonomous
systems, and other safety-sensitive domains. This work bridges the gap between
theoretical reliability and practical applicability in multi-modal AI systems,
offering a scalable solution for hallucination detection and uncertainty-aware
decision-making.
☆ Evaluating Grounded Reasoning by Code-Assisted Large Language Models for Mathematics
Assisting LLMs with code generation improved their performance on
mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is
generally restricted to execution correctness, lacking a rigorous evaluation of
their generated programs. In this work, we bridge this gap by conducting an
in-depth analysis of code-assisted LLMs' generated programs in response to math
reasoning tasks. Our evaluation focuses on the extent to which LLMs ground
their programs to math rules, and how that affects their end performance. For
this purpose, we assess the generations of five different LLMs, on two
different math datasets, both manually and automatically. Our results reveal
that the distribution of grounding depends on LLMs' capabilities and the
difficulty of math problems. Furthermore, mathematical grounding is more
effective for closed-source models, while open-source models fail to employ
math rules in their solutions correctly. On MATH500, the percentage of grounded
programs decreased to half, while the ungrounded generations doubled in
comparison to ASDiv grade-school problems. Our work highlights the need for
in-depth evaluation beyond execution accuracy metrics, toward a better
understanding of code-assisted LLMs' capabilities and limits in the math
domain.
☆ Towards a comprehensive taxonomy of online abusive language informed by machine leaning
The proliferation of abusive language in online communications has posed
significant risks to the health and wellbeing of individuals and communities.
The growing concern regarding online abuse and its consequences necessitates
methods for identifying and mitigating harmful content and facilitating
continuous monitoring, moderation, and early intervention. This paper presents
a taxonomy for distinguishing key characteristics of abusive language within
online text. Our approach uses a systematic method for taxonomy development,
integrating classification systems of 18 existing multi-label datasets to
capture key characteristics relevant to online abusive language classification.
The resulting taxonomy is hierarchical and faceted, comprising 5 categories and
17 dimensions. It classifies various facets of online abuse, including context,
target, intensity, directness, and theme of abuse. This shared understanding
can lead to more cohesive efforts, facilitate knowledge exchange, and
accelerate progress in the field of online abuse detection and mitigation among
researchers, policy makers, online platform owners, and other stakeholders.
☆ RAGAT-Mind: A Multi-Granular Modeling Approach for Rumor Detection Based on MindSpore
As false information continues to proliferate across social media platforms,
effective rumor detection has emerged as a pressing challenge in natural
language processing. This paper proposes RAGAT-Mind, a multi-granular modeling
approach for Chinese rumor detection, built upon the MindSpore deep learning
framework. The model integrates TextCNN for local semantic extraction,
bidirectional GRU for sequential context learning, Multi-Head Self-Attention
for global dependency focusing, and Bidirectional Graph Convolutional Networks
(BiGCN) for structural representation of word co-occurrence graphs. Experiments
on the Weibo1-Rumor dataset demonstrate that RAGAT-Mind achieves superior
classification performance, attaining 99.2% accuracy and a macro-F1 score of
0.9919. The results validate the effectiveness of combining hierarchical
linguistic features with graph-based semantic structures. Furthermore, the
model exhibits strong generalization and interpretability, highlighting its
practical value for real-world rumor detection applications.
☆ DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training
Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, Xiangang Li
Although large language models (LLMs) have recently achieved remarkable
performance on various complex reasoning benchmarks, the academic community
still lacks an in-depth understanding of base model training processes and data
quality. To address this, we construct a large-scale, difficulty-graded
reasoning dataset containing approximately 3.34 million unique queries of
varying difficulty levels and about 40 million distilled responses generated by
multiple models over several passes. Leveraging pass rate and Coefficient of
Variation (CV), we precisely select the most valuable training data to enhance
reasoning capability. Notably, we observe a training pattern shift, indicating
that reasoning-focused training based on base models requires higher learning
rates for effective training. Using this carefully selected data, we
significantly improve the reasoning capabilities of the base model, achieving a
pass rate of 79.2\% on the AIME2024 mathematical reasoning benchmark. This
result surpasses most current distilled models and closely approaches
state-of-the-art performance. We provide detailed descriptions of our data
processing, difficulty assessment, and training methodology, and have publicly
released all datasets and methods to promote rapid progress in open-source
long-reasoning LLMs. The dataset is available at:
https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M
☆ When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars
Rei Higuchi, Ryotaro Kawata, Naoki Nishikawa, Kazusato Oko, Shoichiro Yamaguchi, Sosuke Kobayashi, Seiya Tokui, Kohei Hayashi, Daisuke Okanohara, Taiji Suzuki
The ability to acquire latent semantics is one of the key properties that
determines the performance of language models. One convenient approach to
invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at
the beginning of texts in the pre-training data, making it easier for the model
to access latent semantics before observing the entire text. Previous studies
have reported that this technique actually improves the performance of trained
models in downstream tasks; however, this improvement has been observed only in
specific downstream tasks, without consistent enhancement in average next-token
prediction loss. To understand this phenomenon, we closely investigate how
prepending metadata during pre-training affects model performance by examining
its behavior using artificial data. Interestingly, we found that this approach
produces both positive and negative effects on the downstream tasks. We
demonstrate that the effectiveness of the approach depends on whether latent
semantics can be inferred from the downstream task's prompt. Specifically,
through investigations using data generated by probabilistic context-free
grammars, we show that training with metadata helps improve model's performance
when the given context is long enough to infer the latent semantics. In
contrast, the technique negatively impacts performance when the context lacks
the necessary information to make an accurate posterior inference.
☆ HalluLens: LLM Hallucination Benchmark
Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, Pascale Fung
Large language models (LLMs) often generate responses that deviate from user
input or training data, a phenomenon known as "hallucination." These
hallucinations undermine user trust and hinder the adoption of generative AI
systems. Addressing hallucinations is essential for the advancement of LLMs.
This paper introduces a comprehensive hallucination benchmark, incorporating
both new extrinsic and existing intrinsic evaluation tasks, built upon clear
taxonomy of hallucination. A major challenge in benchmarking hallucinations is
the lack of a unified framework due to inconsistent definitions and
categorizations. We disentangle LLM hallucination from "factuality," proposing
a clear taxonomy that distinguishes between extrinsic and intrinsic
hallucinations, to promote consistency and facilitate research. Extrinsic
hallucinations, where the generated content is not consistent with the training
data, are increasingly important as LLMs evolve. Our benchmark includes dynamic
test set generation to mitigate data leakage and ensure robustness against such
leakage. We also analyze existing benchmarks, highlighting their limitations
and saturation. The work aims to: (1) establish a clear taxonomy of
hallucinations, (2) introduce new extrinsic hallucination tasks, with data that
can be dynamically regenerated to prevent saturation by leakage, (3) provide a
comprehensive analysis of existing benchmarks, distinguishing them from
factuality evaluations.
comment: 42 pages
☆ Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation
Watermarking has emerged as a critical technique for combating misinformation
and protecting intellectual property in large language models (LLMs). A recent
discovery, termed watermark radioactivity, reveals that watermarks embedded in
teacher models can be inherited by student models through knowledge
distillation. On the positive side, this inheritance allows for the detection
of unauthorized knowledge distillation by identifying watermark traces in
student models. However, the robustness of watermarks against scrubbing attacks
and their unforgeability in the face of spoofing attacks under unauthorized
knowledge distillation remain largely unexplored. Existing watermark attack
methods either assume access to model internals or fail to simultaneously
support both scrubbing and spoofing attacks. In this work, we propose
Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified
framework that enables bidirectional attacks under unauthorized knowledge
distillation. Our approach employs contrastive decoding to extract corrupted or
amplified watermark texts via comparing outputs from the student model and
weakly watermarked references, followed by bidirectional distillation to train
new student models capable of watermark removal and watermark forgery,
respectively. Extensive experiments show that CDG-KD effectively performs
attacks while preserving the general performance of the distilled model. Our
findings underscore critical need for developing watermarking schemes that are
robust and unforgeable.
☆ HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models VLDB
The significant computational demands of pretrained language models (PLMs),
which often require dedicated hardware, present a substantial challenge in
serving them efficiently, especially in multi-tenant environments. To address
this, we introduce HMI, a Hierarchical knowledge management-based Multi-tenant
Inference system, designed to manage tenants with distinct PLMs
resource-efficiently. Our approach is three-fold: Firstly, we categorize PLM
knowledge into general, domain-specific, and task-specific. Leveraging insights
on knowledge acquisition across different model layers, we construct
hierarchical PLMs (hPLMs) by extracting and storing knowledge at different
levels, significantly reducing GPU memory usage per tenant. Secondly, we
establish hierarchical knowledge management for hPLMs generated by various
tenants in HMI. We manage domain-specific knowledge with acceptable storage
increases by constructing and updating domain-specific knowledge trees based on
frequency. We manage task-specific knowledge within limited GPU memory through
parameter swapping. Finally, we propose system optimizations to enhance
resource utilization and inference throughput. These include fine-grained
pipelining via hierarchical knowledge prefetching to overlap CPU and I/O
operations with GPU computations, and optimizing parallel implementations with
batched matrix multiplications. Our experimental results demonstrate that the
proposed HMI can efficiently serve up to 10,000 hPLMs (hBERTs and hGPTs) on a
single GPU, with only a negligible compromise in accuracy.
comment: Accepted by VLDBJ 2025
☆ Creating Targeted, Interpretable Topic Models with LLM-Generated Text Augmentation
Unsupervised machine learning techniques, such as topic modeling and
clustering, are often used to identify latent patterns in unstructured text
data in fields such as political science and sociology. These methods overcome
common concerns about reproducibility and costliness involved in the
labor-intensive process of human qualitative analysis. However, two major
limitations of topic models are their interpretability and their practicality
for answering targeted, domain-specific social science research questions. In
this work, we investigate opportunities for using LLM-generated text
augmentation to improve the usefulness of topic modeling output. We use a
political science case study to evaluate our results in a domain-specific
application, and find that topic modeling using GPT-4 augmentations creates
highly interpretable categories that can be used to investigate domain-specific
research questions with minimal human guidance.
comment: Presented at IC2S2 2024 in Philadelphia, USA
☆ PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image Persona NAACL 2025
Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests
through natural language interactions, yet existing systems often produce
generic, monotonic responses that lack individuality and fail to adapt to
users' personal attributes. To address this, we introduce PicPersona-TOD, a
novel dataset that incorporates user images as part of the persona, enabling
personalized responses tailored to user-specific factors such as age or
emotional context. This is facilitated by first impressions, dialogue
policy-guided prompting, and the use of external knowledge to reduce
hallucinations. Human evaluations confirm that our dataset enhances user
experience, with personalized responses contributing to a more engaging
interaction. Additionally, we introduce a new NLG model, Pictor, which not only
personalizes responses, but also demonstrates robust performance across unseen
domains https://github.com/JihyunLee1/PicPersona.
comment: Accepted in NAACL 2025 main
☆ LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams
Long-context understanding poses significant challenges in natural language
processing, particularly for real-world dialogues characterized by speech-based
elements, high redundancy, and uneven information density. Although large
language models (LLMs) achieve impressive results on existing benchmarks, these
datasets fail to reflect the complexities of such texts, limiting their
applicability to practical scenarios. To bridge this gap, we construct the
first spoken long-text dataset, derived from live streams, designed to reflect
the redundancy-rich and conversational nature of real-world scenarios. We
construct tasks in three categories: retrieval-dependent, reasoning-dependent,
and hybrid. We then evaluate both popular LLMs and specialized methods to
assess their ability to understand long-contexts in these tasks. Our results
show that current methods exhibit strong task-specific preferences and perform
poorly on highly redundant inputs, with no single method consistently
outperforming others. We propose a new baseline that better handles redundancy
in spoken text and achieves strong performance across tasks. Our findings
highlight key limitations of current methods and suggest future directions for
improving long-context understanding. Finally, our benchmark fills a gap in
evaluating long-context spoken language understanding and provides a practical
foundation for developing real-world e-commerce systems. The code and benchmark
are available at https://github.com/Yarayx/livelongbench.
☆ TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation
Soccer is a globally popular sporting event, typically characterized by long
matches and distinctive highlight moments. Recent advances in Multimodal Large
Language Models (MLLMs) offer promising capabilities in temporal grounding and
video understanding, soccer commentary generation often requires precise
temporal localization and semantically rich descriptions over long-form video.
However, existing soccer MLLMs often rely on the temporal a priori for caption
generation, so they cannot process the soccer video end-to-end. While some
traditional approaches follow a two-step paradigm that is complex and fails to
capture the global context to achieve suboptimal performance. To solve the
above issues, we present TimeSoccer, the first end-to-end soccer MLLM for
Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos.
TimeSoccer jointly predicts timestamps and generates captions in a single pass,
enabling global context modeling across 45-minute matches. To support long
video understanding of soccer matches, we introduce MoFA-Select, a
training-free, motion-aware frame compression module that adaptively selects
representative frames via a coarse-to-fine strategy, and incorporates
complementary training paradigms to strengthen the model's ability to handle
long temporal sequences. Extensive experiments demonstrate that our TimeSoccer
achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end
form, generating high-quality commentary with accurate temporal alignment and
strong semantic relevance.
☆ PatientDx: Merging Large Language Models for Protecting Data-Privacy in Healthcare
Fine-tuning of Large Language Models (LLMs) has become the default practice
for improving model performance on a given task. However, performance
improvement comes at the cost of training on vast amounts of annotated data
which could be sensitive leading to significant data privacy concerns. In
particular, the healthcare domain is one of the most sensitive domains exposed
to data privacy issues. In this paper, we present PatientDx, a framework of
model merging that allows the design of effective LLMs for health-predictive
tasks without requiring fine-tuning nor adaptation on patient data. Our
proposal is based on recently proposed techniques known as merging of LLMs and
aims to optimize a building block merging strategy. PatientDx uses a pivotal
model adapted to numerical reasoning and tunes hyperparameters on examples
based on a performance metric but without training of the LLM on these data.
Experiments using the mortality tasks of the MIMIC-IV dataset show improvements
up to 7% in terms of AUROC when compared to initial models. Additionally, we
confirm that when compared to fine-tuned models, our proposal is less prone to
data leak problems without hurting performance. Finally, we qualitatively show
the capabilities of our proposal through a case study. Our best model is
publicly available at https://huggingface.co/ Jgmorenof/mistral\_merged\_0\_4.
☆ M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction
Chengguang Gan, Sunbowen Lee, Zhixi Cai, Yanbin Wei, Lei Zheng, Yunhao Liang, Shiwen Ni, Tatsunori Mori
Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection
of information extraction and model interpretability. MRE aims to leverage the
mutual understanding between tasks of different granularities, enhancing the
performance of both coarse-grained and fine-grained tasks through joint
modeling. While MRE has been explored and validated in the textual domain, its
applicability to visual and multimodal domains remains unexplored. In this
work, we extend MRE to the multimodal information extraction domain for the
first time. Specifically, we introduce a new task: Multimodal Mutual
Reinforcement Effect (M-MRE), and construct a corresponding dataset to support
this task. To address the challenges posed by M-MRE, we further propose a
Prompt Format Adapter (PFA) that is fully compatible with various Large
Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can
also be observed in the M-MRE task, a multimodal text-image understanding
scenario. This provides strong evidence that MRE facilitates mutual gains
across three interrelated tasks, confirming its generalizability beyond the
textual domain.
☆ Bridging Cognition and Emotion: Empathy-Driven Multimodal Misinformation Detection
In the digital era, social media has become a major conduit for information
dissemination, yet it also facilitates the rapid spread of misinformation.
Traditional misinformation detection methods primarily focus on surface-level
features, overlooking the crucial roles of human empathy in the propagation
process. To address this gap, we propose the Dual-Aspect Empathy Framework
(DAE), which integrates cognitive and emotional empathy to analyze
misinformation from both the creator and reader perspectives. By examining
creators' cognitive strategies and emotional appeals, as well as simulating
readers' cognitive judgments and emotional responses using Large Language
Models (LLMs), DAE offers a more comprehensive and human-centric approach to
misinformation detection. Moreover, we further introduce an empathy-aware
filtering mechanism to enhance response authenticity and diversity.
Experimental results on benchmark datasets demonstrate that DAE outperforms
existing methods, providing a novel paradigm for multimodal misinformation
detection.
☆ FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
Yulia Otmakhova, Hung Thinh Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu, Daniel Beck, Jey Han Lau
We present FLUKE (Framework for LingUistically-driven and tasK-agnostic
robustness Evaluation), a task-agnostic framework for assessing model
robustness through systematic minimal variations of test data. FLUKE introduces
controlled variations across linguistic levels - from orthography to dialect
and style varieties - and leverages large language models (LLMs) with human
validation to generate modifications. We demonstrate FLUKE's utility by
evaluating both fine-tuned models and LLMs across four diverse NLP tasks, and
reveal that (1) the impact of linguistic variations is highly task-dependent,
with some tests being critical for certain tasks but irrelevant for others; (2)
while LLMs have better overall robustness compared to fine-tuned models, they
still exhibit significant brittleness to certain linguistic variations; (3) all
models show substantial vulnerability to negation modifications across most
tasks. These findings highlight the importance of systematic robustness testing
for understanding model behaviors.
☆ CoheMark: A Novel Sentence-Level Watermark for Enhanced Text Quality ICLR 2025
Watermarking technology is a method used to trace the usage of content
generated by large language models. Sentence-level watermarking aids in
preserving the semantic integrity within individual sentences while maintaining
greater robustness. However, many existing sentence-level watermarking
techniques depend on arbitrary segmentation or generation processes to embed
watermarks, which can limit the availability of appropriate sentences. This
limitation, in turn, compromises the quality of the generated response. To
address the challenge of balancing high text quality with robust watermark
detection, we propose CoheMark, an advanced sentence-level watermarking
technique that exploits the cohesive relationships between sentences for better
logical fluency. The core methodology of CoheMark involves selecting sentences
through trained fuzzy c-means clustering and applying specific next sentence
selection criteria. Experimental evaluations demonstrate that CoheMark achieves
strong watermark strength while exerting minimal impact on text quality.
comment: Published at the 1st workshop on GenAI Watermarking, collocated with
ICLR 2025
☆ Evaluating and Mitigating Bias in AI-Based Medical Text Generation
Artificial intelligence (AI) systems, particularly those based on deep
learning models, have increasingly achieved expert-level performance in medical
applications. However, there is growing concern that such AI systems may
reflect and amplify human bias, and reduce the quality of their performance in
historically under-served populations. The fairness issue has attracted
considerable research interest in the medical imaging classification field, yet
it remains understudied in the text generation domain. In this study, we
investigate the fairness problem in text generation within the medical field
and observe significant performance discrepancies across different races,
sexes, and age groups, including intersectional groups, various model scales,
and different evaluation metrics. To mitigate this fairness issue, we propose
an algorithm that selectively optimizes those underperformed groups to reduce
bias. The selection rules take into account not only word-level accuracy but
also the pathology accuracy to the target reference, while ensuring that the
entire process remains fully differentiable for effective model training. Our
evaluations across multiple backbones, datasets, and modalities demonstrate
that our proposed algorithm enhances fairness in text generation without
compromising overall performance. Specifically, the disparities among various
groups across different metrics were diminished by more than 30% with our
algorithm, while the relative change in text generation accuracy was typically
within 2%. By reducing the bias generated by deep learning models, our proposed
approach can potentially alleviate concerns about the fairness and reliability
of text generation diagnosis in medical domain.
Our code is publicly available to facilitate further research at
https://github.com/iriscxy/GenFair.
comment: 12 pages, 8 figures, published in Nature Computational Science
☆ JurisCTC: Enhancing Legal Judgment Prediction via Cross-Domain Transfer and Contrastive Learning IJCNN
In recent years, Unsupervised Domain Adaptation (UDA) has gained significant
attention in the field of Natural Language Processing (NLP) owing to its
ability to enhance model generalization across diverse domains. However, its
application for knowledge transfer between distinct legal domains remains
largely unexplored. To address the challenges posed by lengthy and complex
legal texts and the limited availability of large-scale annotated datasets, we
propose JurisCTC, a novel model designed to improve the accuracy of Legal
Judgment Prediction (LJP) tasks. Unlike existing approaches, JurisCTC
facilitates effective knowledge transfer across various legal domains and
employs contrastive learning to distinguish samples from different domains.
Specifically, for the LJP task, we enable knowledge transfer between civil and
criminal law domains. Compared to other models and specific large language
models (LLMs), JurisCTC demonstrates notable advancements, achieving peak
accuracies of 76.59% and 78.83%, respectively.
comment: Accepted in International Joint Conference on Neural Networks (IJCNN)
2025
☆ Low-Resource Neural Machine Translation Using Recurrent Neural Networks and Transfer Learning: A Case Study on English-to-Igbo
In this study, we develop Neural Machine Translation (NMT) and
Transformer-based transfer learning models for English-to-Igbo translation - a
low-resource African language spoken by over 40 million people across Nigeria
and West Africa. Our models are trained on a curated and benchmarked dataset
compiled from Bible corpora, local news, Wikipedia articles, and Common Crawl,
all verified by native language experts. We leverage Recurrent Neural Network
(RNN) architectures, including Long Short-Term Memory (LSTM) and Gated
Recurrent Units (GRU), enhanced with attention mechanisms to improve
translation accuracy. To further enhance performance, we apply transfer
learning using MarianNMT pre-trained models within the SimpleTransformers
framework. Our RNN-based system achieves competitive results, closely matching
existing English-Igbo benchmarks. With transfer learning, we observe a
performance gain of +4.83 BLEU points, reaching an estimated translation
accuracy of 70%. These findings highlight the effectiveness of combining RNNs
with transfer learning to address the performance gap in low-resource language
translation tasks.
comment: 25 pages, 14 combined figures (19 total), includes horizontal
layouts. Submitted to arXiv for open access
☆ Crisp: Cognitive Restructuring of Negative Thoughts through Multi-turn Supportive Dialogues
Jinfeng Zhou, Yuxuan Chen, Jianing Yin, Yongkang Huang, Yihan Shi, Xikun Zhang, Libiao Peng, Rongsheng Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang
Cognitive Restructuring (CR) is a psychotherapeutic process aimed at
identifying and restructuring an individual's negative thoughts, arising from
mental health challenges, into more helpful and positive ones via multi-turn
dialogues. Clinician shortage and stigma urge the development of human-LLM
interactive psychotherapy for CR. Yet, existing efforts implement CR via simple
text rewriting, fixed-pattern dialogues, or a one-shot CR workflow, failing to
align with the psychotherapeutic process for effective CR. To address this gap,
we propose CRDial, a novel framework for CR, which creates multi-turn dialogues
with specifically designed identification and restructuring stages of negative
thoughts, integrates sentence-level supportive conversation strategies, and
adopts a multi-channel loop mechanism to enable iterative CR. With CRDial, we
distill Crisp, a large-scale and high-quality bilingual dialogue dataset, from
LLM. We then train Crispers, Crisp-based conversational LLMs for CR, at 7B and
14B scales. Extensive human studies show the superiority of Crispers in
pointwise, pairwise, and intervention evaluations.
☆ Does Knowledge Distillation Matter for Large Language Model based Bundle Generation?
LLMs are increasingly explored for bundle generation, thanks to their
reasoning capabilities and knowledge. However, deploying large-scale LLMs
introduces significant efficiency challenges, primarily high computational
costs during fine-tuning and inference due to their massive parameterization.
Knowledge distillation (KD) offers a promising solution, transferring expertise
from large teacher models to compact student models. This study systematically
investigates knowledge distillation approaches for bundle generation, aiming to
minimize computational demands while preserving performance. We explore three
critical research questions: (1) how does the format of KD impact bundle
generation performance? (2) to what extent does the quantity of distilled
knowledge influence performance? and (3) how do different ways of utilizing the
distilled knowledge affect performance? We propose a comprehensive KD framework
that (i) progressively extracts knowledge (patterns, rules, deep thoughts);
(ii) captures varying quantities of distilled knowledge through different
strategies; and (iii) exploits complementary LLM adaptation techniques
(in-context learning, supervised fine-tuning, combination) to leverage
distilled knowledge in small student models for domain-specific adaptation and
enhanced efficiency. Extensive experiments provide valuable insights into how
knowledge format, quantity, and utilization methodologies collectively shape
LLM-based bundle generation performance, exhibiting KD's significant potential
for more efficient yet effective LLM-based bundle generation.
☆ A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation
Yangxinyu Xie, Bowen Jiang, Tanwi Mallick, Joshua David Bergerson, John K. Hutchison, Duane R. Verner, Jordan Branham, M. Ross Alexander, Robert B. Ross, Yan Feng, Leslie-Anne Levy, Weijie Su, Camillo J. Taylor
Large language models (LLMs) are a transformational capability at the
frontier of artificial intelligence and machine learning that can support
decision-makers in addressing pressing societal challenges such as extreme
natural hazard events. As generalized models, LLMs often struggle to provide
context-specific information, particularly in areas requiring specialized
knowledge. In this work we propose a retrieval-augmented generation (RAG)-based
multi-agent LLM system to support analysis and decision-making in the context
of natural hazards and extreme weather events. As a proof of concept, we
present WildfireGPT, a specialized system focused on wildfire hazards. The
architecture employs a user-centered, multi-agent design to deliver tailored
risk insights across diverse stakeholder groups. By integrating natural hazard
and extreme weather projection data, observational datasets, and scientific
literature through an RAG framework, the system ensures both the accuracy and
contextual relevance of the information it provides. Evaluation across ten
expert-led case studies demonstrates that WildfireGPT significantly outperforms
existing LLM-based solutions for decision support.
☆ Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Despite the rapid growth of machine learning research, corresponding code
implementations are often unavailable, making it slow and labor-intensive for
researchers to reproduce results and build upon prior work. In the meantime,
recent Large Language Models (LLMs) excel at understanding scientific documents
and generating high-quality code. Inspired by this, we introduce PaperCoder, a
multi-agent LLM framework that transforms machine learning papers into
functional code repositories. PaperCoder operates in three stages: planning,
where it constructs a high-level roadmap, designs the system architecture with
diagrams, identifies file dependencies, and generates configuration files;
analysis, which focuses on interpreting implementation-specific details; and
generation, where modular, dependency-aware code is produced. Moreover, each
phase is instantiated through a set of specialized agents designed to
collaborate effectively across the pipeline. We then evaluate PaperCoder on
generating code implementations from machine learning papers based on both
model-based and human evaluations, specifically from the original paper
authors, with author-released repositories as ground truth if available. Our
results demonstrate the effectiveness of PaperCoder in creating high-quality,
faithful implementations. Furthermore, it consistently shows strengths in the
recently released PaperBench benchmark, surpassing strong baselines by
substantial margins.
♻ ☆ jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Andreas Koukounas, Georgios Mastrapas, Sedigheh Eslami, Bo Wang, Mohammad Kalim Akram, Michael Günther, Isabelle Mohr, Saba Sturua, Nan Wang, Han Xiao
Contrastive Language-Image Pretraining (CLIP) has been widely used for
crossmodal information retrieval and multimodal understanding tasks. However,
CLIP models are mainly optimized for crossmodal vision-language tasks and
underperform in single-mode text tasks. Moreover, these models are often
trained on English datasets and therefore lack multilingual understanding.
Additionally, from a visual understanding perspective, previous CLIP-based
models exhibit insufficient understanding of visually rich documents. In this
work, we propose jina-clip-v2, a contrastive vision-language model trained on
text pairs, triplets and image-text pairs via a multi-task and multi-stage
contrastive learning paradigm in order to support both text-only and crossmodal
tasks. We employ a multilingual text encoder and expand the training dataset to
include multilingual texts from 29 non-English languages, including Hindi,
Chinese, German, French, and others, as well as images of visually rich
documents. We evaluate the model's performance and show that jina-clip-v2
achieves notable improvements over state-of-the-art CLIP-based models in
zero-shot text-only retrieval, semantic textual similarity, and crossmodal
retrieval tasks in both English and multilingual settings. jina-clip-v2 also
provides for flexibility in embedding dimensionality, enabling users to select
the granularity of the representations. jina-clip-v2 is publicly available at
https://huggingface.co/jinaai/jina-clip-v2.
comment: 30 pages, 1-10 main paper, 10-12 refs, 12-30 benchmarks
♻ ☆ CallNavi, A Challenge and Empirical Study on LLM Function Calling and Routing
Yewei Song, Xunzhu Tang, Cedric Lothritz, Saad Ezzini, Jacques Klein, Tegawendé F. Bissyandé, Andrey Boytsov, Ulrick Ble, Anne Goujon
API-driven chatbot systems are increasingly integral to software engineering
applications, yet their effectiveness hinges on accurately generating and
executing API calls. This is particularly challenging in scenarios requiring
multi-step interactions with complex parameterization and nested API
dependencies. Addressing these challenges, this work contributes to the
evaluation and assessment of AI-based software development through three key
advancements: (1) the introduction of a novel dataset specifically designed for
benchmarking API function selection, parameter generation, and nested API
execution; (2) an empirical evaluation of state-of-the-art language models,
analyzing their performance across varying task complexities in API function
generation and parameter accuracy; and (3) a hybrid approach to API routing,
combining general-purpose large language models for API selection with
fine-tuned models and prompt engineering for parameter generation. These
innovations significantly improve API execution in chatbot systems, offering
practical methodologies for enhancing software design, testing, and operational
workflows in real-world software engineering contexts.
♻ ☆ Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse ICLR 2025
LLMs are an integral component of retrieval-augmented generation (RAG)
systems. While many studies focus on evaluating the overall quality of
end-to-end RAG systems, there is a gap in understanding the appropriateness of
LLMs for the RAG task. To address this, we introduce Trust-Score, a holistic
metric that evaluates the trustworthiness of LLMs within the RAG framework. Our
results show that various prompting methods, such as in-context learning, fail
to effectively adapt LLMs to the RAG task as measured by Trust-Score.
Consequently, we propose Trust-Align, a method to align LLMs for improved
Trust-Score performance. 26 out of 27 models aligned using Trust-Align
substantially outperform competitive baselines on ASQA, QAMPARI, and ELI5.
Specifically, in LLaMA-3-8b, Trust-Align outperforms FRONT on ASQA (up 12.56),
QAMPARI (up 36.04), and ELI5 (up 17.69). Trust-Align also significantly
enhances models' ability to correctly refuse and provide quality citations. We
also demonstrate the effectiveness of Trust-Align across different open-weight
models, including the LLaMA series (1b to 8b), Qwen-2.5 series (0.5b to 7b),
and Phi3.5 (3.8b). We release our code at
https://github.com/declare-lab/trust-align.
comment: Published at ICLR 2025 (Oral)
♻ ☆ Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu, Jingwei Sun, Yueqian Lin, Jingyang Zhang, Ming Yin, Qinsi Wang, Jianyi Zhang, Hai Li, Yiran Chen
Vision language models (VLMs) demonstrate strong capabilities in jointly
processing visual and textual data. However, they often incur substantial
computational overhead due to redundant visual information, particularly in
long-form video scenarios. Existing approaches predominantly focus on either
vision token pruning, which may overlook spatio-temporal dependencies, or
keyframe selection, which identifies informative frames but discards others,
thus disrupting contextual continuity. In this work, we propose KVTP
(Keyframe-oriented Vision Token Pruning), a novel framework that overcomes the
drawbacks of token pruning and keyframe selection. By adaptively assigning
pruning rates based on frame relevance to the query, KVTP effectively retains
essential contextual information while significantly reducing redundant
computation. To thoroughly evaluate the long-form video understanding
capacities of VLMs, we curated and reorganized subsets from VideoMME,
EgoSchema, and NextQA into a unified benchmark named SparseKV-QA that
highlights real-world scenarios with sparse but crucial events. Our experiments
with VLMs of various scales show that KVTP can reduce token usage by 80%
without compromising spatiotemporal and contextual consistency, significantly
cutting computation while maintaining the performance. These results
demonstrate our approach's effectiveness in efficient long-video processing,
facilitating more scalable VLM deployment.
♻ ☆ ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for
empowering large language model (LLM) applications. Compared with the
supervised training process of LLMs, the RLHF training process is much more
sophisticated, requiring a diverse range of computation workloads with
intricate dependencies between multiple LLM instances. Therefore, simply
adopting the fixed parallelization strategies from supervised training for LLMs
can be insufficient for RLHF and result in low training efficiency. To overcome
this limitation, we propose a novel technique named parameter ReaLlocation,
which dynamically adapts the parallelization strategies for different workloads
during training by redistributing LLM parameters across the training cluster.
Building upon this idea, we introduce ReaL, a pioneering system for efficient
RLHF training. ReaL introduces the concept of an execution plan, which defines
a fine-grained resource allocation and parallelization strategy particularly
designed for RLHF training. Based on this concept, ReaL employs a tailored
search algorithm with a lightweight run-time estimator to automatically
discover an efficient execution plan for an instance of RLHF experiment.
Subsequently, the runtime engine deploys the selected plan by effectively
parallelizing computations and redistributing parameters. We evaluate ReaL on
the LLaMA models with up to 70 billion parameters and 128 GPUs. The
experimental results demonstrate that ReaL achieves speedups of up to
$3.58\times$ compared to baseline methods. Furthermore, the execution plans
generated by ReaL exhibit an average of $81\%$ performance improvement over
heuristic approaches based on Megatron-LM in the long-context scenario. The
source code of ReaL is publicly available at
https://github.com/openpsi-project/ReaLHF .
comment: 11 pages (20 pages with references and the appendix), 17 figures.
Accepted by MLSys 25
♻ ☆ Not All Data Are Unlearned Equally
Machine unlearning is concerned with the task of removing knowledge learned
from particular data points from a trained model. In the context of large
language models (LLMs), unlearning has recently received increased attention,
particularly for removing knowledge about named entities from models for
privacy purposes. While various approaches have been proposed to address the
unlearning problem, most existing approaches treat all data points to be
unlearned equally, i.e., unlearning that Montreal is a city in Canada is
treated exactly the same as unlearning the phone number of the first author of
this paper. In this work, we show that this all data is equal assumption does
not hold for LLM unlearning. We study how the success of unlearning depends on
the frequency of the knowledge we want to unlearn in the pre-training data of a
model and find that frequency strongly affects unlearning, i.e., more frequent
knowledge is harder to unlearn. Additionally, we uncover a misalignment between
probability and generation-based evaluations of unlearning and show that this
problem worsens as models become larger. Overall, our experiments highlight the
need for better evaluation practices and novel methods for LLM unlearning that
take the training data of models into account.
♻ ☆ Context-Aware Neural Gradient Mapping for Fine-Grained Instruction Processing
The integration of contextual embeddings into the optimization processes of
large language models is an advancement in natural language processing. The
Context-Aware Neural Gradient Mapping framework introduces a dynamic gradient
adjustment mechanism, incorporating contextual embeddings directly into the
optimization process. This approach facilitates real-time parameter
adjustments, enhancing task-specific generalization even in the presence of
sparse or noisy data inputs. The mathematical foundation of this framework
relies on gradient descent modifications, where contextual embeddings are
derived from a supplementary neural network trained to map input features to
optimal adaptation gradients. By employing differential geometry principles,
high-dimensional input dependencies are encoded into low-dimensional gradient
manifolds, enabling efficient adaptation without necessitating the retraining
of the entire model. Empirical evaluations demonstrate that the proposed
framework consistently outperforms baseline models across various metrics,
including accuracy, robustness to noise, and computational efficiency. The
integration of context-specific embeddings allows for a more complex
understanding of language, thereby improving the model's ability to handle
diverse linguistic phenomena. Furthermore, the computational efficiency
achieved through this method demonstrates its scalability for large-scale
language models operating under diverse constraints.
comment: arXiv admin note: This paper has been withdrawn by arXiv due to
disputed and unverifiable authorship
♻ ☆ Probabilistic Subspace Manifolds for Contextual Inference in Large Language Models
Christopher Nightingale, Dominic Lavington, Jonathan Thistlethwaite, Sebastian Penhaligon, Thomas Belinski, David Boldo
Representing token embeddings as probability distributions over learned
manifolds allows for more flexible contextual inference, reducing
representational rigidity while enhancing semantic granularity. Comparative
evaluations demonstrate that probabilistic embeddings improve neighborhood
consistency and decrease redundancy, ensuring that token relationships remain
more structurally coherent across fine-tuning iterations. The integration of
probabilistic subspaces within attention mechanisms facilitates more adaptive
contextual weighting, enabling models to capture latent dependencies that would
otherwise be obscured in conventional embeddings. Experimental results
highlight increased robustness against adversarial modifications, with
probabilistic embeddings preserving contextual integrity even under
perturbation-based evaluation scenarios. Performance assessments indicate that
probabilistic representations achieve greater adaptability in domain-specific
applications, mitigating the need for extensive retraining when shifting across
linguistic domains. Computational trade-offs remain within operationally
feasible limits, with marginal increases in inference latency balanced against
the benefits of enhanced representation stability and contextual
expressiveness. The capacity to encode structured uncertainty provides
advantages in generative modeling tasks, particularly where maintaining
coherence across extended sequences requires a representation framework capable
of handling ambiguous or context-dependent linguistic constructs.
comment: arXiv admin note: This paper has been withdrawn by arXiv due to
disputed and unverifiable authorship
♻ ☆ Transferable text data distillation by trajectory matching
In the realm of large language model (LLM), as the size of large models
increases, it also brings higher training costs. There is a urgent need to
minimize the data size in LLM training. Compared with data selection method,
the data distillation method aims to synthesize a small number of data samples
to achieve the training effect of the full data set and has better flexibility.
Despite its successes in computer vision, the discreteness of text data has
hitherto stymied its exploration in natural language processing (NLP). In this
work, we proposed a method that involves learning pseudo prompt data based on
trajectory matching and finding its nearest neighbor ID to achieve
cross-architecture transfer. During the distillation process, we introduce a
regularization loss to improve the robustness of our distilled data. To our
best knowledge, this is the first data distillation work suitable for text
generation tasks such as instruction tuning. Evaluations on two benchmarks,
including ARC-Easy and MMLU instruction tuning datasets, established the
superiority of our distillation approach over the SOTA data selection method
LESS. Furthermore, our method demonstrates a good transferability over LLM
structures (i.e., OPT to Llama).
♻ ☆ Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book? ICLR 2025
Extremely low-resource (XLR) languages lack substantial corpora for training
NLP models, motivating the use of all available resources such as dictionaries
and grammar books. Machine Translation from One Book (Tanzer et al., 2024)
suggests that prompting long-context LLMs with one grammar book enables
English-Kalamang translation, an XLR language unseen by LLMs - a noteworthy
case of linguistics helping an NLP task. We investigate the source of this
translation ability, finding almost all improvements stem from the book's
parallel examples rather than its grammatical explanations. We find similar
results for Nepali and Guarani, seen low-resource languages, and we achieve
performance comparable to an LLM with a grammar book by simply fine-tuning an
encoder-decoder translation model. We then investigate where grammar books help
by testing two linguistic tasks, grammaticality judgment and gloss prediction,
and we explore what kind of grammatical knowledge helps by introducing a
typological feature prompt that achieves leading results on these more relevant
tasks. We thus emphasise the importance of task-appropriate data for XLR
languages: parallel examples for translation, and grammatical data for
linguistic tasks. As we find no evidence that long-context LLMs can make
effective use of grammatical explanations for XLR translation, we conclude data
collection for multilingual XLR tasks such as translation is best focused on
parallel data over linguistic description.
comment: Accepted at ICLR 2025 (Spotlight)
♻ ☆ OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure ACL
Autoregressive language models demonstrate excellent performance in various
scenarios. However, the inference efficiency is limited by its
one-step-one-word generation mode, which has become a pressing problem recently
as the models become increasingly larger. Speculative decoding employs a "draft
and then verify" mechanism to allow multiple tokens to be generated in one
step, realizing lossless acceleration. Existing methods mainly adopt fixed
heuristic draft structures, which fail to adapt to different situations to
maximize the acceptance length during verification. To alleviate this dilemma,
we proposed OPT-Tree, an algorithm to construct adaptive and scalable draft
trees. It searches the optimal tree structure that maximizes the mathematical
expectation of the acceptance length in each decoding step. Experimental
results reveal that OPT-Tree outperforms the existing draft structures and
achieves a speed-up ratio of up to 3.2 compared with autoregressive decoding.
If the draft model is powerful enough and the node budget is sufficient, it can
generate more than ten tokens in a single step. Our code is available at
https://github.com/Jikai0Wang/OPT-Tree.
comment: Published in TACL
♻ ☆ SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection SemEval2025
Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Seid Muhie Yimam, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine De Kock, Tadesse Destaw Belay, Ibrahim Said Ahmad, Nirmal Surange, Daniela Teodorescu, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino Ali, Vladimir Araujo, Abinew Ali Ayele, Oana Ignat, Alexander Panchenko, Yi Zhou, Saif M. Mohammad
We present our shared task on text-based emotion detection, covering more
than 30 languages from seven distinct language families. These languages are
predominantly low-resource and are spoken across various continents. The data
instances are multi-labeled with six emotional classes, with additional
datasets in 11 languages annotated for emotion intensity. Participants were
asked to predict labels in three tracks: (a) multilabel emotion detection, (b)
emotion intensity score detection, and (c) cross-lingual emotion detection.
The task attracted over 700 participants. We received final submissions from
more than 200 teams and 93 system description papers. We report baseline
results, along with findings on the best-performing systems, the most common
approaches, and the most effective methods across different tracks and
languages. The datasets for this task are publicly available. The dataset is
available at SemEval2025 Task 11 https://brighter-dataset.github.io
comment: SemEval2025 Task11 (Task Description Paper). arXiv admin note: text
overlap with arXiv:2502.11926
♻ ☆ Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Multimodal language analysis is a rapidly evolving field that leverages
multiple modalities to enhance the understanding of high-level semantics
underlying human conversational utterances. Despite its significance, little
research has investigated the capability of multimodal large language models
(MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce
MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA
comprises over 61K multimodal utterances drawn from both staged and real-world
scenarios, covering six core dimensions of multimodal semantics: intent,
emotion, dialogue act, sentiment, speaking style, and communication behavior.
We evaluate eight mainstream branches of LLMs and MLLMs using three methods:
zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive
experiments reveal that even fine-tuned models achieve only about 60%~70%
accuracy, underscoring the limitations of current MLLMs in understanding
complex human language. We believe that MMLA will serve as a solid foundation
for exploring the potential of large language models in multimodal language
analysis and provide valuable resources to advance this field. The datasets and
code are open-sourced at https://github.com/thuiar/MMLA.
comment: 23 pages, 5 figures
♻ ☆ From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs
Jiliang Ni, Jiachen Pu, Zhongyi Yang, Kun Zhou, Hui Wang, Xiaoliang Xiao, Dakui Wang, Xin Li, Jingfeng Luo, Conggang Hu
In recent years, Large Language Models (LLMs) have significantly advanced
artificial intelligence by optimizing traditional Natural Language Processing
(NLP) pipelines, improving performance and generalization. This has spurred
their integration into various systems. Many NLP systems, including ours,
employ a "one-stage" pipeline directly incorporating LLMs. While effective,
this approach incurs substantial costs and latency due to the need for large
model parameters to achieve satisfactory outcomes. This paper introduces a
three-stage cost-efficient end-to-end LLM deployment pipeline-including
prototyping, knowledge transfer, and model compression-to tackle the
cost-performance dilemma in LLM-based frameworks. Our approach yields a super
tiny model optimized for cost and performance in online systems, simplifying
the system architecture. Initially, by transforming complex tasks into a
function call-based LLM-driven pipeline, an optimal performance prototype
system is constructed to produce high-quality data as a teacher model. The
second stage combines techniques like rejection fine-tuning, reinforcement
learning, and knowledge distillation to transfer knowledge to a smaller 0.5B
student model, delivering effective performance at minimal cost. The final
stage applies quantization and pruning to extremely compress models to 0.4B,
achieving ultra-low latency and cost. The framework's modular design and
cross-domain capabilities suggest potential applicability in other NLP areas.
♻ ☆ Synthetic Lyrics Detection Across Languages and Genres NAACL 2025
In recent years, the use of large language models (LLMs) to generate music
content, particularly lyrics, has gained in popularity. These advances provide
valuable tools for artists and enhance their creative processes, but they also
raise concerns about copyright violations, consumer satisfaction, and content
spamming. Previous research has explored content detection in various domains.
However, no work has focused on the text modality, lyrics, in music. To address
this gap, we curated a diverse dataset of real and synthetic lyrics from
multiple languages, music genres, and artists. The generation pipeline was
validated using both humans and automated methods. We performed a thorough
evaluation of existing synthetic text detection approaches on lyrics, a
previously unexplored data type. We also investigated methods to adapt the
best-performing features to lyrics through unsupervised domain adaptation.
Following both music and industrial constraints, we examined how well these
approaches generalize across languages, scale with data availability, handle
multilingual language content, and perform on novel genres in few-shot
settings. Our findings show promising results that could inform policy
decisions around AI-generated music and enhance transparency for users.
comment: Published in the TrustNLP Workshop at NAACL 2025
♻ ☆ Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
The large models, as predicted by scaling raw forecasts, have made
groundbreaking progress in many fields, particularly in natural language
generation tasks, where they have approached or even surpassed human levels.
However, the unprecedented scale of their parameters brings significant
computational and storage costs. These large models require substantial
computational resources and GPU memory to operate. When adapting large models
to specific downstream tasks, their massive parameter scale poses a significant
challenge in fine-tuning on hardware platforms with limited computational power
and GPU memory. To address this issue, Parameter-Efficient Fine-Tuning (PEFT)
offers a practical solution by efficiently adjusting the parameters of large
pre-trained models to suit various downstream tasks. Specifically, PEFT adjusts
the parameters of pre-trained large models to adapt to specific tasks or
domains, minimizing the introduction of additional parameters and the
computational resources required. This review mainly introduces the preliminary
knowledge of PEFT, the core ideas and principles of various PEFT algorithms,
the applications of PEFT, and potential future research directions. By reading
this review, we believe that interested parties can quickly grasp the PEFT
methodology, thereby accelerating its development and innovation.
♻ ☆ Automatically Evaluating the Paper Reviewing Capability of Large Language Models
Hyungyu Shin, Jingyu Tang, Yoonjoo Lee, Nayoung Kim, Hyunseung Lim, Ji Yong Cho, Hwajung Hong, Moontae Lee, Juho Kim
Peer review is essential for scientific progress, but it faces challenges
such as reviewer shortages and growing workloads. Although Large Language
Models (LLMs) show potential for providing assistance, research has reported
significant limitations in the reviews they generate. While the insights are
valuable, conducting the analysis is challenging due to the considerable time
and effort required, especially given the rapid pace of LLM developments. To
address the challenge, we developed an automatic evaluation pipeline to assess
the LLMs' paper review capability by comparing them with expert-generated
reviews. By constructing a dataset consisting of 676 OpenReview papers, we
examined the agreement between LLMs and experts in their strength and weakness
identifications. The results showed that LLMs lack balanced perspectives,
significantly overlook novelty assessment when criticizing, and produce poor
acceptance decisions. Our automated pipeline enables a scalable evaluation of
LLMs' paper review capability over time.
♻ ☆ Multilingual State Space Models for Structured Question Answering in Indic Languages NAACL
The diversity and complexity of Indic languages present unique challenges for
natural language processing (NLP) tasks, particularly in the domain of question
answering (QA).To address these challenges, this paper explores the application
of State Space Models (SSMs),to build efficient and contextually aware QA
systems tailored for Indic languages. SSMs are particularly suited for this
task due to their ability to model long-term and short-term dependencies in
sequential data, making them well-equipped to handle the rich morphology,
complex syntax, and contextual intricacies characteristic of Indian languages.
We evaluated multiple SSM architectures across diverse datasets representing
various Indic languages and conducted a comparative analysis of their
performance. Our results demonstrate that these models effectively capture
linguistic subtleties, leading to significant improvements in question
interpretation, context alignment, and answer generation. This work represents
the first application of SSMs to question answering tasks in Indic languages,
establishing a foundational benchmark for future research in this domain. We
propose enhancements to existing SSM frameworks, optimizing their applicability
to low-resource settings and multilingual scenarios prevalent in Indic
languages.
comment: Accepted at NAACL
♻ ☆ Efficient Pretraining Length Scaling
Recent advances in large language models have demonstrated the effectiveness
of length scaling during post-training, yet its potential in pre-training
remains underexplored. We present the Parallel Hidden Decoding Transformer
(\textit{PHD}-Transformer), a novel framework that enables efficient length
scaling during pre-training while maintaining inference efficiency.
\textit{PHD}-Transformer achieves this through an innovative KV cache
management strategy that distinguishes between original tokens and hidden
decoding tokens. By retaining only the KV cache of original tokens for
long-range dependencies while immediately discarding hidden decoding tokens
after use, our approach maintains the same KV cache size as the vanilla
transformer while enabling effective length scaling. To further enhance
performance, we introduce two optimized variants: \textit{PHD-SWA} employs
sliding window attention to preserve local dependencies, while
\textit{PHD-CSWA} implements chunk-wise sliding window attention to eliminate
linear growth in pre-filling time. Extensive experiments demonstrate consistent
improvements across multiple benchmarks.
♻ ☆ LaMsS: When Large Language Models Meet Self-Skepticism ICLR 2025
Hallucination is a major challenge for large language models (LLMs),
preventing their further application in some fields. The skeptical thinking of
humankind could be useful for LLMs to self-cognition, self-reflection and
alleviate their hallucinations. Inspired by this consideration, we propose a
novel approach called LaMsS, which combines the semantic understanding
capability of LLMs with self-skepticism. By introducing a series of skepticism
tokens and augmenting them into the vocabulary, we conduct both pertaining and
finetuning, which allow the LLM to decode each normal token followed by a
skeptical token, representing different skepticism levels. By calculating the
response skepticism given a query, one can define a new self-aware LLM which is
only willing to answer with relative lower skepticism level than the threshold.
By examining the accuracy, AUC and AP of willingly answering questions, we
demonstrate that LaMsS achieves better performance than baselines on both
multi-choice questions and open-domain question-answering benchmarks, and can
generalize to multi-task and out-of-domain settings. Our study sheds some
lights on the self-skepticism modeling on further artificial intelligence.
Project code and model checkpoints can be found in
https://anonymous.4open.science/r/SM-1E76.
comment: 11 pages, 6 figures, Published at ICLR 2025 Workshop on Scaling
Self-Improving Foundation Models,
♻ ☆ Looking beyond the next token
The structure of causal language model training assumes that each token can
be accurately predicted from the previous context. This contrasts with humans'
natural writing and reasoning process, where goals are typically known before
the exact argument or phrasings. While this mismatch has been well studied in
the literature, the working assumption has been that architectural changes are
needed to address this mismatch. We argue that rearranging and processing the
training data sequences can allow models to more accurately imitate the true
data-generating process, and does not require any other changes to the
architecture or training infrastructure. We demonstrate that this technique,
Trelawney, and the inference algorithms derived from it allow us to improve
performance on several key benchmarks that span planning, algorithmic
reasoning, and story generation tasks. Finally, our method naturally enables
the generation of long-term goals at no additional cost. We investigate how
using the model's goal-generation capability can further improve planning and
reasoning. Additionally, we believe Trelawney could potentially open doors to
new capabilities beyond the current language modeling paradigm.
♻ ☆ Shared Global and Local Geometry of Language Model Embeddings
Researchers have recently suggested that models share common representations.
In our work, we find that token embeddings of language models exhibit common
geometric structure. First, we find ``global'' similarities: token embeddings
often share similar relative orientations. Next, we characterize local geometry
in two ways: (1) by using Locally Linear Embeddings, and (2) by defining a
simple measure for the intrinsic dimension of each token embedding. Our
intrinsic dimension demonstrates that token embeddings lie on a lower
dimensional manifold. We qualitatively show that tokens with lower intrinsic
dimensions often have semantically coherent clusters, while those with higher
intrinsic dimensions do not. Both characterizations allow us to find
similarities in the local geometry of token embeddings. Perhaps most
surprisingly, we find that alignment in token embeddings persists through the
hidden states of language models, allowing us to develop an application for
interpretability. Namely, we introduce Emb2Emb, a simple method to transfer
steering vectors from one language model to another, despite the two models
having different dimensions.
♻ ☆ TALES: Text Adventure Learning Environment Suite
Reasoning is an essential skill to enable Large Language Models (LLMs) to
interact with the world. As tasks become more complex, they demand increasingly
sophisticated and diverse reasoning capabilities for sequential
decision-making, requiring structured reasoning over the context history to
determine the next best action. We introduce TALES, a diverse collection of
synthetic and human-written text-adventure games designed to challenge and
evaluate diverse reasoning capabilities. We present results over a range of
LLMs, open- and closed-weights, performing a qualitative analysis on the top
performing models. Despite an impressive showing on synthetic games, even the
top LLM-driven agents fail to achieve 15% on games designed for human
enjoyment. Code and visualization of the experiments can be found at
https://microsoft.github.io/tale-suite.
♻ ☆ Selective Attention Improves Transformer ICLR 2025
Unneeded elements in the attention's context degrade performance. We
introduce Selective Attention, a simple parameter-free change to the standard
attention mechanism which reduces attention to unneeded elements. Selective
attention consistently improves language modeling and downstream task
performance in a variety of model sizes and context lengths. For example,
transformers trained with the language modeling objective on C4 with selective
attention perform language modeling equivalently to standard transformers with
~2X more heads and parameters in their attention modules. Selective attention
also allows decreasing the size of the attention's context buffer, leading to
meaningful reductions in the memory and compute requirements during inference.
For example, transformers trained on C4 with context sizes of 512, 1,024, and
2,048 need 16X, 25X, and 47X less memory for their attention module,
respectively, when equipped with selective attention, as those without
selective attention, with the same validation perplexity.
comment: ICLR 2025
♻ ☆ GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning
Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, Bo Zheng
Geometry problem-solving (GPS), a challenging task requiring both visual
comprehension and symbolic reasoning, effectively measures the reasoning
capabilities of multimodal large language models (MLLMs). Humans exhibit strong
reasoning ability in this task through accurate identification and adaptive
application of geometric principles within visual contexts. However, existing
benchmarks fail to jointly assess both dimensions of the human-like geometric
reasoning mechanism in MLLMs, remaining a critical gap in assessing their
ability to tackle GPS. To this end, we introduce GeoSense, the first
comprehensive bilingual benchmark designed to systematically evaluate the
geometric reasoning abilities of MLLMs through the lens of geometric
principles. GeoSense features a five-level hierarchical framework of geometric
principles spanning plane and solid geometry, an intricately annotated dataset
of 1,789 problems, and an innovative evaluation strategy. Through extensive
experiments on GeoSense with various open-source and closed-source MLLMs, we
observe that Gemini-2.0-pro-flash performs best, achieving an overall score of
$65.3$. Our in-depth analysis reveals that the identification and application
of geometric principles remain a bottleneck for leading MLLMs, jointly
hindering their reasoning abilities. These findings underscore GeoSense's
potential to guide future advancements in MLLMs' geometric reasoning
capabilities, paving the way for more robust and human-like reasoning in
artificial intelligence.
comment: 10 pages, 8 figures
♻ ☆ Cognitive Memory in Large Language Models
This paper examines memory mechanisms in Large Language Models (LLMs),
emphasizing their importance for context-rich responses, reduced
hallucinations, and improved efficiency. It categorizes memory into sensory,
short-term, and long-term, with sensory memory corresponding to input prompts,
short-term memory processing immediate context, and long-term memory
implemented via external databases or structures. The text-based memory section
covers acquisition (selection and summarization), management (updating,
accessing, storing, and resolving conflicts), and utilization (full-text
search, SQL queries, semantic search). The KV cache-based memory section
discusses selection methods (regularity-based summarization, score-based
approaches, special token embeddings) and compression techniques (low-rank
compression, KV merging, multimodal compression), along with management
strategies like offloading and shared attention mechanisms. Parameter-based
memory methods (LoRA, TTT, MoE) transform memories into model parameters to
enhance efficiency, while hidden-state-based memory approaches (chunk
mechanisms, recurrent transformers, Mamba model) improve long-text processing
by combining RNN hidden states with current methods. Overall, the paper offers
a comprehensive analysis of LLM memory mechanisms, highlighting their
significance and future research directions.
comment: 37 pages, 9 figures