Computation and Language 84
☆ The LLM Wears Prada: Analysing Gender Bias and Stereotypes through Online Shopping Data
With the wide and cross-domain adoption of Large Language Models, it becomes
crucial to assess to which extent the statistical correlations in training
data, which underlie their impressive performance, hide subtle and potentially
troubling biases. Gender bias in LLMs has been widely investigated from the
perspectives of works, hobbies, and emotions typically associated with a
specific gender. In this study, we introduce a novel perspective. We
investigate whether LLMs can predict an individual's gender based solely on
online shopping histories and whether these predictions are influenced by
gender biases and stereotypes. Using a dataset of historical online purchases
from users in the United States, we evaluate the ability of six LLMs to
classify gender and we then analyze their reasoning and products-gender
co-occurrences. Results indicate that while models can infer gender with
moderate accuracy, their decisions are often rooted in stereotypical
associations between product categories and gender. Furthermore, explicit
instructions to avoid bias reduce the certainty of model predictions, but do
not eliminate stereotypical patterns. Our findings highlight the persistent
nature of gender biases in LLMs and emphasize the need for robust
bias-mitigation strategies.
☆ OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, Boris Ginsburg
Since the advent of reasoning-based large language models, many have found
great success from distilling reasoning capabilities into student models. Such
techniques have significantly bridged the gap between reasoning and standard
LLMs on coding tasks. Despite this, much of the progress on distilling
reasoning models remains locked behind proprietary datasets or lacks details on
data curation, filtering and subsequent training. To address this, we construct
a superior supervised fine-tuning (SFT) dataset that we use to achieve
state-of-the-art coding capability results in models of various sizes. Our
distilled models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on
CodeContests, surpassing alternatives trained with reinforcement learning. We
then perform analysis on the data sources used to construct our dataset, the
impact of code execution filtering, and the importance of instruction/solution
diversity. We observe that execution filtering negatively affected benchmark
accuracy, leading us to prioritize instruction diversity over solution
correctness. Finally, we also analyze the token efficiency and reasoning
patterns utilized by these models. We will open-source these datasets and
distilled models to the community.
comment: Work in progress
☆ Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection
Souradip Chakraborty, Mohammadreza Pourreza, Ruoxi Sun, Yiwen Song, Nino Scherrer, Jindong Gu, Furong Huang, Amrit Singh Bedi, Ahmad Beirami, Hamid Palangi, Tomas Pfister
While AI agents have shown remarkable performance at various tasks, they
still struggle with complex multi-modal applications, structured generation and
strategic planning. Improvements via standard fine-tuning is often impractical,
as solving agentic tasks usually relies on black box API access without control
over model parameters. Inference-time methods such as Best-of-N (BON) sampling
offer a simple yet effective alternative to improve performance. However, BON
lacks iterative feedback integration mechanism. Hence, we propose Iterative
Agent Decoding (IAD) which combines iterative refinement with dynamic candidate
evaluation and selection guided by a verifier. IAD differs in how feedback is
designed and integrated, specifically optimized to extract maximal signal from
reward scores. We conduct a detailed comparison of baselines across key metrics
on Sketch2Code, Text2SQL, and Webshop where IAD consistently outperforms
baselines, achieving 3--6% absolute gains on Sketch2Code and Text2SQL (with and
without LLM judges) and 8--10% gains on Webshop across multiple metrics. To
better understand the source of IAD's gains, we perform controlled experiments
to disentangle the effect of adaptive feedback from stochastic sampling, and
find that IAD's improvements are primarily driven by verifier-guided
refinement, not merely sampling diversity. We also show that both IAD and BON
exhibit inference-time scaling with increased compute when guided by an optimal
verifier. Our analysis highlights the critical role of verifier quality in
effective inference-time optimization and examines the impact of noisy and
sparse rewards on scaling behavior. Together, these findings offer key insights
into the trade-offs and principles of effective inference-time optimization.
☆ A thorough benchmark of automatic text classification: From traditional approaches to large language models
Automatic text classification (ATC) has experienced remarkable advancements
in the past decade, best exemplified by recent small and large language models
(SLMs and LLMs), leveraged by Transformer architectures. Despite recent
effectiveness improvements, a comprehensive cost-benefit analysis investigating
whether the effectiveness gains of these recent approaches compensate their
much higher costs when compared to more traditional text classification
approaches such as SVMs and Logistic Regression is still missing in the
literature. In this context, this work's main contributions are twofold: (i) we
provide a scientifically sound comparative analysis of the cost-benefit of
twelve traditional and recent ATC solutions including five open LLMs, and (ii)
a large benchmark comprising {22 datasets}, including sentiment analysis and
topic classification, with their (train-validation-test) partitions based on
folded cross-validation procedures, along with documentation, and code. The
release of code, data, and documentation enables the community to replicate
experiments and advance the field in a more scientifically sound manner. Our
comparative experimental results indicate that LLMs outperform traditional
approaches (up to 26%-7.1% on average) and SLMs (up to 4.9%-1.9% on average) in
terms of effectiveness. However, LLMs incur significantly higher computational
costs due to fine-tuning, being, on average 590x and 8.5x slower than
traditional methods and SLMs, respectively. Results suggests the following
recommendations: (1) LLMs for applications that require the best possible
effectiveness and can afford the costs; (2) traditional methods such as
Logistic Regression and SVM for resource-limited applications or those that
cannot afford the cost of tuning large LLMs; and (3) SLMs like Roberta for
near-optimal effectiveness-efficiency trade-off.
comment: 7 pages, 2 figures, 3 tables
☆ Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure
Despite their impressive capabilities, LLMs exhibit a basic generalization
failure known as the Reversal Curse, where they struggle to learn reversible
factual associations. Understanding why this occurs could help identify
weaknesses in current models and advance their generalization and robustness.
In this paper, we conjecture that the Reversal Curse in LLMs is a manifestation
of the long-standing binding problem in cognitive science, neuroscience and AI.
Specifically, we identify two primary causes of the Reversal Curse stemming
from transformers' limitations in conceptual binding: the inconsistency and
entanglements of concept representations. We perform a series of experiments
that support these conjectures. Our exploration leads to a model design based
on JEPA (Joint-Embedding Predictive Architecture) that for the first time
breaks the Reversal Curse without side-stepping it with specialized data
augmentation or non-causal masking, and moreover, generalization could be
further improved by incorporating special memory layers that support
disentangled concept representations. We demonstrate that the skill of reversal
unlocks a new kind of memory integration that enables models to solve
large-scale arithmetic reasoning problems via parametric forward-chaining,
outperforming frontier LLMs based on non-parametric memory and prolonged
explicit reasoning.
comment: Code and data:
https://github.com/OSU-NLP-Group/reversal-curse-binding
☆ Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation
The advent of Large Language Models (LLMs) has significantly reshaped the
landscape of machine translation (MT), particularly for low-resource languages
and domains that lack sufficient parallel corpora, linguistic tools, and
computational infrastructure. This survey presents a comprehensive overview of
recent progress in leveraging LLMs for MT. We analyze techniques such as
few-shot prompting, cross-lingual transfer, and parameter-efficient fine-tuning
that enable effective adaptation to under-resourced settings. The paper also
explores synthetic data generation strategies using LLMs, including
back-translation and lexical augmentation. Additionally, we compare LLM-based
translation with traditional encoder-decoder models across diverse language
pairs, highlighting the strengths and limitations of each. We discuss
persistent challenges such as hallucinations, evaluation inconsistencies, and
inherited biases while also evaluating emerging LLM-driven metrics for
translation quality. This survey offers practical insights and outlines future
directions for building robust, inclusive, and scalable MT systems in the era
of large-scale generative models.
☆ FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs
As a pioneering vision-language model, CLIP (Contrastive Language-Image
Pre-training) has achieved significant success across various domains and a
wide range of downstream vision-language tasks. However, the text encoders in
popular CLIP models are limited to processing only 77 text tokens, which
constrains their ability to effectively handle longer, detail-rich captions.
Additionally, CLIP models often struggle to effectively capture detailed visual
and textual information, which hampers their performance on tasks that require
fine-grained analysis. To address these limitations, we present a novel
approach, \textbf{FineLIP}, that extends the capabilities of CLIP. FineLIP
enhances cross-modal text-image mapping by incorporating \textbf{Fine}-grained
alignment with \textbf{L}onger text input within the CL\textbf{IP}-style
framework. FineLIP first extends the positional embeddings to handle longer
text, followed by the dynamic aggregation of local image and text tokens. The
aggregated results are then used to enforce fine-grained token-to-token
cross-modal alignment. We validate our model on datasets with long, detailed
captions across two tasks: zero-shot cross-modal retrieval and text-to-image
generation. Quantitative and qualitative experimental results demonstrate the
effectiveness of FineLIP, outperforming existing state-of-the-art approaches.
Furthermore, comprehensive ablation studies validate the benefits of key design
elements within FineLIP.
☆ Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning
Large Language Models (LLMs) are playing an expanding role in physics
research by enhancing reasoning, symbolic manipulation, and numerical
computation. However, ensuring the reliability and interpretability of their
outputs remains a significant challenge. In our framework, we conceptualize the
collaboration between AI and human scientists as a dynamic interplay among
three modules: the reasoning module, the interpretation module, and the
AI-scientist interaction module. Recognizing that effective physics reasoning
demands rigorous logical consistency, quantitative precision, and deep
integration with established theoretical models, we introduce the
interpretation module to improve the understanding of AI-generated outputs,
which is not previously explored in the literature. This module comprises
multiple specialized agents, including summarizers, model builders, UI
builders, and testers, which collaboratively structure LLM outputs within a
physically grounded framework, by constructing a more interpretable science
model. A case study demonstrates that our approach enhances transparency,
facilitates validation, and strengthens AI-augmented reasoning in scientific
discovery.
☆ STAR-1: Safer Alignment of Reasoning LLMs with 1K Data
Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, Cihang Xie
This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset
specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built
on three core principles -- diversity, deliberative reasoning, and rigorous
filtering -- STAR-1 aims to address the critical needs for safety alignment in
LRMs. Specifically, we begin by integrating existing open-source safety
datasets from diverse sources. Then, we curate safety policies to generate
policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based
safety scoring system to select training examples aligned with best practices.
Experimental results show that fine-tuning LRMs with STAR-1 leads to an average
40% improvement in safety performance across four benchmarks, while only
incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability
measured across five reasoning tasks. Extensive ablation studies further
validate the importance of our design principles in constructing STAR-1 and
analyze its efficacy across both LRMs and traditional LLMs. Our project page is
https://ucsc-vlaa.github.io/STAR-1.
☆ Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights
Detecting abusive language in social media conversations poses significant
challenges, as identifying abusiveness often depends on the conversational
context, characterized by the content and topology of preceding comments.
Traditional Abusive Language Detection (ALD) models often overlook this
context, which can lead to unreliable performance metrics. Recent Natural
Language Processing (NLP) methods that integrate conversational context often
depend on limited and simplified representations, and report inconsistent
results. In this paper, we propose a novel approach that utilize graph neural
networks (GNNs) to model social media conversations as graphs, where nodes
represent comments, and edges capture reply structures. We systematically
investigate various graph representations and context windows to identify the
optimal configuration for ALD. Our GNN model outperform both context-agnostic
baselines and linear context-aware methods, achieving significant improvements
in F1 scores. These findings demonstrate the critical role of structured
conversational context and establish GNNs as a robust framework for advancing
context-aware abusive language detection.
☆ Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
The rapid development of Large Multimodal Models (LMMs) for 2D images and
videos has spurred efforts to adapt these models for interpreting 3D scenes.
However, the absence of large-scale 3D vision-language datasets has posed a
significant obstacle. To address this issue, typical approaches focus on
injecting 3D awareness into 2D LMMs by designing 3D input-level scene
representations. This work provides a new perspective. We introduce
reconstructive visual instruction tuning with 3D-awareness (Ross3D), which
integrates 3D-aware visual supervision into the training procedure.
Specifically, it incorporates cross-view and global-view reconstruction. The
former requires reconstructing masked views by aggregating overlapping
information from other views. The latter aims to aggregate information from all
available views to recover Bird's-Eye-View images, contributing to a
comprehensive overview of the entire scene. Empirically, Ross3D achieves
state-of-the-art performance across various 3D scene understanding benchmarks.
More importantly, our semi-supervised experiments demonstrate significant
potential in leveraging large amounts of unlabeled 3D vision-only data.
☆ CoRAG: Collaborative Retrieval-Augmented Generation NAACL 2024
Retrieval-Augmented Generation (RAG) models excel in knowledge-intensive
tasks, especially under few-shot learning constraints. We introduce CoRAG, a
framework extending RAG to collaborative settings, where clients jointly train
a shared model using a collaborative passage store. To evaluate CoRAG, we
introduce CRAB, a benchmark for collaborative homogeneous open-domain question
answering. Our experiments demonstrate that CoRAG consistently outperforms both
parametric collaborative learning methods and locally trained RAG models in
low-resource scenarios. Further analysis reveals the critical importance of
relevant passages within the shared store, the surprising benefits of
incorporating irrelevant passages, and the potential for hard negatives to
negatively impact performance. This introduces a novel consideration in
collaborative RAG: the trade-off between leveraging a collectively enriched
knowledge base and the potential risk of incorporating detrimental passages
from other clients. Our findings underscore the viability of CoRAG, while also
highlighting key design challenges and promising avenues for future research.
comment: NAACL 2024
☆ TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables
Humans continuously make new discoveries, and understanding temporal sequence
of events leading to these breakthroughs is essential for advancing science and
society. This ability to reason over time allows us to identify future steps
and understand the effects of financial and political decisions on our lives.
However, large language models (LLMs) are typically trained on static datasets,
limiting their ability to perform effective temporal reasoning. To assess the
temporal reasoning capabilities of LLMs, we present the TRANSIENTTABLES
dataset, which comprises 3,971 questions derived from over 14,000 tables,
spanning 1,238 entities across multiple time periods. We introduce a
template-based question-generation pipeline that harnesses LLMs to refine both
templates and questions. Additionally, we establish baseline results using
state-of-the-art LLMs to create a benchmark. We also introduce novel modeling
strategies centered around task decomposition, enhancing LLM performance.
comment: 19 Pages. 21 Tables, 1 figure
☆ Cross-Lingual Consistency: A Novel Inference Framework for Advancing Reasoning in Large Language Models
Chain-of-thought (CoT) has emerged as a critical mechanism for enhancing
reasoning capabilities in large language models (LLMs), with self-consistency
demonstrating notable promise in boosting performance. However, inherent
linguistic biases in multilingual training corpora frequently cause semantic
drift and logical inconsistencies, especially in sub-10B parameter LLMs
handling complex inference tasks. To overcome these constraints, we propose the
Cross-Lingual Consistency (CLC) framework, an innovative inference paradigm
that integrates multilingual reasoning paths through majority voting to elevate
LLMs' reasoning capabilities. Empirical evaluations on the CMATH dataset reveal
CLC's superiority over the conventional self-consistency method, delivering
9.5%, 6.5%, and 6.0% absolute accuracy gains for DeepSeek-Math-7B-Instruct,
Qwen2.5-Math-7B-Instruct, and Gemma2-9B-Instruct respectively. Expanding CLC's
linguistic scope to 11 diverse languages implies two synergistic benefits: 1)
neutralizing linguistic biases in multilingual training corpora through
multilingual ensemble voting, 2) escaping monolingual reasoning traps by
exploring the broader multilingual solution space. This dual benefits
empirically enables more globally optimal reasoning paths compared to
monolingual self-consistency baselines, as evidenced by the 4.1%-18.5% accuracy
gains using Gemma2-9B-Instruct on the MGSM dataset.
☆ PaperBench: Evaluating AI's Ability to Replicate AI Research
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan
We introduce PaperBench, a benchmark evaluating the ability of AI agents to
replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024
Spotlight and Oral papers from scratch, including understanding paper
contributions, developing a codebase, and successfully executing experiments.
For objective evaluation, we develop rubrics that hierarchically decompose each
replication task into smaller sub-tasks with clear grading criteria. In total,
PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed
with the author(s) of each ICML paper for accuracy and realism. To enable
scalable evaluation, we also develop an LLM-based judge to automatically grade
replication attempts against rubrics, and assess our judge's performance by
creating a separate benchmark for judges. We evaluate several frontier models
on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet
(New) with open-source scaffolding, achieves an average replication score of
21.0\%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench,
finding that models do not yet outperform the human baseline. We
\href{https://github.com/openai/preparedness}{open-source our code} to
facilitate future research in understanding the AI engineering capabilities of
AI agents.
comment: 30 pages, 14 figures
☆ LARGE: Legal Retrieval Augmented Generation Evaluation Tool
Recently, building retrieval-augmented generation (RAG) systems to enhance
the capability of large language models (LLMs) has become a common practice.
Especially in the legal domain, previous judicial decisions play a significant
role under the doctrine of stare decisis which emphasizes the importance of
making decisions based on (retrieved) prior documents. However, the overall
performance of RAG system depends on many components: (1) retrieval corpora,
(2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation
metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of
RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces
to facilitate seamless experiments and investigate how changes in the
aforementioned five components affect the overall accuracy. We validated LRAGE
using multilingual legal benches including Korean (KBL), English (LegalBench),
and Chinese (LawBench) by demonstrating how the overall accuracy changes when
varying the five components mentioned above. The source code is available at
https://github.com/hoorangyee/LRAGE.
comment: 12 pages
☆ YourBench: Easy Custom Evaluation Sets for Everyone
Evaluating large language models (LLMs) effectively remains a critical
bottleneck, as traditional static benchmarks suffer from saturation and
contamination, while human evaluations are costly and slow. This hinders timely
or domain-specific assessment, crucial for real-world applications. We
introduce YourBench, a novel, open-source framework that addresses these
limitations by enabling dynamic, automated generation of reliable, up-to-date,
and domain-tailored benchmarks cheaply and without manual annotation, directly
from user-provided documents. We demonstrate its efficacy by replicating 7
diverse MMLU subsets using minimal source text, achieving this for under 15 USD
in total inference costs while perfectly preserving the relative model
performance rankings (Spearman Rho = 1) observed on the original benchmark. To
ensure that YourBench generates data grounded in provided input instead of
relying on posterior parametric knowledge in models, we also introduce
Tempora-0325, a novel dataset of over 7K diverse documents, published
exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models
from 7 major families across varying scales (3-671B parameters) to validate the
quality of generated evaluations through rigorous algorithmic checks (e.g.,
citation grounding) and human assessments. We release the YourBench library,
the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all
evaluation and inference traces to facilitate reproducible research and empower
the community to generate bespoke benchmarks on demand, fostering more relevant
and trustworthy LLM evaluation.
☆ Efficient Constant-Space Multi-Vector Retrieval ECIR 2025
Multi-vector retrieval methods, exemplified by the ColBERT architecture, have
shown substantial promise for retrieval by providing strong trade-offs in terms
of retrieval latency and effectiveness. However, they come at a high cost in
terms of storage since a (potentially compressed) vector needs to be stored for
every token in the input collection. To overcome this issue, we propose
encoding documents to a fixed number of vectors, which are no longer
necessarily tied to the input tokens. Beyond reducing the storage costs, our
approach has the advantage that document representations become of a fixed size
on disk, allowing for better OS paging management. Through experiments using
the MSMARCO passage corpus and BEIR with the ColBERT-v2 architecture, a
representative multi-vector ranking model architecture, we find that passages
can be effectively encoded into a fixed number of vectors while retaining most
of the original effectiveness.
comment: ECIR 2025
☆ Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training
Zhijun Wang, Jiahuan Li, Hao Zhou, Rongxiang Weng, Jingang Wang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Shujian Huang
Large language models (LLMs) exhibit remarkable multilingual capabilities
despite the extreme language imbalance in the pre-training data. In this paper,
we closely examine the reasons behind this phenomenon, focusing on the
pre-training corpus. We find that the existence of code-switching, alternating
between different languages within a context, is key to multilingual
capabilities. We conduct an analysis to investigate code-switching in the
pre-training corpus, examining its presence and categorizing it into four types
within two quadrants. We then assess its impact on multilingual performance.
These types of code-switching data are unbalanced in proportions and
demonstrate different effects on facilitating language transfer. To better
explore the power of code-switching for language alignment during pre-training,
we investigate the strategy of synthetic code-switching. We continuously scale
up the synthetic code-switching data and observe remarkable improvements in
both benchmarks and representation space. Extensive experiments indicate that
incorporating synthetic code-switching data enables better language alignment
and generalizes well to high, medium, and low-resource languages with
pre-training corpora of varying qualities.
☆ OpenThaiGPT 1.6 and R1: Thai-Centric Open Source and Reasoning Large Language Models
We present OpenThaiGPT 1.6 and R1 (OTG-1.6 and OTG-R1), Thai-centric Large
Language Models (LLMs) developed through distinct methodologies to enhance
generalization and reasoning capabilities. OTG-1.6 employs Task Arithmetic
model merging for broad generalization, while OTG-R1 integrates multi-stage
training with the Less-Is-More Reasoning Hypothesis (LIMO) for advanced
reasoning. Benchmark evaluations demonstrate superior performance across Thai
language tasks, achieving competitive results against larger-scale open-source
Thai LLMs. This paper details the proposed models, training processes,
benchmarks, and results, highlighting improvements over previous models and
establishing new performance standards for Thai-centric LLMs.
☆ Style over Substance: Distilled Language Models Reason Via Stylistic Replication
Specialized reasoning language models (RLMs) have demonstrated that scaling
test-time computation through detailed reasoning traces significantly enhances
performance. Although these traces effectively facilitate knowledge
distillation into smaller, instruction-tuned models, the precise nature of
transferred reasoning remains unclear. In this study, we investigate to what
extent distilled models internalize replicated stylistic patterns during
reasoning. To this end, we systematically analyze reasoning traces, identifying
structural and lexical patterns that characterize successful reasoning. We then
introduce two new datasets -- a dataset of emergent reasoning traces and a
synthetic dataset explicitly constructed to replicate these stylistic patterns
-- to precisely examine their influence on distilled models' reasoning
capabilities. We find that models trained on the synthetic traces achieve
comparable performance, indicating that distilled reasoning abilities rely
significantly on surface-level patterns. Surprisingly, we observe an increase
in performance even when the synthetic traces are altered to lead to the wrong
answer. Our findings highlight how stylistic patterns can be leveraged to
efficiently enhance LM reasoning across diverse model families.
☆ InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation
In-context learning (ICL) is critical for large language models (LLMs), but
its effectiveness is constrained by finite context windows, particularly in
ultra-long contexts. To overcome this, we introduce InfiniteICL, a framework
that parallels context and parameters in LLMs with short- and long-term memory
in human cognitive systems, focusing on transforming temporary context
knowledge into permanent parameter updates. This approach significantly reduces
memory usage, maintains robust performance across varying input lengths, and
theoretically enables infinite context integration through the principles of
context knowledge elicitation, selection, and consolidation. Evaluations
demonstrate that our method reduces context length by 90% while achieving 103%
average performance of full-context prompting across fact recall, grounded
reasoning, and skill acquisition tasks. When conducting sequential multi-turn
transformations on complex, real-world contexts (with length up to 2M tokens),
our approach surpasses full-context prompting while using only 0.4% of the
original contexts. These findings highlight InfiniteICL's potential to enhance
the scalability and efficiency of LLMs by breaking the limitations of
conventional context window sizes.
☆ ToM-RL: Reinforcement Learning Unlocks Theory of Mind in Small LLMs
Recent advancements in rule-based reinforcement learning (RL), applied during
the post-training phase of large language models (LLMs), have significantly
enhanced their capabilities in structured reasoning tasks such as mathematics
and logical inference. However, the effectiveness of RL in social reasoning,
particularly in Theory of Mind (ToM), the ability to infer others' mental
states, remains largely unexplored. In this study, we demonstrate that RL
methods effectively unlock ToM reasoning capabilities even in small-scale LLMs
(0.5B to 7B parameters). Using a modest dataset comprising 3200 questions
across diverse scenarios, our RL-trained 7B model achieves 84.50\% accuracy on
the Hi-ToM benchmark, surpassing models like GPT-4o and DeepSeek-v3 despite
significantly fewer parameters. While smaller models ($\leq$3B parameters)
suffer from reasoning collapse, larger models (7B parameters) maintain stable
performance through consistent belief tracking. Additionally, our RL-based
models demonstrate robust generalization to higher-order, out-of-distribution
ToM problems, novel textual presentations, and previously unseen datasets.
These findings highlight RL's potential to enhance social cognitive reasoning,
bridging the gap between structured problem-solving and nuanced social
inference in LLMs.
☆ Study of scaling laws in language families
This article investigates scaling laws within language families using data
from over six thousand languages and analyzing emergent patterns observed in
Zipf-like classification graphs. Both macroscopic (based on number of languages
by family) and microscopic (based on numbers of speakers by language on a
family) aspects of these classifications are examined. Particularly noteworthy
is the discovery of a distinct division among the fourteen largest contemporary
language families, excluding Afro-Asiatic and Nilo-Saharan languages. These
families are found to be distributed across three language family quadruplets,
each characterized by significantly different exponents in the Zipf graphs.
This finding sheds light on the underlying structure and organization of major
language families, revealing intriguing insights into the nature of linguistic
diversity and distribution.
comment: 10 pages, 4 figures
☆ Testing Low-Resource Language Support in LLMs Using Language Proficiency Exams: the Case of Luxembourgish
Large Language Models (LLMs) have become an increasingly important tool in
research and society at large. While LLMs are regularly used all over the world
by experts and lay-people alike, they are predominantly developed with
English-speaking users in mind, performing well in English and other
wide-spread languages while less-resourced languages such as Luxembourgish are
seen as a lower priority. This lack of attention is also reflected in the
sparsity of available evaluation tools and datasets. In this study, we
investigate the viability of language proficiency exams as such evaluation
tools for the Luxembourgish language. We find that large models such as
ChatGPT, Claude and DeepSeek-R1 typically achieve high scores, while smaller
models show weak performances. We also find that the performances in such
language exams can be used to predict performances in other NLP tasks.
comment: 18 pages, 2 figures, 11 tables
☆ Horizon Scans can be accelerated using novel information retrieval and artificial intelligence tools
Introduction: Horizon scanning in healthcare assesses early signals of
innovation, crucial for timely adoption. Current horizon scanning faces
challenges in efficient information retrieval and analysis, especially from
unstructured sources like news, presenting a need for innovative tools.
Methodology: The study introduces SCANAR and AIDOC, open-source Python-based
tools designed to improve horizon scanning. SCANAR automates the retrieval and
processing of news articles, offering functionalities such as de-duplication
and unsupervised relevancy ranking. AIDOC aids filtration by leveraging AI to
reorder textual data based on relevancy, employing neural networks for semantic
similarity, and subsequently prioritizing likely relevant entries for human
review. Results: Twelve internal datasets from horizon scans and four external
benchmarking datasets were used. SCANAR improved retrieval efficiency by
automating processes previously dependent on manual labour. AIDOC displayed
work-saving potential, achieving around 62% reduction in manual review efforts
at 95% recall. Comparative analysis with benchmarking data showed AIDOC's
performance was similar to existing systematic review automation tools, though
performance varied depending on dataset characteristics. A smaller case-study
on our news datasets shows the potential of ensembling large language models
within the active-learning process for faster detection of relevant articles
across news datasets. Conclusion: The validation indicates that SCANAR and
AIDOC show potential to enhance horizon scanning efficiency by streamlining
data retrieval and prioritisation. These tools may alleviate methodological
limitations and allow broader, swifter horizon scans. Further studies are
suggested to optimize these models and to design new workflows and validation
processes that integrate large language models.
☆ Representation Bending for Large Language Model Safety
Ashkan Yousefpour, Taeheon Kim, Ryan S. Kwon, Seungbeen Lee, Wonje Jeung, Seungju Han, Alvin Wan, Harrison Ngan, Youngjae Yu, Jonghyun Choi
Large Language Models (LLMs) have emerged as powerful tools, but their
inherent safety risks - ranging from harmful content generation to broader
societal harms - pose significant challenges. These risks can be amplified by
the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing
deployment of LLMs in high-stakes environments. Existing safety-enhancing
techniques, such as fine-tuning with human feedback or adversarial training,
are still vulnerable as they address specific threats and often fail to
generalize across unseen attacks, or require manual system-level defenses. This
paper introduces RepBend, a novel approach that fundamentally disrupts the
representations underlying harmful behaviors in LLMs, offering a scalable
solution to enhance (potentially inherent) safety. RepBend brings the idea of
activation steering - simple vector arithmetic for steering model's behavior
during inference - to loss-based fine-tuning. Through extensive evaluation,
RepBend achieves state-of-the-art performance, outperforming prior methods such
as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success
rates across diverse jailbreak benchmarks, all with negligible reduction in
model usability and general capabilities.
☆ Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation
Pretraining data curation is a cornerstone in Large Language Model (LLM)
development, leading to growing research on quality filtering of large web
corpora. From statistical quality flags to LLM-based labeling systems, datasets
are divided into categories, frequently reducing to a binary: those passing the
filters deemed as valuable examples, others discarded as useless or
detrimental. However, a more detailed understanding of the contribution of
different kinds of texts to model performance is still largely lacking. In this
article, we present the first study utilizing registers (also known as genres)
- a widely used standard in corpus linguistics to model linguistic variation -
to curate pretraining datasets and investigate the effect of register on the
performance of LLMs. We perform comparative studies by training models with
register classified data and evaluating them using standard benchmarks, and
show that the register of pretraining data substantially affects model
performance. We uncover surprising relationships between the pretraining
material and the resulting models: using the News register results in subpar
performance, and on the contrary, including the Opinion class, covering texts
such as reviews and opinion blogs, is highly beneficial. While a model trained
on the entire unfiltered dataset outperforms those trained on datasets limited
to a single register, combining well-performing registers like
How-to-Instructions, Informational Description, and Opinion leads to major
improvements. Furthermore, analysis of individual benchmark results reveals key
differences in the strengths and drawbacks of specific register classes as
pretraining data. These findings show that register is an important explainer
of model variation and can facilitate more deliberate future data selection
practices.
☆ From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time
The best performing transformer-based language models use subword
tokenization techniques, such as Byte-Pair-Encoding (BPE). However, these
approaches often overlook linguistic principles, such as morphological
segmentation, which we believe is fundamental for understanding
language-specific word structure. In this study, we leverage an annotated
Danish morphological dataset to train a semisupervised model for morphological
segmentation, enabling the development of tokenizers optimized for Danish
morphology. We evaluate four distinct tokenizers, including two custom
morphological tokenizers, by analyzing their performance in morphologically
segmenting Danish words. Additionally, we train two generative transformer
models, \textit{CerebrasGPT-111M} and \textit{LLaMA-3.2 1B}, using these
tokenizers and evaluate their downstream performance. Our findings reveal that
our custom-developed tokenizers substantially enhance morphological
segmentation, achieving an F1 score of 58.84, compared to 39.28 achieved by a
Danish BPE tokenizer. In downstream tasks, models trained with our
morphological tokenizers outperform those using BPE tokenizers across different
evaluation metrics. These results highlight that incorporating Danish
morphological segmentation strategies into tokenizers leads to improved
performance in generative transformer models on Danish language
☆ Context-Aware Toxicity Detection in Multiplayer Games: Integrating Domain-Adaptive Pretraining and Match Metadata
The detrimental effects of toxicity in competitive online video games are
widely acknowledged, prompting publishers to monitor player chat conversations.
This is challenging due to the context-dependent nature of toxicity, often
spread across multiple messages or informed by non-textual interactions.
Traditional toxicity detectors focus on isolated messages, missing the broader
context needed for accurate moderation. This is especially problematic in video
games, where interactions involve specialized slang, abbreviations, and typos,
making it difficult for standard models to detect toxicity, especially given
its rarity. We adapted RoBERTa LLM to support moderation tailored to video
games, integrating both textual and non-textual context. By enhancing
pretrained embeddings with metadata and addressing the unique slang and
language quirks through domain adaptive pretraining, our method better captures
the nuances of player interactions. Using two gaming datasets - from Defense of
the Ancients 2 (DOTA 2) and Call of Duty$^\circledR$: Modern
Warfare$^\circledR$III (MWIII) we demonstrate which sources of context
(metadata, prior interactions...) are most useful, how to best leverage them to
boost performance, and the conditions conducive to doing so. This work
underscores the importance of context-aware and domain-specific approaches for
proactive moderation.
☆ Redefining technology for indigenous languages
In this paper, we offer an overview of indigenous languages, identifying the
causes of their devaluation and the need for legislation on language rights. We
review the technologies used to revitalize these languages, finding that when
they come from outside, they often have the opposite effect to what they seek;
however, when developed from within communities, they become powerful
instruments of expression. We propose that the inclusion of Indigenous
knowledge in large language models (LLMs) will enrich the technological
landscape, but must be done in a participatory environment that encourages the
exchange of knowledge.
comment: in Spanish language
☆ Chain of Correction for Full-text Speech Recognition with Large Language Models
Full-text error correction with Large Language Models (LLMs) for Automatic
Speech Recognition (ASR) has gained increased attention due to its potential to
correct errors across long contexts and address a broader spectrum of error
types, including punctuation restoration and inverse text normalization.
Nevertheless, many challenges persist, including issues related to stability,
controllability, completeness, and fluency. To mitigate these challenges, this
paper proposes the Chain of Correction (CoC) for full-text error correction
with LLMs, which corrects errors segment by segment using pre-recognized text
as guidance within a regular multi-turn chat format. The CoC also uses
pre-recognized full text for context, allowing the model to better grasp global
semantics and maintain a comprehensive overview of the entire content.
Utilizing the open-sourced full-text error correction dataset ChFT, we
fine-tune a pre-trained LLM to evaluate the performance of the CoC framework.
Experimental results demonstrate that the CoC effectively corrects errors in
full-text ASR outputs, significantly outperforming baseline and benchmark
systems. We further analyze how to set the correction threshold to balance
under-correction and over-rephrasing, extrapolate the CoC model on extremely
long ASR outputs, and investigate whether other types of information can be
employed to guide the error correction process.
☆ PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation
Zhengwei Tao, Zhi Jin, Bincheng Li, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao
Predicting future events stands as one of the ultimate aspirations of
artificial intelligence. Recent advances in large language model (LLM)-based
systems have shown remarkable potential in forecasting future events, thereby
garnering significant interest in the research community. Currently, several
benchmarks have been established to evaluate the forecasting capabilities by
formalizing the event prediction as a retrieval-augmented generation (RAG) and
reasoning task. In these benchmarks, each prediction question is answered with
relevant retrieved news articles. However, because there is no consideration on
whether the questions can be supported by valid or sufficient supporting
rationales, some of the questions in these benchmarks may be inherently
noninferable. To address this issue, we introduce a new benchmark, PROPHET,
which comprises inferable forecasting questions paired with relevant news for
retrieval. To ensure the inferability of the benchmark, we propose Causal
Intervened Likelihood (CIL), a statistical measure that assesses inferability
through causal inference. In constructing this benchmark, we first collected
recent trend forecasting questions and then filtered the data using CIL,
resulting in an inferable benchmark for event prediction. Through extensive
experiments, we first demonstrate the validity of CIL and in-depth
investigations into event prediction with the aid of CIL. Subsequently, we
evaluate several representative prediction systems on PROPHET, drawing valuable
insights for future directions.
☆ CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models
Language models often struggle with cross-mode knowledge retrieval -- the
ability to access knowledge learned in one format (mode) when queried in
another. We demonstrate that models trained on multiple data sources (e.g.,
Wikipedia and TinyStories) exhibit significantly reduced accuracy when
retrieving knowledge in a format different from its original training mode.
This paper quantitatively investigates this phenomenon through a controlled
study of random token sequence memorization across different modes. We first
explore dataset rewriting as a solution, revealing that effective cross-mode
retrieval requires prohibitively extensive rewriting efforts that follow a
sigmoid-like relationship. As an alternative, we propose CASCADE, a novel
pretraining algorithm that uses cascading datasets with varying sequence
lengths to capture knowledge at different scales. Our experiments demonstrate
that CASCADE outperforms dataset rewriting approaches, even when compressed
into a single model with a unified loss function. This work provides both
qualitative evidence of cross-mode retrieval limitations and a practical
solution to enhance language models' ability to access knowledge independently
of its presentational format.
☆ Refining Interactions: Enhancing Anisotropy in Graph Neural Networks with Language Semantics ICME 2025
The integration of Large Language Models (LLMs) with Graph Neural Networks
(GNNs) has recently been explored to enhance the capabilities of Text Attribute
Graphs (TAGs). Most existing methods feed textual descriptions of the graph
structure or neighbouring nodes' text directly into LLMs. However, these
approaches often cause LLMs to treat structural information simply as general
contextual text, thus limiting their effectiveness in graph-related tasks. In
this paper, we introduce LanSAGNN (Language Semantic Anisotropic Graph Neural
Network), a framework that extends the concept of anisotropic GNNs to the
natural language level. This model leverages LLMs to extract tailor-made
semantic information for node pairs, effectively capturing the unique
interactions within node relationships. In addition, we propose an efficient
dual-layer LLMs finetuning architecture to better align LLMs' outputs with
graph tasks. Experimental results demonstrate that LanSAGNN significantly
enhances existing LLM-based methods without increasing complexity while also
exhibiting strong robustness against interference.
comment: Accepted by ICME 2025
☆ FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations
In an era where AI-driven hiring is transforming recruitment practices,
concerns about fairness and bias have become increasingly important. To explore
these issues, we introduce a benchmark, FAIRE (Fairness Assessment In Resume
Evaluation), to test for racial and gender bias in large language models (LLMs)
used to evaluate resumes across different industries. We use two methods-direct
scoring and ranking-to measure how model performance changes when resumes are
slightly altered to reflect different racial or gender identities. Our findings
reveal that while every model exhibits some degree of bias, the magnitude and
direction vary considerably. This benchmark provides a clear way to examine
these differences and offers valuable insights into the fairness of AI-based
hiring tools. It highlights the urgent need for strategies to reduce bias in
AI-driven recruitment. Our benchmark code and dataset are open-sourced at our
repository:
https://github.com/athenawen/FAIRE-Fairness-Assessment-In-Resume-Evaluation.git.
☆ Generative Retrieval and Alignment Model: A New Paradigm for E-commerce Retrieval WWW2025
Ming Pang, Chunyuan Yuan, Xiaoyu He, Zheng Fang, Donghao Xie, Fanyi Qu, Xue Jiang, Changping Peng, Zhangang Lin, Zheng Luo, Jingping Shao
Traditional sparse and dense retrieval methods struggle to leverage general
world knowledge and often fail to capture the nuanced features of queries and
products. With the advent of large language models (LLMs), industrial search
systems have started to employ LLMs to generate identifiers for product
retrieval. Commonly used identifiers include (1) static/semantic IDs and (2)
product term sets. The first approach requires creating a product ID system
from scratch, missing out on the world knowledge embedded within LLMs. While
the second approach leverages this general knowledge, the significant
difference in word distribution between queries and products means that
product-based identifiers often do not align well with user search queries,
leading to missed product recalls. Furthermore, when queries contain numerous
attributes, these algorithms generate a large number of identifiers, making it
difficult to assess their quality, which results in low overall recall
efficiency.
To address these challenges, this paper introduces a novel e-commerce
retrieval paradigm: the Generative Retrieval and Alignment Model (GRAM). GRAM
employs joint training on text information from both queries and products to
generate shared text identifier codes, effectively bridging the gap between
queries and products. This approach not only enhances the connection between
queries and products but also improves inference efficiency. The model uses a
co-alignment strategy to generate codes optimized for maximizing retrieval
efficiency. Additionally, it introduces a query-product scoring mechanism to
compare product values across different codes, further boosting retrieval
efficiency. Extensive offline and online A/B testing demonstrates that GRAM
significantly outperforms traditional models and the latest generative
retrieval models, confirming its effectiveness and practicality.
comment: Accepted by WWW2025
☆ ToolACE-R: Tool Learning with Adaptive Self-Refinement
Xingshan Zeng, Weiwen Liu, Xu Huang, Zezhong Wang, Lingzhi Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Ruiming Tang, Qun Liu
Tool learning, which allows Large Language Models (LLMs) to leverage external
tools for solving complex user tasks, has emerged as a promising avenue for
extending model capabilities. However, current approaches primarily focus on
data synthesis for fine-tuning LLMs to invoke tools effectively, largely
ignoring how to fully stimulate the potential of the model. In this paper, we
propose ToolACE-R, a novel method that introduces adaptive self-refinement for
tool invocations. Our approach features a model-aware iterative training
procedure that progressively incorporates more training samples based on the
model's evolving capabilities. Additionally, it allows LLMs to iteratively
refine their tool calls, optimizing performance without requiring external
feedback. To further enhance computational efficiency, we integrate an adaptive
mechanism when scaling the inference time, enabling the model to autonomously
determine when to stop the refinement process. We conduct extensive experiments
across several benchmark datasets, showing that ToolACE-R achieves competitive
performance compared to advanced API-based models, even without any refinement.
Furthermore, its performance can be further improved efficiently through
adaptive self-refinement. Our results demonstrate the effectiveness of the
proposed method, which is compatible with base models of various sizes,
offering a promising direction for more efficient tool learning.
☆ An Illusion of Progress? Assessing the Current State of Web Agents
As digitalization and cloud technologies evolve, the web is becoming
increasingly important in the modern society. Autonomous web agents based on
large language models (LLMs) hold a great potential in work automation. It is
therefore important to accurately measure and monitor the progression of their
capabilities. In this work, we conduct a comprehensive and rigorous assessment
of the current state of web agents. Our results depict a very different picture
of the competency of current agents, suggesting over-optimism in previously
reported results. This gap can be attributed to shortcomings in existing
benchmarks. We introduce Online-Mind2Web, an online evaluation benchmark
consisting of 300 diverse and realistic tasks spanning 136 websites. It enables
us to evaluate web agents under a setting that approximates how real users use
these agents. To facilitate more scalable evaluation and development, we also
develop a novel LLM-as-a-Judge automatic evaluation method and show that it can
achieve around 85% agreement with human judgment, substantially higher than
existing methods. Finally, we present the first comprehensive comparative
analysis of current web agents, highlighting both their strengths and
limitations to inspire future research.
comment: 22 pages, 16 figures, 4 tables
☆ LITE: LLM-Impelled efficient Taxonomy Evaluation
This paper presents LITE, an LLM-based evaluation method designed for
efficient and flexible assessment of taxonomy quality. To address challenges in
large-scale taxonomy evaluation, such as efficiency, fairness, and consistency,
LITE adopts a top-down hierarchical evaluation strategy, breaking down the
taxonomy into manageable substructures and ensuring result reliability through
cross-validation and standardized input formats. LITE also introduces a penalty
mechanism to handle extreme cases and provides both quantitative performance
analysis and qualitative insights by integrating evaluation metrics closely
aligned with task objectives. Experimental results show that LITE demonstrates
high reliability in complex evaluation tasks, effectively identifying semantic
errors, logical contradictions, and structural flaws in taxonomies, while
offering directions for improvement. Code is available at
https://github.com/Zhang-l-i-n/TAXONOMY_DETECT .
☆ Tasks and Roles in Legal AI: Data Curation, Annotation, and Verification
The application of AI tools to the legal field feels natural: large legal
document collections could be used with specialized AI to improve workflow
efficiency for lawyers and ameliorate the "justice gap" for underserved
clients. However, legal documents differ from the web-based text that underlies
most AI systems. The challenges of legal AI are both specific to the legal
domain, and confounded with the expectation of AI's high performance in
high-stakes settings. We identify three areas of special relevance to
practitioners: data curation, data annotation, and output verification. First,
it is difficult to obtain usable legal texts. Legal collections are
inconsistent, analog, and scattered for reasons technical, economic, and
jurisdictional. AI tools can assist document curation efforts, but the lack of
existing data also limits AI performance. Second, legal data annotation
typically requires significant expertise to identify complex phenomena such as
modes of judicial reasoning or controlling precedents. We describe case studies
of AI systems that have been developed to improve the efficiency of human
annotation in legal contexts and identify areas of underperformance. Finally,
AI-supported work in the law is valuable only if results are verifiable and
trustworthy. We describe both the abilities of AI systems to support evaluation
of their outputs, as well as new approaches to systematic evaluation of
computational systems in complex domains. We call on both legal and AI
practitioners to collaborate across disciplines and to release open access
materials to support the development of novel, high-performing, and reliable AI
tools for legal applications.
☆ GTR: Graph-Table-RAG for Cross-Table Question Answering
Beyond pure text, a substantial amount of knowledge is stored in tables. In
real-world scenarios, user questions often require retrieving answers that are
distributed across multiple tables. GraphRAG has recently attracted much
attention for enhancing LLMs' reasoning capabilities by organizing external
knowledge to address ad-hoc and complex questions, exemplifying a promising
direction for cross-table question answering. In this paper, to address the
current gap in available data, we first introduce a multi-table benchmark,
MutliTableQA, comprising 60k tables and 25k user queries collected from
real-world sources. Then, we propose the first Graph-Table-RAG framework,
namely GTR, which reorganizes table corpora into a heterogeneous graph, employs
a hierarchical coarse-to-fine retrieval process to extract the most relevant
tables, and integrates graph-aware prompting for downstream LLMs' tabular
reasoning. Extensive experiments show that GTR exhibits superior cross-table
question-answering performance while maintaining high deployment efficiency,
demonstrating its real-world practical applicability.
comment: 20 pages, 7 figures
☆ Breaking BERT: Gradient Attack on Twitter Sentiment Analysis for Targeted Misclassification
Social media platforms like Twitter have increasingly relied on Natural
Language Processing NLP techniques to analyze and understand the sentiments
expressed in the user generated content. One such state of the art NLP model is
Bidirectional Encoder Representations from Transformers BERT which has been
widely adapted in sentiment analysis. BERT is susceptible to adversarial
attacks. This paper aims to scrutinize the inherent vulnerabilities of such
models in Twitter sentiment analysis. It aims to formulate a framework for
constructing targeted adversarial texts capable of deceiving these models,
while maintaining stealth. In contrast to conventional methodologies, such as
Importance Reweighting, this framework core idea resides in its reliance on
gradients to prioritize the importance of individual words within the text. It
uses a whitebox approach to attain fine grained sensitivity, pinpointing words
that exert maximal influence on the classification outcome. This paper is
organized into three interdependent phases. It starts with fine-tuning a
pre-trained BERT model on Twitter data. It then analyzes gradients of the model
to rank words on their importance, and iteratively replaces those with feasible
candidates until an acceptable solution is found. Finally, it evaluates the
effectiveness of the adversarial text against the custom trained sentiment
classification model. This assessment would help in gauging the capacity of the
adversarial text to successfully subvert classification without raising any
alarm.
☆ Foundations and Evaluations in NLP
This memoir explores two fundamental aspects of Natural Language Processing
(NLP): the creation of linguistic resources and the evaluation of NLP system
performance. Over the past decade, my work has focused on developing a
morpheme-based annotation scheme for the Korean language that captures
linguistic properties from morphology to semantics. This approach has achieved
state-of-the-art results in various NLP tasks, including part-of-speech
tagging, dependency parsing, and named entity recognition. Additionally, this
work provides a comprehensive analysis of segmentation granularity and its
critical impact on NLP system performance. In parallel with linguistic resource
development, I have proposed a novel evaluation framework, the jp-algorithm,
which introduces an alignment-based method to address challenges in
preprocessing tasks like tokenization and sentence boundary detection (SBD).
Traditional evaluation methods assume identical tokenization and sentence
lengths between gold standards and system outputs, limiting their applicability
to real-world data. The jp-algorithm overcomes these limitations, enabling
robust end-to-end evaluations across a variety of NLP tasks. It enhances
accuracy and flexibility by incorporating linear-time alignment while
preserving the complexity of traditional evaluation metrics. This memoir
provides key insights into the processing of morphologically rich languages,
such as Korean, while offering a generalizable framework for evaluating diverse
end-to-end NLP systems. My contributions lay the foundation for future
developments, with broader implications for multilingual resource development
and system evaluation.
☆ Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design NAACL 2025
Mixture-of-Experts (MoE) has successfully scaled up models while maintaining
nearly constant computing costs. By employing a gating network to route input
tokens, it selectively activates a subset of expert networks to process the
corresponding token embeddings. However, in practice, the efficiency of MoE is
challenging to achieve due to two key reasons: imbalanced expert activation,
which leads to substantial idle time during model or expert parallelism, and
insufficient capacity utilization; massive communication overhead, induced by
numerous expert routing combinations in expert parallelism at the system level.
Previous works typically formulate it as the load imbalance issue characterized
by the gating network favoring certain experts over others or attribute it to
static execution which fails to adapt to the dynamic expert workload at
runtime. In this paper, we exploit it from a brand new perspective, a
higher-order view and analysis of MoE routing policies: expert collaboration
and specialization where some experts tend to activate broadly with others
(collaborative), while others are more likely to activate only with a specific
subset of experts (specialized). Our experiments reveal that most experts tend
to be overly collaborative, leading to increased communication overhead from
repeatedly sending tokens to different accelerators. To this end, we propose a
novel collaboration-constrained routing (C2R) strategy to encourage more
specialized expert groups, as well as to improve expert utilization, and
present an efficient implementation of MoE that further leverages expert
specialization. We achieve an average performance improvement of 0.51% and
0.33% on LLaMA-MoE and Qwen-MoE respectively across ten downstream NLP
benchmarks, and reduce the all2all communication costs between GPUs, bringing
an extra 20%-30% total running time savings on top of the existing SoTA, i.e.
MegaBlocks.
comment: NAACL 2025
☆ On Data Synthesis and Post-training for Visual Abstract Reasoning
This paper is a pioneering work attempting to address abstract visual
reasoning (AVR) problems for large vision-language models (VLMs). We make a
common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific
AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and
closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a
great breakthrough since almost all previous VLMs fail or show nearly random
performance on representative AVR benchmarks. Our key success is our innovative
data synthesis and post-training process, aiming to fully relieve the task
difficulty and elicit the model to learn, step by step. Our 7B model is also
shown to be behave well on AVR without sacrificing common multimodal
comprehension abilities. We hope our paper could serve as an early effort in
this area and would inspire further research in abstract visual reasoning.
☆ Adaptive Rectification Sampling for Test-Time Compute Scaling
The newly released OpenAI-o1 and DeepSeek-R1 have demonstrated that test-time
scaling can significantly improve model performance, especially in complex
tasks such as logical reasoning. Common test-time scaling methods involve
generating more chain of thoughts (CoTs) or longer CoTs with self-correction.
However, while self-correction can improve performance, it may lead to
significant token waste and reduce readability of the CoT if the reasoning
steps are already correct. To demonstrate that large language models (LLMs) can
rectify errors at a more fine-grained level, we propose Adaptive Rectification
Sampling (AR-Sampling), which can guide the LLMs to self-correction at the
appropriate step. AR-Sampling leverages a process-supervised reward model (PRM)
as a verifier and constructed trigger sentences to guide the model in adaptive
step-level rethinking. Through the experiments on GSM8K and MATH500, it
indicate that our approach enables the models to rethink in more fine-grained
level, improving the accuracy of solutions, while generating a reasonable
number of additional tokens.
☆ Biomedical Question Answering via Multi-Level Summarization on a Local Knowledge Graph
In Question Answering (QA), Retrieval Augmented Generation (RAG) has
revolutionized performance in various domains. However, how to effectively
capture multi-document relationships, particularly critical for biomedical
tasks, remains an open question. In this work, we propose a novel method that
utilizes propositional claims to construct a local knowledge graph from
retrieved documents. Summaries are then derived via layerwise summarization
from the knowledge graph to contextualize a small language model to perform QA.
We achieved comparable or superior performance with our method over RAG
baselines on several biomedical QA benchmarks. We also evaluated each
individual step of our methodology over a targeted set of metrics,
demonstrating its effectiveness.
☆ ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning
We present ThinkPrune, a simple yet effective method for pruning the thinking
length for long-thinking LLMs, which has been found to often produce
inefficient and redundant thinking processes. Existing preliminary explorations
of reducing thinking length primarily focus on forcing the thinking process to
early exit, rather than adapting the LLM to optimize and consolidate the
thinking process, and therefore the length-performance tradeoff observed so far
is sub-optimal. To fill this gap, ThinkPrune offers a simple solution that
continuously trains the long-thinking LLMs via reinforcement learning (RL) with
an added token limit, beyond which any unfinished thoughts and answers will be
discarded, resulting in a zero reward. To further preserve model performance,
we introduce an iterative length pruning approach, where multiple rounds of RL
are conducted, each with an increasingly more stringent token limit. We
observed that ThinkPrune results in a remarkable performance-length tradeoff --
on the AIME24 dataset, the reasoning length of DeepSeek-R1-Distill-Qwen-1.5B
can be reduced by half with only 2% drop in performance. We also observed that
after pruning, the LLMs can bypass unnecessary steps while keeping the core
reasoning process complete. Code is available at
https://github.com/UCSB-NLP-Chang/ThinkPrune.
comment: 15 pages, 7 figures
☆ Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing
While the inconsistency of LLMs is not a novel topic, prior research has
predominantly addressed two types of generative inconsistencies: i) Randomness
Inconsistency: running the same LLM multiple trials, yielding varying
responses; ii) Paraphrase Inconsistency: paraphrased prompts result in
different responses from the same LLM. Randomness Inconsistency arises from the
inherent randomness due to stochastic sampling in generative models, while
Paraphrase Inconsistency is a consequence of the language modeling objectives,
where paraphrased prompts alter the distribution of vocabulary logits. This
research discovers Prompt-Reverse Inconsistency (PRIN), a new form of LLM
self-inconsistency: given a question and a couple of LLM-generated answer
candidates, the LLM often has conflicting responses when prompted "Which are
correct answers?" and "Which are incorrect answers?". PRIN poses a big concern
as it undermines the credibility of LLM-as-a-judge, and suggests a challenge
for LLMs to adhere to basic logical rules. We conduct a series of experiments
to investigate PRIN, examining the extent of PRIN across different LLMs,
methods to mitigate it, potential applications, and its relationship with
Randomness Inconsistency and Paraphrase Inconsistency. As the first study to
explore PRIN, our findings offer valuable insights into the inner workings of
LLMs and contribute to advancing trustworthy AI.
comment: 9 pages
☆ Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding
We present a comprehensive framework for enhancing Retrieval-Augmented
Generation (RAG) systems through dynamic retrieval strategies and reinforcement
fine-tuning. This approach significantly improves large language models on
knowledge-intensive tasks, including opendomain question answering and complex
reasoning. Our framework integrates two complementary techniques:
Policy-Optimized RetrievalAugmented Generation (PORAG), which optimizes the use
of retrieved information, and Adaptive Token-Layer Attention Scoring (ATLAS),
which dynamically determines retrieval timing and content based on contextual
needs. Together, these techniques enhance both the utilization and relevance of
retrieved content, improving factual accuracy and response quality. Designed as
a lightweight solution compatible with any Transformer-based LLM without
requiring additional training, our framework excels in knowledge-intensive
tasks, boosting output accuracy in RAG settings. We further propose CRITIC, a
novel method to selectively compress key-value caches by token importance,
mitigating memory bottlenecks in long-context applications. The framework also
incorporates test-time scaling techniques to dynamically balance reasoning
depth and computational resources, alongside optimized decoding strategies for
faster inference. Experiments on benchmark datasets show that our framework
reduces hallucinations, strengthens domain-specific reasoning, and achieves
significant efficiency and scalability gains over traditional RAG systems. This
integrated approach advances the development of robust, efficient, and scalable
RAG systems across diverse applications.
♻ ☆ Code Generation and Algorithmic Problem Solving Using Llama 3.1 405B
Code generation by Llama 3.1 models, such as Meta's Llama 3.1 405B,
represents a significant advancement in the field of artificial intelligence,
particularly in natural language processing and programming automation. This
paper explores the capabilities and applications of Llama-driven code
generation, highlighting its ability to translate natural language prompts into
executable code across multiple programming languages. Key features include
contextual awareness, multi-language support, and enhanced debugging and
optimization functionalities. By examining these aspects, we illustrate how
Llama can serve as a versatile tool for developers of all skill levels,
improving productivity and efficiency in software development. The potential
implications for education, industry, and the future of coding practices are
also discussed, underscoring the transformative impact of AI in programming.
Experimentation shows that while Llama 3.1 405B performs well with simple
algorithmic and data structure based problems, it still struggles with problems
on Quantum Computing, Bioinformatics, and Artificial Intelligence.
comment: updated version
♻ ☆ DEPT: Decoupled Embeddings for Pre-training Language Models
Alex Iacob, Lorenzo Sani, Meghdad Kurmanji, William F. Shen, Xinchi Qiu, Dongqi Cai, Yan Gao, Nicholas D. Lane
Language Model pre-training uses broad data mixtures to enhance performance
across domains and languages. However, training on such heterogeneous text
corpora requires extensive and expensive efforts. Since these data sources vary
significantly in lexical, syntactic, and semantic aspects, they cause negative
interference or the ``curse of multilinguality''. To address these challenges
we propose a communication-efficient pre-training framework, DEPT. Our method
decouples embeddings from the transformer body while simultaneously training
the latter on multiple data sources without requiring a shared vocabulary. DEPT
can: (1) train robustly and effectively under significant data heterogeneity,
(2) minimize token embedding parameters to only what the data source vocabulary
requires, while cutting communication costs in direct proportion to both the
communication frequency and the reduction in parameters, (3) enhance
transformer body plasticity and generalization, improving both average
perplexity (up to 20%) and downstream task performance, and (4) enable training
with custom optimized vocabularies per data source. We demonstrate DEPT's
potential via the first vocabulary-agnostic federated pre-training of
billion-scale models, reducing communication costs by orders of magnitude and
embedding memory by 4-5x.
♻ ☆ Multilingual European Language Models: Benchmarking Approaches and Challenges
The breakthrough of generative large language models (LLMs) that can solve
different tasks through chat interaction has led to a significant increase in
the use of general benchmarks to assess the quality or performance of these
models beyond individual applications. There is also a need for better methods
to evaluate and also to compare models due to the ever increasing number of new
models published. However, most of the established benchmarks revolve around
the English language. This paper analyses the benefits and limitations of
current evaluation datasets, focusing on multilingual European benchmarks. We
analyse seven multilingual benchmarks and identify four major challenges.
Furthermore, we discuss potential solutions to enhance translation quality and
mitigate cultural biases, including human-in-the-loop verification and
iterative translation ranking. Our analysis highlights the need for culturally
aware and rigorously validated benchmarks to assess the reasoning and
question-answering capabilities of multilingual LLMs accurately.
♻ ☆ Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions ICASSP 2025
Lingwei Meng, Shujie Hu, Jiawen Kang, Zhaoqing Li, Yuejiao Wang, Wenxuan Wu, Xixin Wu, Xunying Liu, Helen Meng
Recent advancements in large language models (LLMs) have revolutionized
various domains, bringing significant progress and new opportunities. Despite
progress in speech-related tasks, LLMs have not been sufficiently explored in
multi-talker scenarios. In this work, we present a pioneering effort to
investigate the capability of LLMs in transcribing speech in multi-talker
environments, following versatile instructions related to multi-talker
automatic speech recognition (ASR), target talker ASR, and ASR based on
specific talker attributes such as sex, occurrence order, language, and keyword
spoken. Our approach utilizes WavLM and Whisper encoder to extract
multi-faceted speech representations that are sensitive to speaker
characteristics and semantic context. These representations are then fed into
an LLM fine-tuned using LoRA, enabling the capabilities for speech
comprehension and transcription. Comprehensive experiments reveal the promising
performance of our proposed system, MT-LLM, in cocktail party scenarios,
highlighting the potential of LLM to handle speech-related tasks based on user
instructions in such complex settings. The code, model, and samples are
available at https://github.com/cuhealthybrains/MT-LLM.
comment: Accepted to IEEE ICASSP 2025. Update code link
♻ ☆ Finding Transformer Circuits with Edge Pruning NeurIPS 2024
The path to interpreting a language model often proceeds via analysis of
circuits -- sparse computational subgraphs of the model that capture specific
aspects of its behavior. Recent work has automated the task of discovering
circuits. Yet, these methods have practical limitations, as they rely either on
inefficient search algorithms or inaccurate approximations. In this paper, we
frame automated circuit discovery as an optimization problem and propose *Edge
Pruning* as an effective and scalable solution. Edge Pruning leverages
gradient-based pruning techniques, but instead of removing neurons or
components, it prunes the \emph{edges} between components. Our method finds
circuits in GPT-2 that use less than half the number of edges compared to
circuits found by previous methods while being equally faithful to the full
model predictions on standard circuit-finding tasks. Edge Pruning is efficient
even with as many as 100K examples, outperforming previous methods in speed and
producing substantially better circuits. It also perfectly recovers the
ground-truth circuits in two models compiled with Tracr. Thanks to its
efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale
that prior methods operate on. We use this setting for a case study comparing
the mechanisms behind instruction prompting and in-context learning. We find
two circuits with more than 99.96% sparsity that match the performance of the
full model and reveal that the mechanisms in the two settings overlap
substantially. Our case study shows that Edge Pruning is a practical and
scalable tool for interpretability and sheds light on behaviors that only
emerge in large models.
comment: NeurIPS 2024 (Spotlight), code available at
https://github.com/princeton-nlp/Edge-Pruning
♻ ☆ Can A Society of Generative Agents Simulate Human Behavior and Inform Public Health Policy? A Case Study on Vaccine Hesitancy
Abe Bohan Hou, Hongru Du, Yichen Wang, Jingyu Zhang, Zixiao Wang, Paul Pu Liang, Daniel Khashabi, Lauren Gardner, Tianxing He
Can we simulate a sandbox society with generative agents to model human
behavior, thereby reducing the over-reliance on real human trials for assessing
public policies? In this work, we investigate the feasibility of simulating
health-related decision-making, using vaccine hesitancy, defined as the delay
in acceptance or refusal of vaccines despite the availability of vaccination
services (MacDonald, 2015), as a case study. To this end, we introduce the
VacSim framework with 100 generative agents powered by Large Language Models
(LLMs). VacSim simulates vaccine policy outcomes with the following steps: 1)
instantiate a population of agents with demographics based on census data; 2)
connect the agents via a social network and model vaccine attitudes as a
function of social dynamics and disease-related information; 3) design and
evaluate various public health interventions aimed at mitigating vaccine
hesitancy. To align with real-world results, we also introduce simulation
warmup and attitude modulation to adjust agents' attitudes. We propose a series
of evaluations to assess the reliability of various LLM simulations.
Experiments indicate that models like Llama and Qwen can simulate aspects of
human behavior but also highlight real-world alignment challenges, such as
inconsistent responses with demographic profiles. This early exploration of
LLM-driven simulations is not meant to serve as definitive policy guidance;
instead, it serves as a call for action to examine social simulation for policy
development.
♻ ☆ Non-Determinism of "Deterministic" LLM Settings
Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, Breck Baldwin
LLM (large language model) practitioners commonly notice that outputs can
vary for the same inputs under settings expected to be deterministic. Yet the
questions of how pervasive this is, and with what impact on results, have not
to our knowledge been systematically investigated. We investigate
non-determinism in five LLMs configured to be deterministic when applied to
eight common tasks in across 10 runs, in both zero-shot and few-shot settings.
We see accuracy variations up to 15% across naturally occurring runs with a gap
of best possible performance to worst possible performance up to 70%. In fact,
none of the LLMs consistently delivers repeatable accuracy across all tasks,
much less identical output strings. Sharing preliminary results with insiders
has revealed that non-determinism perhaps essential to the efficient use of
compute resources via co-mingled data in input buffers so this issue is not
going away anytime soon. To better quantify our observations, we introduce
metrics focused on quantifying determinism, TARr@N for the total agreement rate
at N runs over raw output, and TARa@N for total agreement rate of parsed-out
answers. Our code and data are publicly available at
https://github.com/breckbaldwin/llm-stability.
♻ ☆ Prior Lessons of Incremental Dialogue and Robot Action Management for the Age of Language Models
Efforts towards endowing robots with the ability to speak have benefited from
recent advancements in natural language processing, in particular large
language models. However, current language models are not fully incremental, as
their processing is inherently monotonic and thus lack the ability to revise
their interpretations or output in light of newer observations. This
monotonicity has important implications for the development of dialogue systems
for human--robot interaction. In this paper, we review the literature on
interactive systems that operate incrementally (i.e., at the word level or
below it). We motivate the need for incremental systems, survey incremental
modeling of important aspects of dialogue like speech recognition and language
generation. Primary focus is on the part of the system that makes decisions,
known as the dialogue manager. We find that there is very little research on
incremental dialogue management, offer some requirements for practical
incremental dialogue management, and the implications of incremental dialogue
for embodied, robotic platforms in the age of large language models.
comment: 20 pages
♻ ☆ Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Jimin Huang, Mengxi Xiao, Dong Li, Zihao Jiang, Yuzhe Yang, Yifei Zhang, Lingfei Qian, Yan Wang, Xueqing Peng, Yang Ren, Ruoyu Xiang, Zhengyu Chen, Xiao Zhang, Yueru He, Weiguang Han, Shunian Chen, Lihang Shen, Daniel Kim, Yangyang Yu, Yupeng Cao, Zhiyang Deng, Haohang Li, Duanyu Feng, Yongfu Dai, VijayaSai Somasundaram, Peng Lu, Guojun Xiong, Zhiwei Liu, Zheheng Luo, Zhiyuan Yao, Ruey-Ling Weng, Meikang Qiu, Kaleb E Smith, Honghai Yu, Yanzhao Lai, Min Peng, Jian-Yun Nie, Jordan W. Suchow, Xiao-Yang Liu, Benyou Wang, Alejandro Lopez-Lira, Qianqian Xie, Sophia Ananiadou, Junichi Tsujii
Financial LLMs hold promise for advancing financial tasks and domain-specific
applications. However, they are limited by scarce corpora, weak multimodal
capabilities, and narrow evaluations, making them less suited for real-world
application. To address this, we introduce \textit{Open-FinLLMs}, the first
open-source multimodal financial LLMs designed to handle diverse tasks across
text, tabular, time-series, and chart data, excelling in zero-shot, few-shot,
and fine-tuning settings. The suite includes FinLLaMA, pre-trained on a
comprehensive 52-billion-token corpus; FinLLaMA-Instruct, fine-tuned with 573K
financial instructions; and FinLLaVA, enhanced with 1.43M multimodal tuning
pairs for strong cross-modal reasoning. We comprehensively evaluate
Open-FinLLMs across 14 financial tasks, 30 datasets, and 4 multimodal tasks in
zero-shot, few-shot, and supervised fine-tuning settings, introducing two new
multimodal evaluation datasets. Our results show that Open-FinLLMs outperforms
afvanced financial and general LLMs such as GPT-4, across financial NLP,
decision-making, and multi-modal tasks, highlighting their potential to tackle
real-world challenges. To foster innovation and collaboration across academia
and industry, we release all codes
(https://anonymous.4open.science/r/PIXIU2-0D70/B1D7/LICENSE) and models under
OSI-approved licenses.
comment: 33 pages, 13 figures
♻ ☆ Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service
Low-resource machine translation (MT) presents a diversity of community needs
and application challenges that remain poorly understood. To complement surveys
and focus groups, which tend to rely on small samples of respondents, we
propose an observational study on actual usage patterns of tetun$.$org, a
specialized MT service for the Tetun language, which is the lingua franca in
Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that
challenge assumptions based on existing corpora. We find that users, many of
them students on mobile devices, typically translate text from a high-resource
language into Tetun across diverse domains including science, healthcare, and
daily life. This contrasts sharply with available Tetun corpora, which are
dominated by news articles covering government and social issues. Our results
suggest that MT systems for institutionalized minority languages like Tetun
should prioritize accuracy on domains relevant to educational contexts, in the
high-resource to low-resource direction. More broadly, this study demonstrates
how observational analysis can inform low-resource language technology
development, by grounding research in practical community needs.
comment: to be published in LoResMT 2025
♻ ☆ TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang
The detection of telecom fraud faces significant challenges due to the lack
of high-quality multimodal training data that integrates audio signals with
reasoning-oriented textual analysis. To address this gap, we present
TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset
specifically designed for automated telecom fraud analysis. Our dataset is
constructed through three strategies: (1) Privacy-preserved text-truth sample
generation using automatically speech recognition (ASR)-transcribed call
recordings (with anonymized original audio), ensuring real-world consistency
through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via
large language model (LLM)-based self-instruction sampling on authentic ASR
outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that
simulates emerging fraud tactics through predefined communication scenarios and
fraud typologies. The generated dataset contains 28,511 rigorously processed
speech-text pairs, complete with detailed annotations for fraud reasoning. The
dataset is divided into three tasks: scenario classification, fraud detection,
fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a
standardized evaluation benchmark comprising proportionally sampled instances
from the dataset, to facilitate systematic testing of model performance on
telecom fraud detection tasks. We also contribute a production-optimized
supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while
open-sourcing the data processing framework to enable community-driven dataset
expansion. This work establishes a foundational framework for multimodal
anti-fraud research while addressing critical challenges in data privacy and
scenario diversity. The project will be released at
https://github.com/JimmyMa99/TeleAntiFraud.
♻ ☆ Interpretable Steering of Large Language Models with Feature Guided Activation Additions
Effective and reliable control over large language model (LLM) behavior is a
significant challenge. While activation steering methods, which add steering
vectors to a model's hidden states, are a promising approach, existing
techniques often lack precision and interpretability in how they influence
model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel
activation steering method that leverages insights from Contrastive Activation
Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating
in the latent space of a Sparse Autoencoder (SAE) and employing optimization
techniques to select desired SAE features, FGAA constructs precise steering
vectors that provide better steering effects while maintaining coherence of
steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B
models across various steering tasks demonstrate that FGAA outperforms existing
steering methods of CAA, SAE decoder steering, and SAE-TS. Our results also
highlight important trade-offs between steering scale and general model
capabilities that are consistent across all tested steering methods.
comment: 9 maintext pages, 13 appendix pages
♻ ☆ Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
The reproduction of state-of-the-art multimodal LLM pre-training faces
barriers at every stage of the pipeline, including high-quality data filtering,
multimodal data mixture strategies, sequence packing techniques, and training
frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter
Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs
using only 220 A100-40G GPU hours. Our approach employs low-to-high dynamic
image resolution and multimodal sequence packing to significantly enhance
pre-training efficiency. The training dataset was carefully curated using both
MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based
filtering methods, substantially improving data quality and training
efficiency. The Open-Qwen2VL pre-training is conducted on academic level
8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T
multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned
Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on
various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista,
indicating the remarkable training efficiency of Open-Qwen2VL. We open-source
all aspects of our work, including compute-efficient and data-efficient
training details, data filtering methods, sequence packing scripts,
pre-training data in WebDataset format, FSDP-based training codebase, and both
base and instruction-tuned model checkpoints. We redefine "fully open" for
multimodal LLMs as the complete release of: 1) the training codebase, 2)
detailed data filtering techniques, and 3) all pre-training and supervised
fine-tuning data used to develop the model.
♻ ☆ Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning
Large language models (LLMs) have shown a remarkable ability to learn and
perform complex tasks through in-context learning (ICL). However, a
comprehensive understanding of its internal mechanisms is still lacking. This
paper explores the role of induction heads in a few-shot ICL setting. We
analyse two state-of-the-art models, Llama-3-8B and InternLM2-20B on abstract
pattern recognition and NLP tasks. Our results show that even a minimal
ablation of induction heads leads to ICL performance decreases of up to ~32%
for abstract pattern recognition tasks, bringing the performance close to
random. For NLP tasks, this ablation substantially decreases the model's
ability to benefit from examples, bringing few-shot ICL performance close to
that of zero-shot prompts. We further use attention knockout to disable
specific induction patterns, and present fine-grained evidence for the role
that the induction mechanism plays in ICL.
comment: 9 pages, 7 figures; Code link added
♻ ☆ FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning
task requiring intelligent systems to answer natural language queries based on
paired audio-video inputs accurately. However, existing AVQA approaches often
suffer from overfitting to dataset biases, leading to poor robustness.
Moreover, current datasets may not effectively diagnose these methods. To
address these challenges, we first introduce a novel dataset, FortisAVQA,
constructed in two stages: (1) rephrasing questions in the test split of the
public MUSIC-AVQA dataset and (2) introducing distribution shifts across
questions. The first stage expands the test space with greater diversity, while
the second enables a refined robustness evaluation across rare, frequent, and
overall question distributions. Second, we introduce a robust Multimodal
Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle
collaborative debiasing strategy to mitigate bias learning. Experimental
results demonstrate that our architecture achieves state-of-the-art performance
on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies
on both datasets validate the effectiveness of our debiasing components.
Additionally, our evaluation reveals the limited robustness of existing
multimodal QA methods. We also verify the plug-and-play capability of our
strategy by integrating it with various baseline models across both datasets.
Our dataset and code are available at https://github.com/reml-group/fortisavqa.
comment: Under Review
♻ ☆ Medical Spoken Named Entity Recognition NAACL 2025
Khai Le-Duc, David Thulke, Hung-Phong Tran, Long Vo-Dang, Khai-Nguyen Nguyen, Truong-Son Hy, Ralf Schlüter
Spoken Named Entity Recognition (NER) aims to extract named entities from
speech and categorise them into types like person, location, organization, etc.
In this work, we present VietMed-NER - the first spoken NER dataset in the
medical domain. To our knowledge, our Vietnamese real-world dataset is the
largest spoken NER dataset in the world regarding the number of entity types,
featuring 18 distinct types. Furthermore, we present baseline results using
various state-of-the-art pre-trained models: encoder-only and
sequence-to-sequence; and conduct quantitative and qualitative error analysis.
We found that pre-trained multilingual models generally outperform monolingual
models on reference text and ASR output and encoders outperform
sequence-to-sequence models in NER tasks. By translating the transcripts, the
dataset can also be utilised for text NER in the medical domain in other
languages than Vietnamese. All code, data and models are publicly available:
https://github.com/leduckhai/MultiMed/tree/master/VietMed-NER.
comment: NAACL 2025, 60 pages
♻ ☆ Linear Representations of Political Perspective Emerge in Large Language Models ICLR 2025
Large language models (LLMs) have demonstrated the ability to generate text
that realistically reflects a range of different subjective human perspectives.
This paper studies how LLMs are seemingly able to reflect more liberal versus
more conservative viewpoints among other political perspectives in American
politics. We show that LLMs possess linear representations of political
perspectives within activation space, wherein more similar perspectives are
represented closer together. To do so, we probe the attention heads across the
layers of three open transformer-based LLMs (Llama-2-7b-chat,
Mistral-7b-instruct, Vicuna-7b). We first prompt models to generate text from
the perspectives of different U.S. lawmakers. We then identify sets of
attention heads whose activations linearly predict those lawmakers' DW-NOMINATE
scores, a widely-used and validated measure of political ideology. We find that
highly predictive heads are primarily located in the middle layers, often
speculated to encode high-level concepts and tasks. Using probes only trained
to predict lawmakers' ideology, we then show that the same probes can predict
measures of news outlets' slant from the activations of models prompted to
simulate text from those news outlets. These linear probes allow us to
visualize, interpret, and monitor ideological stances implicitly adopted by an
LLM as it generates open-ended responses. Finally, we demonstrate that by
applying linear interventions to these attention heads, we can steer the model
outputs toward a more liberal or conservative stance. Overall, our research
suggests that LLMs possess a high-level linear representation of American
political ideology and that by leveraging recent advances in mechanistic
interpretability, we can identify, monitor, and steer the subjective
perspective underlying generated text.
comment: Published as a conference paper at ICLR 2025
https://openreview.net/forum?id=rwqShzb9li
♻ ☆ Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Recent advances in reasoning models have demonstrated significant
improvements in accuracy, particularly for complex tasks such as mathematical
reasoning, by employing detailed and comprehensive reasoning processes.
However, generating these lengthy reasoning sequences is computationally
expensive and time-consuming. To address this inefficiency, we leverage the
inherent parallelizability of certain tasks to accelerate the reasoning
process. Specifically, when multiple parallel reasoning branches exist, we
decode multiple tokens per step using a specialized attention mask, processing
them within a single sequence, avoiding additional memory usage. Experimental
results show that our method achieves over 100% speedup in decoding time while
maintaining the answer quality.
comment: Our code is available in
https://github.com/yuyijiong/parallel-decoding-in-one-sequence
♻ ☆ TRA: Better Length Generalisation with Threshold Relative Attention
Transformers struggle with length generalisation, displaying poor performance
even on basic tasks. We test whether these limitations can be explained through
two key failures of the self-attention mechanism. The first is the inability to
fully remove irrelevant information. The second is tied to position, even if
the dot product between a key and query is highly negative (i.e. an irrelevant
key) learned positional biases may unintentionally up-weight such information -
dangerous when distances become out of distribution. Put together, these two
failure cases lead to compounding generalisation difficulties. We test whether
they can be mitigated through the combination of a) selective sparsity -
completely removing irrelevant keys from the attention softmax and b)
contextualised relative distance - distance is only considered as between the
query and the keys that matter. We show how refactoring the attention mechanism
with these two mitigations in place can substantially improve generalisation
capabilities of decoder only transformers.
♻ ☆ CrossFormer: Cross-Segment Semantic Fusion for Document Segmentation
Text semantic segmentation involves partitioning a document into multiple
paragraphs with continuous semantics based on the subject matter, contextual
information, and document structure. Traditional approaches have typically
relied on preprocessing documents into segments to address input length
constraints, resulting in the loss of critical semantic information across
segments. To address this, we present CrossFormer, a transformer-based model
featuring a novel cross-segment fusion module that dynamically models latent
semantic dependencies across document segments, substantially elevating
segmentation accuracy. Additionally, CrossFormer can replace rule-based chunk
methods within the Retrieval-Augmented Generation (RAG) system, producing more
semantically coherent chunks that enhance its efficacy. Comprehensive
evaluations confirm CrossFormer's state-of-the-art performance on public text
semantic segmentation datasets, alongside considerable gains on RAG benchmarks.
comment: 10 pages, 4 figures
♻ ☆ Is Your LLM Outdated? A Deep Look at Temporal Generalization NAACL 2025
The rapid advancement of Large Language Models (LLMs) has led to the
development of benchmarks that consider temporal dynamics, however, there
remains a gap in understanding how well these models can generalize across
temporal contexts due to the inherent dynamic nature of language and
information. This paper introduces the concept of temporal generalization in
LLMs, including bias in past and future generalizations. Then we introduce
FreshBench, a new evaluation framework that employs fresh text and event
prediction for assessing LLMs' temporal adaptability, ensuring the evaluation
process free from data leakage and subjective bias. The experiment shows
significant temporal biases and a decline in performance over time. Our
findings reveal that powerful models, while initially superior, tend to decline
more rapidly in future generalization. Additionally, powerful open-source
models demonstrate better long-term adaptability compared to their
closed-source counterparts. Our code is available at
https://github.com/FreedomIntelligence/FreshBench.
comment: NAACL 2025 Oral
♻ ☆ Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery
DeepSeek series have demonstrated outstanding performance in general scene
understanding, question-answering (QA), and text generation tasks, owing to its
efficient training paradigm and strong reasoning capabilities. In this study,
we investigate the dialogue capabilities of the DeepSeek model in robotic
surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and
Detailed Description. The Single Phrase QA tasks further include sub-tasks such
as surgical instrument recognition, action understanding, and spatial position
analysis. We conduct extensive evaluations using publicly available datasets,
including EndoVis18 and CholecT50, along with their corresponding dialogue
data. Our comprehensive evaluation results indicate that, when provided with
specific prompts, DeepSeek-V3 performs well in surgical instrument and tissue
recognition tasks However, DeepSeek-V3 exhibits significant limitations in
spatial position analysis and struggles to understand surgical actions
accurately. Additionally, our findings reveal that, under general prompts,
DeepSeek-V3 lacks the ability to effectively analyze global surgical concepts
and fails to provide detailed insights into surgical scenarios. Based on our
observations, we argue that the DeepSeek-V3 is not ready for vision-language
tasks in surgical contexts without fine-tuning on surgery-specific datasets.
comment: Technical Report
♻ ☆ Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering CVPR 2025
Multimodal LLMs (MLLMs) are the natural extension of large language models to
handle multimodal inputs, combining text and image data. They have recently
garnered attention due to their capability to address complex tasks involving
both modalities. However, their effectiveness is limited to the knowledge
acquired during training, which restricts their practical utility. In this
work, we introduce a novel method to enhance the adaptability of MLLMs by
integrating external knowledge sources. Our proposed model, Reflective LLaVA
(ReflectiVA), utilizes reflective tokens to dynamically determine the need for
external knowledge and predict the relevance of information retrieved from an
external database. Tokens are trained following a two-stage two-model training
recipe. This ultimately enables the MLLM to manage external knowledge while
preserving fluency and performance on tasks where external knowledge is not
needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for
knowledge-based visual question answering, highlighting its superior
performance compared to existing methods. Source code and trained models are
publicly available at https://aimagelab.github.io/ReflectiVA.
comment: CVPR 2025
♻ ☆ Optimizing Social Media Annotation of HPV Vaccine Skepticism and Misinformation Using Large Language Models: An Experimental Evaluation of In-Context Learning and Fine-Tuning Stance Detection Across Multiple Models
Luhang Sun, Varsha Pendyala, Yun-Shiuan Chuang, Shanglin Yang, Jonathan Feldman, Andrew Zhao, Munmun De Choudhury, Sijia Yang, Dhavan Shah
This paper leverages large-language models (LLMs) to experimentally determine
optimal strategies for scaling up social media content annotation for stance
detection on HPV vaccine-related tweets. We examine both conventional
fine-tuning and emergent in-context learning methods, systematically varying
strategies of prompt engineering across widely used LLMs and their variants
(e.g., GPT4, Mistral, and Llama3, etc.). Specifically, we varied prompt
template design, shot sampling methods, and shot quantity to detect stance on
HPV vaccination. Our findings reveal that 1) in general, in-context learning
outperforms fine-tuning in stance detection for HPV vaccine social media
content; 2) increasing shot quantity does not necessarily enhance performance
across models; and 3) different LLMs and their variants present differing
sensitivity to in-context learning conditions. We uncovered that the optimal
in-context learning configuration for stance detection on HPV vaccine tweets
involves six stratified shots paired with detailed contextual prompts. This
study highlights the potential and provides an applicable approach for applying
LLMs to research on social media stance and skepticism detection.
♻ ☆ FAN: Fourier Analysis Networks
Yihong Dong, Ge Li, Yongding Tao, Xue Jiang, Kechi Zhang, Jia Li, Jinliang Deng, Jing Su, Jun Zhang, Jingjing Xu
Despite the remarkable successes of general-purpose neural networks, such as
MLPs and Transformers, we find that they exhibit notable shortcomings in
modeling and reasoning about periodic phenomena, achieving only marginal
performance within the training domain and failing to generalize effectively to
out-of-domain (OOD) scenarios. Periodicity is ubiquitous throughout nature and
science. Therefore, neural networks should be equipped with the essential
ability to model and handle periodicity. In this work, we propose FAN, a novel
general-purpose neural network that offers broad applicability similar to MLP
while effectively addressing periodicity modeling challenges. Periodicity is
naturally integrated into FAN's structure and computational processes by
introducing the Fourier Principle. Unlike existing Fourier-based networks,
which possess particular periodicity modeling abilities but are typically
designed for specific tasks, our approach maintains the general-purpose
modeling capability. Therefore, FAN can seamlessly replace MLP in various model
architectures with fewer parameters and FLOPs. Through extensive experiments,
we demonstrate the superiority of FAN in periodicity modeling tasks and the
effectiveness and generalizability of FAN across a range of real-world tasks,
e.g., symbolic formula representation, time series forecasting, language
modeling, and image recognition.
♻ ☆ APPLS: Evaluating Evaluation Metrics for Plain Language Summarization EMNLP
While there has been significant development of models for Plain Language
Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated
assessment metric, and the suitability of text generation evaluation metrics is
unclear due to the unique transformations involved (e.g., adding background
explanations, removing jargon). To address these questions, our study
introduces a granular meta-evaluation testbed, APPLS, designed to evaluate
metrics for PLS. We identify four PLS criteria from previous work --
informativeness, simplification, coherence, and faithfulness -- and define a
set of perturbations corresponding to these criteria that sensitive metrics
should be able to detect. We apply these perturbations to extractive hypotheses
for two PLS datasets to form our testbed. Using APPLS, we assess performance of
14 metrics, including automated scores, lexical features, and LLM prompt-based
evaluations. Our analysis reveals that while some current metrics show
sensitivity to specific criteria, no single method captures all four criteria
simultaneously. We therefore recommend a suite of automated metrics be used to
capture PLS quality along all relevant criteria. This work contributes the
first meta-evaluation testbed for PLS and a comprehensive evaluation of
existing metrics. APPLS and our evaluation code is available at
https://github.com/LinguisticAnomalies/APPLS.
comment: This paper has been accepted by 2024 EMNLP main. Please cite the
EMNLP version
♻ ☆ Designing Speech Technologies for Australian Aboriginal English: Opportunities, Risks and Participation
In Australia, post-contact language varieties, including creoles and local
varieties of international languages, emerged as a result of forced contact
between Indigenous communities and English speakers. These contact varieties
are widely used, yet are poorly supported by language technologies. This gap
presents barriers to participation in civil and economic society for Indigenous
communities using these varieties, and reproduces minoritisation of
contemporary Indigenous sociolinguistic identities. This paper concerns three
questions regarding this context. First, can speech technologies support
speakers of Australian Aboriginal English, a local indigenised variety of
English? Second, what risks are inherent in such a project? Third, what
technology development practices are appropriate for this context, and how can
researchers integrate meaningful community participation in order to mitigate
risks? We argue that opportunities do exist -- as well as risks -- and
demonstrate this through a case study exploring design practices in a
real-world project aiming to improve speech technologies for Australian
Aboriginal English. We discuss how we integrated culturally appropriate and
participatory processes throughout the project. We call for increased support
for languages used by Indigenous communities, including contact varieties,
which provide practical economic and socio-cultural benefits, provided that
participatory and culturally safe practices are enacted.
♻ ☆ MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Models that rely on subword tokenization have significant drawbacks, such as
sensitivity to character-level noise like spelling errors and inconsistent
compression rates across different languages and scripts. While character- or
byte-level models like ByT5 attempt to address these concerns, they have not
gained widespread adoption -- processing raw byte streams without tokenization
results in significantly longer sequence lengths, making training and inference
inefficient. This work introduces MrT5 (MergeT5), a more efficient variant of
ByT5 that integrates a token deletion mechanism in its encoder to dynamically
shorten the input sequence length. After processing through a fixed number of
encoder layers, a learned delete gate determines which tokens are to be removed
and which are to be retained for subsequent layers. MrT5 effectively "merges"
critical information from deleted tokens into a more compact sequence,
leveraging contextual information from the remaining tokens. In continued
pre-training experiments, we find that MrT5 can achieve significant gains in
inference runtime with minimal effect on performance, as measured by
bits-per-byte. Additionally, with multilingual training, MrT5 adapts to the
orthographic characteristics of each language, learning language-specific
compression rates. Furthermore, MrT5 shows comparable accuracy to ByT5 on
downstream evaluations such as XNLI, TyDi QA, and character-level tasks while
reducing sequence lengths by up to 75%. Our approach presents a solution to the
practical limitations of existing byte-level models.
♻ ☆ On the Implicit Relation Between Low-Rank Adaptation and Differential Privacy
A significant approach in natural language processing involves large-scale
pre-training of models on general domain data followed by their adaptation to
specific tasks or domains. As models grow in size, full fine-tuning all of
their parameters becomes increasingly impractical. To address this, some
methods for low-rank task adaptation of language models have been proposed,
e.g., LoRA and FLoRA. These methods keep the pre-trained model weights fixed
and incorporate trainable low-rank decomposition matrices into some layers of
the transformer architecture, called adapters. This approach significantly
reduces the number of trainable parameters required for downstream tasks
compared to full fine-tuning all parameters. In this work, we look at low-rank
adaptation from the lens of data privacy. We show theoretically that the
low-rank adaptation used in LoRA and FLoRA leads to the injection of some
random noise into the batch gradients w.r.t the adapter parameters. We quantify
the variance of the injected noise and show that the smaller the adaptation
rank, the larger the noise variance. By establishing a Berry-Esseen type bound
on the total variation distance between distribution of the injected noise and
a Gaussian distribution with the same variance, we show that the dynamics of
low-rank adaptation is close to that of differentially private fine-tuning of
the adapters. Finally, using Johnson-Lindenstrauss lemma, we show that when
augmented with gradient scaling, low-rank adaptation is very close to
performing DPSGD algorithm with a fixed noise scale to fine-tune the adapters.
Suggested by our theoretical findings and approved by our experimental results,
we show that low-rank adaptation, besides mitigating the space and
computational complexities, implicitly provides a privacy protection w.r.t the
fine-tuning data, without inducing the high space complexity of DPSGD.
♻ ☆ An evaluation of LLMs and Google Translate for translation of selected Indian languages via sentiment and semantic analyses
Large Language models (LLMs) have been prominent for language translation,
including low-resource languages. There has been limited study about the
assessment of the quality of translations generated by LLMs, including Gemini,
GPT and Google Translate. In this study, we address this limitation by using
semantic and sentiment analysis of selected LLMs for Indian languages,
including Sanskrit, Telugu and Hindi. We select prominent texts that have been
well translated by experts and use LLMs to generate their translations to
English, and then we provide a comparison with selected expert (human)
translations. Our findings suggest that while LLMs have made significant
progress in translation accuracy, challenges remain in preserving sentiment and
semantic integrity, especially in figurative and philosophical contexts. The
sentiment analysis revealed that GPT-4o and GPT-3.5 are better at preserving
the sentiments for the Bhagavad Gita (Sanskrit-English) translations when
compared to Google Translate. We observed a similar trend for the case of Tamas
(Hindi-English) and Maha P (Telugu-English) translations. GPT-4o performs
similarly to GPT-3.5 in the translation in terms of sentiments for the three
languages. We found that LLMs are generally better at translation for capturing
sentiments when compared to Google Translate.
♻ ☆ LR$^2$Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Recent progress in o1-like models has significantly enhanced the reasoning
abilities of Large Language Models (LLMs), empowering them to tackle
increasingly complex tasks through reflection capabilities, such as making
assumptions, backtracking, and self-refinement. However, effectively evaluating
such reflection capabilities remains challenging due to the lack of appropriate
benchmarks. To bridge this gap, we introduce LR$^2$Bench, a novel benchmark
designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs.
LR$^2$Bench comprises 850 samples across six Constraint Satisfaction Problems
(CSPs) where reflective reasoning is crucial for deriving solutions that meet
all given constraints. Each type of task focuses on distinct constraint
patterns, such as knowledge-based, logical, and spatial constraints, providing
a comprehensive evaluation of diverse problem-solving scenarios. We conduct
extensive evaluation on both conventional models and o1-like models. Our
experimental results reveal that even the most advanced reasoning-specific
models, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in
LR$^2$Bench, achieving an average Exact Match score of only 20.0% and 23.6%,
respectively. These findings underscore the significant room for improvement in
the reflective reasoning capabilities of current LLMs. The leaderboard of our
benchmark is available at https://huggingface.co/spaces/UltraRonin/LR2Bench
♻ ☆ MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Tianqi Chen, Beidi Chen
Large Language Models (LLMs) have become more prevalent in long-context
applications such as interactive chatbots, document analysis, and agent
workflows, but it is challenging to serve long-context requests with low
latency and high throughput. Speculative decoding (SD) is a widely used
technique to reduce latency losslessly, but the conventional wisdom suggests
that its efficacy is limited to small batch sizes. In MagicDec, we show that
surprisingly SD can achieve speedup even for a high throughput inference regime
for moderate to long sequences. More interestingly, an intelligent drafting
strategy can achieve better speedup with increasing batch size based on our
rigorous analysis. MagicDec first identifies the bottleneck shifts with
increasing batch size and sequence length, and uses these insights to deploy SD
more effectively for high throughput inference. We leverage draft model with
sparse KV cache to address the KV bottleneck, which scales with both sequence
length and batch size. Additionally, we propose a theoretical model to select
the optimal drafting strategy for maximum speedup. Our work highlights the
broad applicability of speculative decoding in long-context serving, as it can
enhance throughput and reduce latency without compromising accuracy. For
moderate to long sequences, we demonstrate up to 2.51x speedup for Llama3.1-8B
when serving batch sizes ranging from 32 to 256 on various types of hardware
and tasks.
♻ ☆ Calibrating Expressions of Certainty ICLR
Peiqi Wang, Barbara D. Lam, Yingcheng Liu, Ameneh Asgari-Targhi, Rameswar Panda, William M. Wells, Tina Kapur, Polina Golland
We present a novel approach to calibrating linguistic expressions of
certainty, e.g., "Maybe" and "Likely". Unlike prior work that assigns a single
score to each certainty phrase, we model uncertainty as distributions over the
simplex to capture their semantics more accurately. To accommodate this new
representation of certainty, we generalize existing measures of miscalibration
and introduce a novel post-hoc calibration method. Leveraging these tools, we
analyze the calibration of both humans (e.g., radiologists) and computational
models (e.g., language models) and provide interpretable suggestions to improve
their calibration.
comment: International Conference on Learning Representations (ICLR), 2025