Computation and Language 117
☆ Tokenisation is NP-Complete
In this work, we prove the NP-completeness of two variants of tokenisation,
defined as the problem of compressing a dataset to at most $\delta$ symbols by
either finding a vocabulary directly (direct tokenisation), or selecting a
sequence of merge operations (bottom-up tokenisation).
☆ LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li
This paper introduces LongBench v2, a benchmark designed to assess the
ability of LLMs to handle long-context problems requiring deep understanding
and reasoning across real-world multitasks. LongBench v2 consists of 503
challenging multiple-choice questions, with contexts ranging from 8k to 2M
words, across six major task categories: single-document QA, multi-document QA,
long in-context learning, long-dialogue history understanding, code repository
understanding, and long structured data understanding. To ensure the breadth
and the practicality, we collect data from nearly 100 highly educated
individuals with diverse professional backgrounds. We employ both automated and
manual review processes to maintain high quality and difficulty, resulting in
human experts achieving only 53.7% accuracy under a 15-minute time constraint.
Our evaluation reveals that the best-performing model, when directly answers
the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model,
which includes longer reasoning, achieves 57.7%, surpassing the human baseline
by 4%. These results highlight the importance of enhanced reasoning ability and
scaling inference-time compute to tackle the long-context challenges in
LongBench v2. The project is available at https://longbench2.github.io.
comment: 25 pages, 13 figures
☆ MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark
Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, Furu Wei
Multiple-choice question (MCQ) datasets like Massive Multitask Language
Understanding (MMLU) are widely used to evaluate the commonsense,
understanding, and problem-solving abilities of large language models (LLMs).
However, the open-source nature of these benchmarks and the broad sources of
training data for LLMs have inevitably led to benchmark contamination,
resulting in unreliable evaluation results. To alleviate this issue, we propose
a contamination-free and more challenging MCQ benchmark called MMLU-CF. This
benchmark reassesses LLMs' understanding of world knowledge by averting both
unintentional and malicious data leakage. To avoid unintentional data leakage,
we source data from a broader domain and design three decontamination rules. To
prevent malicious data leakage, we divide the benchmark into validation and
test sets with similar difficulty and subject distributions. The test set
remains closed-source to ensure reliable results, while the validation set is
publicly available to promote transparency and facilitate independent
verification. Our evaluation of mainstream LLMs reveals that the powerful
GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on
the test set, which indicates the effectiveness of our approach in creating a
more rigorous and contamination-free evaluation standard. The GitHub repository
is available at https://github.com/microsoft/MMLU-CF and the dataset refers to
https://huggingface.co/datasets/microsoft/MMLU-CF.
☆ Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings
Natural Language Processing and Generation systems have recently shown the
potential to complement and streamline the costly and time-consuming job of
professional fact-checkers. In this work, we lift several constraints of
current state-of-the-art pipelines for automated fact-checking based on the
Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, under
more realistic scenarios, RAG-based methods for the generation of verdicts -
i.e., short texts discussing the veracity of a claim - evaluating them on
stylistically complex claims and heterogeneous, yet reliable, knowledge bases.
Our findings show a complex landscape, where, for example, LLM-based retrievers
outperform other retrieval techniques, though they still struggle with
heterogeneous knowledge bases; larger models excel in verdict faithfulness,
while smaller models provide better context adherence, with human evaluations
favouring zero-shot and one-shot approaches for informativeness, and fine-tuned
models for emotional alignment.
comment: Code and data at https://github.com/drusso98/face-the-facts
☆ LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation
We present LlamaFusion, a framework for empowering pretrained text-only large
language models (LLMs) with multimodal generative capabilities, enabling them
to understand and generate both text and images in arbitrary sequences.
LlamaFusion leverages existing Llama-3's weights for processing texts
autoregressively while introducing additional and parallel transformer modules
for processing images with diffusion. During training, the data from each
modality is routed to its dedicated modules: modality-specific feedforward
layers, query-key-value projections, and normalization layers process each
modality independently, while the shared self-attention layers allow
interactions across text and image features. By freezing the text-specific
modules and only training the image-specific modules, LlamaFusion preserves the
language capabilities of text-only LLMs while developing strong visual
understanding and generation abilities. Compared to methods that pretrain
multimodal generative models from scratch, our experiments demonstrate that,
LlamaFusion improves image understanding by 20% and image generation by 3.6%
using only 50% of the FLOPs while maintaining Llama-3's language capabilities.
We also demonstrate that this framework can adapt existing vision-language
models with multimodal generation ability. Overall, this framework not only
leverages existing computational investments in text-only LLMs but also enables
the parallel development of language and vision capabilities, presenting a
promising direction for efficient multimodal model development.
☆ Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying
Studies have underscored how, regardless of the recent breakthrough and swift
advances in AI research, even state-of-the-art Large Language models (LLMs)
continue to struggle when performing logical and mathematical reasoning. The
results seem to suggest that LLMs still work as (highly advanced) data pattern
identifiers, scoring poorly when attempting to generalise and solve reasoning
problems the models have never previously seen or that are not close to samples
presented in their training data. To address this compelling concern, this
paper makes use of the notion of critical questions from the literature on
argumentation theory, focusing in particular on Toulmin's model of
argumentation. We show that employing these critical questions can improve the
reasoning capabilities of LLMs. By probing the rationale behind the models'
reasoning process, the LLM can assess whether some logical mistake is occurring
and correct it before providing the final reply to the user prompt. The
underlying idea is drawn from the gold standard of any valid argumentative
procedure: the conclusion is valid if it is entailed by accepted premises. Or,
to paraphrase such Aristotelian principle in a real-world approximation,
characterised by incomplete information and presumptive logic, the conclusion
is valid if not proved otherwise. This approach successfully steers the models'
output through a reasoning pipeline, resulting in better performance against
the baseline and its Chain-of-Thought (CoT) implementation. To this end, an
extensive evaluation of the proposed approach on the MT-Bench Reasoning and
Math tasks across a range of LLMs is provided.
☆ Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
Yatai Ji, Jiacheng Zhang, Jie Wu, Shilong Zhang, Shoufa Chen, Chongjian GE, Peize Sun, Weifeng Chen, Wenqi Shao, Xuefeng Xiao, Weilin Huang, Ping Luo
Text-to-video models have made remarkable advancements through optimization
on high-quality text-video pairs, where the textual prompts play a pivotal role
in determining quality of output videos. However, achieving the desired output
often entails multiple revisions and iterative inference to refine
user-provided prompts. Current automatic methods for refining prompts encounter
challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware
when applied to text-to-video diffusion models. To address these problem, we
introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video,
which excels in crafting Video-Centric, Labor-Free and Preference-Aligned
prompts tailored to specific video diffusion model. Our approach involves a
meticulously crafted two-stage optimization and alignment system. Initially, we
conduct a reward-guided prompt evolution pipeline to automatically create
optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the
LLM. Then multi-dimensional rewards are employed to generate pairwise data for
the SFT model, followed by the direct preference optimization (DPO) algorithm
to further facilitate preference alignment. Through extensive experimentation
and comparative analyses, we validate the effectiveness of Prompt-A-Video
across diverse generation models, highlighting its potential to push the
boundaries of video generation.
☆ Language Models as Continuous Self-Evolving Data Engineers
Large Language Models (LLMs) have demonstrated remarkable capabilities on
various tasks, while the further evolvement is limited to the lack of
high-quality training data. In addition, traditional training approaches rely
too much on expert-labeled data, setting an upper limit on the performance of
LLMs. To address this issue, we propose a novel paradigm that enables LLMs to
train itself by autonomously generating, cleaning, reviewing, and annotating
data with preference information, named LANCE. Our approach demonstrates that
LLMs can serve as continuous self-evolving data engineers, significantly
reducing the time and cost of the post-training data construction process.
Through iterative fine-tuning on different variants of the Qwen2, we validate
the effectiveness of LANCE across various tasks, showing that it can
continuously improve model performance and maintain high-quality data
generation. Across eight benchmark dimensions, LANCE resulted in an average
score enhancement of 3.36 for Qwen2-7B and 2.70 for Qwen2-7B-Instruct. This
training paradigm with autonomous data construction not only reduces the
reliance on human experts or external models but also ensures that the data
aligns with human values and preferences, paving the way for the development of
future superintelligent systems that can exceed human capabilities.
☆ Adaptive Pruning for Large Language Models with Structural Importance Awareness
Haotian Zheng, Jinke Ren, Yushan Sun, Ruichen Zhang, Wenbo Zhang, Zhen Li, Dusit Niyato, Shuguang Cui, Yatong Han
The recent advancements in large language models (LLMs) have significantly
improved language understanding and generation capabilities. However, it is
difficult to deploy LLMs on resource-constrained edge devices due to their high
computational and storage resource demands. To address this issue, we propose a
novel LLM model pruning method, namely structurally-aware adaptive pruning
(SAAP), to significantly reduce the computational and memory costs while
maintaining model performance. We first define an adaptive importance fusion
metric to evaluate the importance of all coupled structures in LLMs by
considering their homoscedastic uncertainty. Then, we rank the importance of
all modules to determine the specific layers that should be pruned to meet
particular performance requirements. Furthermore, we develop a new group
fine-tuning strategy to improve the inference efficiency of LLMs. Finally, we
evaluate the proposed SAAP method on multiple LLMs across two common tasks,
i.e., zero-shot classification and text generation. Experimental results show
that our SAAP method outperforms several state-of-the-art baseline methods,
achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and
LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%,
showcasing its practical advantages in resource-constrained scenarios.
comment: 12 pages, 6 figures, 12 tables
☆ Outcome-Refining Process Supervision for Code Generation
Large Language Models have demonstrated remarkable capabilities in code
generation, yet they often struggle with complex programming tasks that require
deep algorithmic reasoning. While process supervision through learned reward
models shows promise in guiding reasoning steps, it requires expensive training
data and suffers from unreliable evaluation. We propose Outcome-Refining
Process Supervision, a novel paradigm that treats outcome refinement itself as
the process to be supervised. Our framework leverages concrete execution
signals to ground the supervision of reasoning steps, while using
tree-structured exploration to maintain multiple solution trajectories
simultaneously. Experiments demonstrate that our approach enables even smaller
models to achieve high success accuracy and performance metrics on competitive
programming tasks, creates more reliable verification than traditional reward
models without requiring training PRMs. Our approach achieves significant
improvements across 5 models and 3 datasets: an average of 26.9% increase in
correctness and 42.2% in efficiency. The results suggest that providing
structured reasoning space with concrete verification signals is crucial for
solving complex programming tasks. We open-source all our code and data at:
https://github.com/zhuohaoyu/ORPS
comment: 18 pages, 5 figures, Code: https://github.com/zhuohaoyu/ORPS
☆ Qwen2.5 Technical Report
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu
In this report, we introduce Qwen2.5, a comprehensive series of large
language models (LLMs) designed to meet diverse needs. Compared to previous
iterations, Qwen 2.5 has been significantly improved during both the
pre-training and post-training stages. In terms of pre-training, we have scaled
the high-quality pre-training datasets from the previous 7 trillion tokens to
18 trillion tokens. This provides a strong foundation for common sense, expert
knowledge, and reasoning capabilities. In terms of post-training, we implement
intricate supervised finetuning with over 1 million samples, as well as
multistage reinforcement learning. Post-training techniques enhance human
preference, and notably improve long text generation, structural data analysis,
and instruction following. To handle diverse and varied use cases effectively,
we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base
and instruction-tuned models, with quantized versions available. In addition,
for hosted solutions, the proprietary models currently include two
mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both
available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier
performance on a wide range of benchmarks evaluating language understanding,
reasoning, mathematics, coding, human preference alignment, etc. Specifically,
the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and
proprietary models and demonstrates competitive performance to the
state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5
times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness
while performing competitively against GPT-4o-mini and GPT-4o respectively.
Additionally, as the foundation, Qwen2.5 models have been instrumental in
training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and
multimodal models.
☆ Associative memory inspires improvements for in-context learning using a novel attention residual stream architecture
Large language models (LLMs) demonstrate an impressive ability to utilise
information within the context of their input sequences to appropriately
respond to data unseen by the LLM during its training procedure. This ability
is known as in-context learning (ICL). Humans and non-human animals demonstrate
similar abilities, however their neural architectures differ substantially from
LLMs. Despite this, a critical component within LLMs, the attention mechanism,
resembles modern associative memory models, widely used in and influenced by
the computational neuroscience community to model biological memory systems.
Using this connection, we introduce an associative memory model capable of
performing ICL. We use this as inspiration for a novel residual stream
architecture which allows information to directly flow between attention heads.
We test this architecture during training within a two-layer Transformer and
show its ICL abilities manifest more quickly than without this modification. We
then apply our architecture in small language models with 8 million parameters,
focusing on attention head values, with results also indicating improved ICL
performance at this larger and more naturalistic scale.
comment: 18 pages, 6 figures, 3 tables
☆ Review-Then-Refine: A Dynamic Framework for Multi-Hop Question Answering with Temporal Adaptability
Retrieve-augmented generation (RAG) frameworks have emerged as a promising
solution to multi-hop question answering(QA) tasks since it enables large
language models (LLMs) to incorporate external knowledge and mitigate their
inherent knowledge deficiencies. Despite this progress, existing RAG
frameworks, which usually follows the retrieve-then-read paradigm, often
struggle with multi-hop QA with temporal information since it has difficulty
retrieving and synthesizing accurate time-related information. To address the
challenge, this paper proposes a novel framework called review-then-refine,
which aims to enhance LLM performance in multi-hop QA scenarios with temporal
information. Our approach begins with a review phase, where decomposed
sub-queries are dynamically rewritten with temporal information, allowing for
subsequent adaptive retrieval and reasoning process. In addition, we implement
adaptive retrieval mechanism to minimize unnecessary retrievals, thus reducing
the potential for hallucinations. In the subsequent refine phase, the LLM
synthesizes the retrieved information from each sub-query along with its
internal knowledge to formulate a coherent answer. Extensive experimental
results across multiple datasets demonstrate the effectiveness of our proposed
framework, highlighting its potential to significantly improve multi-hop QA
capabilities in LLMs.
comment: 20 pages, 2 figures
☆ A Cross-Domain Study of the Use of Persuasion Techniques in Online Disinformation
Disinformation, irrespective of domain or language, aims to deceive or
manipulate public opinion, typically through employing advanced persuasion
techniques. Qualitative and quantitative research on the weaponisation of
persuasion techniques in disinformation has been mostly topic-specific (e.g.,
COVID-19) with limited cross-domain studies, resulting in a lack of
comprehensive understanding of these strategies. This study employs a
state-of-the-art persuasion technique classifier to conduct a large-scale,
multi-domain analysis of the role of 16 persuasion techniques in disinformation
narratives. It shows how different persuasion techniques are employed
disproportionately in different disinformation domains. We also include a
detailed case study on climate change disinformation, highlighting how
linguistic, psychological, and cultural factors shape the adaptation of
persuasion strategies to fit unique thematic contexts.
☆ AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
In this paper, we introduce AceMath, a suite of frontier math models that
excel in solving complex math problems, along with highly effective reward
models capable of evaluating generated solutions and reliably identifying the
correct ones. To develop the instruction-tuned math models, we propose a
supervised fine-tuning (SFT) process that first achieves competitive
performance across general domains, followed by targeted fine-tuning for the
math domain using a carefully curated set of prompts and synthetically
generated responses. The resulting model, AceMath-72B-Instruct greatly
outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop
math-specialized reward model, we first construct AceMath-RewardBench, a
comprehensive and robust benchmark for evaluating math reward models across
diverse problems and difficulty levels. After that, we present a systematic
approach to build our math reward models. The resulting model, AceMath-72B-RM,
consistently outperforms state-of-the-art reward models. Furthermore, when
combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest
average rm@8 score across the math reasoning benchmarks. We will release model
weights, training data, and evaluation benchmarks at:
https://research.nvidia.com/labs/adlr/acemath
☆ Till the Layers Collapse: Compressing a Deep Neural Network through the Lenses of Batch Normalization Layers AAAI 2025
Today, deep neural networks are widely used since they can handle a variety
of complex tasks. Their generality makes them very powerful tools in modern
technology. However, deep neural networks are often overparameterized. The
usage of these large models consumes a lot of computation resources. In this
paper, we introduce a method called \textbf{T}ill the \textbf{L}ayers
\textbf{C}ollapse (TLC), which compresses deep neural networks through the
lenses of batch normalization layers. By reducing the depth of these networks,
our method decreases deep neural networks' computational requirements and
overall latency. We validate our method on popular models such as Swin-T,
MobileNet-V2, and RoBERTa, across both image classification and natural
language processing (NLP) tasks.
comment: Accepted at AAAI 2025
☆ ConfliBERT: A Language Model for Political Conflict
Patrick T. Brandt, Sultan Alsarra, Vito J. D`Orazio, Dagmar Heintze, Latifur Khan, Shreyas Meher, Javier Osorio, Marcus Sianan
Conflict scholars have used rule-based approaches to extract information
about political violence from news reports and texts. Recent Natural Language
Processing developments move beyond rigid rule-based approaches. We review our
recent ConfliBERT language model (Hu et al. 2022) to process political and
violence related texts. The model can be used to extract actor and action
classifications from texts about political conflict. When fine-tuned, results
show that ConfliBERT has superior performance in accuracy, precision and recall
over other large language models (LLM) like Google's Gemma 2 (9B), Meta's Llama
3.1 (7B), and Alibaba's Qwen 2.5 (14B) within its relevant domains. It is also
hundreds of times faster than these more generalist LLMs. These results are
illustrated using texts from the BBC, re3d, and the Global Terrorism Dataset
(GTD).
comment: 30 pages, 4 figures, 5 tables
☆ LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps
Felix Friedrich, Simone Tedeschi, Patrick Schramowski, Manuel Brack, Roberto Navigli, Huu Nguyen, Bo Li, Kristian Kersting
Building safe Large Language Models (LLMs) across multiple languages is
essential in ensuring both safe access and linguistic diversity. To this end,
we introduce M-ALERT, a multilingual benchmark that evaluates the safety of
LLMs in five languages: English, French, German, Italian, and Spanish. M-ALERT
includes 15k high-quality prompts per language, totaling 75k, following the
detailed ALERT taxonomy. Our extensive experiments on 10 state-of-the-art LLMs
highlight the importance of language-specific safety analysis, revealing that
models often exhibit significant inconsistencies in safety across languages and
categories. For instance, Llama3.2 shows high unsafety in the category
crime_tax for Italian but remains safe in other languages. Similar differences
can be observed across all models. In contrast, certain categories, such as
substance_cannabis and crime_propaganda, consistently trigger unsafe responses
across models and languages. These findings underscore the need for robust
multilingual safety practices in LLMs to ensure safe and responsible usage
across diverse user communities.
☆ Large Language Models and Code Security: A Systematic Literature Review
Large Language Models (LLMs) have emerged as powerful tools for automating
various programming tasks, including security-related ones, such as detecting
and fixing vulnerabilities. Despite their promising capabilities, when required
to produce or modify pre-existing code, LLMs could introduce vulnerabilities
unbeknown to the programmer. When analyzing code, they could miss clear
vulnerabilities or signal nonexistent ones. In this Systematic Literature
Review (SLR), we aim to investigate both the security benefits and potential
drawbacks of using LLMs for a variety of code-related tasks. In particular,
first we focus on the types of vulnerabilities that could be introduced by
LLMs, when used for producing code. Second, we analyze the capabilities of LLMs
to detect and fix vulnerabilities, in any given code, and how the prompting
strategy of choice impacts their performance in these two tasks. Last, we
provide an in-depth analysis on how data poisoning attacks on LLMs can impact
performance in the aforementioned tasks.
☆ Chain-of-MetaWriting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts COLING 2025
Large Language Models (LLMs) have been used to generate texts in response to
different writing tasks: reports, essays, story telling. However, language
models do not have a meta-representation of the text writing process, nor
inherent communication learning needs, comparable to those of young human
students. This paper introduces a fine-grained linguistic and textual analysis
of multilingual Small Language Models' (SLMs) writing. With our method,
Chain-of-MetaWriting, SLMs can imitate some steps of the human writing process,
such as planning and evaluation. We mainly focused on short story and essay
writing tasks in French for schoolchildren and undergraduate students
respectively. Our results show that SLMs encounter difficulties in assisting
young students on sensitive topics such as violence in the schoolyard, and they
sometimes use words too complex for the target audience. In particular, the
output is quite different from the human produced texts in term of text
cohesion and coherence regarding temporal connectors, topic progression,
reference.
comment: Accepted at WRAICOGS 2025 (Writing Aids at the Crossroads of AI,
Cognitive Science, and NLP) co-located with COLING 2025
☆ Movie2Story: A framework for understanding videos and telling stories in the form of novel text
Multimodal video-to-text models have made considerable progress, primarily in
generating brief descriptions of video content. However, there is still a
deficiency in generating rich long-form text descriptions that integrate both
video and audio. In this paper, we introduce a framework called M2S, designed
to generate novel-length text by combining audio, video, and character
recognition. M2S includes modules for video long-form text description and
comprehension, audio-based analysis of emotion, speech rate, and character
alignment, and visual-based character recognition alignment. By integrating
multimodal information using the large language model GPT4o, M2S stands out in
the field of multimodal text generation. We demonstrate the effectiveness and
accuracy of M2S through comparative experiments and human evaluation.
Additionally, the model framework has good scalability and significant
potential for future research.
☆ Knowledge Injection via Prompt Distillation
In many practical applications, large language models (LLMs) need to
incorporate new knowledge not present in their pre-training data. The primary
methods for this are fine-tuning and retrieval-augmented generation (RAG).
Although RAG has emerged as the industry standard for knowledge injection,
fine-tuning has not yet achieved comparable success. In this paper, we propose
a new fine-tuning technique for learning new knowledge and show that it can
reach the performance of RAG. The proposed method is based on the
self-distillation approach, which we call prompt distillation. First, we
generate question-answer pairs about the new knowledge. Then, we fine-tune a
student model on the question-answer pairs to imitate the output distributions
of a teacher model, which additionally receives the new knowledge in its
prompt. The student model is identical to the teacher, except it is equipped
with a LoRA adapter. This training procedure facilitates distilling the new
knowledge from the teacher's prompt into the student's weights.
comment: Preprint
☆ Understanding the Dark Side of LLMs' Intrinsic Self-Correction
Intrinsic self-correction was proposed to improve LLMs' responses via
feedback prompts solely based on their inherent capability. However, recent
works show that LLMs' intrinsic self-correction fails without oracle labels as
feedback prompts. In this paper, we aim to interpret LLMs' intrinsic
self-correction for different tasks, especially for those failure cases. By
including one simple task and three complex tasks with state-of-the-art (SOTA)
LLMs like ChatGPT families (o1, 4o, 3.5-turbo) and Llama families (2-7B, 3-8B,
and 3.1-8B), we design three interpretation methods to reveal the dark side of
LLMs' intrinsic self-correction. We identify intrinsic self-correction can (1)
cause LLMs to waver both intermedia and final answers and lead to prompt bias
on simple factual questions; (2) introduce human-like cognitive bias on complex
tasks. In light of our findings, we also provide two simple yet effective
strategies for alleviation: question repeating and supervised fine-tuning with
a few samples. We open-source our work at https://x-isc.info/.
☆ RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response
Supervised fine-tuning (SFT) plays a crucial role in adapting large language
models (LLMs) to specific domains or tasks. However, as demonstrated by
empirical experiments, the collected data inevitably contains noise in
practical applications, which poses significant challenges to model performance
on downstream tasks. Therefore, there is an urgent need for a noise-robust SFT
framework to enhance model capabilities in downstream tasks. To address this
challenge, we introduce a robust SFT framework (RobustFT) that performs noise
detection and relabeling on downstream task data. For noise identification, our
approach employs a multi-expert collaborative system with inference-enhanced
models to achieve superior noise detection. In the denoising phase, we utilize
a context-enhanced strategy, which incorporates the most relevant and confident
knowledge followed by careful assessment to generate reliable annotations.
Additionally, we introduce an effective data selection mechanism based on
response entropy, ensuring only high-quality samples are retained for
fine-tuning. Extensive experiments conducted on multiple LLMs across five
datasets demonstrate RobustFT's exceptional performance in noisy scenarios.
☆ Dehallucinating Parallel Context Extension for Retrieval-Augmented Generation
Large language models (LLMs) are susceptible to generating hallucinated
information, despite the integration of retrieval-augmented generation (RAG).
Parallel context extension (PCE) is a line of research attempting to
effectively integrating parallel (unordered) contexts, while it still suffers
from hallucinations when adapted to RAG scenarios. In this paper, we propose
DePaC (Dehallucinating Parallel Context Extension), which alleviates the
hallucination problem with context-aware negative training and
information-calibrated aggregation. DePaC is designed to alleviate two types of
in-context hallucination: fact fabrication (i.e., LLMs present claims that are
not supported by the contexts) and fact omission (i.e., LLMs fail to present
claims that can be supported by the contexts). Specifically, (1) for fact
fabrication, we apply the context-aware negative training that fine-tunes the
LLMs with negative supervisions, thus explicitly guiding the LLMs to refuse to
answer when contexts are not related to questions; (2) for fact omission, we
propose the information-calibrated aggregation which prioritizes context
windows with higher information increment from their contexts. The experimental
results on nine RAG tasks demonstrate that DePaC significantly alleviates the
two types of hallucination and consistently achieves better performances on
these tasks.
☆ Why language models collapse when trained on recursively generated text
Language models (LMs) have been widely used to generate text on the Internet.
The generated text is often collected into the training corpus of the next
generations of LMs. Previous work has experimentally found that LMs collapse
when trained on recursively generated text. This paper contributes to existing
knowledge from two aspects. We present a theoretical proof of LM collapse. Our
proof reveals the cause of LM collapse and proves that all auto-regressive LMs
will definitely collapse. We present a new finding: the performance of LMs
gradually declines when trained on recursively generated text until they
perform no better than a randomly initialized LM. The trained LMs produce large
amounts of repetitive text and perform poorly across a wide range of natural
language tasks. The above proof and new findings deepen our understanding of LM
collapse and offer valuable insights that may inspire new training techniques
to mitigate this threat.
comment: 28 pages, 9 figures
☆ Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering
Recent advances in machine learning, particularly Large Language Models
(LLMs) such as BERT and GPT, provide rich contextual embeddings that improve
text representation. However, current document clustering approaches often
ignore the deeper relationships between named entities (NEs) and the potential
of LLM embeddings. This paper proposes a novel approach that integrates Named
Entity Recognition (NER) and LLM embeddings within a graph-based framework for
document clustering. The method builds a graph with nodes representing
documents and edges weighted by named entity similarity, optimized using a
graph-convolutional network (GCN). This ensures a more effective grouping of
semantically related documents. Experimental results indicate that our approach
outperforms conventional co-occurrence-based methods in clustering, notably for
documents rich in named entities.
comment: 11 pages, 4 figures
☆ Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling
Despite their outstanding capabilities, large language models (LLMs) are
prone to hallucination and producing factually incorrect information. This
challenge has spurred efforts in attributed text generation, which prompts LLMs
to generate content with supporting evidence. In this paper, we propose a novel
framework, called Think&Cite, and formulate attributed text generation as a
multi-step reasoning problem integrated with search. Specifically, we propose
Self-Guided Monte Carlo Tree Search (SG-MCTS), which capitalizes on the
self-reflection capability of LLMs to reflect on the intermediate states of
MCTS for guiding the tree expansion process. To provide reliable and
comprehensive feedback, we introduce Progress Reward Models to measure the
progress of tree search from the root to the current state from two aspects,
i.e., generation and attribution progress. We conduct extensive experiments on
three datasets and the results show that our approach significantly outperforms
baseline approaches.
☆ DS$^2$-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis
Recently developed large language models (LLMs) have presented promising new
avenues to address data scarcity in low-resource scenarios. In few-shot
aspect-based sentiment analysis (ABSA), previous efforts have explored data
augmentation techniques, which prompt LLMs to generate new samples by modifying
existing ones. However, these methods fail to produce adequately diverse data,
impairing their effectiveness. Besides, some studies apply in-context learning
for ABSA by using specific instructions and a few selected examples as prompts.
Though promising, LLMs often yield labels that deviate from task requirements.
To overcome these limitations, we propose DS$^2$-ABSA, a dual-stream data
synthesis framework targeted for few-shot ABSA. It leverages LLMs to synthesize
data from two complementary perspectives: \textit{key-point-driven} and
\textit{instance-driven}, which effectively generate diverse and high-quality
ABSA samples in low-resource settings. Furthermore, a \textit{label refinement}
module is integrated to improve the synthetic labels. Extensive experiments
demonstrate that DS$^2$-ABSA significantly outperforms previous few-shot ABSA
solutions and other LLM-oriented data generation methods.
☆ A Survey of RWKV
The Receptance Weighted Key Value (RWKV) model offers a novel alternative to
the Transformer architecture, merging the benefits of recurrent and
attention-based systems. Unlike conventional Transformers, which depend heavily
on self-attention, RWKV adeptly captures long-range dependencies with minimal
computational demands. By utilizing a recurrent framework, RWKV addresses some
computational inefficiencies found in Transformers, particularly in tasks with
long sequences. RWKV has recently drawn considerable attention for its robust
performance across multiple domains. Despite its growing popularity, no
systematic review of the RWKV model exists. This paper seeks to fill this gap
as the first comprehensive review of the RWKV architecture, its core
principles, and its varied applications, such as natural language generation,
natural language understanding, and computer vision. We assess how RWKV
compares to traditional Transformer models, highlighting its capability to
manage long sequences efficiently and lower computational costs. Furthermore,
we explore the challenges RWKV encounters and propose potential directions for
future research and advancement. We consistently maintain the related
open-source materials at: https://github.com/MLGroupJLU/RWKV-Survey.
comment: 18 pages
☆ Mapping and Influencing the Political Ideology of Large Language Models using Synthetic Personas
Pietro Bernardelle, Leon Fröhling, Stefano Civelli, Riccardo Lunardi, Kevin Roiter, Gianluca Demartini
The analysis of political biases in large language models (LLMs) has
primarily examined these systems as single entities with fixed viewpoints.
While various methods exist for measuring such biases, the impact of
persona-based prompting on LLMs' political orientation remains unexplored. In
this work we leverage PersonaHub, a collection of synthetic persona
descriptions, to map the political distribution of persona-based prompted LLMs
using the Political Compass Test (PCT). We then examine whether these initial
compass distributions can be manipulated through explicit ideological prompting
towards diametrically opposed political orientations: right-authoritarian and
left-libertarian. Our experiments reveal that synthetic personas predominantly
cluster in the left-libertarian quadrant, with models demonstrating varying
degrees of responsiveness when prompted with explicit ideological descriptors.
While all models demonstrate significant shifts towards right-authoritarian
positions, they exhibit more limited shifts towards left-libertarian positions,
suggesting an asymmetric response to ideological manipulation that may reflect
inherent biases in model training.
comment: 4 pages, 2 figures, 2 tables
☆ DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs
Efficient KV cache management in LLMs is crucial for long-context tasks like
RAG and summarization. Existing KV cache compression methods enforce a fixed
pattern, neglecting task-specific characteristics and reducing the retention of
essential information. However, we observe distinct activation patterns across
layers in various tasks, highlighting the need for adaptive strategies tailored
to each task's unique demands. Based on this insight, we propose DynamicKV, a
method that dynamically optimizes token retention by adjusting the number of
tokens retained at each layer to adapt to the specific task. DynamicKV
establishes global and per-layer maximum KV cache budgets, temporarily
retaining the maximum budget for the current layer, and periodically updating
the KV cache sizes of all preceding layers during inference. Our method retains
only 1.7% of the KV cache size while achieving ~85% of the Full KV cache
performance on LongBench. Notably, even under extreme compression (0.9%),
DynamicKV surpasses state-of-the-art (SOTA) methods by 11% in the
Needle-in-a-Haystack test using Mistral-7B-Instruct-v0.2. The code will be
released.
☆ Progressive Multimodal Reasoning via Active Retrieval
Multi-step multimodal reasoning tasks pose significant challenges for
multimodal large language models (MLLMs), and finding effective ways to enhance
their performance in such scenarios remains an unresolved issue. In this paper,
we propose AR-MCTS, a universal framework designed to progressively improve the
reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo
Tree Search (MCTS). Our approach begins with the development of a unified
retrieval module that retrieves key supporting insights for solving complex
reasoning problems from a hybrid-modal retrieval corpus. To bridge the gap in
automated multimodal reasoning verification, we employ the MCTS algorithm
combined with an active retrieval mechanism, which enables the automatic
generation of step-wise annotations. This strategy dynamically retrieves key
insights for each reasoning step, moving beyond traditional beam search
sampling to improve the diversity and reliability of the reasoning space.
Additionally, we introduce a process reward model that aligns progressively to
support the automatic verification of multimodal reasoning tasks. Experimental
results across three complex multimodal reasoning benchmarks confirm the
effectiveness of the AR-MCTS framework in enhancing the performance of various
multimodal models. Further analysis demonstrates that AR-MCTS can optimize
sampling diversity and accuracy, yielding reliable multimodal reasoning.
comment: Working in progress
☆ Mention Attention for Pronoun Translation ACL
Most pronouns are referring expressions, computers need to resolve what do
the pronouns refer to, and there are divergences on pronoun usage across
languages. Thus, dealing with these divergences and translating pronouns is a
challenge in machine translation. Mentions are referring candidates of pronouns
and have closer relations with pronouns compared to general tokens. We assume
that extracting additional mention features can help pronoun translation.
Therefore, we introduce an additional mention attention module in the decoder
to pay extra attention to source mentions but not non-mention tokens. Our
mention attention module not only extracts features from source mentions, but
also considers target-side context which benefits pronoun translation. In
addition, we also introduce two mention classifiers to train models to
recognize mentions, whose outputs guide the mention attention. We conduct
experiments on the WMT17 English-German translation task, and evaluate our
models on general translation and pronoun translation, using BLEU, APT, and
contrastive evaluation metrics. Our proposed model outperforms the baseline
Transformer model in terms of APT and BLEU scores, this confirms our hypothesis
that we can improve pronoun translation by paying additional attention to
source mentions, and shows that our introduced additional modules do not have
negative effect on the general translation quality.
comment: camera-ready version of the paper accepted by JCRAI-23 conference, in
ACL format
☆ ResoFilter: Rine-grained Synthetic Data Filtering for Large Language Models through Data-Parameter Resonance Analysis
Large language models (LLMs) have shown remarkable effectiveness across
various domains, with data augmentation methods utilizing GPT for synthetic
data generation becoming prevalent. However, the quality and utility of
augmented data remain questionable, and current methods lack clear metrics for
evaluating data characteristics. To address these challenges, we propose
ResoFilter, a novel method that integrates models, data, and tasks to refine
datasets. ResoFilter leverages the fine-tuning process to obtain Data-Parameter
features for data selection, offering improved interpretability by representing
data characteristics through model weights. Our experiments demonstrate that
ResoFilter achieves comparable results to full-scale fine-tuning using only
half the data in mathematical tasks and exhibits strong generalization across
different models and domains. This method provides valuable insights for
constructing synthetic datasets and evaluating high-quality data, offering a
promising solution for enhancing data augmentation techniques and improving
training dataset quality for LLMs. For reproducibility, we will release our
code and data upon acceptance.
comment: under review
☆ Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning
When using agent-task datasets to enhance agent capabilities for Large
Language Models (LLMs), current methodologies often treat all tokens within a
sample equally. However, we argue that tokens serving different roles -
specifically, reasoning tokens versus boilerplate tokens (e.g., those governing
output format) - differ significantly in importance and learning complexity,
necessitating their disentanglement and distinct treatment. To address this, we
propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token
discrimination. SHAD classifies tokens by exploiting predictability differences
observed after shuffling input-output combinations across samples: boilerplate
tokens, due to their repetitive nature among samples, maintain predictability,
whereas reasoning tokens do not. Using SHAD, we propose the
Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes
reasoning tokens during fine-tuning, yielding notable performance gains over
common Supervised Fine-Tuning (SFT).
☆ ALKAFI-LLAMA3: Fine-Tuning LLMs for Precise Legal Understanding in Palestine
Large Language Models (LLMs) have demonstrated remarkable potential in
diverse domains, yet their application in the legal sector, particularly in
low-resource contexts, remains limited. This study addresses the challenges of
adapting LLMs to the Palestinian legal domain, where political instability,
fragmented legal frameworks, and limited AI resources hinder effective
machine-learning applications. We present a fine-tuned model based on a
quantized version of Llama-3.2-1B-Instruct, trained on a synthetic data set
derived from Palestinian legal texts. Using smaller-scale models and
strategically generated question-answer pairs, we achieve a cost-effective,
locally sustainable solution that provides accurate and contextually relevant
legal guidance. Our experiments demonstrate promising performance on various
query types, ranging from yes/no questions and narrative explanations to
complex legal differentiations, while highlighting areas for improvement, such
as handling calculation-based inquiries and structured list formatting. This
work provides a pathway for the deployment of AI-driven legal assistance tools
tailored to the needs of resource-constrained environments.
☆ PsyDraw: A Multi-Agent Multimodal System for Mental Health Screening in Left-Behind Children
Left-behind children (LBCs), numbering over 66 million in China, face severe
mental health challenges due to parental migration for work. Early screening
and identification of at-risk LBCs is crucial, yet challenging due to the
severe shortage of mental health professionals, especially in rural areas.
While the House-Tree-Person (HTP) test shows higher child participation rates,
its requirement for expert interpretation limits its application in
resource-scarce regions. To address this challenge, we propose PsyDraw, a
multi-agent system based on Multimodal Large Language Models that assists
mental health professionals in analyzing HTP drawings. The system employs
specialized agents for feature extraction and psychological interpretation,
operating in two stages: comprehensive feature analysis and professional report
generation. Evaluation of HTP drawings from 290 primary school students reveals
that 71.03% of the analyzes achieved High Consistency with professional
evaluations, 26.21% Moderate Consistency and only 2.41% Low Consistency. The
system identified 31.03% of cases requiring professional attention,
demonstrating its effectiveness as a preliminary screening tool. Currently
deployed in pilot schools, \method shows promise in supporting mental health
professionals, particularly in resource-limited areas, while maintaining high
professional standards in psychological assessment.
comment: preprint
☆ Query pipeline optimization for cancer patient question answering systems
Retrieval-augmented generation (RAG) mitigates hallucination in Large
Language Models (LLMs) by using query pipelines to retrieve relevant external
information and grounding responses in retrieved knowledge. However, query
pipeline optimization for cancer patient question-answering (CPQA) systems
requires separately optimizing multiple components with domain-specific
considerations. We propose a novel three-aspect optimization approach for the
RAG query pipeline in CPQA systems, utilizing public biomedical databases like
PubMed and PubMed Central. Our optimization includes: (1) document retrieval,
utilizing a comparative analysis of NCBI resources and introducing Hybrid
Semantic Real-time Document Retrieval (HSRDR); (2) passage retrieval,
identifying optimal pairings of dense retrievers and rerankers; and (3)
semantic representation, introducing Semantic Enhanced Overlap Segmentation
(SEOS) for improved contextual understanding. On a custom-developed dataset
tailored for cancer-related inquiries, our optimized RAG approach improved the
answer accuracy of Claude-3-haiku by 5.24% over chain-of-thought prompting and
about 3% over a naive RAG setup. This study highlights the importance of
domain-specific query optimization in realizing the full potential of RAG and
provides a robust framework for building more accurate and reliable CPQA
systems, advancing the development of RAG-based biomedical systems.
☆ On Verbalized Confidence Scores for LLMs
The rise of large language models (LLMs) and their tight integration into our
daily life make it essential to dedicate efforts towards their trustworthiness.
Uncertainty quantification for LLMs can establish more human trust into their
responses, but also allows LLM agents to make more informed decisions based on
each other's uncertainty. To estimate the uncertainty in a response, internal
token logits, task-specific proxy models, or sampling of multiple responses are
commonly used. This work focuses on asking the LLM itself to verbalize its
uncertainty with a confidence score as part of its output tokens, which is a
promising way for prompt- and model-agnostic uncertainty quantification with
low overhead. Using an extensive benchmark, we assess the reliability of
verbalized confidence scores with respect to different datasets, models, and
prompt methods. Our results reveal that the reliability of these scores
strongly depends on how the model is asked, but also that it is possible to
extract well-calibrated confidence scores with certain prompt methods. We argue
that verbalized confidence scores can become a simple but effective and
versatile uncertainty quantification method in the future. Our code is
available at https://github.com/danielyxyang/llm-verbalized-uq .
☆ How to Synthesize Text Data without Model Collapse?
Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou
Model collapse in synthetic data indicates that iterative training on
self-generated data leads to a gradual decline in performance. With the
proliferation of AI models, synthetic data will fundamentally reshape the web
data ecosystem. Future GPT-$\{n\}$ models will inevitably be trained on a blend
of synthetic and human-produced data. In this paper, we focus on two questions:
what is the impact of synthetic data on language model training, and how to
synthesize data without model collapse? We first pre-train language models
across different proportions of synthetic data, revealing a negative
correlation between the proportion of synthetic data and model performance. We
further conduct statistical analysis on synthetic data to uncover
distributional shift phenomenon and over-concentration of n-gram features.
Inspired by the above findings, we propose token editing on human-produced data
to obtain semi-synthetic data. As a proof of concept, we theoretically
demonstrate that token-level editing can prevent model collapse, as the test
error is constrained by a finite upper bound. We conduct extensive experiments
on pre-training from scratch, continual pre-training, and supervised
fine-tuning. The results validate our theoretical proof that token-level
editing improves data quality and enhances model performance.
☆ Each Fake News is Fake in its Own Way: An Attribution Multi-Granularity Benchmark for Multimodal Fake News Detection
Social platforms, while facilitating access to information, have also become
saturated with a plethora of fake news, resulting in negative consequences.
Automatic multimodal fake news detection is a worthwhile pursuit. Existing
multimodal fake news datasets only provide binary labels of real or fake.
However, real news is alike, while each fake news is fake in its own way. These
datasets fail to reflect the mixed nature of various types of multimodal fake
news. To bridge the gap, we construct an attributing multi-granularity
multimodal fake news detection dataset \amg, revealing the inherent fake
pattern. Furthermore, we propose a multi-granularity clue alignment model \our
to achieve multimodal fake news detection and attribution. Experimental results
demonstrate that \amg is a challenging dataset, and its attribution setting
opens up new avenues for future research.
☆ LLMs as mediators: Can they diagnose conflicts accurately?
Prior research indicates that to be able to mediate conflict, observers of
disagreements between parties must be able to reliably distinguish the sources
of their disagreement as stemming from differences in beliefs about what is
true (causality) vs. differences in what they value (morality). In this paper,
we test if OpenAI's Large Language Models GPT 3.5 and GPT 4 can perform this
task and whether one or other type of disagreement proves particularly
challenging for LLM's to diagnose. We replicate study 1 in Ko\c{c}ak et al.
(2003), which employes a vignette design, with OpenAI's GPT 3.5 and GPT 4. We
find that both LLMs have similar semantic understanding of the distinction
between causal and moral codes as humans and can reliably distinguish between
them. When asked to diagnose the source of disagreement in a conversation, both
LLMs, compared to humans, exhibit a tendency to overestimate the extent of
causal disagreement and underestimate the extent of moral disagreement in the
moral misalignment condition. This tendency is especially pronounced for GPT 4
when using a proximate scale that relies on concrete language specific to an
issue. GPT 3.5 does not perform as well as GPT4 or humans when using either the
proximate or the distal scale. The study provides a first test of the potential
for using LLMs to mediate conflict by diagnosing the root of disagreements in
causal and evaluative codes.
comment: 27 pages, 2 appendices, 21 tables (incl appendices)
☆ Analysis and Visualization of Linguistic Structures in Large Language Models: Neural Representations of Verb-Particle Constructions in BERT
This study investigates the internal representations of verb-particle
combinations within transformer-based large language models (LLMs),
specifically examining how these models capture lexical and syntactic nuances
at different neural network layers. Employing the BERT architecture, we analyse
the representational efficacy of its layers for various verb-particle
constructions such as 'agree on', 'come back', and 'give up'. Our methodology
includes a detailed dataset preparation from the British National Corpus,
followed by extensive model training and output analysis through techniques
like multi-dimensional scaling (MDS) and generalized discrimination value (GDV)
calculations. Results show that BERT's middle layers most effectively capture
syntactic structures, with significant variability in representational accuracy
across different verb categories. These findings challenge the conventional
uniformity assumed in neural network processing of linguistic elements and
suggest a complex interplay between network architecture and linguistic
representation. Our research contributes to a better understanding of how deep
learning models comprehend and process language, offering insights into the
potential and limitations of current neural approaches to linguistic analysis.
This study not only advances our knowledge in computational linguistics but
also prompts further research into optimizing neural architectures for enhanced
linguistic precision.
☆ Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models COLING 2025
Multimodal large language models (MLLMs) combine visual and textual data for
tasks such as image captioning and visual question answering. Proper
uncertainty calibration is crucial, yet challenging, for reliable use in areas
like healthcare and autonomous driving. This paper investigates representative
MLLMs, focusing on their calibration across various scenarios, including before
and after visual fine-tuning, as well as before and after multimodal training
of the base LLMs. We observed miscalibration in their performance, and at the
same time, no significant differences in calibration across these scenarios. We
also highlight how uncertainty differs between text and images and how their
integration affects overall uncertainty. To better understand MLLMs'
miscalibration and their ability to self-assess uncertainty, we construct the
IDK (I don't know) dataset, which is key to evaluating how they handle
unknowns. Our findings reveal that MLLMs tend to give answers rather than admit
uncertainty, but this self-assessment improves with proper prompt adjustments.
Finally, to calibrate MLLMs and enhance model reliability, we propose
techniques such as temperature scaling and iterative prompt optimization. Our
results provide insights into improving MLLMs for effective and responsible
deployment in multimodal applications. Code and IDK dataset:
\href{https://github.com/hfutml/Calibration-MLLM}{https://github.com/hfutml/Calibration-MLLM}.
comment: Accepted to COLING 2025
☆ Length Controlled Generation for Black-box LLMs
Large language models (LLMs) have demonstrated impressive instruction
following capabilities, while still struggling to accurately manage the length
of the generated text, which is a fundamental requirement in many real-world
applications. Existing length control methods involve fine-tuning the
parameters of LLMs, which is inefficient and suboptimal for practical use. In
this paper, we propose a novel iterative sampling framework for text length
control, integrating the Metropolis-Hastings algorithm with an importance
sampling acceleration strategy. This framework efficiently and reliably
regulates LLMs to generate length-constrained text without modifying the
underlying parameters, thereby preserving the original capabilities of LLMs.
Experimental results demonstrate that our framework achieves almost 100\%
success rates of length control on Llama3.1 for tasks such as length-controlled
abstractive summarization and length-constrained instruction following, with
minimal additional computational overhead. This also highlights the significant
potential of our method for precise length control across a broader range of
applications, without compromising the versatility of LLMs.
comment: Preprint
☆ TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation
In this paper, we propose Text-based Open Molecule Generation Benchmark
(TOMG-Bench), the first benchmark to evaluate the open-domain molecule
generation capability of LLMs. TOMG-Bench encompasses a dataset of three major
tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and
customized molecule generation (MolCustom). Each task further contains three
subtasks, with each subtask comprising 5,000 test samples. Given the inherent
complexity of open molecule generation, we have also developed an automated
evaluation system that helps measure both the quality and the accuracy of the
generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the
current limitations and potential areas for improvement in text-guided molecule
discovery. Furthermore, with the assistance of OpenMolIns, a specialized
instruction tuning dataset proposed for solving challenges raised by
TOMG-Bench, Llama3.1-8B could outperform all the open-source general LLMs, even
surpassing GPT-3.5-turbo by 46.5\% on TOMG-Bench. Our codes and datasets are
available through https://github.com/phenixace/TOMG-Bench.
comment: A benchmark for text-based open molecule generation
☆ Learning to Generate Research Idea with Dynamic Control
The rapid advancements in large language models (LLMs) have demonstrated
their potential to accelerate scientific discovery, particularly in automating
the process of research ideation. LLM-based systems have shown promise in
generating hypotheses and research ideas. However, current approaches
predominantly rely on prompting-based pre-trained models, limiting their
ability to optimize generated content effectively. Moreover, they also lack the
capability to deal with the complex interdependence and inherent restrictions
among novelty, feasibility, and effectiveness, which remains challenging due to
the inherent trade-offs among these dimensions, such as the
innovation-feasibility conflict. To address these limitations, we for the first
time propose fine-tuning LLMs to be better idea proposers and introduce a novel
framework that employs a two-stage approach combining Supervised Fine-Tuning
(SFT) and controllable Reinforcement Learning (RL). In the SFT stage, the model
learns foundational patterns from pairs of research papers and follow-up ideas.
In the RL stage, multi-dimensional reward modeling, guided by fine-grained
feedback, evaluates and optimizes the generated ideas across key metrics.
Dimensional controllers enable dynamic adjustment of generation, while a
sentence-level decoder ensures context-aware emphasis during inference. Our
framework provides a balanced approach to research ideation, achieving
high-quality outcomes by dynamically navigating the trade-offs among novelty,
feasibility, and effectiveness.
☆ How good is GPT at writing political speeches for the White House?
Using large language models (LLMs), computers are able to generate a written
text in response to a us er request. As this pervasive technology can be
applied in numerous contexts, this study analyses the written style of one LLM
called GPT by comparing its generated speeches with those of the recent US
presidents. To achieve this objective, the State of the Union (SOTU) addresses
written by Reagan to Biden are contrasted to those produced by both GPT-3.5 and
GPT-4.o versions. Compared to US presidents, GPT tends to overuse the lemma
"we" and produce shorter messages with, on average, longer sentences. Moreover,
GPT opts for an optimistic tone, opting more often for political (e.g.,
president, Congress), symbolic (e.g., freedom), and abstract terms (e.g.,
freedom). Even when imposing an author's style to GPT, the resulting speech
remains distinct from addresses written by the target author. Finally, the two
GPT versions present distinct characteristics, but both appear overall
dissimilar to true presidential messages.
☆ HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model
Vision-language models (VLMs) have shown impressive abilities in text and
image understanding. However, existing metrics for evaluating the text
generated by VLMs focus exclusively on overall quality, leading to two
limitations: 1) it is challenging to identify which aspects of the text need
improvement from the overall score; 2) metrics may overlook specific evaluation
criteria when predicting an overall score. To address these limitations, we
propose HarmonicEval, a reference-free evaluation metric that aggregates
criterion-wise scores to produce the overall score in a bottom-up manner.
Furthermore, we construct the Multi-task Multi-criteria Human Evaluation (MMHE)
dataset, which comprises 18,000 expert human judgments across four
vision-language tasks. Our experiments demonstrate that HarmonicEval achieves
higher correlations with human judgments than conventional metrics while
providing numerical scores for each criterion.
☆ KARRIEREWEGE: A Large Scale Career Path Prediction Dataset COLING
Accurate career path prediction can support many stakeholders, like job
seekers, recruiters, HR, and project managers. However, publicly available data
and tools for career path prediction are scarce. In this work, we introduce
KARRIEREWEGE, a comprehensive, publicly available dataset containing over 500k
career paths, significantly surpassing the size of previously available
datasets. We link the dataset to the ESCO taxonomy to offer a valuable resource
for predicting career trajectories. To tackle the problem of free-text inputs
typically found in resumes, we enhance it by synthesizing job titles and
descriptions resulting in KARRIEREWEGE+. This allows for accurate predictions
from unstructured data, closely aligning with real-world application
challenges. We benchmark existing state-of-the-art (SOTA) models on our dataset
and a prior benchmark and observe improved performance and robustness,
particularly for free-text use cases, due to the synthesized data.
comment: Accepted at COLING Industry Track
☆ LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining AAAI2025
Visual Information Extraction (VIE) plays a crucial role in the comprehension
of semi-structured documents, and several pre-trained models have been
developed to enhance performance. However, most of these works are monolingual
(usually English). Due to the extremely unbalanced quantity and quality of
pre-training corpora between English and other languages, few works can extend
to non-English scenarios. In this paper, we conduct systematic experiments to
show that vision and layout modality hold invariance among images with
different languages. If decoupling language bias from document images, a
vision-layout-based model can achieve impressive cross-lingual generalization.
Accordingly, we present a simple but effective multilingual training paradigm
LDP (Language Decoupled Pre-training) for better utilization of monolingual
pre-training data. Our proposed model LDM (Language Decoupled Model) is first
pre-trained on the language-independent data, where the language knowledge is
decoupled by a diffusion model, and then the LDM is fine-tuned on the
downstream languages. Extensive experiments show that the LDM outperformed all
SOTA multilingual pre-trained models, and also maintains competitiveness on
downstream monolingual/English benchmarks.
comment: Accepted by AAAI2025
☆ Beyond Guilt: Legal Judgment Prediction with Trichotomous Reasoning
In legal practice, judges apply the trichotomous dogmatics of criminal law,
sequentially assessing the elements of the offense, unlawfulness, and
culpability to determine whether an individual's conduct constitutes a crime.
Although current legal large language models (LLMs) show promising accuracy in
judgment prediction, they lack trichotomous reasoning capabilities due to the
absence of an appropriate benchmark dataset, preventing them from predicting
innocent outcomes. As a result, every input is automatically assigned a charge,
limiting their practical utility in legal contexts. To bridge this gap, we
introduce LJPIV, the first benchmark dataset for Legal Judgment Prediction with
Innocent Verdicts. Adhering to the trichotomous dogmatics, we extend three
widely-used legal datasets through LLM-based augmentation and manual
verification. Our experiments with state-of-the-art legal LLMs and novel
strategies that integrate trichotomous reasoning into zero-shot prompting and
fine-tuning reveal: (1) current legal LLMs have significant room for
improvement, with even the best models achieving an F1 score of less than 0.3
on LJPIV; and (2) our strategies notably enhance both in-domain and
cross-domain judgment prediction accuracy, especially for cases resulting in an
innocent verdict.
☆ Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues AAAI 2025
Recent advancements in proactive dialogues have garnered significant
attention, particularly for more complex objectives (e.g. emotion support and
persuasion). Unlike traditional task-oriented dialogues, proactive dialogues
demand advanced policy planning and adaptability, requiring rich scenarios and
comprehensive policy repositories to develop such systems. However, existing
approaches tend to rely on Large Language Models (LLMs) for user simulation and
online learning, leading to biases that diverge from realistic scenarios and
result in suboptimal efficiency. Moreover, these methods depend on manually
defined, context-independent, coarse-grained policies, which not only incur
high expert costs but also raise concerns regarding their completeness. In our
work, we highlight the potential for automatically discovering policies
directly from raw, real-world dialogue records. To this end, we introduce a
novel dialogue policy planning framework, LDPP. It fully automates the process
from mining policies in dialogue records to learning policy planning.
Specifically, we employ a variant of the Variational Autoencoder to discover
fine-grained policies represented as latent vectors. After automatically
annotating the data with these latent policy labels, we propose an Offline
Hierarchical Reinforcement Learning (RL) algorithm in the latent space to
develop effective policy planning capabilities. Our experiments demonstrate
that LDPP outperforms existing methods on two proactive scenarios, even
surpassing ChatGPT with only a 1.8-billion-parameter LLM.
comment: 24 pages, 5 fgiures, AAAI 2025
☆ CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation
With the adoption of retrieval-augmented generation (RAG), large language
models (LLMs) are expected to ground their generation to the retrieved
contexts. Yet, this is hindered by position bias of LLMs, failing to evenly
attend to all contexts. Previous work has addressed this by synthesizing
contexts with perturbed positions of gold segment, creating a
position-diversified train set. We extend this intuition to propose consistency
regularization with augmentation and distillation. First, we augment each
training instance with its position perturbation to encourage consistent
predictions, regardless of ordering. We also distill behaviors of this pair,
although it can be counterproductive in certain RAG scenarios where the given
order from the retriever is crucial for generation quality. We thus propose
CORD, balancing COnsistency and Rank Distillation. CORD adaptively samples
noise-controlled perturbations from an interpolation space, ensuring both
consistency and respect for the rank prior. Empirical results show this balance
enables CORD to outperform consistently in diverse RAG benchmarks.
☆ Sliding Windows Are Not the End: Exploring Full Ranking with Long-Context Large Language Models
Large Language Models (LLMs) have shown exciting performance in listwise
passage ranking. Due to the limited input length, existing methods often adopt
the sliding window strategy. Such a strategy, though effective, is inefficient
as it involves repetitive and serialized processing, which usually re-evaluates
relevant passages multiple times. As a result, it incurs redundant API costs,
which are proportional to the number of inference tokens. The development of
long-context LLMs enables the full ranking of all passages within a single
inference, avoiding redundant API costs. In this paper, we conduct a
comprehensive study of long-context LLMs for ranking tasks in terms of
efficiency and effectiveness. Surprisingly, our experiments reveal that full
ranking with long-context LLMs can deliver superior performance in the
supervised fine-tuning setting with a huge efficiency improvement. Furthermore,
we identify two limitations of fine-tuning the full ranking model based on
existing methods: (1) sliding window strategy fails to produce a full ranking
list as a training label, and (2) the language modeling loss cannot emphasize
top-ranked passage IDs in the label. To alleviate these issues, we propose a
new complete listwise label construction approach and a novel importance-aware
learning objective for full ranking. Experiments show the superior performance
of our method over baselines. Our codes are available at
\url{https://github.com/8421BCD/fullrank}.
comment: 14 pages
☆ CitaLaw: Enhancing LLM with Citations in Legal Domain
In this paper, we propose CitaLaw, the first benchmark designed to evaluate
LLMs' ability to produce legally sound responses with appropriate citations.
CitaLaw features a diverse set of legal questions for both laypersons and
practitioners, paired with a comprehensive corpus of law articles and precedent
cases as a reference pool. This framework enables LLM-based systems to retrieve
supporting citations from the reference corpus and align these citations with
the corresponding sentences in their responses. Moreover, we introduce
syllogism-inspired evaluation methods to assess the legal alignment between
retrieved references and LLM-generated responses, as well as their consistency
with user questions. Extensive experiments on 2 open-domain and 7
legal-specific LLMs demonstrate that integrating legal references substantially
enhances response quality. Furthermore, our proposed syllogism-based evaluation
method exhibits strong agreement with human judgments.
☆ ClusterTalk: Corpus Exploration Framework using Multi-Dimensional Exploratory Search
Exploratory search of large text corpora is essential in domains like
biomedical research, where large amounts of research literature are
continuously generated. This paper presents ClusterTalk (The demo video and
source code are available at: https://github.com/achouhan93/ClusterTalk), a
framework for corpus exploration using multi-dimensional exploratory search.
Our system integrates document clustering with faceted search, allowing users
to interactively refine their exploration and ask corpus and document-level
queries. Compared to traditional one-dimensional search approaches like keyword
search or clustering, this system improves the discoverability of information
by encouraging a deeper interaction with the corpus. We demonstrate the
functionality of the ClusterTalk framework based on four million PubMed
abstracts for the four-year time frame.
comment: 5 pages, 1 figure
☆ Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models AAAI 2025
Knowledge distillation (KD) has become a prevalent technique for compressing
large language models (LLMs). Existing KD methods are constrained by the need
for identical tokenizers (i.e., vocabularies) between teacher and student
models, limiting their versatility in handling LLMs of different architecture
families. In this paper, we introduce the Multi-Level Optimal Transport
(MultiLevelOT), a novel approach that advances the optimal transport for
universal cross-tokenizer knowledge distillation. Our method aligns the logit
distributions of the teacher and the student at both token and sequence levels
using diverse cost matrices, eliminating the need for dimensional or
token-by-token correspondence. At the token level, MultiLevelOT integrates both
global and local information by jointly optimizing all tokens within a sequence
to enhance robustness. At the sequence level, we efficiently capture complex
distribution structures of logits via the Sinkhorn distance, which approximates
the Wasserstein distance for divergence measures. Extensive experiments on
tasks such as extractive QA, generative QA, and summarization demonstrate that
the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under
various settings. Our approach is robust to different student and teacher
models across model families, architectures, and parameter sizes.
comment: Accepted by AAAI 2025
☆ Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment NeurIPS 2024
We study the problem of aligning large language models (LLMs) with human
preference data. Contrastive preference optimization has shown promising
results in aligning LLMs with available preference data by optimizing the
implicit reward associated with the policy. However, the contrastive objective
focuses mainly on the relative values of implicit rewards associated with two
responses while ignoring their actual values, resulting in suboptimal alignment
with human preferences. To address this limitation, we propose calibrated
direct preference optimization (Cal-DPO), a simple yet effective algorithm. We
show that substantial improvement in alignment with the given preferences can
be achieved simply by calibrating the implicit reward to ensure that the
learned implicit rewards are comparable in scale to the ground-truth rewards.
We demonstrate the theoretical advantages of Cal-DPO over existing approaches.
The results of our experiments on a variety of standard benchmarks show that
Cal-DPO remarkably improves off-the-shelf methods.
comment: Accepted by NeurIPS 2024 Main
☆ PA-RAG: RAG Alignment via Multi-Perspective Preference Optimization
The emergence of Retrieval-augmented generation (RAG) has alleviated the
issues of outdated and hallucinatory content in the generation of large
language models (LLMs), yet it still reveals numerous limitations. When a
general-purpose LLM serves as the RAG generator, it often suffers from
inadequate response informativeness, response robustness, and citation quality.
Past approaches to tackle these limitations, either by incorporating additional
steps beyond generating responses or optimizing the generator through
supervised fine-tuning (SFT), still failed to align with the RAG requirement
thoroughly. Consequently, optimizing the RAG generator from multiple preference
perspectives while maintaining its end-to-end LLM form remains a challenge. To
bridge this gap, we propose Multiple Perspective Preference Alignment for
Retrieval-Augmented Generation (PA-RAG), a method for optimizing the generator
of RAG systems to align with RAG requirements comprehensively. Specifically, we
construct high-quality instruction fine-tuning data and multi-perspective
preference data by sampling varied quality responses from the generator across
different prompt documents quality scenarios. Subsequently, we optimize the
generator using SFT and Direct Preference Optimization (DPO). Extensive
experiments conducted on four question-answer datasets across three LLMs
demonstrate that PA-RAG can significantly enhance the performance of RAG
generators. Our code and datasets are available at
https://github.com/wujwyi/PA-RAG.
☆ Do Large Language Models Defend Inferentialist Semantics?: On the Logical Expressivism and Anti-Representationalism of LLMs
The philosophy of language, which has historically been developed through an
anthropocentric lens, is now being forced to move towards post-anthropocentrism
due to the advent of large language models (LLMs) like ChatGPT (OpenAI), Claude
(Anthropic), which are considered to possess linguistic abilities comparable to
those of humans. Traditionally, LLMs have been explained through distributional
semantics as their foundational semantics. However, recent research is
exploring alternative foundational semantics beyond distributional semantics.
This paper proposes Robert Brandom's inferentialist semantics as an suitable
foundational semantics for LLMs, specifically focusing on the issue of
linguistic representationalism within this post-anthropocentric trend. Here, we
show that the anti-representationalism and logical expressivism of inferential
semantics, as well as quasi-compositionality, are useful in interpreting the
characteristics and behaviors of LLMs. Further, we propose a \emph{consensus
theory of truths} for LLMs. This paper argues that the characteristics of LLMs
challenge mainstream assumptions in philosophy of language, such as semantic
externalism and compositionality. We believe the argument in this paper leads
to a re-evaluation of anti\hyphen{}representationalist views of language,
potentially leading to new developments in the philosophy of language.
☆ GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
Saumya Saxena, Blake Buchanan, Chris Paxton, Bingqing Chen, Narunas Vaskevicius, Luigi Palmieri, Jonathan Francis, Oliver Kroemer
In Embodied Question Answering (EQA), agents must explore and develop a
semantic understanding of an unseen environment in order to answer a situated
question with confidence. This remains a challenging problem in robotics, due
to the difficulties in obtaining useful semantic representations, updating
these representations online, and leveraging prior world knowledge for
efficient exploration and planning. Aiming to address these limitations, we
propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic
scene graphs (3DSGs) and task relevant images as multi-modal memory for
grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen
environments. We employ a hierarchical planning approach that exploits the
hierarchical nature of 3DSGs for structured planning and semantic-guided
exploration. Through experiments in simulation on the HM-EQA dataset and in the
real world in home and office environments, we demonstrate that our method
outperforms key baselines by completing EQA tasks with higher success rates and
fewer planning steps.
comment: Project website: https://saumyasaxena.github.io/grapheqa
☆ MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong
Despite the rapidly growing demand for multimodal retrieval, progress in this
field remains severely constrained by a lack of training data. In this paper,
we introduce MegaPairs, a novel data synthesis method that leverages vision
language models (VLMs) and open-domain images, together with a massive
synthetic dataset generated from this method. Our empirical analysis shows that
MegaPairs generates high-quality data, enabling the multimodal retriever to
significantly outperform the baseline model trained on 70$\times$ more data
from existing datasets. Moreover, since MegaPairs solely relies on general
image corpora and open-source VLMs, it can be easily scaled up, enabling
continuous improvements in retrieval performance. In this stage, we produced
more than 26 million training instances and trained several models of varying
sizes using this data. These new models achieve state-of-the-art zero-shot
performance across 4 popular composed image retrieval (CIR) benchmarks and the
highest overall performance on the 36 datasets provided by MMEB. They also
demonstrate notable performance improvements with additional downstream
fine-tuning. Our produced dataset, well-trained models, and data synthesis
pipeline will be made publicly available to facilitate the future development
of this field.
☆ Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs
Koshiro Saito, Sakae Mizuki, Masanari Ohi, Taishi Nakamura, Taihei Shiotani, Koki Maeda, Youmi Ma, Kakeru Hattori, Kazuki Fujii, Takumi Okamoto, Shigeki Ishida, Hiroya Takamura, Rio Yokota, Naoaki Okazaki
Why do we build local large language models (LLMs)? What should a local LLM
learn from the target language? Which abilities can be transferred from other
languages? Do language-specific scaling laws exist? To explore these research
questions, we evaluated 35 Japanese, English, and multilingual LLMs on 19
evaluation benchmarks for Japanese and English, taking Japanese as a local
language. Adopting an observational approach, we analyzed correlations of
benchmark scores, and conducted principal component analysis (PCA) on the
scores to derive \textit{ability factors} of local LLMs. We found that training
on English text can improve the scores of academic subjects in Japanese
(JMMLU). In addition, it is unnecessary to specifically train on Japanese text
to enhance abilities for solving Japanese code generation, arithmetic
reasoning, commonsense, and reading comprehension tasks. In contrast, training
on Japanese text could improve question-answering tasks about Japanese
knowledge and English-Japanese translation, which indicates that abilities for
solving these two tasks can be regarded as \textit{Japanese abilities} for
LLMs. Furthermore, we confirmed that the Japanese abilities scale with the
computational budget for Japanese text.
comment: Preprint. Under review
☆ Agent-SafetyBench: Evaluating the Safety of LLM Agents
As large language models (LLMs) are increasingly deployed as agents, their
integration into interactive environments and tool use introduce new safety
challenges beyond those associated with the models themselves. However, the
absence of comprehensive benchmarks for evaluating agent safety presents a
significant barrier to effective assessment and further improvement. In this
paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to
evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349
interaction environments and 2,000 test cases, evaluating 8 categories of
safety risks and covering 10 common failure modes frequently encountered in
unsafe interactions. Our evaluation of 16 popular LLM agents reveals a
concerning result: none of the agents achieves a safety score above 60%. This
highlights significant safety challenges in LLM agents and underscores the
considerable need for improvement. Through quantitative analysis, we identify
critical failure modes and summarize two fundamental safety detects in current
LLM agents: lack of robustness and lack of risk awareness. Furthermore, our
findings suggest that reliance on defense prompts alone is insufficient to
address these safety issues, emphasizing the need for more advanced and robust
strategies. We release Agent-SafetyBench at
\url{https://github.com/thu-coai/Agent-SafetyBench} to facilitate further
research and innovation in agent safety evaluation and improvement.
comment: 23 pages, 9 figures
☆ From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research
Unstructured text data annotation and analysis are fundamental to management
research, often relying on human annotators through crowdsourcing platforms.
While Large Language Models (LLMs) promise to provide a cost-effective and
efficient alternative to human annotation, there lacks a systematic workflow
that evaluate when LLMs are suitable or how to proceed with LLM-based text
annotation in a reproducible manner. This paper addresses this methodological
gap by introducing the ``SILICON" (\textbf{S}ystematic \textbf{I}nference with
\textbf{L}LMs for \textbf{I}nformation \textbf{C}lassificati\textbf{o}n and
\textbf{N}otation) workflow. The workflow integrates established principles of
human annotation with systematic prompt optimization and model selection,
addressing challenges such as developing robust annotation guidelines,
establishing high-quality human baselines, optimizing prompts, and ensuring
reproducibility across LLMs. We validate the SILICON workflow through seven
case studies covering common management research tasks, including business
proposal evaluation, dialog intent and breakdown analysis, review attribute
detection. Our findings highlight the importance of validating annotation
guideline agreement, the superiority of expert-developed human baselines over
crowdsourced ones, the iterative nature of prompt optimization, and the
necessity of testing multiple LLMs. Notably, we propose a regression-based
methodology to empirically compare LLM outputs across prompts and models. Our
workflow advances management research by establishing reproducible processes
for LLM-based annotation that maintain scientific rigor. We provide practical
guidance for researchers to effectively navigate the evolving landscape of
generative AI tools effectively while maintaining transparency and
reproducibility.
☆ Are Longer Prompts Always Better? Prompt Selection in Large Language Models for Recommendation Systems
In large language models (LLM)-based recommendation systems (LLM-RSs),
accurately predicting user preferences by leveraging the general knowledge of
LLMs is possible without requiring extensive training data. By converting
recommendation tasks into natural language inputs called prompts, LLM-RSs can
efficiently solve issues that have been difficult to address due to data
scarcity but are crucial in applications such as cold-start and cross-domain
problems. However, when applying this in practice, selecting the prompt that
matches tasks and data is essential. Although numerous prompts have been
proposed in LLM-RSs and representing the target user in prompts significantly
impacts recommendation accuracy, there are still no clear guidelines for
selecting specific prompts.
In this paper, we categorize and analyze prompts from previous research to
establish practical prompt selection guidelines. Through 450 experiments with
90 prompts and five real-world datasets, we examined the relationship between
prompts and dataset characteristics in recommendation accuracy. We found that
no single prompt consistently outperforms others; thus, selecting prompts on
the basis of dataset characteristics is crucial. Here, we propose a prompt
selection method that achieves higher accuracy with minimal validation data.
Because increasing the number of prompts to explore raises costs, we also
introduce a cost-efficient strategy using high-performance and cost-efficient
LLMs, significantly reducing exploration costs while maintaining high
prediction accuracy. Our work offers valuable insights into the prompt
selection, advancing accurate and efficient LLM-RSs.
comment: 15 pages
☆ ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study
Recent advances in language modeling demonstrate the need for high-quality
domain-specific training data, especially for tasks that require specialized
knowledge. General-purpose models, while versatile, often lack the depth needed
for expert-level tasks because of limited domain-specific information. Domain
adaptation training can enhance these models, but it demands substantial,
high-quality data. To address this, we propose ORBIT, a cost-efficient
methodology for curating massive, high-quality domain-specific datasets from
noisy web sources, tailored for training specialist large language models.
Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu
dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning
\textsc{LLaMA-3-8B} on a 1B-token astronomy subset improved performance on the
MMLU astronomy benchmark from 69\% to 76\% and achieved top results on
AstroBench, an astronomy-specific benchmark. Moreover, our model (Orbit-LLaMA)
outperformed \textsc{LLaMA-3-8B-base}, with GPT-4o evaluations preferring it in
73\% of cases across 1000 astronomy-specific questions. Additionally, we
validated ORBIT's generalizability by applying it to law and medicine,
achieving a significant improvement of data quality compared to an unfiltered
baseline. We open-source the ORBIT methodology, including the curated datasets,
the codebase, and the resulting model at
\href{https://github.com/ModeEric/ORBIT-Llama}{https://github.com/ModeEric/ORBIT-Llama}.
☆ All-in-One Tuning and Structural Pruning for Domain-Specific LLMs
Lei Lu, Zhepeng Wang, Ruexue Bao, Mengbing Wang, Fangyi Li, Yawen Wu, Weiwen Jiang, Jie Xu, Yanzhi Wang, Shangqian Gao
Existing pruning techniques for large language models (LLMs) targeting
domain-specific applications typically follow a two-stage process: pruning the
pretrained general-purpose LLMs and then fine-tuning the pruned LLMs on
specific domains. However, the pruning decisions, derived from the pretrained
weights, remain unchanged during fine-tuning, even if the weights have been
updated. Therefore, such a combination of the pruning decisions and the
finetuned weights may be suboptimal, leading to non-negligible performance
degradation. To address these limitations, we propose ATP: All-in-One Tuning
and Structural Pruning, a unified one-stage structural pruning and fine-tuning
approach that dynamically identifies the current optimal substructure
throughout the fine-tuning phase via a trainable pruning decision generator.
Moreover, given the limited available data for domain-specific applications,
Low-Rank Adaptation (LoRA) becomes a common technique to fine-tune the LLMs. In
ATP, we introduce LoRA-aware forward and sparsity regularization to ensure that
the substructures corresponding to the learned pruning decisions can be
directly removed after the ATP process. ATP outperforms the state-of-the-art
two-stage pruning methods on tasks in the legal and healthcare domains. More
specifically, ATP recovers up to 88% and 91% performance of the dense model
when pruning 40% parameters of LLaMA2-7B and LLaMA3-8B models, respectively.
♻ ☆ CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement
Large Language Models (LLMs) have revolutionized code generation but require
significant resources and often over-generalize, limiting their task-specific
efficiency. Fine-tuning smaller, open-source LLMs provides a cost-effective
alternative. However, standard supervised approaches rely only on correct
examples, missing valuable insights from failures. We introduce CodeLutra, a
framework that leverages both correct and incorrect code attempts. Instead of
using only correct solutions, CodeLutra applies iterative preference-based
refinement, comparing successful and failed outputs to better approximate
desired results. This approach narrows the performance gap with
state-of-the-art larger models without requiring massive datasets or auxiliary
models. For instance, on a challenging data science coding task, using only 500
samples improved Llama-3-8B's accuracy from 28.2% to 48.6%, approaching GPT-4's
level. By learning from both successes and mistakes, CodeLutra provides a
scalable and efficient path to high-quality code generation, making smaller
open-source models more competitive with leading closed-source alternatives.
comment: 16 pages, 7 figures
♻ ☆ URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base COLING 2025
Aditya Khan, Mason Shipton, David Anugraha, Kaiyao Duan, Phuong H. Hoang, Eric Khiu, A. Seza Doğruöz, En-Shiun Annie Lee
URIEL is a knowledge base offering geographical, phylogenetic, and
typological vector representations for 7970 languages. It includes distance
measures between these vectors for 4005 languages, which are accessible via the
lang2vec tool. Despite being frequently cited, URIEL is limited in terms of
linguistic inclusion and overall usability. To tackle these challenges, we
introduce URIEL+, an enhanced version of URIEL and lang2vec that addresses
these limitations. In addition to expanding typological feature coverage for
2898 languages, URIEL+ improves the user experience with robust, customizable
distance calculations to better suit the needs of users. These upgrades also
offer competitive performance on downstream tasks and provide distances that
better align with linguistic distance studies.
comment: Accepted to COLING 2025
♻ ☆ Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization
Language models (LMs), like other neural networks, often favor shortcut
heuristics based on surface-level patterns. Although LMs behave like n-gram
models early in training, they must eventually learn hierarchical syntactic
representations to correctly apply grammatical rules out-of-distribution (OOD).
In this work, we use case studies of English grammar to explore how complex,
diverse training data drives models to generalize OOD. We construct a framework
that unifies our understanding of random variation with training dynamics, rule
selection with memorization, and data diversity with complexity. We show that
these factors are nuanced, and that intermediate levels of diversity and
complexity lead to inconsistent behavior across random seeds and to unstable
training dynamics. Our findings emphasize the critical role of training data in
shaping generalization patterns and illuminate how competing model strategies
lead to inconsistent generalization outcomes across random seeds. Code is
available at https://github.com/sunnytqin/concept_comp.git.
♻ ☆ Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models
Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai
This paper introduces Typhoon 2, a series of text and multimodal large
language models optimized for the Thai language. The series includes models for
text, vision, and audio. Typhoon2-Text builds on state-of-the-art open models,
such as Llama 3 and Qwen2, and we perform continual pre-training on a mixture
of English and Thai data. We employ post-training techniques to enhance Thai
language performance while preserving the base models' original capabilities.
We release text models across a range of sizes, from 1 to 70 billion
parameters, available in both base and instruction-tuned variants. To guardrail
text generation, we release Typhoon2-Safety, a classifier enhanced for Thai
cultures and language. Typhoon2-Vision improves Thai document understanding
while retaining general visual capabilities, such as image captioning.
Typhoon2-Audio introduces an end-to-end speech-to-speech model architecture
capable of processing audio, speech, and text inputs and generating both text
and speech outputs.
comment: technical report, 55 pages
♻ ☆ LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings
Zero-shot graph machine learning, especially with graph neural networks
(GNNs), has garnered significant interest due to the challenge of scarce
labeled data. While methods like self-supervised learning and graph prompt
learning have been extensively explored, they often rely on fine-tuning with
task-specific labels, limiting their effectiveness in zero-shot scenarios.
Inspired by the zero-shot capabilities of instruction-fine-tuned large language
models (LLMs), we introduce a novel framework named Token Embedding-Aligned
Graph Language Model (TEA-GLM) that leverages LLMs as cross-dataset and
cross-task zero-shot learners for graph machine learning. Concretely, we
pretrain a GNN, aligning its representations with token embeddings of an LLM.
We then train a linear projector that transforms the GNN's representations into
a fixed number of graph token embeddings without tuning the LLM. A unified
instruction is designed for various graph tasks at different levels, such as
node classification (node-level) and link prediction (edge-level). These design
choices collectively enhance our method's effectiveness in zero-shot learning,
setting it apart from existing methods. Experiments show that our graph token
embeddings help the LLM predictor achieve state-of-the-art performance on
unseen datasets and tasks compared to other methods using LLMs as predictors.
♻ ☆ Identifying Query-Relevant Neurons in Large Language Models for Long-Form Texts AAAI 2025
Large Language Models (LLMs) possess vast amounts of knowledge within their
parameters, prompting research into methods for locating and editing this
knowledge. Previous work has largely focused on locating entity-related (often
single-token) facts in smaller models. However, several key questions remain
unanswered: (1) How can we effectively locate query-relevant neurons in
decoder-only LLMs, such as Llama and Mistral? (2) How can we address the
challenge of long-form (or free-form) text generation? (3) Are there localized
knowledge regions in LLMs? In this study, we introduce Query-Relevant Neuron
Cluster Attribution (QRNCA), a novel architecture-agnostic framework capable of
identifying query-relevant neurons in LLMs. QRNCA allows for the examination of
long-form answers beyond triplet facts by employing the proxy task of
multi-choice question answering. To evaluate the effectiveness of our detected
neurons, we build two multi-choice QA datasets spanning diverse domains and
languages. Empirical evaluations demonstrate that our method outperforms
baseline methods significantly. Further, analysis of neuron distributions
reveals the presence of visible localized regions, particularly within
different domains. Finally, we show potential applications of our detected
neurons in knowledge editing and neuron-based prediction.
comment: AAAI 2025 Main Track
♻ ☆ SPICA: Retrieving Scenarios for Pluralistic In-Context Alignment
When different groups' values differ, one approach to model alignment is to
steer models at inference time towards each group's preferences. However,
techniques like in-context learning only consider similarity when drawing
few-shot examples and not cross-group differences in values. We propose SPICA,
a framework that accounts for group-level differences during in-context example
retrieval. SPICA introduces three designs: scenario banks, group-informed
retrieval metrics, and in-context alignment prompts. From an evaluation of
SPICA on an alignment task collecting inputs from four demographic groups ($n =
544$), our metrics retrieve in-context examples that more closely match
observed preferences, with the best prompt configuration using multiple
contrastive responses to demonstrate examples. In an end-to-end evaluation ($n
= 120$), we observe that SPICA is higher rated than similarity-based retrieval,
with groups seeing up to a +0.16 point improvement on a 5 point scale.
Additionally, gains from SPICA were more uniform, with all groups benefiting
from alignment rather than only some. Finally, we find that while a
group-agnostic approach can align to aggregated values, it is not most suited
for divergent groups.
♻ ☆ Knowledge Tagging with Large Language Model based Multi-Agent System AAAI 2025
Knowledge tagging for questions is vital in modern intelligent educational
applications, including learning progress diagnosis, practice question
recommendations, and course content organization. Traditionally, these
annotations have been performed by pedagogical experts, as the task demands not
only a deep semantic understanding of question stems and knowledge definitions
but also a strong ability to link problem-solving logic with relevant knowledge
concepts. With the advent of advanced natural language processing (NLP)
algorithms, such as pre-trained language models and large language models
(LLMs), pioneering studies have explored automating the knowledge tagging
process using various machine learning models. In this paper, we investigate
the use of a multi-agent system to address the limitations of previous
algorithms, particularly in handling complex cases involving intricate
knowledge definitions and strict numerical constraints. By demonstrating its
superior performance on the publicly available math question knowledge tagging
dataset, MathKnowCT, we highlight the significant potential of an LLM-based
multi-agent system in overcoming the challenges that previous methods have
encountered. Finally, through an in-depth discussion of the implications of
automating knowledge tagging, we underscore the promising results of deploying
LLM-based algorithms in educational contexts.
comment: Accepted by AAAI 2025 (AAAI/IAAI 2025 Innovative Application Award)
♻ ☆ Beyond Dataset Creation: Critical View of Annotation Variation and Bias Probing of a Dataset for Online Radical Content Detection COLING 2025
The proliferation of radical content on online platforms poses significant
risks, including inciting violence and spreading extremist ideologies. Despite
ongoing research, existing datasets and models often fail to address the
complexities of multilingual and diverse data. To bridge this gap, we introduce
a publicly available multilingual dataset annotated with radicalization levels,
calls for action, and named entities in English, French, and Arabic. This
dataset is pseudonymized to protect individual privacy while preserving
contextual information. Beyond presenting our freely available dataset, we
analyze the annotation process, highlighting biases and disagreements among
annotators and their implications for model performance. Additionally, we use
synthetic data to investigate the influence of socio-demographic traits on
annotation patterns and model predictions. Our work offers a comprehensive
examination of the challenges and opportunities in building robust datasets for
radical content detection, emphasizing the importance of fairness and
transparency in model development.
comment: Accepted to COLING 2025
♻ ☆ LLM-SEM: A Sentiment-Based Student Engagement Metric Using LLMS for E-Learning Platforms
Current methods for analyzing student engagement in e-learning platforms,
including automated systems, often struggle with challenges such as handling
fuzzy sentiment in text comments and relying on limited metadata. Traditional
approaches, such as surveys and questionnaires, also face issues like small
sample sizes and scalability. In this paper, we introduce LLM-SEM (Language
Model-Based Student Engagement Metric), a novel approach that leverages video
metadata and sentiment analysis of student comments to measure engagement. By
utilizing recent Large Language Models (LLMs), we generate high-quality
sentiment predictions to mitigate text fuzziness and normalize key features
such as views and likes. Our holistic method combines comprehensive metadata
with sentiment polarity scores to gauge engagement at both the course and
lesson levels. Extensive experiments were conducted to evaluate various LLM
models, demonstrating the effectiveness of LLM-SEM in providing a scalable and
accurate measure of student engagement. We fine-tuned TXLM-RoBERTa using
human-annotated sentiment datasets to enhance prediction accuracy and utilized
LLama 3B, and Gemma 9B from Ollama.
♻ ☆ G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o
Evaluation metric of visual captioning is important yet not thoroughly
explored. Traditional metrics like BLEU, METEOR, CIDEr, and ROUGE often miss
semantic depth, while trained metrics such as CLIP-Score, PAC-S, and Polos are
limited in zero-shot scenarios. Advanced Language Model-based metrics also
struggle with aligning to nuanced human preferences. To address these issues,
we introduce G-VEval, a novel metric inspired by G-Eval and powered by the new
GPT-4o. G-VEval uses chain-of-thought reasoning in large multimodal models and
supports three modes: reference-free, reference-only, and combined,
accommodating both video and image inputs. We also propose MSVD-Eval, a new
dataset for video captioning evaluation, to establish a more transparent and
consistent framework for both human experts and evaluation metrics. It is
designed to address the lack of clear criteria in existing datasets by
introducing distinct dimensions of Accuracy, Completeness, Conciseness, and
Relevance (ACCR). Extensive results show that G-VEval outperforms existing
methods in correlation with human annotations, as measured by Kendall tau-b and
Kendall tau-c. This provides a flexible solution for diverse captioning tasks
and suggests a straightforward yet effective approach for large language models
to understand video content, paving the way for advancements in automated
captioning. Codes are available at https://github.com/ztangaj/gveval
♻ ☆ To Word Senses and Beyond: Inducing Concepts with Contextualized Language Models EMNLP 2024
Polysemy and synonymy are two crucial interrelated facets of lexical
ambiguity. While both phenomena are widely documented in lexical resources and
have been studied extensively in NLP, leading to dedicated systems, they are
often being considered independently in practical problems. While many tasks
dealing with polysemy (e.g. Word Sense Disambiguiation or Induction) highlight
the role of word's senses, the study of synonymy is rooted in the study of
concepts, i.e. meanings shared across the lexicon. In this paper, we introduce
Concept Induction, the unsupervised task of learning a soft clustering among
words that defines a set of concepts directly from data. This task generalizes
Word Sense Induction. We propose a bi-level approach to Concept Induction that
leverages both a local lemma-centric view and a global cross-lexicon view to
induce concepts. We evaluate the obtained clustering on SemCor's annotated data
and obtain good performance (BCubed F1 above 0.60). We find that the local and
the global levels are mutually beneficial to induce concepts and also senses in
our setting. Finally, we create static embeddings representing our induced
concepts and use them on the Word-in-Context task, obtaining competitive
performance with the State-of-the-Art.
comment: Published in EMNLP 2024 main conference proceedings
♻ ☆ Benchmarking Large Language Models for Math Reasoning Tasks
The use of Large Language Models (LLMs) in mathematical reasoning has become
a cornerstone of related research, demonstrating the intelligence of these
models and enabling potential practical applications through their advanced
performance, such as in educational settings. Despite the variety of datasets
and in-context learning algorithms designed to improve the ability of LLMs to
automate mathematical problem solving, the lack of comprehensive benchmarking
across different datasets makes it complicated to select an appropriate model
for specific tasks. In this project, we present a benchmark that fairly
compares seven state-of-the-art in-context learning algorithms for mathematical
problem solving across five widely used mathematical datasets on four powerful
foundation models. Furthermore, we explore the trade-off between efficiency and
performance, highlighting the practical applications of LLMs for mathematical
reasoning. Our results indicate that larger foundation models like GPT-4o and
LLaMA 3-70B can solve mathematical reasoning independently from the concrete
prompting strategy, while for smaller models the in-context learning approach
significantly influences the performance. Moreover, the optimal prompt depends
on the chosen foundation model. We open-source our benchmark code to support
the integration of additional models in future research.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis AAAI 2025
Prosody contains rich information beyond the literal meaning of words, which
is crucial for the intelligibility of speech. Current models still fall short
in phrasing and intonation; they not only miss or misplace breaks when
synthesizing long sentences with complex structures but also produce unnatural
intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis
(TTS) model with a flow-matching (FM) backbone that aims to enhance the
phrasing and intonation aspects of prosody. ProsodyFM introduces two key
components: a Phrase Break Encoder to capture initial phrase break locations,
followed by a Duration Predictor for the flexible adjustment of break
durations; and a Terminal Intonation Encoder which learns a bank of intonation
shape tokens combined with a novel Pitch Processor for more robust modeling of
human-perceived intonation change. ProsodyFM is trained with no explicit
prosodic labels and yet can uncover a broad spectrum of break durations and
intonation patterns. Experimental results demonstrate that ProsodyFM can
effectively improve the phrasing and intonation aspects of prosody, thereby
enhancing the overall intelligibility compared to four state-of-the-art (SOTA)
models. Out-of-distribution experiments show that this prosody improvement can
further bring ProsodyFM superior generalizability for unseen complex sentences
and speakers. Our case study intuitively illustrates the powerful and
fine-grained controllability of ProsodyFM over phrasing and intonation.
comment: Accepted by AAAI 2025
♻ ☆ Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets
The rise of online platforms exacerbated the spread of hate speech, demanding
scalable and effective detection. However, the accuracy of hate speech
detection systems heavily relies on human-labeled data, which is inherently
susceptible to biases. While previous work has examined the issue, the
interplay between the characteristics of the annotator and those of the target
of the hate are still unexplored. We fill this gap by leveraging an extensive
dataset with rich socio-demographic information of both annotators and targets,
uncovering how human biases manifest in relation to the target's attributes.
Our analysis surfaces the presence of widespread biases, which we
quantitatively describe and characterize based on their intensity and
prevalence, revealing marked differences. Furthermore, we compare human biases
with those exhibited by persona-based LLMs. Our findings indicate that while
persona-based LLMs do exhibit biases, these differ significantly from those of
human annotators. Overall, our work offers new and nuanced results on human
biases in hate speech annotations, as well as fresh insights into the design of
AI-driven hate speech detection systems.
♻ ☆ From Bench to Bedside: A Review of Clinical Trials in Drug Discovery and Development
Tianyang Wang, Ming Liu, Benji Peng, Xinyuan Song, Charles Zhang, Xintian Sun, Qian Niu, Junyu Liu, Silin Chen, Keyu Chen, Ming Li, Pohsun Feng, Ziqian Bi, Yunze Wang, Yichao Zhang, Cheng Fei, Lawrence KQ Yan
Clinical trials are an indispensable part of the drug development process,
bridging the gap between basic research and clinical application. During the
development of new drugs, clinical trials are used not only to evaluate the
safety and efficacy of the drug but also to explore its dosage, treatment
regimens, and potential side effects. This review discusses the various stages
of clinical trials, including Phase I (safety assessment), Phase II
(preliminary efficacy evaluation), Phase III (large-scale validation), and
Phase IV (post-marketing surveillance), highlighting the characteristics of
each phase and their interrelationships. Additionally, the paper addresses the
major challenges encountered in clinical trials, such as ethical issues,
subject recruitment difficulties, diversity and representativeness concerns,
and proposes strategies for overcoming these challenges. With the advancement
of technology, innovative technologies such as artificial intelligence, big
data, and digitalization are gradually transforming clinical trial design and
implementation, improving trial efficiency and data quality. The article also
looks forward to the future of clinical trials, particularly the impact of
emerging therapies such as gene therapy and immunotherapy on trial design, as
well as the importance of regulatory reforms and global collaboration. In
conclusion, the core role of clinical trials in drug development will continue
to drive the progress of innovative drug development and clinical treatment.
comment: 11 pages
♻ ☆ ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models NeurIPS 2024
Large language models (LLMs) exhibit hallucinations in long-form
question-answering tasks across various domains and wide applications. Current
hallucination detection and mitigation datasets are limited in domains and
sizes, which struggle to scale due to prohibitive labor costs and insufficient
reliability of existing hallucination annotators. To facilitate the scalable
oversight of LLM hallucinations, this paper introduces an iterative
self-training framework that simultaneously and progressively scales up the
hallucination annotation dataset and improves the accuracy of the hallucination
annotator. Based on the Expectation Maximization (EM) algorithm, in each
iteration, the framework first applies a hallucination annotation pipeline to
annotate a scaled dataset and then trains a more accurate hallucination
annotator on the dataset. This new hallucination annotator is adopted in the
hallucination annotation pipeline used for the next iteration. Extensive
experimental results demonstrate that the finally obtained hallucination
annotator with only 7B parameters surpasses the performance of GPT-4 and
obtains new state-of-the-art hallucination detection results on HaluEval and
HalluQA by zero-shot inference. Such an annotator can not only evaluate the
hallucination levels of various LLMs on the large-scale dataset but also help
to mitigate the hallucination of LLMs generations, with the Natural Language
Inference (NLI) metric increasing from 25% to 37% on HaluEval.
comment: Accepted by NeurIPS 2024. Dataset, code, and model are released at
https://github.com/open-compass/ANAH
♻ ☆ BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment
Large language models (LLMs), with their powerful generative capabilities and
vast knowledge, empower various tasks in everyday life. However, these
abilities are primarily concentrated in high-resource languages, leaving
low-resource languages with weaker generative capabilities and relatively
limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore
crucial for serving over 100 linguistic communities worldwide. An intuitive
approach to enhance the multilingual capabilities would be to construct
instruction data for various languages, but constructing instruction data for
over 100 languages is prohibitively costly. In this paper, we introduce BayLing
2, which efficiently transfers generative capabilities and knowledge from
high-resource languages to low-resource languages through language alignment.
To achieve this, we constructed a dataset of 3.2 million instructions,
comprising high-resource language instructions (Chinese and English) and
cross-lingual instructions for 100+ languages and performed instruction tuning
based on the dataset to facilitate the capability transfer between languages.
Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B,
and BayLing-2-8B, and conducted a comprehensive evaluation of BayLing. For
multilingual translation across 100+ languages, BayLing shows superior
performance compared to open-source models of similar scale. For multilingual
knowledge and understanding benchmarks, BayLing achieves significant
improvements across over 20 low-resource languages, demonstrating its
capability of effective knowledge transfer from high-resource to low-resource
languages. Furthermore, results on English benchmarks indicate that BayLing
maintains high performance in highresource languages while enhancing the
performance in low-resource languages. Demo, homepage, code and models of
BayLing are available.
comment: BayLing 2's online demo: http://nlp.ict.ac.cn/bayling/demo. BayLing
2's code and models: https://github.com/ictnlp/BayLing
♻ ☆ Agent-OM: Leveraging LLM Agents for Ontology Matching
Ontology matching (OM) enables semantic interoperability between different
ontologies and resolves their conceptual heterogeneity by aligning related
entities. OM systems currently have two prevailing design paradigms:
conventional knowledge-based expert systems and newer machine learning-based
predictive systems. While large language models (LLMs) and LLM agents have
revolutionised data engineering and have been applied creatively in many
domains, their potential for OM remains underexplored. This study introduces a
novel agent-powered LLM-based design paradigm for OM systems. With
consideration of several specific challenges in leveraging LLM agents for OM,
we propose a generic framework, namely Agent-OM (Agent for Ontology Matching),
consisting of two Siamese agents for retrieval and matching, with a set of
simple OM tools. Our framework is implemented in a proof-of-concept system.
Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks
over state-of-the-art OM systems show that our system can achieve results very
close to the long-standing best performance on simple OM tasks and can
significantly improve the performance on complex and few-shot OM tasks.
comment: 19 pages, 13 figures, 4 tables
♻ ☆ An $\mathbf{L^*}$ Algorithm for Deterministic Weighted Regular Languages
Extracting finite state automata (FSAs) from black-box models offers a
powerful approach to gaining interpretable insights into complex model
behaviors. To support this pursuit, we present a weighted variant of Angluin's
(1987) $\mathbf{L^*}$ algorithm for learning FSAs. We stay faithful to the
original algorithm, devising a way to exactly learn deterministic weighted FSAs
whose weights support division. Furthermore, we formulate the learning process
in a manner that highlights the connection with FSA minimization, showing how
$\mathbf{L^*}$ directly learns a minimal automaton for the target language.
♻ ☆ MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models
In a real-world RAG system, the current query often involves spoken ellipses
and ambiguous references from dialogue contexts, necessitating query rewriting
to better describe user's information needs. However, traditional context-based
rewriting has minimal enhancement on downstream generation tasks due to the
lengthy process from query rewriting to response generation. Some researchers
try to utilize reinforcement learning with generation feedback to assist the
rewriter, but these sparse rewards provide little guidance in most cases,
leading to unstable training and generation results. We find that user's needs
are also reflected in the gold document, retrieved documents and ground truth.
Therefore, by feeding back these multi-aspect dense rewards to query rewriting,
more stable and satisfactory responses can be achieved. In this paper, we
propose a novel query rewriting method MaFeRw, which improves RAG performance
by integrating multi-aspect feedback from both the retrieval process and
generated results. Specifically, we first use manual data to train a T5 model
for the rewriter initialization. Next, we design three metrics as reinforcement
learning feedback: the similarity between the rewritten query and the gold
document, the ranking metrics, and ROUGE between the generation and the ground
truth. Inspired by RLAIF, we train three kinds of reward models for the above
metrics to achieve more efficient training. Finally, we combine the scores of
these reward models as feedback, and use PPO algorithm to explore the optimal
query rewriting strategy. Experimental results on two conversational RAG
datasets demonstrate that MaFeRw achieves superior generation metrics and more
stable training compared to baselines.
♻ ☆ LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, André F. T. Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, Alberto Testoni
There is an increasing trend towards evaluating NLP models with LLMs instead
of human judgments, raising questions about the validity of these evaluations,
as well as their reproducibility in the case of proprietary models. We provide
JUDGE-BENCH, an extensible collection of 20 NLP datasets with human annotations
covering a broad range of evaluated properties and types of data, and
comprehensively evaluate 11 current LLMs, covering both open-weight and
proprietary models, for their ability to replicate the annotations. Our
evaluations show substantial variance across models and datasets. Models are
reliable evaluators on some tasks, but overall display substantial variability
depending on the property being evaluated, the expertise level of the human
judges, and whether the language is human or model-generated. We conclude that
LLMs should be carefully validated against human judgments before being used as
evaluators.
♻ ☆ TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs
Specializing large language models (LLMs) for local deployment in
domain-specific use cases is necessary for strong performance while meeting
latency and privacy constraints. However, conventional task-specific adaptation
approaches do not show simultaneous memory saving and inference speedup at
deployment time. Practical compression techniques like quantization and pruning
require dedicated hardware or kernel support to achieve measured inference
speedup. We develop TrimLLM based on the layer-wise specialization phenomenon
we empirically observed and verified on contemporary LLMs. TrimLLM reduces the
depth of LLMs via progressive layer dropping. We show it retains LLMs' capacity
in specific domains and achieves inference speedup irrespective of hardware and
deep learning frameworks. We evaluated TrimLLM on LLMs of various sizes for
inference; models adapted on medical, legal, and financial datasets all
demonstrate $2.1-5.7\times$ inference speedup on consumer GPUs and up to
$3.1\times$ speedup on A100 when compared to state-of-the-art model compression
algorithms, with no loss in accuracy at 50$\sim$60\% model compression ratio.
♻ ☆ RAZOR: Sharpening Knowledge by Cutting Bias with Unsupervised Text Rewriting AAAI'25
Despite the widespread use of LLMs due to their superior performance in
various tasks, their high computational costs often lead potential users to opt
for the pretraining-finetuning pipeline. However, biases prevalent in manually
constructed datasets can introduce spurious correlations between tokens and
labels, creating so-called shortcuts and hindering the generalizability of
fine-tuned models. Existing debiasing methods often rely on prior knowledge of
specific dataset biases, which is challenging to acquire a priori. We propose
RAZOR (Rewriting And Zero-bias Optimization Refinement), a novel, unsupervised,
and data-focused debiasing approach based on text rewriting for shortcut
mitigation. RAZOR leverages LLMs to iteratively rewrite potentially biased text
segments by replacing them with heuristically selected alternatives in a
shortcut space defined by token statistics and positional information. This
process aims to align surface-level text features more closely with diverse
label distributions, thereby promoting the learning of genuine linguistic
patterns. Compared with unsupervised SoTA models, RAZOR improves by 3.5% on the
FEVER and 6.5% on MNLI and SNLI datasets according to the F1 score.
Additionally, RAZOR effectively mitigates specific known biases, reducing
bias-related terms by x2 without requiring prior bias information, a result
that is on par with SoTA models that leverage prior information. Our work
prioritizes data manipulation over architectural modifications, emphasizing the
pivotal role of data quality in enhancing model performance and fairness. This
research contributes to developing more robust evaluation benchmarks for
debiasing methods by incorporating metrics for bias reduction and overall model
efficacy.
comment: Shuo and Bardh contributed equally. Accepted to AAAI'25, Paper #17117
♻ ☆ When Every Token Counts: Optimal Segmentation for Low-Resource Language Models COLING 2025
Traditional greedy tokenization methods have been a critical step in Natural
Language Processing (NLP), influencing how text is converted into tokens and
directly impacting model performance. While subword tokenizers like Byte-Pair
Encoding (BPE) are widely used, questions remain about their optimality across
model scales and languages. In this work, we demonstrate through extensive
experiments that an optimal BPE configuration significantly reduces token count
compared to greedy segmentation, yielding improvements in token-saving
percentages and performance benefits, particularly for smaller models. We
evaluate tokenization performance across various intrinsic and extrinsic tasks,
including generation and classification. Our findings suggest that
compression-optimized tokenization strategies could provide substantial
advantages for multilingual and low-resource language applications,
highlighting a promising direction for further research and inclusive NLP.
comment: LoResLM @ COLING 2025
♻ ☆ Deep CLAS: Deep Contextual Listen, Attend and Spell
Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech
Recognition (ASR) of rare words. It relies on phrase-level contextual modeling
and attention-based relevance scoring without explicit contextual constraint
which lead to insufficient use of contextual information. In this work, we
propose deep CLAS to use contextual information better. We introduce bias loss
forcing model to focus on contextual information. The query of bias attention
is also enriched to improve the accuracy of the bias attention score. To get
fine-grained contextual information, we replace phrase-level encoding with
character-level encoding and encode contextual information with conformer
rather than LSTM. Moreover, we directly use the bias attention score to correct
the output probability distribution of the model. Experiments using the public
AISHELL-1 and AISHELL-NER. On AISHELL-1, compared to CLAS baselines, deep CLAS
obtains a 65.78% relative recall and a 53.49% relative F1-score increase in the
named entity recognition scene.
comment: Submitted to JUSTC
♻ ☆ Towards an optimised evaluation of teachers' discourse: The case of engaging messages
Evaluating teachers' skills is crucial for enhancing education quality and
student outcomes. Teacher discourse, significantly influencing student
performance, is a key component. However, coding this discourse can be
laborious. This study addresses this issue by introducing a new methodology for
optimising the assessment of teacher discourse. The research consisted of two
studies, both within the framework of engaging messages used by secondary
education teachers. The first study involved training two large language models
on real-world examples from audio-recorded lessons over two academic years to
identify and classify the engaging messages from the lessons' transcripts. This
resulted in sensitivities of 84.31% and 91.11%, and specificities of 97.69% and
86.36% in identification and classification, respectively. The second study
applied these models to transcripts of audio-recorded lessons from a third
academic year to examine the frequency and distribution of message types by
educational level and moment of the academic year. Results showed teachers
predominantly use messages emphasising engagement benefits, linked to improved
outcomes, while one-third highlighted non-engagement disadvantages, associated
with increased anxiety. The use of engaging messages declined in Grade 12 and
towards the academic year's end. These findings suggest potential interventions
to optimise engaging message use, enhancing teaching quality and student
outcomes.
♻ ☆ Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service
Low-resource machine translation (MT) presents a diversity of community needs
and application challenges that remain poorly understood. To complement surveys
and focus groups, which tend to rely on small samples of respondents, we
propose an observational study on actual usage patterns of a specialized MT
service for the Tetun language, which is the lingua franca in Timor-Leste. Our
analysis of 100,000 translation requests reveals patterns that challenge
assumptions based on existing corpora. We find that users, many of them
students on mobile devices, typically translate text from a high-resource
language into Tetun across diverse domains including science, healthcare, and
daily life. This contrasts sharply with available Tetun corpora, which are
dominated by news articles covering government and social issues. Our results
suggest that MT systems for minority languages like Tetun should prioritize
accuracy on domains relevant to educational contexts, in the high-resource to
low-resource direction. More broadly, this study demonstrates how observational
analysis can inform low-resource language technology development, by grounding
research in practical community needs.
♻ ☆ Piece of Table: A Divide-and-Conquer Approach for Selecting Sub-Tables in Table Question Answering
Applying language models (LMs) to tables is challenging due to the inherent
structural differences between two-dimensional tables and one-dimensional text
for which the LMs were originally designed. Furthermore, when applying
linearized tables to LMs, the maximum token lengths often imposed in
self-attention calculations make it difficult to comprehensively understand the
context spread across large tables. To address these challenges, we present
PieTa (Piece of Table), a new framework for sub-table-based question answering
(QA). PieTa operates through an iterative process of dividing tables into
smaller windows, using LMs to select relevant cells within each window, and
merging these cells into a sub-table. This multi-resolution approach captures
dependencies across multiple rows and columns while avoiding the limitations
caused by long context inputs. Instantiated as a simple iterative sub-table
union algorithm, PieTa demonstrates improved performance over previous
sub-table-based QA approaches.
♻ ☆ Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions EMNLP 2024
Large language models are susceptible to jailbreak attacks, which can result
in the generation of harmful content. While prior defenses mitigate these risks
by perturbing or inspecting inputs, they ignore competing objectives, the
underlying cause of alignment failures. In this paper, we propose
Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive
decoding to address the root causes of jailbreak issues. We first define the
Competitive Index to quantify alignment failures and utilize feedback from
self-evaluation to compute post-alignment logits. Then, AED adaptively combines
AED and post-alignment logits with the original logits to obtain harmless and
helpful distributions. Consequently, our method enhances safety alignment while
maintaining helpfulness. We conduct experiments across five models and four
common jailbreaks, with the results validating the effectiveness of our
approach. Code is available at https://github.com/GIGABaozi/AED.git.
comment: Accepted by EMNLP 2024, 15 pages, 5 figures
♻ ☆ Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli
Encoder-only transformer models such as BERT offer a great performance-size
tradeoff for retrieval and classification tasks with respect to larger
decoder-only models. Despite being the workhorse of numerous production
pipelines, there have been limited Pareto improvements to BERT since its
release. In this paper, we introduce ModernBERT, bringing modern model
optimizations to encoder-only models and representing a major Pareto
improvement over older encoders. Trained on 2 trillion tokens with a native
8192 sequence length, ModernBERT models exhibit state-of-the-art results on a
large pool of evaluations encompassing diverse classification tasks and both
single and multi-vector retrieval on different domains (including code). In
addition to strong downstream performance, ModernBERT is also the most speed
and memory efficient encoder and is designed for inference on common GPUs.
♻ ☆ Improving Retrieval Augmented Language Model with Self-Reasoning AAAI 2025
The Retrieval-Augmented Language Model (RALM) has shown remarkable
performance on knowledge-intensive tasks by incorporating external knowledge
during inference, which mitigates the factual hallucinations inherited in large
language models (LLMs). Despite these advancements, challenges persist in the
implementation of RALMs, particularly concerning their reliability and
traceability. To be specific, the irrelevant document retrieval may result in
unhelpful response generation or even deteriorate the performance of LLMs,
while the lack of proper citations in generated outputs complicates efforts to
verify the trustworthiness of the models. To this end, we propose a novel
self-reasoning framework aimed at improving the reliability and traceability of
RALMs, whose core idea is to leverage reasoning trajectories generated by the
LLM itself. The framework involves constructing self-reason trajectories with
three processes: a relevance-aware process, an evidence-aware selective
process, and a trajectory analysis process. We have evaluated our framework
across four public datasets (two short-form QA datasets, one long-form QA
dataset, and one fact verification dataset) to demonstrate the superiority of
our method, which can outperform existing state-of-the-art models and can
achieve comparable performance with GPT-4, while only using 2,000 training
samples.
comment: AAAI 2025 (main conference)
♻ ☆ Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models
Existing Large Vision-Language Models (LVLMs) excel at matching concepts
across multi-modal inputs but struggle with compositional concepts and
high-level relationships between entities. This paper introduces Progressive
multi-granular Vision-Language alignments (PromViL), a novel framework to
enhance LVLMs' ability in performing grounded compositional visual reasoning
tasks. Our approach constructs a hierarchical structure of multi-modal
alignments, ranging from simple to complex concepts. By progressively aligning
textual descriptions with corresponding visual regions, our model learns to
leverage contextual information from lower levels to inform higher-level
reasoning. To facilitate this learning process, we introduce a data generation
process that creates a novel dataset derived from Visual Genome, providing a
wide range of nested compositional vision-language pairs. Experimental results
demonstrate that our PromViL framework significantly outperforms baselines on
various visual grounding and compositional question answering tasks. The code
is available at: https://github.com/lqh52/PromViL.
♻ ☆ Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models
Despite significant ongoing efforts in safety alignment, large language
models (LLMs) such as GPT-4 and LLaMA 3 remain vulnerable to jailbreak attacks
that can induce harmful behaviors, including through the use of adversarial
suffixes. Building on prior research, we hypothesize that these adversarial
suffixes are not mere bugs but may represent features that can dominate the
LLM's behavior. To evaluate this hypothesis, we conduct several experiments.
First, we demonstrate that benign features can be effectively made to function
as adversarial suffixes, i.e., we develop a feature extraction method to
extract sample-agnostic features from benign dataset in the form of suffixes
and show that these suffixes may effectively compromise safety alignment.
Second, we show that adversarial suffixes generated from jailbreak attacks may
contain meaningful features, i.e., appending the same suffix to different
prompts results in responses exhibiting specific characteristics. Third, we
show that such benign-yet-safety-compromising features can be easily introduced
through fine-tuning using only benign datasets. As a result, we are able to
completely eliminate GPT's safety alignment in a blackbox setting through
finetuning with only benign data. Our code and data is available at
\url{https://github.com/suffix-maybe-feature/adver-suffix-maybe-features}.
♻ ☆ VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models COLING 2025
Visual Language Models (VLMs) have rapidly progressed with the recent success
of large language models. However, there have been few attempts to incorporate
efficient linear Recurrent Neural Networks (RNNs) architectures into VLMs. In
this study, we introduce VisualRWKV, the first application of a linear RNN
model to multimodal learning tasks, leveraging the pre-trained RWKV language
model. We propose a data-dependent recurrence and sandwich prompts to enhance
our modeling capabilities, along with a 2D image scanning mechanism to enrich
the processing of visual sequences. Extensive experiments demonstrate that
VisualRWKV achieves competitive performance compared to Transformer-based
models like LLaVA-1.5 on various benchmarks. Compared to LLaVA-1.5, VisualRWKV
has a speed advantage of 3.98 times and can save 54% of GPU memory when
reaching an inference length of 24K tokens. To facilitate further research and
analysis, we have made the checkpoints and the associated code publicly
accessible at the following GitHub repository: see
https://github.com/howard-hou/VisualRWKV.
comment: Accepted at COLING 2025 main conference
♻ ☆ Fairness in Large Language Models: A Taxonomic Survey
Large Language Models (LLMs) have demonstrated remarkable success across
various domains. However, despite their promising performance in numerous
real-world applications, most of these algorithms lack fairness considerations.
Consequently, they may lead to discriminatory outcomes against certain
communities, particularly marginalized populations, prompting extensive study
in fair LLMs. On the other hand, fairness in LLMs, in contrast to fairness in
traditional machine learning, entails exclusive backgrounds, taxonomies, and
fulfillment techniques. To this end, this survey presents a comprehensive
overview of recent advances in the existing literature concerning fair LLMs.
Specifically, a brief introduction to LLMs is provided, followed by an analysis
of factors contributing to bias in LLMs. Additionally, the concept of fairness
in LLMs is discussed categorically, summarizing metrics for evaluating bias in
LLMs and existing algorithms for promoting fairness. Furthermore, resources for
evaluating bias in LLMs, including toolkits and datasets, are summarized.
Finally, existing research challenges and open questions are discussed.
♻ ☆ Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation
Large Vision-Language Models (VLMs) have demonstrated remarkable performance
across multimodal tasks by integrating vision encoders with large language
models (LLMs). However, these models remain vulnerable to adversarial attacks.
Among such attacks, Universal Adversarial Perturbations (UAPs) are especially
powerful, as a single optimized perturbation can mislead the model across
various input images. In this work, we introduce a novel UAP specifically
designed for VLMs: the Doubly-Universal Adversarial Perturbation (Doubly-UAP),
capable of universally deceiving VLMs across both image and text inputs. To
successfully disrupt the vision encoder's fundamental process, we analyze the
core components of the attention mechanism. After identifying value vectors in
the middle-to-late layers as the most vulnerable, we optimize Doubly-UAP in a
label-free manner with a frozen model. Despite being developed as a black-box
to the LLM, Doubly-UAP achieves high attack success rates on VLMs, consistently
outperforming baseline methods across vision-language tasks. Extensive ablation
studies and analyses further demonstrate the robustness of Doubly-UAP and
provide insights into how it influences internal attention mechanisms.
♻ ☆ Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou
Reward modeling is crucial for aligning large language models (LLMs) with
human preferences, especially in reinforcement learning from human feedback
(RLHF). However, current reward models mainly produce scalar scores and
struggle to incorporate critiques in a natural language format. We hypothesize
that predicting both critiques and the scalar reward would improve reward
modeling ability. Motivated by this, we propose Critic-RM, a framework that
improves reward models using self-generated critiques without extra
supervision. Critic-RM employs a two-stage process: generating and filtering
high-quality critiques, followed by joint fine-tuning on reward prediction and
critique generation. Experiments across benchmarks show that Critic-RM improves
reward modeling accuracy by 3.7%-7.3% compared to standard reward models and
LLM judges, demonstrating strong performance and data efficiency. Additional
studies further validate the effectiveness of generated critiques in rectifying
flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.
comment: 20 pages
♻ ☆ KnowledgePrompts: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting COLING 2025
Thilini Wijesiriwardene, Ruwan Wickramarachchi, Sreeram Vennam, Vinija Jain, Aman Chadha, Amitava Das, Ponnurangam Kumaraguru, Amit Sheth
Making analogies is fundamental to cognition. Proportional analogies, which
consist of four terms, are often used to assess linguistic and cognitive
abilities. For instance, completing analogies like "Oxygen is to Gas as
is to " requires identifying the semantic relationship (e.g., "type of")
between the first pair of terms ("Oxygen" and "Gas") and finding a second pair
that shares the same relationship (e.g., "Aluminum" and "Metal"). In this work,
we introduce a 15K Multiple-Choice Question Answering (MCQA) dataset for
proportional analogy completion and evaluate the performance of contemporary
Large Language Models (LLMs) in various knowledge-enhanced prompt settings.
Specifically, we augment prompts with three types of knowledge: exemplar,
structured, and targeted. Our results show that despite extensive training
data, solving proportional analogies remains challenging for current LLMs, with
the best model achieving an accuracy of 55%. Notably, we find that providing
targeted knowledge can better assist models in completing proportional
analogies compared to providing exemplars or collections of structured
knowledge. Our code and data are available at:
https://github.com/Thiliniiw/KnowledgePrompts/
comment: Accepted at COLING 2025
♻ ☆ UOR: Universal Backdoor Attacks on Pre-trained Language Models ACL
Backdoors implanted in pre-trained language models (PLMs) can be transferred
to various downstream tasks, which exposes a severe security threat. However,
most existing backdoor attacks against PLMs are un-targeted and task-specific.
Few targeted and task-agnostic methods use manually pre-defined triggers and
output representations, which prevent the attacks from being more effective and
general. In this paper, we first summarize the requirements that a more
threatening backdoor attack against PLMs should satisfy, and then propose a new
backdoor attack method called UOR, which breaks the bottleneck of the previous
approach by turning manual selection into automatic optimization. Specifically,
we define poisoned supervised contrastive learning which can automatically
learn the more uniform and universal output representations of triggers for
various PLMs. Moreover, we use gradient search to select appropriate trigger
words which can be adaptive to different PLMs and vocabularies. Experiments
show that our method can achieve better attack performance on various text
classification tasks compared to manual methods. Further, we tested our method
on PLMs with different architectures, different usage paradigms, and more
difficult tasks, which demonstrated the universality of our method.
comment: ACL-Findings 2024
♻ ☆ DavIR: Data Selection via Implicit Reward for Large Language Models
Haotian Zhou, Tingkai Liu, Qianli Ma, Yufeng Zhang, Jianbo Yuan, Pengfei Liu, Yang You, Hongxia Yang
We introduce DavIR, a model-based data selection method for post-training
Large Language Models. DavIR generalizes Reducible Holdout Loss to core-set
selection problem of causal language modeling, and quantifies the learnability
of a given datum with respect to a pre-trained LLM based on relative reduction
in loss during fine-tuning, a metric we show to be closely related to the
implicit reward model described in Direct Preference Optimization (DPO). We
show that 6% of Alpaca dataset selected with DavIR can steer both the LLaMA and
Gemma model family to produce superior performance compared to the same models
trained on the full 52K dataset. We also show that Alpaca dataset compressed
with DavIR can be combined with GSM8K dataset to effectively balance
open-domain freeform QA and mathematical reasoning capabilities. Finally, we
apply the DavIR objective to DPO and develop a normalized DavIR-DPO objective
which improves alignment performance of Zephyr-7B-SFT model by 8% (relative) on
AlpacaEval, compared against training on vanilla DPO objective.
♻ ☆ Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? NeurIPS 2024
Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks
As artificial intelligence systems grow more powerful, there has been
increasing interest in "AI safety" research to address emerging and future
risks. However, the field of AI safety remains poorly defined and
inconsistently measured, leading to confusion about how researchers can
contribute. This lack of clarity is compounded by the unclear relationship
between AI safety benchmarks and upstream general capabilities (e.g., general
knowledge and reasoning). To address these issues, we conduct a comprehensive
meta-analysis of AI safety benchmarks, empirically analyzing their correlation
with general capabilities across dozens of models and providing a survey of
existing directions in AI safety. Our findings reveal that many safety
benchmarks highly correlate with both upstream model capabilities and training
compute, potentially enabling "safetywashing" -- where capability improvements
are misrepresented as safety advancements. Based on these findings, we propose
an empirical foundation for developing more meaningful safety metrics and
define AI safety in a machine learning research context as a set of clearly
delineated research goals that are empirically separable from generic
capabilities advancements. In doing so, we aim to provide a more rigorous
framework for AI safety research, advancing the science of safety evaluations
and clarifying the path towards measurable progress.
comment: NeurIPS 2024
♻ ☆ SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
Caishuang Huang, Wanxu Zhao, Rui Zheng, Huijie Lv, Shihan Dou, Sixian Li, Xiao Wang, Enyu Zhou, Junjie Ye, Yuming Yang, Tao Gui, Qi Zhang, Xuanjing Huang
As the development of large language models (LLMs) rapidly advances, securing
these models effectively without compromising their utility has become a
pivotal area of research. However, current defense strategies against jailbreak
attacks (i.e., efforts to bypass security protocols) often suffer from limited
adaptability, restricted general capability, and high cost. To address these
challenges, we introduce SafeAligner, a methodology implemented at the decoding
stage to fortify defenses against jailbreak attacks. We begin by developing two
specialized models: the Sentinel Model, which is trained to foster safety, and
the Intruder Model, designed to generate riskier responses. SafeAligner
leverages the disparity in security levels between the responses from these
models to differentiate between harmful and beneficial tokens, effectively
guiding the safety alignment by altering the output token distribution of the
target model. Extensive experiments show that SafeAligner can increase the
likelihood of beneficial tokens, while reducing the occurrence of harmful ones,
thereby ensuring secure alignment with minimal loss to generality.
♻ ☆ Agent Planning with World Knowledge Model NeurIPS 2024
Shuofei Qiao, Runnan Fang, Ningyu Zhang, Yuqi Zhu, Xiang Chen, Shumin Deng, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Recent endeavors towards directly using large language models (LLMs) as agent
models to execute interactive planning tasks have shown commendable results.
Despite their achievements, however, they still struggle with brainless
trial-and-error in global planning and generating hallucinatory actions in
local planning due to their poor understanding of the ``real'' physical world.
Imitating humans' mental world knowledge model which provides global prior
knowledge before the task and maintains local dynamic knowledge during the
task, in this paper, we introduce parametric World Knowledge Model (WKM) to
facilitate agent planning. Concretely, we steer the agent model to
self-synthesize knowledge from both expert and sampled trajectories. Then we
develop WKM, providing prior task knowledge to guide the global planning and
dynamic state knowledge to assist the local planning. Experimental results on
three complex real-world simulated datasets with three state-of-the-art
open-source LLMs, Mistral-7B, Gemma-7B, and Llama-3-8B, demonstrate that our
method can achieve superior performance compared to various strong baselines.
Besides, we analyze to illustrate that our WKM can effectively alleviate the
blind trial-and-error and hallucinatory action issues, providing strong support
for the agent's understanding of the world. Other interesting findings include:
1) our instance-level task knowledge can generalize better to unseen tasks, 2)
weak WKM can guide strong agent model planning, and 3) unified WKM training has
promising potential for further development. The code is available at
https://github.com/zjunlp/WKM.
comment: NeurIPS 2024
♻ ☆ WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models NeurIPS 2024
Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Large language models (LLMs) need knowledge updates to meet the ever-growing
world facts and correct the hallucinated responses, facilitating the methods of
lifelong model editing. Where the updated knowledge resides in memories is a
fundamental question for model editing. In this paper, we find that editing
either long-term memory (direct model parameters) or working memory
(non-parametric knowledge of neural network activations/representations by
retrieval) will result in an impossible triangle -- reliability,
generalization, and locality can not be realized together in the lifelong
editing settings. For long-term memory, directly editing the parameters will
cause conflicts with irrelevant pretrained knowledge or previous edits (poor
reliability and locality). For working memory, retrieval-based activations can
hardly make the model understand the edits and generalize (poor
generalization). Therefore, we propose WISE to bridge the gap between memories.
In WISE, we design a dual parametric memory scheme, which consists of the main
memory for the pretrained knowledge and a side memory for the edited knowledge.
We only edit the knowledge in the side memory and train a router to decide
which memory to go through when given a query. For continual editing, we devise
a knowledge-sharding mechanism where different sets of edits reside in distinct
subspaces of parameters, and are subsequently merged into a shared memory
without conflicts. Extensive experiments show that WISE can outperform previous
model editing methods and overcome the impossible triangle under lifelong model
editing of question answering, hallucination, and out-of-distribution settings
across trending LLM architectures, e.g., GPT, LLaMA, and Mistral. Code is
available at https://github.com/zjunlp/EasyEdit.
comment: NeurIPS 2024
♻ ☆ DialSim: A Real-Time Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents
Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, Edward Choi
Recent advancements in Large Language Models (LLMs) have significantly
enhanced the capabilities of conversational agents, making them applicable to
various fields (e.g., education). Despite their progress, the evaluation of the
agents often overlooks the complexities of real-world conversations, such as
real-time interactions, multi-party dialogues, and extended contextual
dependencies. To bridge this gap, we introduce DialSim, a real-time dialogue
simulator. In this simulator, an agent is assigned the role of a character from
popular TV shows, requiring it to respond to spontaneous questions using past
dialogue information and to distinguish between known and unknown information.
Key features of DialSim include assessing the agent's ability to respond within
a reasonable time limit, handling long-term multi-party dialogues, and
evaluating performance under randomized questioning with LongDialQA, a novel,
high-quality question-answering dataset. Our experiments using DialSim reveal
the strengths and weaknesses of the latest conversational agents, offering
valuable insights for future advancements in conversational AI. DialSim is
available at https://dialsim.github.io/.
♻ ☆ Knowledge Circuits in Pretrained Transformers NeurIPS 2024
The remarkable capabilities of modern large language models are rooted in
their vast repositories of knowledge encoded within their parameters, enabling
them to perceive the world and engage in reasoning. The inner workings of how
these models store knowledge have long been a subject of intense interest and
investigation among researchers. To date, most studies have concentrated on
isolated components within these models, such as the Multilayer Perceptrons and
attention head. In this paper, we delve into the computation graph of the
language model to uncover the knowledge circuits that are instrumental in
articulating specific knowledge. The experiments, conducted with GPT2 and
TinyLLAMA, have allowed us to observe how certain information heads, relation
heads, and Multilayer Perceptrons collaboratively encode knowledge within the
model. Moreover, we evaluate the impact of current knowledge editing techniques
on these knowledge circuits, providing deeper insights into the functioning and
constraints of these editing methodologies. Finally, we utilize knowledge
circuits to analyze and interpret language model behaviors such as
hallucinations and in-context learning. We believe the knowledge circuits hold
potential for advancing our understanding of Transformers and guiding the
improved design of knowledge editing. Code and data are available in
https://github.com/zjunlp/KnowledgeCircuits.
comment: NeurIPS 2024, 26 pages