Computation and Language 49
♻ ☆ Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models?
Shaina Raza, Oluwanifemi Bamgbose, Shardul Ghuge, Fatemeh Tavakol, Deepak John Reji, Syed Raza Bashir
Large Language Models (LLMs) have significantly advanced various NLP tasks.
However, these models often risk generating unsafe text that perpetuates
biases. Current approaches to produce unbiased outputs from LLMs can reduce
biases but at the expense of knowledge retention. In this research, we address
the question of whether producing safe (unbiased) outputs through LLMs can
retain knowledge and language understanding. In response, we developed the
Safety and Responsible Large Language Model (\textbf{SR}$_{\text{LLM}}$), an
LLM that has been instruction fine-tuned on top of already safe LLMs (e.g.,
Llama2 or related) to diminish biases in generated text. To achieve our goals,
we compiled a specialized dataset designed to train our model in identifying
and correcting biased text. We conduct experiments, both on this custom data
and out-of-distribution test sets, to show the bias reduction and knowledge
retention. The results confirm that \textbf{SR}$_{\text{LLM}}$ outperforms
traditional fine-tuning and prompting methods in both reducing biases and
preserving the integrity of language knowledge. The significance of our
findings lies in demonstrating that instruction fine-tuning can provide a more
robust solution for bias reduction in LLMs. We have made our code and data
available at
\href{https://github.com/shainarazavi/Safe-Responsible-LLM}{Safe-LLM}.
♻ ☆ Large Language Models Assume People are More Rational than We Really are
In order for AI systems to communicate effectively with people, they must
understand how we make decisions. However, people's decisions are not always
rational, so the implicit internal models of human decision-making in Large
Language Models (LLMs) must account for this. Previous empirical evidence seems
to suggest that these implicit models are accurate -- LLMs offer believable
proxies of human behavior, acting how we expect humans would in everyday
interactions. However, by comparing LLM behavior and predictions to a large
dataset of human decisions, we find that this is actually not the case: when
both simulating and predicting people's choices, a suite of cutting-edge LLMs
(GPT-4o & 4-Turbo, Llama-3-8B & 70B, Claude 3 Opus) assume that people are more
rational than we really are. Specifically, these models deviate from human
behavior and align more closely with a classic model of rational choice --
expected value theory. Interestingly, people also tend to assume that other
people are rational when interpreting their behavior. As a consequence, when we
compare the inferences that LLMs and people draw from the decisions of others
using another psychological dataset, we find that these inferences are highly
correlated. Thus, the implicit decision-making models of LLMs appear to be
aligned with the human expectation that other people will act rationally,
rather than with how people actually act.
♻ ☆ Predicting Text Preference Via Structured Comparative Reasoning
Jing Nathan Yan, Tianqi Liu, Justin T Chiu, Jiaming Shen, Zhen Qin, Yue Yu, Yao Zhao, Charu Lakshmanan, Yair Kurzion, Alexander M. Rush, Jialu Liu, Michael Bendersky
Comparative reasoning plays a crucial role in text preference prediction;
however, large language models (LLMs) often demonstrate inconsistencies in
their reasoning. While approaches like Chain-of-Thought improve accuracy in
many other settings, they struggle to consistently distinguish the similarities
and differences of complex texts. We introduce SC, a prompting approach that
predicts text preferences by generating structured intermediate comparisons. SC
begins by proposing aspects of comparison, followed by generating textual
comparisons under each aspect. We select consistent comparisons with a pairwise
consistency comparator that ensures each aspect's comparisons clearly
distinguish differences between texts, significantly reducing hallucination and
improving consistency. Our comprehensive evaluations across various NLP tasks,
including summarization, retrieval, and automatic rating, demonstrate that SC
equips LLMs to achieve state-of-the-art performance in text preference
prediction.
♻ ☆ Does Writing with Language Models Reduce Content Diversity? ICLR 2024
Large language models (LLMs) have led to a surge in collaborative writing
with model assistance. As different users incorporate suggestions from the same
model, there is a risk of decreased diversity in the produced content,
potentially limiting diverse perspectives in public discourse. In this work, we
measure the impact of co-writing on diversity via a controlled experiment,
where users write argumentative essays in three setups -- using a base LLM
(GPT3), a feedback-tuned LLM (InstructGPT), and writing without model help. We
develop a set of diversity metrics and find that writing with InstructGPT (but
not the GPT3) results in a statistically significant reduction in diversity.
Specifically, it increases the similarity between the writings of different
authors and reduces the overall lexical and content diversity. We additionally
find that this effect is mainly attributable to InstructGPT contributing less
diverse text to co-written essays. In contrast, the user-contributed text
remains unaffected by model collaboration. This suggests that the recent
improvement in generation quality from adapting models to human feedback might
come at the cost of more homogeneous and less diverse content.
comment: ICLR 2024
♻ ☆ Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives ACL 2024
Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan
Humans use multiple senses to comprehend the environment. Vision and language
are two of the most vital senses since they allow us to easily communicate our
thoughts and perceive the world around us. There has been a lot of interest in
creating video-language understanding systems with human-like senses since a
video-language pair can mimic both our linguistic medium and visual environment
with temporal dynamics. In this survey, we review the key tasks of these
systems and highlight the associated challenges. Based on the challenges, we
summarize their methods from model architecture, model training, and data
perspectives. We also conduct performance comparison among the methods, and
discuss promising directions for future research.
comment: Accepted at ACL 2024 (Findings)
♻ ☆ Explainability of machine learning approaches in forensic linguistics: a case study in geolinguistic authorship profiling
Forensic authorship profiling uses linguistic markers to infer
characteristics about an author of a text. This task is paralleled in dialect
classification, where a prediction is made about the linguistic variety of a
text based on the text itself. While there have been significant advances in
recent years in variety classification, forensic linguistics rarely relies on
these approaches due to their lack of transparency, among other reasons. In
this paper we therefore explore the explainability of machine learning
approaches considering the forensic context. We focus on variety classification
as a means of geolinguistic profiling of unknown texts based on social media
data from the German-speaking area. For this, we identify the lexical items
that are the most impactful for the variety classification. We find that the
extracted lexical features are indeed representative of their respective
varieties and note that the trained models also rely on place names for
classifications.
♻ ☆ Are LLMs Rational Investors? A Study on Detecting and Reducing the Financial Bias in LLMs
Yuhang Zhou, Yuchen Ni, Yunhui Gan, Zhangyue Yin, Xiang Liu, Jian Zhang, Sen Liu, Xipeng Qiu, Guangnan Ye, Hongfeng Chai
Large Language Models (LLMs) are increasingly adopted in financial analysis
for interpreting complex market data and trends. However, their use is
challenged by intrinsic biases (e.g., risk-preference bias) and a superficial
understanding of market intricacies, necessitating a thorough assessment of
their financial insight. To address these issues, we introduce Financial Bias
Indicators (FBI), a framework with components like Bias Unveiler, Bias
Detective, Bias Tracker, and Bias Antidote to identify, detect, analyze, and
eliminate irrational biases in LLMs. By combining behavioral finance principles
with bias examination, we evaluate 23 leading LLMs and propose a de-biasing
method based on financial causal knowledge. Results show varying degrees of
financial irrationality among models, influenced by their design and training.
Models trained specifically on financial datasets may exhibit more
irrationality, and even larger financial language models (FinLLMs) can show
more bias than smaller, general models. We utilize four prompt-based methods
incorporating causal debiasing, effectively reducing financial biases in these
models. This work enhances the understanding of LLMs' bias in financial
applications, laying the foundation for developing more reliable and rational
financial analysis tools.
♻ ☆ Rethinking Machine Ethics -- Can LLMs Perform Moral Reasoning through the Lens of Moral Theories?
Making moral judgments is an essential step toward developing ethical AI
systems. Prevalent approaches are mostly implemented in a bottom-up manner,
which uses a large set of annotated data to train models based on crowd-sourced
opinions about morality. These approaches have been criticized for
overgeneralizing the moral stances of a limited group of annotators and lacking
explainability. This work proposes a flexible top-down framework to steer
(Large) Language Models (LMs) to perform moral reasoning with well-established
moral theories from interdisciplinary research. The theory-guided top-down
framework can incorporate various moral theories. Our experiments demonstrate
the effectiveness of the proposed framework on datasets derived from moral
theories. Furthermore, we show the alignment between different moral theories
and existing morality datasets. Our analysis exhibits the potential and flaws
in existing resources (models and datasets) in developing explainable moral
judgment-making systems.
♻ ☆ Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models UAI 2024
For downstream applications of vision-language pre-trained models, there has
been significant interest in constructing effective prompts. Existing works on
prompt engineering, which either require laborious manual designs or optimize
the prompt tuning as a point estimation problem, may fail to describe diverse
characteristics of categories and limit their applications. We introduce a
Bayesian probabilistic resolution to prompt tuning, where the label-specific
stochastic prompts are generated hierarchically by first sampling a latent
vector from an underlying distribution and then employing a lightweight
generative model. Importantly, we semantically regularize the tuning process by
minimizing the statistical distance between the visual patches and linguistic
prompts, which pushes the stochastic label representations to faithfully
capture diverse visual concepts, instead of overfitting the training
categories. We evaluate the effectiveness of our approach on four tasks:
few-shot image recognition, base-to-new generalization, dataset transfer
learning, and domain shifts. Extensive results over 15 datasets show promising
transferability and generalization performance of our proposed model, both
quantitatively and qualitatively.
comment: Accepted by UAI 2024
♻ ☆ ViANLI: Adversarial Natural Language Inference for Vietnamese
The development of Natural Language Processing (NLI) datasets and models has
been inspired by innovations in annotation design. With the rapid development
of machine learning models today, the performance of existing machine learning
models has quickly reached state-of-the-art results on a variety of tasks
related to natural language processing, including natural language inference
tasks. By using a pre-trained model during the annotation process, it is
possible to challenge current NLI models by having humans produce
premise-hypothesis combinations that the machine model cannot correctly
predict. To remain attractive and challenging in the research of natural
language inference for Vietnamese, in this paper, we introduce the adversarial
NLI dataset to the NLP research community with the name ViANLI. This data set
contains more than 10K premise-hypothesis pairs and is built by a continuously
adjusting process to obtain the most out of the patterns generated by the
annotators. ViANLI dataset has brought many difficulties to many current SOTA
models when the accuracy of the most powerful model on the test set only
reached 48.4%. Additionally, the experimental results show that the models
trained on our dataset have significantly improved the results on other
Vietnamese NLI datasets.
♻ ☆ BeHonest: Benchmarking Honesty of Large Language Models
Previous works on Large Language Models (LLMs) have mainly focused on
evaluating their helpfulness or harmlessness. However, honesty, another crucial
alignment criterion, has received relatively less attention. Dishonest
behaviors in LLMs, such as spreading misinformation and defrauding users,
eroding user trust, and causing real-world harm, present severe risks that
intensify as these models approach superintelligence levels. Enhancing honesty
in LLMs addresses critical deficiencies and helps uncover latent capabilities
that are not readily expressed. This underscores the urgent need for reliable
methods and benchmarks to effectively ensure and evaluate the honesty of LLMs.
In this paper, we introduce BeHonest, a pioneering benchmark specifically
designed to assess honesty in LLMs comprehensively. BeHonest evaluates three
essential aspects of honesty: awareness of knowledge boundaries, avoidance of
deceit, and consistency in responses. Building on this foundation, we designed
10 scenarios to evaluate and analyze 9 popular LLMs on the market, including
both closed-source and open-source models from different model families with
varied model sizes. Our findings indicate that there is still significant room
for improvement in the honesty of LLMs. We also encourage the AI community to
prioritize honesty alignment in LLMs. Our benchmark and code can be found at:
\url{https://github.com/GAIR-NLP/BeHonest}.
♻ ☆ Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text
Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein
Detecting text generated by modern large language models is thought to be
hard, as both LLMs and humans can exhibit a wide range of complex behaviors.
However, we find that a score based on contrasting two closely related language
models is highly accurate at separating human-generated and machine-generated
text. Based on this mechanism, we propose a novel LLM detector that only
requires simple calculations using a pair of pre-trained LLMs. The method,
called Binoculars, achieves state-of-the-art accuracy without any training
data. It is capable of spotting machine text from a range of modern LLMs
without any model-specific modifications. We comprehensively evaluate
Binoculars on a number of text sources and in varied situations. Over a wide
range of document types, Binoculars detects over 90% of generated samples from
ChatGPT (and other LLMs) at a false positive rate of 0.01%, despite not being
trained on any ChatGPT data.
comment: 20 pages, code available at https://github.com/ahans30/Binoculars
♻ ☆ $R^3$-NL2GQL: A Model Coordination and Knowledge Graph Alignment Approach for NL2GQL
Yuhang Zhou, Yu He, Siyu Tian, Yuchen Ni, Zhangyue Yin, Xiang Liu, Chuanjun Ji, Sen Liu, Xipeng Qiu, Guangnan Ye, Hongfeng Chai
While current tasks of converting natural language to SQL (NL2SQL) using
Foundation Models have shown impressive achievements, adapting these approaches
for converting natural language to Graph Query Language (NL2GQL) encounters
hurdles due to the distinct nature of GQL compared to SQL, alongside the
diverse forms of GQL. Moving away from traditional rule-based and slot-filling
methodologies, we introduce a novel approach, $R^3$-NL2GQL, integrating both
small and large Foundation Models for ranking, rewriting, and refining tasks.
This method leverages the interpretative strengths of smaller models for
initial ranking and rewriting stages, while capitalizing on the superior
generalization and query generation prowess of larger models for the final
transformation of natural language queries into GQL formats. Addressing the
scarcity of datasets in this emerging field, we have developed a bilingual
dataset, sourced from graph database manuals and selected open-source Knowledge
Graphs (KGs). Our evaluation of this methodology on this dataset demonstrates
its promising efficacy and robustness.
♻ ☆ SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking
Recent studies have shown that maintaining a consistent response style by
human experts and enhancing data quality in training sets can significantly
improve the performance of fine-tuned Large Language Models (LLMs) while
reducing the number of training examples needed. However, the precise
definition of style and the relationship between style, data quality, and LLM
performance remains unclear. This research decomposes response style into
presentation and composition styles and finds that, among training data of
similar quality, those with higher style consistency lead to better LLM
performance. Inspired by this, we introduce Style Consistency-Aware Response
Ranking (SCAR), which automatically prioritizes instruction-response pairs in
the training set based on their response stylistic consistency. By selecting
the most style-consistent examples, ranging from the top 25% to 0.7% of the
full dataset, the fine-tuned LLMs can match or even surpass the performance of
models trained on the entire dataset in coding and open-ended
question-answering benchmarks. Code and data are available at
https://github.com/zhuang-li/SCAR .
comment: 21 pages
♻ ☆ Rethinking LLM Memorization through the Lens of Adversarial Compression
Large language models (LLMs) trained on web-scale datasets raise substantial
concerns regarding permissible data usage. One major question is whether these
models "memorize" all their training data or they integrate many data sources
in some way more akin to how a human would learn and synthesize information.
The answer hinges, to a large degree, on how we define memorization. In this
work, we propose the Adversarial Compression Ratio (ACR) as a metric for
assessing memorization in LLMs. A given string from the training data is
considered memorized if it can be elicited by a prompt (much) shorter than the
string itself -- in other words, if these strings can be "compressed" with the
model by computing adversarial prompts of fewer tokens. The ACR overcomes the
limitations of existing notions of memorization by (i) offering an adversarial
view of measuring memorization, especially for monitoring unlearning and
compliance; and (ii) allowing for the flexibility to measure memorization for
arbitrary strings at a reasonably low compute. Our definition serves as a
practical tool for determining when model owners may be violating terms around
data usage, providing a potential legal tool and a critical lens through which
to address such scenarios.
comment: https://locuslab.github.io/acr-memorization
♻ ☆ $Classi|Q\rangle$ Towards a Translation Framework To Bridge The Classical-Quantum Programming Gap
Matteo Esposito, Maryam Tavassoli Sabzevari, Boshuai Ye, Davide Falessi, Arif Ali Khan, Davide Taibi
Quantum computing, albeit readily available as hardware or emulated on the
cloud, is still far from being available in general regarding complex
programming paradigms and learning curves. This vision paper introduces
$Classi|Q\rangle$, a translation framework idea to bridge Classical and Quantum
Computing by translating high-level programming languages, e.g., Python or C++,
into a low-level language, e.g., Quantum Assembly. Our idea paper serves as a
blueprint for ongoing efforts in quantum software engineering, offering a
roadmap for further $Classi|Q\rangle$ development to meet the diverse needs of
researchers and practitioners. $Classi|Q\rangle$ is designed to empower
researchers and practitioners with no prior quantum experience to harness the
potential of hybrid quantum computation. We also discuss future enhancements to
$Classi|Q\rangle$, including support for additional quantum languages, improved
optimization strategies, and integration with emerging quantum computing
platforms.
♻ ☆ Efficient Prompt Tuning by Multi-Space Projection and Prompt Fusion
Prompt tuning is a promising method to fine-tune a pre-trained language model
without retraining its large-scale parameters. Instead, it attaches a soft
prompt to the input text, whereby downstream tasks can be well adapted by
merely learning the embeddings of prompt tokens. Nevertheless, existing methods
still suffer from two challenges: (i) they are hard to balance accuracy and
efficiency. A longer (shorter) soft prompt generally leads to a better(worse)
accuracy but at the cost of more (less) training time. (ii)The performance may
not be consistent when adapting to different downstream tasks. We attribute it
to the same embedding space but responsible for different requirements of
downstream tasks. To address these issues, we propose an Efficient Prompt
Tuning method (EPT) by multi-space projection and prompt fusion. Specifically,
it decomposes a given soft prompt into a shorter prompt and two low-rank
matrices, significantly reducing the training time. Accuracy is also enhanced
by leveraging low-rank matrices and the short prompt as additional knowledge
sources to enrich the semantics of the original short prompt. In addition, we
project the soft prompt into multiple subspaces to improve the performance
consistency, and then adaptively learn the combination weights of different
spaces through a gating network. Experiments on 13 natural language processing
downstream tasks show that our method significantly and consistently
outperforms 11 comparison methods with the relative percentage of improvements
up to 12.9%, and training time decreased by 14%.
♻ ☆ Assessing Logical Reasoning Capabilities of Encoder-Only Transformer Models
Logical reasoning is central to complex human activities, such as thinking,
debating, and planning; it is also a central component of many AI systems as
well. In this paper, we investigate the extent to which encoder-only
transformer language models (LMs) can reason according to logical rules. We ask
whether those LMs can deduce theorems in propositional calculus and first-order
logic; if their relative success in these problems reflects general logical
capabilities; and which layers contribute the most to the task. First, we show
for several encoder-only LMs that they can be trained, to a reasonable degree,
to determine logical validity on various datasets. Next, by cross-probing
fine-tuned models on these datasets, we show that LMs have difficulty in
transferring their putative logical reasoning ability, which suggests that they
may have learned dataset-specific features, instead of a general capability.
Finally, we conduct a layerwise probing experiment, which shows that the
hypothesis classification task is mostly solved through higher layers.
♻ ☆ On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
As Large Language Models (LLMs) are increasingly being employed in real-world
applications in critical domains such as healthcare, it is important to ensure
that the Chain-of-Thought (CoT) reasoning generated by these models faithfully
captures their underlying behavior.
While LLMs are known to generate CoT reasoning that is appealing to humans,
prior studies have shown that these explanations do not accurately reflect the
actual behavior of the underlying LLMs. In this work, we explore the promise of
three broad approaches commonly employed to steer the behavior of LLMs to
enhance the faithfulness of the CoT reasoning generated by LLMs: in-context
learning, fine-tuning, and activation editing. Specifically, we introduce novel
strategies for in-context learning, fine-tuning, and activation editing aimed
at improving the faithfulness of the CoT reasoning. We then carry out extensive
empirical analyses with multiple benchmark datasets to explore the promise of
these strategies. Our analyses indicate that these strategies offer limited
success in improving the faithfulness of the CoT reasoning, with only slight
performance enhancements in controlled scenarios. Activation editing
demonstrated minimal success, while fine-tuning and in-context learning
achieved marginal improvements that failed to generalize across diverse
reasoning and truthful question-answering benchmarks. In summary, our work
underscores the inherent difficulty in eliciting faithful CoT reasoning from
LLMs, suggesting that the current array of approaches may not be sufficient to
address this complex challenge.
♻ ☆ First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning
Language models can solve complex reasoning tasks better by learning to
generate rationales for their predictions. Often these models know how to solve
a task but their auto-regressive decoding nature leads to incorrect results if
they start incorrectly. We observe that smaller models in particular when
corrected, can solve a task that they would have otherwise struggled with. We
demonstrate this phenomenon by using a larger model to guide smaller models,
which leads to significantly improved performance (up to +24 points on the
GSM8K dataset by 7B models). To assist smaller models in initiating the
starting step, we propose QuestCoT, where a smaller model first asks itself how
to start, before proceeding with a chain of reasoning. On various multistep
mathematical reasoning datasets over multiple smaller models, we show that
getting the right start can lead to significant performance gains across all
models (gains of up to +6 points on GSM8K, +9 on SVAMP, +5 on ASDiv, and +7 on
MultiArith).
♻ ☆ Model Generation with LLMs: From Requirements to UML Sequence Diagrams
Complementing natural language (NL) requirements with graphical models can
improve stakeholders' communication and provide directions for system design.
However, creating models from requirements involves manual effort. The advent
of generative large language models (LLMs), ChatGPT being a notable example,
offers promising avenues for automated assistance in model generation. This
paper investigates the capability of ChatGPT to generate a specific type of
model, i.e., UML sequence diagrams, from NL requirements. We conduct a
qualitative study in which we examine the sequence diagrams generated by
ChatGPT for 28 requirements documents of various types and from different
domains. Observations from the analysis of the generated diagrams have
systematically been captured through evaluation logs, and categorized through
thematic analysis. Our results indicate that, although the models generally
conform to the standard and exhibit a reasonable level of understandability,
their completeness and correctness with respect to the specified requirements
often present challenges. This issue is particularly pronounced in the presence
of requirements smells, such as ambiguity and inconsistency. The insights
derived from this study can influence the practical utilization of LLMs in the
RE process, and open the door to novel RE-specific prompting strategies
targeting effective model generation.
♻ ☆ Recovering the Pre-Fine-Tuning Weights of Generative Models ICML 2024
The dominant paradigm in generative modeling consists of two steps: i)
pre-training on a large-scale but unsafe dataset, ii) aligning the pre-trained
model with human values via fine-tuning. This practice is considered safe, as
no current method can recover the unsafe, pre-fine-tuning model weights. In
this paper, we demonstrate that this assumption is often false. Concretely, we
present Spectral DeTuning, a method that can recover the weights of the
pre-fine-tuning model using a few low-rank (LoRA) fine-tuned models. In
contrast to previous attacks that attempt to recover pre-fine-tuning
capabilities, our method aims to recover the exact pre-fine-tuning weights. Our
approach exploits this new vulnerability against large-scale models such as a
personalized Stable Diffusion and an aligned Mistral.
comment: ICML 2024. Project page: https://vision.huji.ac.il/spectral_detuning/
♻ ☆ Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
Ensuring the verifiability of model answers is a fundamental challenge for
retrieval-augmented generation (RAG) in the question answering (QA) domain.
Recently, self-citation prompting was proposed to make large language models
(LLMs) generate citations to supporting documents along with their answers.
However, self-citing LLMs often struggle to match the required format, refer to
non-existent sources, and fail to faithfully reflect LLMs' context usage
throughout the generation. In this work, we present MIRAGE --Model
Internals-based RAG Explanations -- a plug-and-play approach using model
internals for faithful answer attribution in RAG applications. MIRAGE detects
context-sensitive answer tokens and pairs them with retrieved documents
contributing to their prediction via saliency methods. We evaluate our proposed
approach on a multilingual extractive QA dataset, finding high agreement with
human answer attribution. On open-ended QA, MIRAGE achieves citation quality
and efficiency comparable to self-citation while also allowing for a
finer-grained control of attribution parameters. Our qualitative evaluation
highlights the faithfulness of MIRAGE's attributions and underscores the
promising application of model internals for RAG answer attribution.
comment: Under review. Code and data released at
https://github.com/Betswish/MIRAGE
♻ ☆ Paraphrase Types for Generation and Detection EMNLP 2023
Current approaches in paraphrase generation and detection heavily rely on a
single general similarity score, ignoring the intricate linguistic properties
of language. This paper introduces two new tasks to address this shortcoming by
considering paraphrase types - specific linguistic perturbations at particular
text positions. We name these tasks Paraphrase Type Generation and Paraphrase
Type Detection. Our results suggest that while current techniques perform well
in a binary classification scenario, i.e., paraphrased or not, the inclusion of
fine-grained paraphrase types poses a significant challenge. While most
approaches are good at generating and detecting general semantic similar
content, they fail to understand the intrinsic linguistic variables they
manipulate. Models trained in generating and identifying paraphrase types also
show improvements in tasks without them. In addition, scaling these models
further improves their ability to understand paraphrase types. We believe
paraphrase types can unlock a new paradigm for developing paraphrase models and
solving tasks in the future.
comment: Published at EMNLP 2023
♻ ☆ We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields EMNLP 2023
Natural Language Processing (NLP) is poised to substantially influence the
world. However, significant progress comes hand-in-hand with substantial risks.
Addressing them requires broad engagement with various fields of study. Yet,
little empirical work examines the state of such engagement (past or current).
In this paper, we quantify the degree of influence between 23 fields of study
and NLP (on each other). We analyzed ~77k NLP papers, ~3.1m citations from NLP
papers to other papers, and ~1.8m citations from other papers to NLP papers. We
show that, unlike most fields, the cross-field engagement of NLP, measured by
our proposed Citation Field Diversity Index (CFDI), has declined from 0.58 in
1980 to 0.31 in 2022 (an all-time low). In addition, we find that NLP has grown
more insular -- citing increasingly more NLP papers and having fewer papers
that act as bridges between fields. NLP citations are dominated by computer
science; Less than 8% of NLP citations are to linguistics, and less than 3% are
to math and psychology. These findings underscore NLP's urgent need to reflect
on its engagement with various fields.
comment: Published at EMNLP 2023
♻ ☆ The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research ACL 2023
Mohamed Abdalla, Jan Philip Wahle, Terry Ruas, Aurélie Névéol, Fanny Ducel, Saif M. Mohammad, Karën Fort
Recent advances in deep learning methods for natural language processing
(NLP) have created new business opportunities and made NLP research critical
for industry development. As one of the big players in the field of NLP,
together with governments and universities, it is important to track the
influence of industry on research. In this study, we seek to quantify and
characterize industry presence in the NLP community over time. Using a corpus
with comprehensive metadata of 78,187 NLP publications and 701 resumes of NLP
publication authors, we explore the industry presence in the field since the
early 90s. We find that industry presence among NLP authors has been steady
before a steep increase over the past five years (180% growth from 2017 to
2022). A few companies account for most of the publications and provide funding
to academic researchers through grants and internships. Our study shows that
the presence and impact of the industry on natural language processing research
are significant and fast-growing. This work calls for increased transparency of
industry influence in the field.
comment: Published at ACL 2023
♻ ☆ Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As
Eden Avnat, Michal Levy, Daniel Herstain, Elia Yanko, Daniel Ben Joya, Michal Tzuchman Katz, Dafna Eshel, Sahar Laros, Yael Dagan, Shahar Barami, Joseph Mermelstein, Shahar Ovadia, Noam Shomron, Varda Shalev, Raja-Elie E. Abdulnour
Clinical problem-solving requires processing of semantic medical knowledge
such as illness scripts and numerical medical knowledge of diagnostic tests for
evidence-based decision-making. As large language models (LLMs) show promising
results in many aspects of language-based clinical practice, their ability to
generate non-language evidence-based answers to clinical questions is
inherently limited by tokenization. Therefore, we evaluated LLMs' performance
on two question types: numeric (correlating findings) and semantic
(differentiating entities) while examining differences within and between LLMs
in medical aspects and comparing their performance to humans. To generate
straightforward multi-choice questions and answers (QAs) based on
evidence-based medicine (EBM), we used a comprehensive medical knowledge graph
(encompassed data from more than 50,00 peer-reviewed articles) and created the
"EBMQA". EBMQA contains 105,000 QAs labeled with medical and non-medical topics
and classified into numerical or semantic questions. We benchmarked this
dataset using more than 24,500 QAs on two state-of-the-art LLMs: Chat-GPT4 and
Claude3-Opus. We evaluated the LLMs accuracy on semantic and numerical question
types and according to sub-labeled topics. For validation, six medical experts
were tested on 100 numerical EBMQA questions. We found that both LLMs excelled
more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical
QAs. However, both LLMs showed inter and intra gaps in different medical
aspects and remained inferior to humans. Thus, their medical advice should be
addressed carefully.
♻ ☆ ProTrix: Building Models for Planning and Reasoning over Tables with Sentence Context
Tables play a crucial role in conveying information in various domains. We
propose a Plan-then-Reason framework to answer different types of user queries
over tables with sentence context. The framework first plans the reasoning
paths over the context, then assigns each step to program-based or textual
reasoning to reach the final answer. This framework enhances the table
reasoning abilities for both in-context learning and fine-tuning methods.
GPT-3.5-Turbo following Plan-then-Reason framework surpasses other prompting
baselines without self-consistency while using less API calls and in-context
demonstrations. We also construct an instruction tuning set TrixInstruct to
evaluate the effectiveness of fine-tuning with this framework. We present
ProTrix model family by finetuning models on TrixInstruct. Our experiments show
that ProTrix family generalizes to diverse unseen tabular tasks with only 6k
training instances. We further demonstrate that ProTrix can generate accurate
and faithful explanations to answer complex free-form questions. Our work
underscores the importance of the planning and reasoning abilities towards a
model over tabular tasks with generalizability and interpretability. We
open-source our dataset and models at https://github.com/WilliamZR/ProTrix.
♻ ☆ Climate Change from Large Language Models
Climate change poses grave challenges, demanding widespread understanding and
low-carbon lifestyle awareness. Large language models (LLMs) offer a powerful
tool to address this crisis, yet comprehensive evaluations of their
climate-crisis knowledge are lacking. This paper proposes an automated
evaluation framework to assess climate-crisis knowledge within LLMs. We adopt a
hybrid approach for data acquisition, combining data synthesis and manual
collection, to compile a diverse set of questions encompassing various aspects
of climate change. Utilizing prompt engineering based on the compiled
questions, we evaluate the model's knowledge by analyzing its generated
answers. Furthermore, we introduce a comprehensive set of metrics to assess
climate-crisis knowledge, encompassing indicators from 10 distinct
perspectives. These metrics provide a multifaceted evaluation, enabling a
nuanced understanding of the LLMs' climate crisis comprehension. The
experimental results demonstrate the efficacy of our proposed method. In our
evaluation utilizing diverse high-performing LLMs, we discovered that while
LLMs possess considerable climate-related knowledge, there are shortcomings in
terms of timeliness, indicating a need for continuous updating and refinement
of their climate-related content.
♻ ☆ Improving Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts ACL2023
In the era of large language models, applying techniques such as Retrieval
Augmented Generation can better address Open-Domain Question-Answering
problems. Due to constraints including model sizes and computing resources, the
length of context is often limited, and it becomes challenging to empower the
model to cover overlong contexts while answering questions from open domains.
This paper proposes a general and convenient method to covering longer contexts
in Open-Domain Question-Answering tasks. It leverages a small encoder language
model that effectively encodes contexts, and the encoding applies
cross-attention with origin inputs. With our method, the origin language models
can cover several times longer contexts while keeping the computing
requirements close to the baseline. Our experiments demonstrate that after
fine-tuning, there is improved performance across two held-in datasets, four
held-out datasets, and also in two In Context Learning settings.
comment: ACL2023 Findings
♻ ☆ Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models
Fine-tuning large language models on small, high-quality datasets can enhance
their performance on specific downstream tasks. Recent research shows that
fine-tuning on benign, instruction-following data can inadvertently undo the
safety alignment process and increase a model's propensity to comply with
harmful queries. Although critical, understanding and mitigating safety risks
in well-defined tasks remains distinct from the instruction-following context
due to structural differences in the data. Our work addresses the gap in our
understanding of these risks across diverse types of data in closed models -
where providers control how user data is utilized in the fine-tuning process.
We demonstrate how malicious actors can subtly manipulate the structure of
almost any task-specific dataset to foster significantly more dangerous model
behaviors, while maintaining an appearance of innocuity and reasonable
downstream task performance. To address this issue, we propose a novel
mitigation strategy that mixes in safety data which mimics the task format and
prompting style of the user data, showing this is more effective than existing
baselines at re-establishing safety alignment while maintaining similar task
performance.
♻ ☆ The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models
Bolei Ma, Xinpeng Wang, Tiancheng Hu, Anna-Carolina Haensch, Michael A. Hedderich, Barbara Plank, Frauke Kreuter
Recent advances in Large Language Models (LLMs) have sparked wide interest in
validating and comprehending the human-like cognitive-behavioral traits LLMs
may have. These cognitive-behavioral traits include typically Attitudes,
Opinions, Values (AOV). However, measuring AOV embedded within LLMs remains
opaque, and different evaluation methods may yield different results. This has
led to a lack of clarity on how different studies are related to each other and
how they can be interpreted. This paper aims to bridge this gap by providing an
overview of recent works on the evaluation of AOV in LLMs. Moreover, we survey
related approaches in different stages of the evaluation pipeline in these
works. By doing so, we address the potential and challenges with respect to
understanding the model, human-AI alignment, and downstream application in
social sciences. Finally, we provide practical insights into evaluation
methods, model enhancement, and interdisciplinary collaboration, thereby
contributing to the evolving landscape of evaluating AOV in LLMs.
♻ ☆ CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay ICML'24
Natasha Butt, Blazej Manczak, Auke Wiggers, Corrado Rainone, David W. Zhang, Michaël Defferrard, Taco Cohen
Large language models are increasingly solving tasks that are commonly
believed to require human-level reasoning ability. However, these models still
perform very poorly on benchmarks of general intelligence such as the
Abstraction and Reasoning Corpus (ARC). In this paper, we approach ARC as a
programming-by-examples problem, and introduce a novel and scalable method for
language model self-improvement called Code Iteration (CodeIt). Our method
iterates between 1) program sampling and hindsight relabeling, and 2) learning
from prioritized experience replay. By relabeling the goal of an episode (i.e.,
the target program output given input) to the realized output produced by the
sampled program, our method effectively deals with the extreme sparsity of
rewards in program synthesis. Applying CodeIt to the ARC dataset, we
demonstrate that prioritized hindsight replay, along with pre-training and
data-augmentation, leads to successful inter-task generalization. CodeIt is the
first neuro-symbolic approach that scales to the full ARC evaluation dataset.
Our method solves 15% of ARC evaluation tasks, achieving state-of-the-art
performance and outperforming existing neural and symbolic baselines. Our code
is available at https://github.com/Qualcomm-AI-research/codeit .
comment: ICML'24 camera-ready version
♻ ☆ CoCoST: Automatic Complex Code Generation with Online Searching and Correctness Testing
Large Language Models have revolutionized code generation ability by
converting natural language descriptions into executable code. However,
generating complex code within real-world scenarios remains challenging due to
intricate structures, subtle bugs, understanding of advanced data types, and
lack of supplementary contents. To address these challenges, we introduce the
CoCoST framework, which enhances complex code generation by online searching
for more information with planned queries and correctness testing for code
refinement. Moreover, CoCoST serializes the complex inputs and outputs to
improve comprehension and generates test cases to ensure the adaptability for
real-world applications. CoCoST is validated through rigorous experiments on
the DS-1000 and ClassEval datasets. Experimental results show that CoCoST
substantially improves the quality of complex code generation, highlighting its
potential to enhance the practicality of LLMs in generating complex code.
♻ ☆ Exploring the Potential of Large Language Models in Computational Argumentation ACL 2024
Computational argumentation has become an essential tool in various domains,
including law, public policy, and artificial intelligence. It is an emerging
research field in natural language processing that attracts increasing
attention. Research on computational argumentation mainly involves two types of
tasks: argument mining and argument generation. As large language models (LLMs)
have demonstrated impressive capabilities in understanding context and
generating natural language, it is worthwhile to evaluate the performance of
LLMs on diverse computational argumentation tasks. This work aims to embark on
an assessment of LLMs, such as ChatGPT, Flan models, and LLaMA2 models, in both
zero-shot and few-shot settings. We organize existing tasks into six main
categories and standardize the format of fourteen openly available datasets. In
addition, we present a new benchmark dataset on counter speech generation that
aims to holistically evaluate the end-to-end performance of LLMs on argument
mining and argument generation. Extensive experiments show that LLMs exhibit
commendable performance across most of the datasets, demonstrating their
capabilities in the field of argumentation. Our analysis offers valuable
suggestions for evaluating computational argumentation and its integration with
LLMs in future research endeavors.
comment: Accepted at ACL 2024 Main
♻ ☆ Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations
Existing retrieval-based methods have made significant strides in maintaining
long-term conversations. However, these approaches face challenges in memory
database management and accurate memory retrieval, hindering their efficacy in
dynamic, real-world interactions. This study introduces a novel framework,
COmpressive Memory-Enhanced Dialogue sYstems (COMEDY), which eschews
traditional retrieval modules and memory databases. Instead, COMEDY adopts a
"One-for-All" approach, utilizing a single language model to manage memory
generation, compression, and response generation. Central to this framework is
the concept of compressive memory, which intergrates session-specific
summaries, user-bot dynamics, and past events into a concise memory format. To
support COMEDY, we curated a large-scale Chinese instruction-tuning dataset,
Dolphin, derived from real user-chatbot interactions. Comparative evaluations
demonstrate COMEDY's superiority over traditional retrieval-based methods in
producing more nuanced and human-like conversational experiences. Our codes are
available at https://github.com/nuochenpku/COMEDY.
comment: 17pages, 5 figures
♻ ☆ Textual Similarity as a Key Metric in Machine Translation Quality Estimation
Machine Translation (MT) Quality Estimation (QE) assesses translation
reliability without reference texts. This study introduces "textual similarity"
as a new metric for QE, using sentence transformers and cosine similarity to
measure semantic closeness. Analyzing data from the MLQE-PE dataset, we found
that textual similarity exhibits stronger correlations with human scores than
traditional metrics (hter, model evaluation, sentence probability etc.).
Employing GAMMs as a statistical tool, we demonstrated that textual similarity
consistently outperforms other metrics across multiple language pairs in
predicting human scores. We also found that "hter" actually failed to predict
human scores in QE. Our findings highlight the effectiveness of textual
similarity as a robust QE metric, recommending its integration with other
metrics into QE frameworks and MT system training for improved accuracy and
usability.
♻ ☆ Revealing User Familiarity Bias in Task-Oriented Dialogue via Interactive Evaluation ACL 2024
Most task-oriented dialogue (TOD) benchmarks assume users that know exactly
how to use the system by constraining the user behaviors within the system's
capabilities via strict user goals, namely "user familiarity" bias. This data
bias deepens when it combines with data-driven TOD systems, as it is impossible
to fathom the effect of it with existing static evaluations. Hence, we conduct
an interactive user study to unveil how vulnerable TOD systems are against
realistic scenarios. In particular, we compare users with 1) detailed goal
instructions that conform to the system boundaries (closed-goal) and 2) vague
goal instructions that are often unsupported but realistic (open-goal). Our
study reveals that conversations in open-goal settings lead to catastrophic
failures of the system, in which 92% of the dialogues had significant issues.
Moreover, we conduct a thorough analysis to identify distinctive features
between the two settings through error annotation. From this, we discover a
novel "pretending" behavior, in which the system pretends to handle the user
requests even though they are beyond the system's capabilities. We discuss its
characteristics and toxicity while showing recent large language models can
also suffer from this behavior.
comment: NLP4ConvAI Workshop at ACL 2024
♻ ☆ GraphWiz: An Instruction-Following Language Model for Graph Problems
Large language models (LLMs) have achieved impressive success across several
fields, but their proficiency in understanding and resolving complex graph
problems is less explored. To bridge this gap, we introduce GraphInstruct, a
novel and comprehensive instruction-tuning dataset designed to equip language
models with the ability to tackle a broad spectrum of graph problems using
explicit reasoning paths. Utilizing GraphInstruct, we build GraphWiz, an
open-source language model capable of resolving various graph problem types
while generating clear reasoning processes. To enhance the model's capability
and reliability, we incorporate the Direct Preference Optimization (DPO)
framework into the graph problem-solving context. The enhanced model,
GraphWiz-DPO, achieves an average accuracy of 65% across nine tasks with
different complexity levels, surpassing GPT-4 which has an average accuracy of
43.8%. Moreover, our research delves into the delicate balance between training
data volume and model performance, highlighting the potential for overfitting
with increased data. We also explore the transferability of the model's
reasoning ability across different graph tasks, indicating the model's
adaptability and practical application potential. Our investigation offers a
new blueprint and valuable insights for developing LLMs specialized in graph
reasoning and problem-solving.
comment: 27pages, 15 tables
♻ ☆ How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?
Work on instruction-tuned Large Language Models (LLMs) has used automatic
methods based on text overlap and LLM judgments as cost-effective alternatives
to human evaluation. In this paper, we perform a meta-evaluation of such
methods and assess their reliability across a broad range of tasks. We observe
that while automatic evaluation methods can approximate human ratings under
specific conditions, their validity is highly context-dependent. Specifically,
the simple ROUGE-L metric correlates well with human ratings for short-answer
English tasks but is unreliable in free-form generation tasks and cross-lingual
transfer. The effectiveness of the more advanced method of using GPT-4 as a
judge diminishes significantly if reference answers are not included in the
prompt, which is the scenario where this method has the potential to provide
the most value compared to other metrics. Our findings enhance the
understanding of how automatic methods should be applied and interpreted when
developing and evaluating instruction-tuned LLMs.
♻ ☆ Is one brick enough to break the wall of spoken dialogue state tracking?
In Task-Oriented Dialogue (TOD) systems, correctly updating the system's
understanding of the user's requests (\textit{a.k.a} dialogue state tracking)
is key to a smooth interaction. Traditionally, TOD systems perform this update
in three steps: transcription of the user's utterance, semantic extraction of
the key concepts, and contextualization with the previously identified
concepts. Such cascade approaches suffer from cascading errors and separate
optimization. End-to-End approaches have been proven helpful up to the
turn-level semantic extraction step. This paper goes one step further and
provides (1) a novel approach for completely neural spoken DST, (2) an in depth
comparison with a state of the art cascade approach and (3) avenues towards
better context propagation. Our study highlights that jointly-optimized
approaches are also competitive for contextually dependent tasks, such as
Dialogue State Tracking (DST), especially in audio native settings. Context
propagation in DST systems could benefit from training procedures accounting
for the previous' context inherent uncertainty.
♻ ☆ Evaluating Copyright Takedown Methods for Language Models
Boyi Wei, Weijia Shi, Yangsibo Huang, Noah A. Smith, Chiyuan Zhang, Luke Zettlemoyer, Kai Li, Peter Henderson
Language models (LMs) derive their capabilities from extensive training on
diverse data, including potentially copyrighted material. These models can
memorize and generate content similar to their training data, posing potential
concerns. Therefore, model creators are motivated to develop mitigation methods
that prevent generating protected content. We term this procedure as copyright
takedowns for LMs, noting the conceptual similarity to (but legal distinction
from) the DMCA takedown This paper introduces the first evaluation of the
feasibility and side effects of copyright takedowns for LMs. We propose
CoTaEval, an evaluation framework to assess the effectiveness of copyright
takedown methods, the impact on the model's ability to retain uncopyrightable
factual knowledge from the training data whose recitation is embargoed, and how
well the model maintains its general utility and efficiency. We examine several
strategies, including adding system prompts, decoding-time filtering
interventions, and unlearning approaches. Our findings indicate that no tested
method excels across all metrics, showing significant room for research in this
unique problem setting and indicating potential unresolved challenges for live
policy proposals.
comment: 31 pages, 9 figures, 14 tables
♻ ☆ Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
Large language models (LLMs) show inherent brittleness in their safety
mechanisms, as evidenced by their susceptibility to jailbreaking and even
non-malicious fine-tuning. This study explores this brittleness of safety
alignment by leveraging pruning and low-rank modifications. We develop methods
to identify critical regions that are vital for safety guardrails, and that are
disentangled from utility-relevant regions at both the neuron and rank levels.
Surprisingly, the isolated regions we find are sparse, comprising about $3\%$
at the parameter level and $2.5\%$ at the rank level. Removing these regions
compromises safety without significantly impacting utility, corroborating the
inherent brittleness of the model's safety mechanisms. Moreover, we show that
LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications
to the safety-critical regions are restricted. These findings underscore the
urgent need for more robust safety strategies in LLMs.
comment: 22 pages, 9 figures. Project page is available at
https://boyiwei.com/alignment-attribution/
♻ ☆ Benchmarking Mental State Representations in Language Models ICML 2024
While numerous works have assessed the generative performance of language
models (LMs) on tasks requiring Theory of Mind reasoning, research into the
models' internal representation of mental states remains limited. Recent work
has used probing to demonstrate that LMs can represent beliefs of themselves
and others. However, these claims are accompanied by limited evaluation, making
it difficult to assess how mental state representations are affected by model
design and training choices. We report an extensive benchmark with various LM
types with different model sizes, fine-tuning approaches, and prompt designs to
study the robustness of mental state representations and memorisation issues
within the probes. Our results show that the quality of models' internal
representations of the beliefs of others increases with model size and, more
crucially, with fine-tuning. We are the first to study how prompt variations
impact probing performance on theory of mind tasks. We demonstrate that models'
representations are sensitive to prompt variations, even when such variations
should be beneficial. Finally, we complement previous activation editing
experiments on Theory of Mind tasks and show that it is possible to improve
models' reasoning performance by steering their activations without the need to
train any probe.
comment: ICML 2024 Workshop on Mechanistic Interpretability
♻ ☆ SeaLLMs -- Large Language Models for Southeast Asia ACL 2024
Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, Lidong Bing
Despite the remarkable achievements of large language models (LLMs) in
various tasks, there remains a linguistic bias that favors high-resource
languages, such as English, often at the expense of low-resource and regional
languages. To address this imbalance, we introduce SeaLLMs, an innovative
series of language models that specifically focuses on Southeast Asian (SEA)
languages. SeaLLMs are built upon the Llama-2 model and further advanced
through continued pre-training with an extended vocabulary, specialized
instruction and alignment tuning to better capture the intricacies of regional
languages. This allows them to respect and reflect local cultural norms,
customs, stylistic preferences, and legal considerations. Our comprehensive
evaluation demonstrates that SeaLLM-13b models exhibit superior performance
across a wide spectrum of linguistic tasks and assistant-style
instruction-following capabilities relative to comparable open-source models.
Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai,
Khmer, Lao, and Burmese, by large margins while remaining lightweight and
cost-effective to operate.
comment: Technical report, ACL 2024 DEMO TRACK
♻ ☆ RouteLLM: Learning to Route LLMs with Preference Data
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica
Large language models (LLMs) exhibit impressive capabilities across a wide
range of tasks, yet the choice of which model to use often involves a trade-off
between performance and cost. More powerful models, though effective, come with
higher expenses, while less capable models are more cost-effective. To address
this dilemma, we propose several efficient router models that dynamically
select between a stronger and a weaker LLM during inference, aiming to optimize
the balance between cost and response quality. We develop a training framework
for these routers leveraging human preference data and data augmentation
techniques to enhance performance. Our evaluation on widely-recognized
benchmarks shows that our approach significantly reduces costs-by over 2 times
in certain cases-without compromising the quality of responses. Interestingly,
our router models also demonstrate significant transfer learning capabilities,
maintaining their performance even when the strong and weak models are changed
at test time. This highlights the potential of these routers to provide a
cost-effective yet high-performance solution for deploying LLMs.
♻ ☆ KoLA: Carefully Benchmarking World Knowledge of Large Language Models ICLR 2024
Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan Zhang, Yushi Bai, Yantao Liu, Amy Xin, Nianyi Lin, Kaifeng Yun, Linlu Gong, Jianhui Chen, Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan, Kaisheng Zeng, Ji Qi, Hailong Jin, Jinxin Liu, Yu Gu, Yuan Yao, Ning Ding, Lei Hou, Zhiyuan Liu, Bin Xu, Jie Tang, Juanzi Li
The unprecedented performance of large language models (LLMs) necessitates
improvements in evaluations. Rather than merely exploring the breadth of LLM
abilities, we believe meticulous and thoughtful designs are essential to
thorough, unbiased, and applicable evaluations. Given the importance of world
knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark
(KoLA), in which we carefully design three crucial factors: (1) For
\textbf{ability modeling}, we mimic human cognition to form a four-level
taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For
\textbf{data}, to ensure fair comparisons, we use both Wikipedia, a corpus
prevalently pre-trained by LLMs, along with continuously collected emerging
corpora, aiming to evaluate the capacity to handle unseen data and evolving
knowledge. (3) For \textbf{evaluation criteria}, we adopt a contrastive system,
including overall standard scores for better numerical comparability across
tasks and models and a unique self-contrast metric for automatically evaluating
knowledge-creating ability. We evaluate $28$ open-source and commercial LLMs
and obtain some intriguing findings. The KoLA dataset and open-participation
leaderboard are publicly released at https://kola.xlore.cn and will be
continuously updated to provide references for developing LLMs and
knowledge-related systems.
comment: Accepted by ICLR 2024
♻ ☆ WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models ACL 2024
To mitigate the potential misuse of large language models (LLMs), recent
research has developed watermarking algorithms, which restrict the generation
process to leave an invisible trace for watermark detection. Due to the
two-stage nature of the task, most studies evaluate the generation and
detection separately, thereby presenting a challenge in unbiased, thorough, and
applicable evaluations. In this paper, we introduce WaterBench, the first
comprehensive benchmark for LLM watermarks, in which we design three crucial
factors: (1) For benchmarking procedure, to ensure an apples-to-apples
comparison, we first adjust each watermarking method's hyper-parameter to reach
the same watermarking strength, then jointly evaluate their generation and
detection performance. (2) For task selection, we diversify the input and
output length to form a five-category taxonomy, covering $9$ tasks. (3) For
evaluation metric, we adopt the GPT4-Judge for automatically evaluating the
decline of instruction-following abilities after watermarking. We evaluate $4$
open-source watermarks on $2$ LLMs under $2$ watermarking strengths and observe
the common struggles for current methods on maintaining the generation quality.
The code and data are available at https://github.com/THU-KEG/WaterBench.
comment: 26pages, 7 figures, accepted by ACL 2024
♻ ☆ Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration ACL 2024
Despite efforts to expand the knowledge of large language models (LLMs),
knowledge gaps -- missing or outdated information in LLMs -- might always
persist given the evolving nature of knowledge. In this work, we study
approaches to identify LLM knowledge gaps and abstain from answering questions
when knowledge gaps are present. We first adapt existing approaches to model
calibration or adaptation through fine-tuning/prompting and analyze their
ability to abstain from generating low-confidence outputs. Motivated by their
failures in self-reflection and over-reliance on held-out sets, we propose two
novel approaches that are based on model collaboration, i.e., LLMs probing
other LLMs for knowledge gaps, either cooperatively or competitively. Extensive
experiments with three LLMs on four QA tasks featuring diverse knowledge
domains demonstrate that both cooperative and competitive approaches to
unveiling LLM knowledge gaps achieve up to 19.3% improvements on abstain
accuracy against the strongest baseline. Further analysis reveals that our
proposed mechanisms could help identify failure cases in retrieval augmentation
and pinpoint knowledge gaps in multi-hop reasoning.
comment: ACL 2024