Computation and Language 73
☆ Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
Proprietary LMs such as GPT-4 are often employed to assess the quality of
responses from various LMs. However, concerns including transparency,
controllability, and affordability strongly motivate the development of
open-source LMs specialized in evaluations. On the other hand, existing open
evaluator LMs exhibit critical shortcomings: 1) they issue scores that
significantly diverge from those assigned by humans, and 2) they lack the
flexibility to perform both direct assessment and pairwise ranking, the two
most prevalent forms of assessment. Additionally, they do not possess the
ability to evaluate based on custom evaluation criteria, focusing instead on
general attributes like helpfulness and harmlessness. To address these issues,
we introduce Prometheus 2, a more powerful evaluator LM than its predecessor
that closely mirrors human and GPT-4 judgements. Moreover, it is capable of
processing both direct assessment and pair-wise ranking formats grouped with a
user-defined evaluation criteria. On four direct assessment benchmarks and four
pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and
agreement with humans and proprietary LM judges among all tested open evaluator
LMs. Our models, code, and data are all publicly available at
https://github.com/prometheus-eval/prometheus-eval.
comment: Work in Progress
☆ FLAME: Factuality-Aware Alignment for Large Language Models
Alignment is a standard procedure to fine-tune pre-trained large language
models (LLMs) to follow natural language instructions and serve as helpful AI
assistants. We have observed, however, that the conventional alignment process
fails to enhance the factual accuracy of LLMs, and often leads to the
generation of more false facts (i.e. hallucination). In this paper, we study
how to make the LLM alignment process more factual, by first identifying
factors that lead to hallucination in both alignment steps:\ supervised
fine-tuning (SFT) and reinforcement learning (RL). In particular, we find that
training the LLM on new knowledge or unfamiliar texts can encourage
hallucination. This makes SFT less factual as it trains on human labeled data
that may be novel to the LLM. Furthermore, reward functions used in standard RL
can also encourage hallucination, because it guides the LLM to provide more
helpful responses on a diverse set of instructions, often preferring longer and
more detailed responses. Based on these observations, we propose
factuality-aware alignment, comprised of factuality-aware SFT and
factuality-aware RL through direct preference optimization. Experiments show
that our proposed factuality-aware alignment guides LLMs to output more factual
responses while maintaining instruction-following capability.
☆ D2PO: Discriminator-Guided DPO with Response Evaluation Models
Varied approaches for aligning language models have been proposed, including
supervised fine-tuning, RLHF, and direct optimization methods such as DPO.
Although DPO has rapidly gained popularity due to its straightforward training
process and competitive results, there is an open question of whether there
remain practical advantages of using a discriminator, like a reward model, to
evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for
the online setting where preferences are being collected throughout learning.
As we collect gold preferences, we use these not only to train our policy, but
to train a discriminative response evaluation model to silver-label even more
synthetic data for policy training. We explore this approach across a set of
diverse tasks, including a realistic chat setting, we find that our approach
leads to higher-quality outputs compared to DPO with the same data budget, and
greater efficiency in terms of preference data requirements. Furthermore, we
show conditions under which silver labeling is most helpful: it is most
effective when training the policy with DPO, outperforming traditional PPO, and
benefits from maintaining a separate discriminator from the policy model.
comment: 20 pages, 12 figures
☆ Analyzing the Role of Semantic Representations in the Era of Large Language Models NAACL 2024
Zhijing Jin, Yuen Chen, Fernando Gonzalez, Jiarui Liu, Jiayi Zhang, Julian Michael, Bernhard Schölkopf, Mona Diab
Traditionally, natural language processing (NLP) models often use a rich set
of features created by linguistic expertise, such as semantic representations.
However, in the era of large language models (LLMs), more and more tasks are
turned into generic, end-to-end sequence generation problems. In this paper, we
investigate the question: what is the role of semantic representations in the
era of LLMs? Specifically, we investigate the effect of Abstract Meaning
Representation (AMR) across five diverse NLP tasks. We propose an AMR-driven
chain-of-thought prompting method, which we call AMRCoT, and find that it
generally hurts performance more than it helps. To investigate what AMR may
have to offer on these tasks, we conduct a series of analysis experiments. We
find that it is difficult to predict which input examples AMR may help or hurt
on, but errors tend to arise with multi-word expressions, named entities, and
in the final inference step where the LLM must connect its reasoning over the
AMR to its prediction. We recommend focusing on these areas for future work in
semantic representations for LLMs. Our code:
https://github.com/causalNLP/amr_llm.
comment: NAACL 2024
☆ Controllable Text Generation in the Instruction-Tuning Era
While most research on controllable text generation has focused on steering
base Language Models, the emerging instruction-tuning and prompting paradigm
offers an alternate approach to controllability. We compile and release
ConGenBench, a testbed of 17 different controllable generation tasks, using a
subset of it to benchmark the performance of 9 different baselines and methods
on Instruction-tuned Language Models. To our surprise, we find that
prompting-based approaches outperform controllable text generation methods on
most datasets and tasks, highlighting a need for research on controllable text
generation with Instruction-tuned Language Models in specific. Prompt-based
approaches match human performance on most stylistic tasks while lagging on
structural tasks, foregrounding a need to study more varied constraints and
more challenging stylistic tasks. To facilitate such research, we provide an
algorithm that uses only a task dataset and a Large Language Model with
in-context capabilities to automatically generate a constraint dataset. This
method eliminates the fields dependence on pre-curated constraint datasets,
hence vastly expanding the range of constraints that can be studied in the
future.
☆ MANTIS: Interleaved Multi-Image Instruction Tuning
The recent years have witnessed a great array of large multimodal models
(LMMs) to effectively solve single-image vision language tasks. However, their
abilities to solve multi-image visual language tasks is yet to be improved. The
existing multi-image LMMs (e.g. OpenFlamingo, Emu, Idefics, etc) mostly gain
their multi-image ability through pre-training on hundreds of millions of noisy
interleaved image-text data from web, which is neither efficient nor effective.
In this paper, we aim at building strong multi-image LMMs via instruction
tuning with academic-level resources. Therefore, we meticulously construct
Mantis-Instruct containing 721K instances from 14 multi-image datasets. We
design Mantis-Instruct to cover different multi-image skills like co-reference,
reasoning, comparing, temporal understanding. We combine Mantis-Instruct with
several single-image visual-language datasets to train our model Mantis to
handle any interleaved image-text inputs. We evaluate the trained Mantis on
five multi-image benchmarks and eight single-image benchmarks. Though only
requiring academic-level resources (i.e. 36 hours on 16xA100-40G), Mantis-8B
can achieve state-of-the-art performance on all the multi-image benchmarks and
beats the existing best multi-image LMM Idefics2-8B by an average of 9 absolute
points. We observe that Mantis performs equivalently well on the held-in and
held-out evaluation benchmarks. We further evaluate Mantis on single-image
benchmarks and demonstrate that Mantis can maintain a strong single-image
performance on par with CogVLM and Emu2. Our results are particularly
encouraging as it shows that low-cost instruction tuning is indeed much more
effective than intensive pre-training in terms of building multi-image LMMs.
comment: 9 pages, 3 figures
☆ NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment
Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, Oleksii Kuchaiev
Aligning Large Language Models (LLMs) with human values and preferences is
essential for making them helpful and safe. However, building efficient tools
to perform alignment can be challenging, especially for the largest and most
competent LLMs which often contain tens or hundreds of billions of parameters.
We create NeMo-Aligner, a toolkit for model alignment that can efficiently
scale to using hundreds of GPUs for training. NeMo-Aligner comes with highly
optimized and scalable implementations for major paradigms of model alignment
such as: Reinforcement Learning from Human Feedback (RLHF), Direct Preference
Optimization (DPO), SteerLM, and Self-Play Fine-Tuning (SPIN). Additionally,
our toolkit supports running most of the alignment techniques in a Parameter
Efficient Fine-Tuning (PEFT) setting. NeMo-Aligner is designed for
extensibility, allowing support for other alignment techniques with minimal
effort. It is open-sourced with Apache 2.0 License and we invite community
contributions at https://github.com/NVIDIA/NeMo-Aligner
comment: 13 pages, 4 figures
☆ V-FLUTE: Visual Figurative Language Understanding with Textual Explanations
Large Vision-Language models (VLMs) have demonstrated strong reasoning
capabilities in tasks requiring a fine-grained understanding of literal images
and text, such as visual question-answering or visual entailment. However,
there has been little exploration of these models' capabilities when presented
with images and captions containing figurative phenomena such as metaphors or
humor, the meaning of which is often implicit. To close this gap, we propose a
new task and a high-quality dataset: Visual Figurative Language Understanding
with Textual Explanations (V-FLUTE). We frame the visual figurative language
understanding problem as an explainable visual entailment task, where the model
has to predict whether the image (premise) entails a claim (hypothesis) and
justify the predicted label with a textual explanation. Using a human-AI
collaboration framework, we build a high-quality dataset, V-FLUTE, that
contains 6,027 instances spanning five
diverse multimodal figurative phenomena: metaphors, similes, idioms, sarcasm,
and humor. The figurative phenomena can be present either in the image, the
caption, or both. We further conduct both automatic and human evaluations to
assess current VLMs' capabilities in understanding figurative phenomena.
☆ WildChat: 1M ChatGPT Interaction Logs in the Wild ICLR 2024
Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite
their widespread use, there remains a lack of public datasets showcasing how
these tools are used by a population of users in practice. To bridge this gap,
we offered free access to ChatGPT for online users in exchange for their
affirmative, consensual opt-in to anonymously collect their chat transcripts
and request headers. From this, we compiled WildChat, a corpus of 1 million
user-ChatGPT conversations, which consists of over 2.5 million interaction
turns. We compare WildChat with other popular user-chatbot interaction
datasets, and find that our dataset offers the most diverse user prompts,
contains the largest number of languages, and presents the richest variety of
potentially toxic use-cases for researchers to study. In addition to
timestamped chat transcripts, we enrich the dataset with demographic data,
including state, country, and hashed IP addresses, alongside request headers.
This augmentation allows for more detailed analysis of user behaviors across
different geographical regions and temporal dimensions. Finally, because it
captures a broad range of use cases, we demonstrate the dataset's potential
utility in fine-tuning instruction-following models. WildChat is released at
https://wildchat.allen.ai under AI2 ImpACT Licenses.
comment: accepted by ICLR 2024
☆ UQA: Corpus for Urdu Question Answering
This paper introduces UQA, a novel dataset for question answering and text
comprehension in Urdu, a low-resource language with over 70 million native
speakers. UQA is generated by translating the Stanford Question Answering
Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called
EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in
the translated context paragraphs. The paper describes the process of selecting
and evaluating the best translation model among two candidates: Google
Translator and Seamless M4T. The paper also benchmarks several state-of-the-art
multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and
reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and
74.56 EM. UQA is a valuable resource for developing and testing multilingual
NLP systems for Urdu and for enhancing the cross-lingual transferability of
existing models. Further, the paper demonstrates the effectiveness of EATS for
creating high-quality datasets for other languages and domains. The UQA dataset
and the code are publicly available at www.github.com/sameearif/UQA.
☆ MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors
Large 2D vision-language models (2D-LLMs) have gained significant attention
by bridging Large Language Models (LLMs) with images using a simple projector.
Inspired by their success, large 3D point cloud-language models (3D-LLMs) also
integrate point clouds into LLMs. However, directly aligning point clouds with
LLM requires expensive training costs, typically in hundreds of GPU-hours on
A100, which hinders the development of 3D-LLMs. In this paper, we introduce
MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA
results while training for only 27 hours on one RTX 3090. Specifically, we
propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which
can leverage the similarity between 2D and 3D visual information. We introduce
a novel four-stage training strategy for modality alignment in a cascaded way,
and a mixture of query experts module to adaptively aggregate features with
high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods
LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which
is up to 260x fewer than existing methods. Extensive experiments show that
MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with
significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12
increase on GPT-4 evaluation score for the challenging object captioning task
compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800.
We are the first to explore the efficient 3D-LLM, offering new insights to the
community. Code and weights are available at
https://github.com/TangYuan96/MiniGPT-3D.
comment: 17 pages, 9 figures
☆ Unsupervised Flow Discovery from Task-oriented Dialogues
The design of dialogue flows is a critical but time-consuming task when
developing task-oriented dialogue (TOD) systems. We propose an approach for the
unsupervised discovery of flows from dialogue history, thus making the process
applicable to any domain for which such an history is available. Briefly,
utterances are represented in a vector space and clustered according to their
semantic similarity. Clusters, which can be seen as dialogue states, are then
used as the vertices of a transition graph for representing the flows visually.
We present concrete examples of flows, discovered from MultiWOZ, a public TOD
dataset. We further elaborate on their significance and relevance for the
underlying conversations and introduce an automatic validation metric for their
assessment. Experimental results demonstrate the potential of the proposed
approach for extracting meaningful flows from task-oriented conversations.
comment: 12 pages, 4 figures
☆ Verification and Refinement of Natural Language Explanations through LLM-Symbolic Theorem Proving
Natural language explanations have become a proxy for evaluating explainable
and multi-step Natural Language Inference (NLI) models. However, assessing the
validity of explanations for NLI is challenging as it typically involves the
crowd-sourcing of apposite datasets, a process that is time-consuming and prone
to logical errors. To address existing limitations, this paper investigates the
verification and refinement of natural language explanations through the
integration of Large Language Models (LLMs) and Theorem Provers (TPs).
Specifically, we present a neuro-symbolic framework, named Explanation-Refiner,
that augments a TP with LLMs to generate and formalise explanatory sentences
and suggest potential inference strategies for NLI. In turn, the TP is employed
to provide formal guarantees on the logical validity of the explanations and to
generate feedback for subsequent improvements. We demonstrate how
Explanation-Refiner can be jointly used to evaluate explanatory reasoning,
autoformalisation, and error correction mechanisms of state-of-the-art LLMs as
well as to automatically enhance the quality of human-annotated explanations of
variable complexity in different domains.
☆ Topics in the Study of the Pragmatic Functions of Phonetic Reduction in Dialog
Reduced articulatory precision is common in speech, but for dialog its
acoustic properties and pragmatic functions have been little studied. We here
try to remedy this gap. This technical report contains content that was omitted
from the journal article (Ward et al. 2024, submitted). Specifically, we here
report 1) lessons learned about annotating for perceived reduction, 2) the
finding that, unlike in read speech, the correlates of reduction in dialog
include high pitch, wide pitch range, and intensity, and 3) a baseline model
for predicting reduction in dialog, using simple acoustic/prosodic features,
that achieves correlations with human perceptions of 0.24 for English, and 0.17
for Spanish. We also provide examples of additional possible pragmatic
functions of reduction in English, and various discussion, observations and
speculations
☆ GAIA: A General AI Assistant for Intelligent Accelerator Operations
Large-scale machines like particle accelerators are usually run by a team of
experienced operators. In case of a particle accelerator, these operators
possess suitable background knowledge on both accelerator physics and the
technology comprising the machine. Due to the complexity of the machine,
particular subsystems of the machine are taken care of by experts, who the
operators can turn to. In this work the reasoning and action (ReAct) prompting
paradigm is used to couple an open-weights large language model (LLM) with a
high-level machine control system framework and other tools, e.g. the
electronic logbook or machine design documentation. By doing so, a multi-expert
retrieval augmented generation (RAG) system is implemented, which assists
operators in knowledge retrieval tasks, interacts with the machine directly if
needed, or writes high level control system scripts. This consolidation of
expert knowledge and machine interaction can simplify and speed up machine
operation tasks for both new and experienced human operators.
☆ The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights
Bridging the significant gap between large language model's English and
non-English performance presents a great challenge. While some previous studies
attempt to mitigate this gap with translated training data, the recently
proposed question alignment approach leverages the model's English expertise to
improve multilingual performance with minimum usage of expensive, error-prone
translation. In this paper, we explore how broadly this method can be applied
by examining its effects in reasoning with executable code and reasoning with
common sense. We also explore how to apply this approach efficiently to
extremely large language models using proxy-tuning. Experiment results on
multilingual reasoning benchmarks mGSM, mSVAMP and xCSQA demonstrate that the
question alignment approach can be used to boost multilingual performance
across diverse reasoning scenarios, model families, and sizes. For instance,
when applied to the LLaMA2 models, our method brings an average accuracy
improvements of 12.2% on mGSM even with the 70B model. To understand the
mechanism of its success, we analyze representation space, chain-of-thought and
translation data scales, which reveals how question translation training
strengthens language alignment within LLMs and shapes their working patterns.
☆ Overcoming LLM Challenges using RAG-Driven Precision in Coffee Leaf Disease Remediation
Dr. Selva Kumar S, Afifah Khan Mohammed Ajmal Khan, Imadh Ajaz Banday, Manikantha Gada, Vibha Venkatesh Shanbhag
This research introduces an innovative AI-driven precision agriculture
system, leveraging YOLOv8 for disease identification and Retrieval Augmented
Generation (RAG) for context-aware diagnosis. Focused on addressing the
challenges of diseases affecting the coffee production sector in Karnataka, The
system integrates sophisticated object detection techniques with language
models to address the inherent constraints associated with Large Language
Models (LLMs). Our methodology not only tackles the issue of hallucinations in
LLMs, but also introduces dynamic disease identification and remediation
strategies. Real-time monitoring, collaborative dataset expansion, and
organizational involvement ensure the system's adaptability in diverse
agricultural settings. The effect of the suggested system extends beyond
automation, aiming to secure food supplies, protect livelihoods, and promote
eco-friendly farming practices. By facilitating precise disease identification,
the system contributes to sustainable and environmentally conscious
agriculture, reducing reliance on pesticides. Looking to the future, the
project envisions continuous development in RAG-integrated object detection
systems, emphasizing scalability, reliability, and usability. This research
strives to be a beacon for positive change in agriculture, aligning with global
efforts toward sustainable and technologically enhanced food production.
comment: 6 pages, 3 figures
☆ The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation LREC
Large Language Models (LLMs) have emerged as powerful support tools across
various natural language tasks and a range of application domains. Recent
studies focus on exploring their capabilities for data annotation. This paper
provides a comparative overview of twelve studies investigating the potential
of LLMs in labelling data. While the models demonstrate promising cost and
time-saving benefits, there exist considerable limitations, such as
representativeness, bias, sensitivity to prompt variations and English language
preference. Leveraging insights from these studies, our empirical analysis
further examines the alignment between human and GPT-generated opinion
distributions across four subjective datasets. In contrast to the studies
examining representation, our methodology directly obtains the opinion
distribution from GPT. Our analysis thereby supports the minority of studies
that are considering diverse perspectives when evaluating data annotation tasks
and highlights the need for further research in this direction.
comment: LREC-COLING NLPerspectives workshop
☆ Low-resource speech recognition and dialect identification of Irish in a multi-task framework
This paper explores the use of Hybrid CTC/Attention encoder-decoder models
trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech
recognition (ASR) and dialect identification (DID). Results are compared to the
current best performing models trained for ASR (TDNN-HMM) and DID (ECAPA-TDNN).
An optimal InterCTC setting is initially established using a Conformer encoder.
This setting is then used to train a model with an E-branchformer encoder and
the performance of both architectures are compared. A multi-task fine-tuning
approach is adopted for language model (LM) shallow fusion. The experiments
yielded an improvement in DID accuracy of 10.8% relative to a baseline
ECAPA-TDNN, and WER performance approaching the TDNN-HMM model. This multi-task
approach emerges as a promising strategy for Irish low-resource ASR and DID.
comment: 7 pages. Accepted to Odyssey 2024 - The Speaker and Language
Recognition Workshop
☆ Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation
Non-autoregressive (NAR) language models are known for their low latency in
neural machine translation (NMT). However, a performance gap exists between NAR
and autoregressive models due to the large decoding space and difficulty in
capturing dependency between target words accurately. Compounding this,
preparing appropriate training data for NAR models is a non-trivial task, often
exacerbating exposure bias. To address these challenges, we apply reinforcement
learning (RL) to Levenshtein Transformer, a representative edit-based NAR
model, demonstrating that RL with self-generated data can enhance the
performance of edit-based NAR models. We explore two RL approaches: stepwise
reward maximization and episodic reward maximization. We discuss the respective
pros and cons of these two approaches and empirically verify them. Moreover, we
experimentally investigate the impact of temperature setting on performance,
confirming the importance of proper temperature setting for NAR models'
training.
☆ Identification of Entailment and Contradiction Relations between Natural Language Sentences: A Neurosymbolic Approach
Natural language inference (NLI), also known as Recognizing Textual
Entailment (RTE), is an important aspect of natural language understanding.
Most research now uses machine learning and deep learning to perform this task
on specific datasets, meaning their solution is not explainable nor explicit.
To address the need for an explainable approach to RTE, we propose a novel
pipeline that is based on translating text into an Abstract Meaning
Representation (AMR) graph. For this we use a pre-trained AMR parser. We then
translate the AMR graph into propositional logic and use a SAT solver for
automated reasoning. In text, often commonsense suggests that an entailment (or
contradiction) relationship holds between a premise and a claim, but because
different wordings are used, this is not identified from their logical
representations. To address this, we introduce relaxation methods to allow
replacement or forgetting of some propositions. Our experimental results show
this pipeline performs well on four RTE datasets.
☆ Prompt engineering paradigms for medical applications: scoping review and recommendations for better practices
Prompt engineering is crucial for harnessing the potential of large language
models (LLMs), especially in the medical domain where specialized terminology
and phrasing is used. However, the efficacy of prompt engineering in the
medical domain remains to be explored. In this work, 114 recent studies
(2022-2024) applying prompt engineering in medicine, covering prompt learning
(PL), prompt tuning (PT), and prompt design (PD) are reviewed. PD is the most
prevalent (78 articles). In 12 papers, PD, PL, and PT terms were used
interchangeably. ChatGPT is the most commonly used LLM, with seven papers using
it for processing sensitive clinical data. Chain-of-Thought emerges as the most
common prompt engineering technique. While PL and PT articles typically provide
a baseline for evaluating prompt-based approaches, 64% of PD studies lack
non-prompt-related baselines. We provide tables and figures summarizing
existing work, and reporting recommendations to guide future research
contributions.
☆ Boosting Jailbreak Attack with Momentum ICLR 2024
Large Language Models (LLMs) have achieved remarkable success across diverse
tasks, yet they remain vulnerable to adversarial attacks, notably the
well-documented \textit{jailbreak} attack. Recently, the Greedy Coordinate
Gradient (GCG) attack has demonstrated efficacy in exploiting this
vulnerability by optimizing adversarial prompts through a combination of
gradient heuristics and greedy search. However, the efficiency of this attack
has become a bottleneck in the attacking process. To mitigate this limitation,
in this paper we rethink the generation of adversarial prompts through an
optimization lens, aiming to stabilize the optimization process and harness
more heuristic insights from previous iterations. Specifically, we introduce
the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack,
which incorporates a momentum term into the gradient heuristic. Experimental
results showcase the notable enhancement achieved by MAP in gradient-based
attacks on aligned language models. Our code is available at
https://github.com/weizeming/momentum-attack-llm.
comment: ICLR 2024 Workshop on Reliable and Responsible Foundation Models
☆ DMON: A Simple yet Effective Approach for Argument Structure Learning COLING 2024
Argument structure learning~(ASL) entails predicting relations between
arguments. Because it can structure a document to facilitate its understanding,
it has been widely applied in many fields~(medical, commercial, and scientific
domains). Despite its broad utilization, ASL remains a challenging task because
it involves examining the complex relationships between the sentences in a
potentially unstructured discourse. To resolve this problem, we have developed
a simple yet effective approach called Dual-tower Multi-scale cOnvolution
neural Network~(DMON) for the ASL task. Specifically, we organize arguments
into a relationship matrix that together with the argument embeddings forms a
relationship tensor and design a mechanism to capture relations with contextual
arguments. Experimental results on three different-domain argument mining
datasets demonstrate that our framework outperforms state-of-the-art models.
The code is available at https://github.com/VRCMF/DMON.git .
comment: COLING 2024
☆ TartuNLP at EvaLatin 2024: Emotion Polarity Detection
This paper presents the TartuNLP team submission to EvaLatin 2024 shared task
of the emotion polarity detection for historical Latin texts. Our system relies
on two distinct approaches to annotating training data for supervised learning:
1) creating heuristics-based labels by adopting the polarity lexicon provided
by the organizers and 2) generating labels with GPT4. We employed parameter
efficient fine-tuning using the adapters framework and experimented with both
monolingual and cross-lingual knowledge transfer for training language and task
adapters. Our submission with the LLM-generated labels achieved the overall
first place in the emotion polarity detection task. Our results show that
LLM-based annotations show promising results on texts in Latin.
comment: Accepted to The Third Workshop on Language Technologies for
Historical and Ancient Languages (LT4HALA 2024)
☆ It Couldn't Help But Overhear: On the Limits of Modelling Meta-Communicative Grounding Acts with Supervised Learning
Active participation in a conversation is key to building common ground,
since understanding is jointly tailored by producers and recipients.
Overhearers are deprived of the privilege of performing grounding acts and can
only conjecture about intended meanings. Still, data generation and annotation,
modelling, training and evaluation of NLP dialogue models place reliance on the
overhearing paradigm. How much of the underlying grounding processes are
thereby forfeited? As we show, there is evidence pointing to the impossibility
of properly modelling human meta-communicative acts with data-driven learning
models. In this paper, we discuss this issue and provide a preliminary analysis
on the variability of human decisions for requesting clarification. Most
importantly, we wish to bring this topic back to the community's table,
encouraging discussion on the consequences of having models designed to only
"listen in".
comment: work in progress
☆ Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts
Existing methods for creating source-grounded information-seeking dialog
datasets are often costly and hard to implement due to their sole reliance on
human annotators. We propose combining large language models (LLMs) prompting
with human expertise for more efficient and reliable data generation. Instead
of the labor-intensive Wizard-of-Oz (WOZ) method, where two annotators generate
a dialog from scratch, role-playing agent and user, we use LLM generation to
simulate the two roles. Annotators then verify the output and augment it with
attribution data. We demonstrate our method by constructing MISeD -- Meeting
Information Seeking Dialogs dataset -- the first information-seeking dialog
dataset focused on meeting transcripts. Models finetuned with MISeD demonstrate
superior performance on our test set, as well as on a novel fully-manual WOZ
test set and an existing query-based summarization benchmark, suggesting the
utility of our approach.
☆ Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-Identification
Whistleblowing is essential for ensuring transparency and accountability in
both public and private sectors. However, (potential) whistleblowers often fear
or face retaliation, even when reporting anonymously. The specific content of
their disclosures and their distinct writing style may re-identify them as the
source. Legal measures, such as the EU WBD, are limited in their scope and
effectiveness. Therefore, computational methods to prevent re-identification
are important complementary tools for encouraging whistleblowers to come
forward. However, current text sanitization tools follow a one-size-fits-all
approach and take an overly limited view of anonymity. They aim to mitigate
identification risk by replacing typical high-risk words (such as person names
and other NE labels) and combinations thereof with placeholders. Such an
approach, however, is inadequate for the whistleblowing scenario since it
neglects further re-identification potential in textual features, including
writing style. Therefore, we propose, implement, and evaluate a novel
classification and mitigation strategy for rewriting texts that involves the
whistleblower in the assessment of the risk and utility. Our prototypical tool
semi-automatically evaluates risk at the word/term level and applies
risk-adapted anonymization techniques to produce a grammatically disjointed yet
appropriately sanitized text. We then use a LLM that we fine-tuned for
paraphrasing to render this text coherent and style-neutral. We evaluate our
tool's effectiveness using court cases from the ECHR and excerpts from a
real-world whistleblower testimony and measure the protection against
authorship attribution (AA) attacks and utility loss statistically using the
popular IMDb62 movie reviews dataset. Our method can significantly reduce AA
accuracy from 98.81% to 31.22%, while preserving up to 73.1% of the original
content's semantics.
comment: Accepted for publication at the ACM Conference on Fairness,
Accountability, and Transparency 2024 (ACM FAccT'24). This is a preprint
manuscript (authors' own version before final copy-editing)
☆ Few Shot Class Incremental Learning using Vision-Language models
Recent advancements in deep learning have demonstrated remarkable performance
comparable to human capabilities across various supervised computer vision
tasks. However, the prevalent assumption of having an extensive pool of
training data encompassing all classes prior to model training often diverges
from real-world scenarios, where limited data availability for novel classes is
the norm. The challenge emerges in seamlessly integrating new classes with few
samples into the training data, demanding the model to adeptly accommodate
these additions without compromising its performance on base classes. To
address this exigency, the research community has introduced several solutions
under the realm of few-shot class incremental learning (FSCIL).
In this study, we introduce an innovative FSCIL framework that utilizes
language regularizer and subspace regularizer. During base training, the
language regularizer helps incorporate semantic information extracted from a
Vision-Language model. The subspace regularizer helps in facilitating the
model's acquisition of nuanced connections between image and text semantics
inherent to base classes during incremental training. Our proposed framework
not only empowers the model to embrace novel classes with limited data, but
also ensures the preservation of performance on base classes. To substantiate
the efficacy of our approach, we conduct comprehensive experiments on three
distinct FSCIL benchmarks, where our framework attains state-of-the-art
performance.
comment: under review at Pattern Recognition Letters
☆ UniGen: Universal Domain Generalization for Sentiment Classification via Zero-shot Dataset Generation
Although pre-trained language models have exhibited great flexibility and
versatility with prompt-based few-shot learning, they suffer from the extensive
parameter size and limited applicability for inference. Recent studies have
suggested that PLMs be used as dataset generators and a tiny task-specific
model be trained to achieve efficient inference. However, their applicability
to various domains is limited because they tend to generate domain-specific
datasets. In this work, we propose a novel approach to universal domain
generalization that generates a dataset regardless of the target domain. This
allows for generalization of the tiny task model to any domain that shares the
label space, thus enhancing the real-world applicability of the dataset
generation paradigm. Our experiments indicate that the proposed method
accomplishes generalizability across various domains while using a parameter
set that is orders of magnitude smaller than PLMs.
☆ The IgboAPI Dataset: Empowering Igbo Language Technologies through Multi-dialectal Enrichment LREC
Chris Chinenye Emezue, Ifeoma Okoh, Chinedu Mbonu, Chiamaka Chukwuneke, Daisy Lal, Ignatius Ezeani, Paul Rayson, Ijemma Onwuzulike, Chukwuma Okeke, Gerald Nweya, Bright Ogbonna, Chukwuebuka Oraegbunam, Esther Chidinma Awo-Ndubuisi, Akudo Amarachukwu Osuagwu, Obioha Nmezi
The Igbo language is facing a risk of becoming endangered, as indicated by a
2025 UNESCO study. This highlights the need to develop language technologies
for Igbo to foster communication, learning and preservation. To create robust,
impactful, and widely adopted language technologies for Igbo, it is essential
to incorporate the multi-dialectal nature of the language. The primary obstacle
in achieving dialectal-aware language technologies is the lack of comprehensive
dialectal datasets. In response, we present the IgboAPI dataset, a
multi-dialectal Igbo-English dictionary dataset, developed with the aim of
enhancing the representation of Igbo dialects. Furthermore, we illustrate the
practicality of the IgboAPI dataset through two distinct studies: one focusing
on Igbo semantic lexicon and the other on machine translation. In the semantic
lexicon project, we successfully establish an initial Igbo semantic lexicon for
the Igbo semantic tagger, while in the machine translation study, we
demonstrate that by finetuning existing machine translation systems using the
IgboAPI dataset, we significantly improve their ability to handle dialectal
variations in sentences.
comment: Accepted to the LREC-COLING 2024 conference
☆ Context-Aware Clustering using Large Language Models
Sindhu Tipirneni, Ravinarayana Adkathimar, Nurendra Choudhary, Gaurush Hiranandani, Rana Ali Amjad, Vassilis N. Ioannidis, Changhe Yuan, Chandan K. Reddy
Despite the remarkable success of Large Language Models (LLMs) in text
understanding and generation, their potential for text clustering tasks remains
underexplored. We observed that powerful closed-source LLMs provide good
quality clusterings of entity sets but are not scalable due to the massive
compute power required and the associated costs. Thus, we propose CACTUS
(Context-Aware ClusTering with aUgmented triplet losS), a systematic approach
that leverages open-source LLMs for efficient and effective supervised
clustering of entity subsets, particularly focusing on text-based entities.
Existing text clustering methods fail to effectively capture the context
provided by the entity subset. Moreover, though there are several language
modeling based approaches for clustering, very few are designed for the task of
supervised clustering. This paper introduces a novel approach towards
clustering entity subsets using LLMs by capturing context via a scalable
inter-entity attention mechanism. We propose a novel augmented triplet loss
function tailored for supervised clustering, which addresses the inherent
challenges of directly applying the triplet loss to this problem. Furthermore,
we introduce a self-supervised clustering task based on text augmentation
techniques to improve the generalization of our model. For evaluation, we
collect ground truth clusterings from a closed-source LLM and transfer this
knowledge to an open-source LLM under the supervised clustering framework,
allowing a faster and cheaper open-source model to perform the same task.
Experiments on various e-commerce query and product clustering datasets
demonstrate that our proposed approach significantly outperforms existing
unsupervised and supervised baselines under various external clustering
evaluation metrics.
comment: 16 pages
☆ On the Evaluation of Machine-Generated Reports SIGIR 2024
James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, Noah Hibbler
Large Language Models (LLMs) have enabled new ways to satisfy information
needs. Although great strides have been made in applying them to settings like
document ranking and short-form text generation, they still struggle to compose
complete, accurate, and verifiable long-form reports. Reports with these
qualities are necessary to satisfy the complex, nuanced, or multi-faceted
information needs of users. In this perspective paper, we draw together
opinions from industry and academia, and from a variety of related research
areas, to present our vision for automatic report generation, and -- critically
-- a flexible framework by which such reports can be evaluated. In contrast
with other summarization tasks, automatic report generation starts with a
detailed description of an information need, stating the necessary background,
requirements, and scope of the report. Further, the generated reports should be
complete, accurate, and verifiable. These qualities, which are desirable -- if
not required -- in many analytic report-writing settings, require rethinking
how to build and evaluate systems that exhibit these qualities. To foster new
efforts in building these systems, we present an evaluation framework that
draws on ideas found in various evaluations. To test completeness and accuracy,
the framework uses nuggets of information, expressed as questions and answers,
that need to be part of any high-quality generated report. Additionally,
evaluation of citations that map claims made in the report to their source
documents ensures verifiability.
comment: 12 pages, 4 figures, accepted at SIGIR 2024 as perspective paper
☆ Bayesian Optimization with LLM-Based Acquisition Functions for Natural Language Preference Elicitation
Designing preference elicitation (PE) methodologies that can quickly
ascertain a user's top item preferences in a cold-start setting is a key
challenge for building effective and personalized conversational recommendation
(ConvRec) systems. While large language models (LLMs) constitute a novel
technology that enables fully natural language (NL) PE dialogues, we
hypothesize that monolithic LLM NL-PE approaches lack the multi-turn,
decision-theoretic reasoning required to effectively balance the NL exploration
and exploitation of user preferences towards an arbitrary item set. In
contrast, traditional Bayesian optimization PE methods define theoretically
optimal PE strategies, but fail to use NL item descriptions or generate NL
queries, unrealistically assuming users can express preferences with direct
item ratings and comparisons. To overcome the limitations of both approaches,
we formulate NL-PE in a Bayesian Optimization (BO) framework that seeks to
generate NL queries which actively elicit natural language feedback to reduce
uncertainty over item utilities to identify the best recommendation. We
demonstrate our framework in a novel NL-PE algorithm, PEBOL, which uses Natural
Language Inference (NLI) between user preference utterances and NL item
descriptions to maintain preference beliefs and BO strategies such as Thompson
Sampling (TS) and Upper Confidence Bound (UCB) to guide LLM query generation.
We numerically evaluate our methods in controlled experiments, finding that
PEBOL achieves up to 131% improvement in MAP@10 after 10 turns of cold start
NL-PE dialogue compared to monolithic GPT-3.5, despite relying on a much
smaller 400M parameter NLI model for preference inference.
☆ A Hong Kong Sign Language Corpus Collected from Sign-interpreted TV News LREC
This paper introduces TVB-HKSL-News, a new Hong Kong sign language (HKSL)
dataset collected from a TV news program over a period of 7 months. The dataset
is collected to enrich resources for HKSL and support research in
large-vocabulary continuous sign language recognition (SLR) and translation
(SLT). It consists of 16.07 hours of sign videos of two signers with a
vocabulary of 6,515 glosses (for SLR) and 2,850 Chinese characters or 18K
Chinese words (for SLT). One signer has 11.66 hours of sign videos and the
other has 4.41 hours. One objective in building the dataset is to support the
investigation of how well large-vocabulary continuous sign language
recognition/translation can be done for a single signer given a (relatively)
large amount of his/her training data, which could potentially lead to the
development of new modeling methods. Besides, most parts of the data collection
pipeline are automated with little human intervention; we believe that our
collection method can be scaled up to collect more sign language data easily
for SLT in the future for any sign languages if such sign-interpreted videos
are available. We also run a SOTA SLR/SLT model on the dataset and get a
baseline SLR word error rate of 34.08% and a baseline SLT BLEU-4 score of 23.58
for benchmarking future research on the dataset.
comment: Accepted by LREC-COLING 2024
☆ Language Fairness in Multilingual Information Retrieval SIGIR 2024
Multilingual information retrieval (MLIR) considers the problem of ranking
documents in several languages for a query expressed in a language that may
differ from any of those languages. Recent work has observed that approaches
such as combining ranked lists representing a single document language each or
using multilingual pretrained language models demonstrate a preference for one
language over others. This results in systematic unfair treatment of documents
in different languages. This work proposes a language fairness metric to
evaluate whether documents across different languages are fairly ranked through
statistical equivalence testing using the Kruskal-Wallis test. In contrast to
most prior work in group fairness, we do not consider any language to be an
unprotected group. Thus our proposed measure, PEER (Probability of
EqualExpected Rank), is the first fairness metric specifically designed to
capture the language fairness of MLIR systems. We demonstrate the behavior of
PEER on artificial ranked lists. We also evaluate real MLIR systems on two
publicly available benchmarks and show that the PEER scores align with prior
analytical findings on MLIR fairness. Our implementation is compatible with
ir-measures and is available at http://github.com/hltcoe/peer_measure.
comment: 5 pages, 1 figure, accepted at SIGIR 2024 as short paper
☆ Distillation for Multilingual Information Retrieval SIGIR 2024
Recent work in cross-language information retrieval (CLIR), where queries and
documents are in different languages, has shown the benefit of the
Translate-Distill framework that trains a cross-language neural dual-encoder
model using translation and distillation. However, Translate-Distill only
supports a single document language. Multilingual information retrieval (MLIR),
which ranks a multilingual document collection, is harder to train than CLIR
because the model must assign comparable relevance scores to documents in
different languages. This work extends Translate-Distill and propose
Multilingual Translate-Distill (MTD) for MLIR. We show that ColBERT-X models
trained with MTD outperform their counterparts trained ith Multilingual
Translate-Train, which is the previous state-of-the-art training approach, by
5% to 25% in nDCG@20 and 15% to 45% in MAP. We also show that the model is
robust to the way languages are mixed in training batches. Our implementation
is available on GitHub.
comment: 6 pages, 1 figure, accepted at SIGIR 2024 as short paper
☆ PLAID SHIRTTT for Large-Scale Streaming Dense Retrieval SIGIR 2024
PLAID, an efficient implementation of the ColBERT late interaction bi-encoder
using pretrained language models for ranking, consistently achieves
state-of-the-art performance in monolingual, cross-language, and multilingual
retrieval. PLAID differs from ColBERT by assigning terms to clusters and
representing those terms as cluster centroids plus compressed residual vectors.
While PLAID is effective in batch experiments, its performance degrades in
streaming settings where documents arrive over time because representations of
new tokens may be poorly modeled by the earlier tokens used to select cluster
centroids. PLAID Streaming Hierarchical Indexing that Runs on Terabytes of
Temporal Text (PLAID SHIRTTT) addresses this concern using multi-phase
incremental indexing based on hierarchical sharding. Experiments on ClueWeb09
and the multilingual NeuCLIR collection demonstrate the effectiveness of this
approach both for the largest collection indexed to date by the ColBERT
architecture and in the multilingual setting, respectively.
comment: 5 pages, 1 figure, accepted at SIGIR 2024 as short paper
☆ CACTUS: Chemistry Agent Connecting Tool-Usage to Science
Andrew D. McNaughton, Gautham Ramalaxmi, Agustin Kruel, Carter R. Knutson, Rohith A. Varikoti, Neeraj Kumar
Large language models (LLMs) have shown remarkable potential in various
domains, but they often lack the ability to access and reason over
domain-specific knowledge and tools. In this paper, we introduced CACTUS
(Chemistry Agent Connecting Tool-Usage to Science), an LLM-based agent that
integrates cheminformatics tools to enable advanced reasoning and
problem-solving in chemistry and molecular discovery. We evaluate the
performance of CACTUS using a diverse set of open-source LLMs, including
Gemma-7b, Falcon-7b, MPT-7b, Llama2-7b, and Mistral-7b, on a benchmark of
thousands of chemistry questions. Our results demonstrate that CACTUS
significantly outperforms baseline LLMs, with the Gemma-7b and Mistral-7b
models achieving the highest accuracy regardless of the prompting strategy
used. Moreover, we explore the impact of domain-specific prompting and hardware
configurations on model performance, highlighting the importance of prompt
engineering and the potential for deploying smaller models on consumer-grade
hardware without significant loss in accuracy. By combining the cognitive
capabilities of open-source LLMs with domain-specific tools, CACTUS can assist
researchers in tasks such as molecular property prediction, similarity
searching, and drug-likeness assessment. Furthermore, CACTUS represents a
significant milestone in the field of cheminformatics, offering an adaptable
tool for researchers engaged in chemistry and molecular discovery. By
integrating the strengths of open-source LLMs with domain-specific tools,
CACTUS has the potential to accelerate scientific advancement and unlock new
frontiers in the exploration of novel, effective, and safe therapeutic
candidates, catalysts, and materials. Moreover, CACTUS's ability to integrate
with automated experimentation platforms and make data-driven decisions in real
time opens up new possibilities for autonomous discovery.
☆ How Can I Get It Right? Using GPT to Rephrase Incorrect Trainee Responses
Jionghao Lin, Zifei Han, Danielle R. Thomas, Ashish Gurung, Shivang Gupta, Vincent Aleven, Kenneth R. Koedinger
One-on-one tutoring is widely acknowledged as an effective instructional
method, conditioned on qualified tutors. However, the high demand for qualified
tutors remains a challenge, often necessitating the training of novice tutors
(i.e., trainees) to ensure effective tutoring. Research suggests that providing
timely explanatory feedback can facilitate the training process for trainees.
However, it presents challenges due to the time-consuming nature of assessing
trainee performance by human experts. Inspired by the recent advancements of
large language models (LLMs), our study employed the GPT-4 model to build an
explanatory feedback system. This system identifies trainees' responses in
binary form (i.e., correct/incorrect) and automatically provides template-based
feedback with responses appropriately rephrased by the GPT-4 model. We
conducted our study on 410 responses from trainees across three training
lessons: Giving Effective Praise, Reacting to Errors, and Determining What
Students Know. Our findings indicate that: 1) using a few-shot approach, the
GPT-4 model effectively identifies correct/incorrect trainees' responses from
three training lessons with an average F1 score of 0.84 and an AUC score of
0.85; and 2) using the few-shot approach, the GPT-4 model adeptly rephrases
incorrect trainees' responses into desired responses, achieving performance
comparable to that of human experts.
comment: International Journal of Artificial Intelligence in Education
☆ Efficient Compression of Multitask Multilingual Speech Models
Whisper is a multitask and multilingual speech model covering 99 languages.
It yields commendable automatic speech recognition (ASR) results in a subset of
its covered languages, but the model still underperforms on a non-negligible
number of under-represented languages, a problem exacerbated in smaller model
versions. In this work, we examine its limitations, demonstrating the presence
of speaker-related (gender, age) and model-related (resourcefulness and model
size) bias. Despite that, we show that only model-related bias are amplified by
quantization, impacting more low-resource languages and smaller models.
Searching for a better compression approach, we propose DistilWhisper, an
approach that is able to bridge the performance gap in ASR for these languages
while retaining the advantages of multitask and multilingual capabilities. Our
approach involves two key strategies: lightweight modular ASR fine-tuning of
whisper-small using language-specific experts, and knowledge distillation from
whisper-large-v2. This dual approach allows us to effectively boost ASR
performance while keeping the robustness inherited from the multitask and
multilingual pre-training. Results demonstrate that our approach is more
effective than standard fine-tuning or LoRA adapters, boosting performance in
the targeted languages for both in- and out-of-domain test sets, while
introducing only a negligible parameter overhead at inference.
comment: Master Thesis
☆ The Role of Model Architecture and Scale in Predicting Molecular Properties: Insights from Fine-Tuning RoBERTa, BART, and LLaMA
This study introduces a systematic framework to compare the efficacy of Large
Language Models (LLMs) for fine-tuning across various cheminformatics tasks.
Employing a uniform training methodology, we assessed three well-known
models-RoBERTa, BART, and LLaMA-on their ability to predict molecular
properties using the Simplified Molecular Input Line Entry System (SMILES) as a
universal molecular representation format. Our comparative analysis involved
pre-training 18 configurations of these models, with varying parameter sizes
and dataset scales, followed by fine-tuning them on six benchmarking tasks from
DeepChem. We maintained consistent training environments across models to
ensure reliable comparisons. This approach allowed us to assess the influence
of model type, size, and training dataset size on model performance.
Specifically, we found that LLaMA-based models generally offered the lowest
validation loss, suggesting their superior adaptability across tasks and
scales. However, we observed that absolute validation loss is not a definitive
indicator of model performance - contradicts previous research - at least for
fine-tuning tasks: instead, model size plays a crucial role. Through rigorous
replication and validation, involving multiple training and fine-tuning cycles,
our study not only delineates the strengths and limitations of each model type
but also provides a robust methodology for selecting the most suitable LLM for
specific cheminformatics applications. This research underscores the importance
of considering model architecture and dataset characteristics in deploying AI
for molecular property prediction, paving the way for more informed and
effective utilization of AI in drug discovery and related fields.
☆ Modeling Empathetic Alignment in Conversation NAACL 2024
Empathy requires perspective-taking: empathetic responses require a person to
reason about what another has experienced and communicate that understanding in
language. However, most NLP approaches to empathy do not explicitly model this
alignment process. Here, we introduce a new approach to recognizing alignment
in empathetic speech, grounded in Appraisal Theory. We introduce a new dataset
of over 9.2K span-level annotations of different types of appraisals of a
person's experience and over 3K empathetic alignments between a speaker's and
observer's speech. Through computational experiments, we show that these
appraisals and alignments can be accurately recognized. In experiments in over
9.2M Reddit conversations, we find that appraisals capture meaningful groupings
of behavior but that most responses have minimal alignment. However, we find
that mental health professionals engage with substantially more empathetic
alignment.
comment: Camera-ready version for NAACL 2024
☆ LLaVA Finds Free Lunch: Teaching Human Behavior Improves Content Understanding Abilities Of LLMs
Somesh Singh, Harini S I, Yaman K Singla, Veeky Baths, Rajiv Ratn Shah, Changyou Chen, Balaji Krishnamurthy
Communication is defined as ``Who says what to whom with what effect.'' A
message from a communicator generates downstream receiver effects, also known
as behavior. Receiver behavior, being a downstream effect of the message,
carries rich signals about it. Even after carrying signals about the message,
the behavior data is often ignored while training large language models. We
show that training LLMs on receiver behavior can actually help improve their
content-understanding abilities. Specifically, we show that training LLMs to
predict the receiver behavior of likes and comments improves the LLM's
performance on a wide variety of downstream content understanding tasks. We
show this performance increase over 40 video and image understanding tasks over
23 benchmark datasets across both 0-shot and fine-tuning settings,
outperforming many supervised baselines. Moreover, since receiver behavior,
such as likes and comments, is collected by default on the internet and does
not need any human annotations to be useful, the performance improvement we get
after training on this data is essentially free-lunch. We release the receiver
behavior cleaned comments and likes of 750k images and videos collected from
multiple platforms along with our instruction-tuning data.
♻ ☆ Large language models can accurately predict searcher preferences
Relevance labels, which indicate whether a search result is valuable to a
searcher, are key to evaluating and optimising search systems. The best way to
capture the true preferences of users is to ask them for their careful feedback
on which results would be useful, but this approach does not scale to produce a
large number of labels. Getting relevance labels at scale is usually done with
third-party labellers, who judge on behalf of the user, but there is a risk of
low-quality data if the labeller doesn't understand user needs. To improve
quality, one standard approach is to study real users through interviews, user
studies and direct feedback, find areas where labels are systematically
disagreeing with users, then educate labellers about user needs through judging
guidelines, training and monitoring. This paper introduces an alternate
approach for improving label quality. It takes careful feedback from real
users, which by definition is the highest-quality first-party gold data that
can be derived, and develops an large language model prompt that agrees with
that data.
We present ideas and observations from deploying language models for
large-scale relevance labelling at Bing, and illustrate with data from TREC. We
have found large language models can be effective, with accuracy as good as
human labellers and similar capability to pick the hardest queries, best runs,
and best groups. Systematic changes to the prompts make a difference in
accuracy, but so too do simple paraphrases. To measure agreement with real
searchers needs high-quality ``gold'' labels, but with these we find that
models produce better labels than third-party workers, for a fraction of the
cost, and these labels let us train notably better rankers.
♻ ☆ Towards Real-time Learning in Large Language Models: A Critical Review
Real-time learning concerns the ability of learning systems to acquire
knowledge over time, enabling their adaptation and generalization to novel
tasks. It is a critical ability for intelligent, real-world systems, especially
when data may be insufficient or difficult to obtain. This review provides a
comprehensive analysis of real-time learning in Large Language Models. It
synthesizes the state-of-the-art real-time learning paradigms, including
continual learning, meta-learning, parameter-efficient learning, and
mixture-of-experts learning. We demonstrate their utility for real-time
learning by describing specific achievements from these related topics and
their critical factors. Finally, the paper highlights current problems and
challenges for future research in the field. By consolidating the latest
relevant research developments, this review offers a comprehensive
understanding of real-time learning and its implications for designing and
developing LLM-based learning systems addressing real-world problems.
♻ ☆ A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele Lunati, Summer Yue
Large language models (LLMs) have achieved impressive success on many
benchmarks for mathematical reasoning. However, there is growing concern that
some of this performance actually reflects dataset contamination, where data
closely resembling benchmark questions leaks into the training data, instead of
true reasoning ability. To investigate this claim rigorously, we commission
Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and
complexity of the established GSM8k benchmark, the gold standard for measuring
elementary mathematical reasoning. We ensure that the two benchmarks are
comparable across important metrics such as human solve rates, number of steps
in solution, answer magnitude, and more. When evaluating leading open- and
closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with
several families of models (e.g., Phi and Mistral) showing evidence of
systematic overfitting across almost all model sizes. At the same time, many
models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show
minimal signs of overfitting. Further analysis suggests a positive relationship
(Spearman's r^2=0.32) between a model's probability of generating an example
from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that
many models may have partially memorized GSM8k.
♻ ☆ Fewer Truncations Improve Language Modeling ICML 2024
In large language model training, input documents are typically concatenated
together and then split into sequences of equal length to avoid padding tokens.
Despite its efficiency, the concatenation approach compromises data integrity
-- it inevitably breaks many documents into incomplete pieces, leading to
excessive truncations that hinder the model from learning to compose logically
coherent and factually consistent content that is grounded on the complete
context. To address the issue, we propose Best-fit Packing, a scalable and
efficient method that packs documents into training sequences through
length-aware combinatorial optimization. Our method completely eliminates
unnecessary truncations while retaining the same training efficiency as
concatenation. Empirical results from both text and code pre-training show that
our method achieves superior performance (e.g., relatively +4.7% on reading
comprehension; +16.8% in context following; and +9.2% on program synthesis),
and reduces closed-domain hallucination effectively by up to 58.3%.
comment: ICML 2024
♻ ☆ BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine
Large Language Models (LLMs) have swiftly emerged as vital resources for
different applications in the biomedical and healthcare domains; however, these
models encounter issues such as generating inaccurate information or
hallucinations. Retrieval-augmented generation provided a solution for these
models to update knowledge and enhance their performance. In contrast to
previous retrieval-augmented LMs, which utilize specialized cross-attention
mechanisms to help LLM encode retrieved text, BiomedRAG adopts a simpler
approach by directly inputting the retrieved chunk-based documents into the
LLM. This straightforward design is easily applicable to existing retrieval and
language models, effectively bypassing noise information in retrieved
documents, particularly in noise-intensive tasks. Moreover, we demonstrate the
potential for utilizing the LLM to supervise the retrieval model in the
biomedical domain, enabling it to retrieve the document that assists the LM in
improving its predictions. Our experiments reveal that with the tuned
scorer,\textsc{ BiomedRAG} attains superior performance across 5 biomedical NLP
tasks, encompassing information extraction (triple extraction, relation
extraction), text classification, link prediction, and question-answering,
leveraging over 9 datasets. For instance, in the triple extraction task,
\textsc{BiomedRAG} outperforms other triple extraction systems with micro-F1
scores of 81.42 and 88.83 on GIT and ChemProt corpora, respectively.
♻ ☆ CPLLM: Clinical Prediction with Large Language Models
We present Clinical Prediction with Large Language Models (CPLLM), a method
that involves fine-tuning a pre-trained Large Language Model (LLM) for clinical
disease and readmission prediction. We utilized quantization and fine-tuned the
LLM using prompts. For diagnosis prediction, we predict whether patients will
be diagnosed with a target disease during their next visit or in the subsequent
diagnosis, leveraging their historical diagnosis records. We compared our
results to various baselines, including RETAIN, and Med-BERT, the current
state-of-the-art model for disease prediction using temporal structured EHR
data. In addition, We also evaluated CPLLM for patient hospital readmission
prediction and compared our method's performance with benchmark baselines. Our
experiments have shown that our proposed method, CPLLM, surpasses all the
tested models in terms of PR-AUC and ROC-AUC metrics, showing state-of-the-art
results for diagnosis prediction and patient hospital readmission prediction.
Such a method can be easily implemented and integrated into the clinical
process to help care providers estimate the next steps of patients
comment: v2
♻ ☆ Sparse is Enough in Fine-tuning Pre-trained Large Language Models ICML 2024
With the prevalence of pre-training-fine-tuning paradigm, how to efficiently
adapt the pre-trained model to the downstream tasks has been an intriguing
issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for
low-cost adaptation. Although PEFT has demonstrated effectiveness and been
widely applied, the underlying principles are still unclear. In this paper, we
adopt the PAC-Bayesian generalization error bound, viewing pre-training as a
shift of prior distribution which leads to a tighter bound for generalization
error. We validate this shift from the perspectives of oscillations in the loss
landscape and the quasi-sparsity in gradient distribution. Based on this, we
propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment
Fine-Tuning (SIFT), and validate its effectiveness on a range of tasks
including the GLUE Benchmark and Instruction-tuning. The code is accessible at
https://github.com/song-wx/SIFT/.
comment: Accepted at ICML 2024
♻ ☆ Strong Priority and Determinacy in Timed CCS
Building on the standard theory of process algebra with priorities, we
identify a new scheduling mechanism, called "constructive reduction" which is
designed to capture the essence of synchronous programming. The distinctive
property of this evaluation strategy is to achieve determinacy-by-construction
for multi-cast concurrent communication with shared memory. In the technical
setting of CCS extended by clocks and priorities, we prove for a large class of
"coherent" processes a confluence property for constructive reductions. We show
that under some restrictions, called "pivotability", coherence is preserved by
the operators of prefix, summation, parallel composition, restriction and
hiding. Since this permits memory and sharing, we are able to cover a strictly
larger class of processes compared to those in Milner's classical confluence
theory for CCS without priorities.
♻ ☆ StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization ICML 2024
In this paper, we investigate the long-term memory learning capabilities of
state-space models (SSMs) from the perspective of parameterization. We prove
that state-space models without any reparameterization exhibit a memory
limitation similar to that of traditional RNNs: the target relationships that
can be stably approximated by state-space models must have an exponential
decaying memory. Our analysis identifies this ``curse of memory'' as a result
of the recurrent weights converging to a stability boundary, suggesting that a
reparameterization technique can be effective. To this end, we introduce a
class of reparameterization techniques for SSMs that effectively lift its
memory limitations. Besides improving approximation capabilities, we further
illustrate that a principled choice of reparameterization scheme can also
enhance optimization stability. We validate our findings using synthetic
datasets, language models and image classifications.
comment: 27 pages, 7 figures, ICML 2024
♻ ☆ Language Models As Semantic Indexers
Bowen Jin, Hansi Zeng, Guoyin Wang, Xiusi Chen, Tianxin Wei, Ruirui Li, Zhengyang Wang, Zheng Li, Yang Li, Hanqing Lu, Suhang Wang, Jiawei Han, Xianfeng Tang
Semantic identifier (ID) is an important concept in information retrieval
that aims to preserve the semantics of objects such as documents and items
inside their IDs. Previous studies typically adopt a two-stage pipeline to
learn semantic IDs by first procuring embeddings using off-the-shelf text
encoders and then deriving IDs based on the embeddings. However, each step
introduces potential information loss, and there is usually an inherent
mismatch between the distribution of embeddings within the latent space
produced by text encoders and the anticipated distribution required for
semantic indexing. It is non-trivial to design a method that can learn the
document's semantic representations and its hierarchical structure
simultaneously, given that semantic IDs are discrete and sequentially
structured, and the semantic supervision is deficient. In this paper, we
introduce LMIndexer, a self-supervised framework to learn semantic IDs with a
generative language model. We tackle the challenge of sequential discrete ID by
introducing a semantic indexer capable of generating neural sequential discrete
representations with progressive training and contrastive learning. In response
to the semantic supervision deficiency, we propose to train the model with a
self-supervised document reconstruction objective. We show the high quality of
the learned IDs and demonstrate their effectiveness on three tasks including
recommendation, product search, and document retrieval on five datasets from
various domains. Code is available at
https://github.com/PeterGriffinJin/LMIndexer.
comment: 10 pages, 5 appendix pages
♻ ☆ LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, Duen Horng Chau
Large language models (LLMs) are popular for high-quality text generation but
can produce harmful content, even when aligned with human values through
reinforcement learning. Adversarial prompts can bypass their safety measures.
We propose LLM Self Defense, a simple approach to defend against these attacks
by having an LLM screen the induced responses. Our method does not require any
fine-tuning, input preprocessing, or iterative output generation. Instead, we
incorporate the generated content into a pre-defined prompt and employ another
instance of an LLM to analyze the text and predict whether it is harmful. We
test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent
LLMs against various types of attacks, such as forcefully inducing affirmative
responses to prompts and prompt engineering attacks. Notably, LLM Self Defense
succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5
and Llama 2. The code is publicly available at
https://github.com/poloclub/llm-self-defense
♻ ☆ Generating Feedback-Ladders for Logical Errors in Programming using Large Language Models
In feedback generation for logical errors in programming assignments, large
language model (LLM)-based methods have shown great promise. These methods ask
the LLM to generate feedback given the problem statement and a student's
(buggy) submission. There are several issues with these types of methods.
First, the generated feedback messages are often too direct in revealing the
error in the submission and thus diminish valuable opportunities for the
student to learn. Second, they do not consider the student's learning context,
i.e., their previous submissions, current knowledge, etc. Third, they are not
layered since existing methods use a single, shared prompt for all student
submissions. In this paper, we explore using LLMs to generate a
"feedback-ladder", i.e., multiple levels of feedback for the same
problem-submission pair. We evaluate the quality of the generated
feedback-ladder via a user study with students, educators, and researchers. We
have observed diminishing effectiveness for higher-level feedback and
higher-scoring submissions overall in the study. In practice, our method
enables teachers to select an appropriate level of feedback to show to a
student based on their personal learning context, or in a progressive manner to
go more detailed if a higher-level feedback fails to correct the student's
error.
comment: Published on the 17th EDM 2024 - Posters and Demos Track
♻ ☆ Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU EMNLP 2021
Incremental processing allows interactive systems to respond based on partial
inputs, which is a desirable property e.g. in dialogue agents. The currently
popular Transformer architecture inherently processes sequences as a whole,
abstracting away the notion of time. Recent work attempts to apply Transformers
incrementally via restart-incrementality by repeatedly feeding, to an unchanged
model, increasingly longer input prefixes to produce partial outputs. However,
this approach is computationally costly and does not scale efficiently for long
sequences. In parallel, we witness efforts to make Transformers more efficient,
e.g. the Linear Transformer (LT) with a recurrence mechanism. In this work, we
examine the feasibility of LT for incremental NLU in English. Our results show
that the recurrent LT model has better incremental performance and faster
inference speed compared to the standard Transformer and LT with
restart-incrementality, at the cost of part of the non-incremental (full
sequence) quality. We show that the performance drop can be mitigated by
training the model to wait for right context before committing to an output and
that training with input prefixes is beneficial for delivering correct partial
outputs.
comment: Accepted at EMNLP 2021 (contains corrigendum)
♻ ☆ CreoleVal: Multilingual Multitask Benchmarks for Creoles ACL
Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Ruth-Ann Armstrong, Abee Eijansantos, Catriona Malau, Hans Erik Heje, Ernests Lavrinovics, Diptesh Kanojia, Paul Belony, Marcel Bollmann, Loïc Grobol, Miryam de Lhoneux, Daniel Hershcovich, Michel DeGraff, Anders Søgaard, Johannes Bjerva
Creoles represent an under-explored and marginalized group of languages, with
few available resources for NLP research.While the genealogical ties between
Creoles and a number of highly-resourced languages imply a significant
potential for transfer learning, this potential is hampered due to this lack of
annotated data. In this work we present CreoleVal, a collection of benchmark
datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it
is an aggregate of novel development datasets for reading comprehension,
relation classification, and machine translation for Creoles, in addition to a
practical gateway to a handful of preexisting benchmarks. For each benchmark,
we conduct baseline experiments in a zero-shot setting in order to further
ascertain the capabilities and limitations of transfer learning for Creoles.
Ultimately, we see CreoleVal as an opportunity to empower research on Creoles
in NLP and computational linguistics, and in general, a step towards more
equitable language technology around the globe.
comment: Accepted to TACL
♻ ☆ Towards A Structured Overview of Use Cases for Natural Language Processing in the Legal Domain: A German Perspective
In recent years, the field of Legal Tech has risen in prevalence, as the
Natural Language Processing (NLP) and legal disciplines have combined forces to
digitalize legal processes. Amidst the steady flow of research solutions
stemming from the NLP domain, the study of use cases has fallen behind, leading
to a number of innovative technical methods without a place in practice. In
this work, we aim to build a structured overview of Legal Tech use cases,
grounded in NLP literature, but also supplemented by voices from legal practice
in Germany. Based upon a Systematic Literature Review, we identify seven
categories of NLP technologies for the legal domain, which are then studied in
juxtaposition to 22 legal use cases. In the investigation of these use cases,
we identify 15 ethical, legal, and social aspects (ELSA), shedding light on the
potential concerns of digitally transforming the legal domain.
comment: 10 pages, 6 tables, 30th Americas Conference on Information Systems
(AMCIS 2024)
♻ ☆ An Expert is Worth One Token: Synergizing Multiple Expert LLMs as Generalist via Expert Token Routing
Ziwei Chai, Guoyin Wang, Jing Su, Tianjie Zhang, Xuanwen Huang, Xuwu Wang, Jingjing Xu, Jianbo Yuan, Hongxia Yang, Fei Wu, Yang Yang
We present Expert-Token-Routing, a unified generalist framework that
facilitates seamless integration of multiple expert LLMs. Our framework
represents expert LLMs as special expert tokens within the vocabulary of a meta
LLM. The meta LLM can route to an expert LLM like generating new tokens.
Expert-Token-Routing not only supports learning the implicit expertise of
expert LLMs from existing instruction dataset but also allows for dynamic
extension of new expert LLMs in a plug-and-play manner. It also conceals the
detailed collaboration process from the user's perspective, facilitating
interaction as though it were a singular LLM. Our framework outperforms various
existing multi-LLM collaboration paradigms across benchmarks that incorporate
six diverse expert domains, demonstrating effectiveness and robustness in
building generalist LLM system via synergizing multiple expert LLMs.
♻ ☆ Evaluating Generative Ad Hoc Information Retrieval
Lukas Gienapp, Harrisen Scells, Niklas Deckers, Janek Bevendorff, Shuai Wang, Johannes Kiesel, Shahbaz Syed, Maik Fröbe, Guido Zuccon, Benno Stein, Matthias Hagen, Martin Potthast
Recent advances in large language models have enabled the development of
viable generative retrieval systems. Instead of a traditional document ranking,
many generative retrieval systems directly return a grounded generated text as
an answer to an information need expressed as a query or question. Quantifying
the utility of the textual responses is essential for appropriately evaluating
such generative ad hoc retrieval. Yet, the established evaluation methodology
for ranking-based retrieval is not suited for reliable, repeatable, and
reproducible evaluation of generated answers. In this paper, we survey the
relevant literature from the fields of information retrieval and natural
language processing, we identify search tasks and system architectures in
generative retrieval, we develop a corresponding user model, and we study its
operationalization. Our analysis provides a foundation and new insights for the
evaluation of generative retrieval systems, focusing on ad hoc retrieval.
comment: 14 pages, 6 figures, 1 table
♻ ☆ ReactGenie: A Development Framework for Complex Multimodal Interactions Using Large Language Models
Jackie Junrui Yang, Yingtian Shi, Yuhan Zhang, Karina Li, Daniel Wan Rosli, Anisha Jain, Shuning Zhang, Tianshi Li, James A. Landay, Monica S. Lam
By combining voice and touch interactions, multimodal interfaces can surpass
the efficiency of either modality alone. Traditional multimodal frameworks
require laborious developer work to support rich multimodal commands where the
user's multimodal command involves possibly exponential combinations of
actions/function invocations. This paper presents ReactGenie, a programming
framework that better separates multimodal input from the computational model
to enable developers to create efficient and capable multimodal interfaces with
ease. ReactGenie translates multimodal user commands into NLPL (Natural
Language Programming Language), a programming language we created, using a
neural semantic parser based on large-language models. The ReactGenie runtime
interprets the parsed NLPL and composes primitives in the computational model
to implement complex user commands. As a result, ReactGenie allows easy
implementation and unprecedented richness in commands for end-users of
multimodal apps. Our evaluation showed that 12 developers can learn and build a
nontrivial ReactGenie application in under 2.5 hours on average. In addition,
compared with a traditional GUI, end-users can complete tasks faster and with
less task load using ReactGenie apps.
♻ ☆ On the Learnability of Watermarks for Language Models ICLR 2024
Watermarking of language model outputs enables statistical detection of
model-generated text, which can mitigate harms and misuses of language models.
Existing watermarking strategies operate by altering the decoder of an existing
language model. In this paper, we ask whether language models can directly
learn to generate watermarked text, which would have significant implications
for the real-world deployment of watermarks. First, learned watermarks could be
used to build open models that naturally generate watermarked text, enabling
watermarking for open models, where users can control the decoding procedure.
Second, if watermarking is used to determine the provenance of generated text,
an adversary can hurt the reputation of a victim model by spoofing its
watermark and generating damaging watermarked text. To investigate the
learnability of watermarks, we propose watermark distillation, which trains a
student model to behave like a teacher model that uses decoding-based
watermarking. We test our approach on three decoding-based watermarking
strategies and various hyperparameter settings, finding that models can learn
to generate watermarked text with high detectability. We also find limitations
to learnability, including the loss of watermarking capabilities under
fine-tuning on normal text and high sample complexity when learning
low-distortion watermarks.
comment: Accepted at ICLR 2024
♻ ☆ HANS, are you clever? Clever Hans Effect Analysis of Neural Systems
Instruction-tuned Large Language Models (It-LLMs) have been exhibiting
outstanding abilities to reason around cognitive states, intentions, and
reactions of all people involved, letting humans guide and comprehend
day-to-day social interactions effectively. In fact, several multiple-choice
questions (MCQ) benchmarks have been proposed to construct solid assessments of
the models' abilities. However, earlier works are demonstrating the presence of
inherent "order bias" in It-LLMs, posing challenges to the appropriate
evaluation. In this paper, we investigate It-LLMs' resilience abilities towards
a series of probing tests using four MCQ benchmarks. Introducing adversarial
examples, we show a significant performance gap, mainly when varying the order
of the choices, which reveals a selection bias and brings into discussion
reasoning abilities. Following a correlation between first positions and model
choices due to positional bias, we hypothesized the presence of structural
heuristics in the decision-making process of the It-LLMs, strengthened by
including significant examples in few-shot scenarios. Finally, by using the
Chain-of-Thought (CoT) technique, we elicit the model to reason and mitigate
the bias by obtaining more robust models.
comment: This paper contains erroneous evaluations and we would like to
withdraw it
♻ ☆ GRAMMAR: Grounded and Modular Methodology for Assessment of Domain-Specific Retrieval-Augmented Language Model
Retrieval-augmented Generation (RAG) systems have been actively studied and
deployed across various industries to query on domain-specific knowledge base.
However, evaluating these systems presents unique challenges due to the
scarcity of domain-specific queries and corresponding ground truths, as well as
a lack of systematic approaches to diagnosing the cause of failure cases --
whether they stem from knowledge deficits or issues related to system
robustness. To address these challenges, we introduce GRAMMAR (GRounded And
Modular Methodology for Assessment of RAG), an evaluation framework comprising
two key elements: 1) a data generation process that leverages relational
databases and LLMs to efficiently produce scalable query-answer pairs. This
method facilitates the separation of query logic from linguistic variations for
enhanced debugging capabilities; and 2) an evaluation framework that
differentiates knowledge gaps from robustness and enables the identification of
defective modules. Our empirical results underscore the limitations of current
reference-free evaluation approaches and the reliability of GRAMMAR to
accurately identify model vulnerabilities.
♻ ☆ Structured Probabilistic Coding AAAI 2024
This paper presents a new supervised representation learning framework,
namely structured probabilistic coding (SPC), to learn compact and informative
representations from input related to the target task. SPC is an encoder-only
probabilistic coding technology with a structured regularization from the
target space. It can enhance the generalization ability of pre-trained language
models for better language understanding. Specifically, our probabilistic
coding simultaneously performs information encoding and task prediction in one
module to more fully utilize the effective information from input data. It uses
variational inference in the output space to reduce randomness and uncertainty.
Besides, to better control the learning process of probabilistic
representations, a structured regularization is proposed to promote uniformity
across classes in the latent space. With the regularization term, SPC can
preserve the Gaussian structure of the latent code and achieve better coverage
of the hidden space with class uniformly. Experimental results on 12 natural
language understanding tasks demonstrate that our SPC effectively improves the
performance of pre-trained language models for classification and regression.
Extensive experiments show that SPC can enhance the generalization capability,
robustness to label noise, and clustering quality of output representations.
comment: 11 pages, accepted by AAAI 2024 (Oral)
♻ ☆ Adversarial Attacks and Defense for Conversation Entailment Task
As the deployment of NLP systems in critical applications grows, ensuring the
robustness of large language models (LLMs) against adversarial attacks becomes
increasingly important. Large language models excel in various NLP tasks but
remain vulnerable to low-cost adversarial attacks. Focusing on the domain of
conversation entailment, where multi-turn dialogues serve as premises to verify
hypotheses, we fine-tune a transformer model to accurately discern the
truthfulness of these hypotheses. Adversaries manipulate hypotheses through
synonym swapping, aiming to deceive the model into making incorrect
predictions. To counteract these attacks, we implemented innovative fine-tuning
techniques and introduced an embedding perturbation loss method to
significantly bolster the model's robustness. Our findings not only emphasize
the importance of defending against adversarial attacks in NLP but also
highlight the real-world implications, suggesting that enhancing model
robustness is critical for reliable NLP applications.
♻ ☆ CIC: A framework for Culturally-aware Image Captioning IJCAI 2024
Image Captioning generates descriptive sentences from images using
Vision-Language Pre-trained models (VLPs) such as BLIP, which has improved
greatly. However, current methods lack the generation of detailed descriptive
captions for the cultural elements depicted in the images, such as the
traditional clothing worn by people from Asian cultural groups. In this paper,
we propose a new framework, \textbf{Culturally-aware Image Captioning (CIC)},
that generates captions and describes cultural elements extracted from cultural
visual elements in images representing cultures. Inspired by methods combining
visual modality and Large Language Models (LLMs) through appropriate prompts,
our framework (1) generates questions based on cultural categories from images,
(2) extracts cultural visual elements from Visual Question Answering (VQA)
using generated questions, and (3) generates culturally-aware captions using
LLMs with the prompts. Our human evaluation conducted on 45 participants from 4
different cultural groups with a high understanding of the corresponding
culture shows that our proposed framework generates more culturally descriptive
captions when compared to the image captioning baseline based on VLPs. Our code
and dataset will be made publicly available upon acceptance.
comment: Accepted in IJCAI 2024
♻ ☆ More Compute Is What You Need
Large language model pre-training has become increasingly expensive, with
most practitioners relying on scaling laws to allocate compute budgets for
model size and training tokens, commonly referred to as Compute-Optimal or
Chinchilla Optimal. In this paper, we hypothesize a new scaling law that
suggests model performance depends mostly on the amount of compute spent for
transformer-based models, independent of the specific allocation to model size
and dataset size. Using this unified scaling law, we predict that (a) for
inference efficiency, training should prioritize smaller model sizes and larger
training datasets, and (b) assuming the exhaustion of available web datasets,
scaling the model size might be the only way to further improve model
performance.
♻ ☆ Large Language Models as Zero-shot Dialogue State Tracker through Function Calling
Zekun Li, Zhiyu Zoey Chen, Mike Ross, Patrick Huber, Seungwhan Moon, Zhaojiang Lin, Xin Luna Dong, Adithya Sagar, Xifeng Yan, Paul A. Crook
Large language models (LLMs) are increasingly prevalent in conversational
systems due to their advanced understanding and generative capabilities in
general contexts. However, their effectiveness in task-oriented dialogues
(TOD), which requires not only response generation but also effective dialogue
state tracking (DST) within specific tasks and domains, remains less
satisfying. In this work, we propose a novel approach FnCTOD for solving DST
with LLMs through function calling. This method improves zero-shot DST,
allowing adaptation to diverse domains without extensive data collection or
model tuning. Our experimental results demonstrate that our approach achieves
exceptional performance with both modestly sized open-source and also
proprietary LLMs: with in-context prompting it enables various 7B or 13B
parameter models to surpass the previous state-of-the-art (SOTA) achieved by
ChatGPT, and improves ChatGPT's performance beating the SOTA by 5.6% average
joint goal accuracy (JGA). Individual model results for GPT-3.5 and GPT-4 are
boosted by 4.8% and 14%, respectively. We also show that by fine-tuning on a
small collection of diverse task-oriented dialogues, we can equip modestly
sized models, specifically a 13B parameter LLaMA2-Chat model, with
function-calling capabilities and DST performance comparable to ChatGPT while
maintaining their chat capabilities. We have made the code publicly available
at https://github.com/facebookresearch/FnCTOD
comment: More results in the next version. Code available at:
https://github.com/facebookresearch/FnCTOD
♻ ☆ A Primer on the Inner Workings of Transformer-based Language Models
The rapid progress of research aimed at interpreting the inner workings of
advanced language models has highlighted a need for contextualizing the
insights gained from years of work in this area. This primer provides a concise
technical introduction to the current techniques used to interpret the inner
workings of Transformer-based language models, focusing on the generative
decoder-only architecture. We conclude by presenting a comprehensive overview
of the known internal mechanisms implemented by these models, uncovering
connections across popular approaches and active research directions in this
area.
♻ ☆ AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
As large language models (LLMs) become increasingly prevalent and integrated
into autonomous systems, ensuring their safety is imperative. Despite
significant strides toward safety alignment, recent work
GCG~\citep{zou2023universal} proposes a discrete token optimization algorithm
and selects the single suffix with the lowest loss to successfully jailbreak
aligned LLMs. In this work, we first discuss the drawbacks of solely picking
the suffix with the lowest loss during GCG optimization for jailbreaking and
uncover the missed successful suffixes during the intermediate steps. Moreover,
we utilize those successful suffixes as training data to learn a generative
model, named AmpleGCG, which captures the distribution of adversarial suffixes
given a harmful query and enables the rapid generation of hundreds of suffixes
for any harmful queries in seconds. AmpleGCG achieves near 100\% attack success
rate (ASR) on two aligned LLMs (Llama-2-7B-chat and Vicuna-7B), surpassing two
strongest attack baselines. More interestingly, AmpleGCG also transfers
seamlessly to attack different models, including closed-source LLMs, achieving
a 99\% ASR on the latest GPT-3.5. To summarize, our work amplifies the impact
of GCG by training a generative model of adversarial suffixes that is universal
to any harmful queries and transferable from attacking open-source LLMs to
closed-source LLMs. In addition, it can generate 200 adversarial suffixes for
one harmful query in only 4 seconds, rendering it more challenging to defend.
♻ ☆ OpenELM: An Efficient Language Model Family with Open Training and Inference Framework
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari
The reproducibility and transparency of large language models are crucial for
advancing open research, ensuring the trustworthiness of results, and enabling
investigations into data and model biases, as well as potential risks. To this
end, we release OpenELM, a state-of-the-art open language model. OpenELM uses a
layer-wise scaling strategy to efficiently allocate parameters within each
layer of the transformer model, leading to enhanced accuracy. For example, with
a parameter budget of approximately one billion parameters, OpenELM exhibits a
2.36% improvement in accuracy compared to OLMo while requiring $2\times$ fewer
pre-training tokens.
Diverging from prior practices that only provide model weights and inference
code, and pre-train on private datasets, our release includes the complete
framework for training and evaluation of the language model on publicly
available datasets, including training logs, multiple checkpoints, and
pre-training configurations. We also release code to convert models to MLX
library for inference and fine-tuning on Apple devices. This comprehensive
release aims to empower and strengthen the open research community, paving the
way for future open research endeavors.
Our source code along with pre-trained model weights and training recipes is
available at \url{https://github.com/apple/corenet}. Additionally, \model
models can be found on HuggingFace at:
\url{https://huggingface.co/apple/OpenELM}.
comment: Minor corrections