Computation and Language 48
☆ Predictive Insights into LGBTQ+ Minority Stress: A Transductive Exploration of Social Media Discourse
S. Chapagain, Y. Zhao, T. K. Rohleen, S. M. Hamdi, S. F. Boubrahimi, R. E. Flinn, E. M. Lund, D. Klooster, J. R. Scheer, C. J. Cascalheira
Individuals who identify as sexual and gender minorities, including lesbian,
gay, bisexual, transgender, queer, and others (LGBTQ+) are more likely to
experience poorer health than their heterosexual and cisgender counterparts.
One primary source that drives these health disparities is minority stress
(i.e., chronic and social stressors unique to LGBTQ+ communities' experiences
adapting to the dominant culture). This stress is frequently expressed in
LGBTQ+ users' posts on social media platforms. However, these expressions are
not just straightforward manifestations of minority stress. They involve
linguistic complexity (e.g., idiom or lexical diversity), rendering them
challenging for many traditional natural language processing methods to detect.
In this work, we designed a hybrid model using Graph Neural Networks (GNN) and
Bidirectional Encoder Representations from Transformers (BERT), a pre-trained
deep language model to improve the classification performance of minority
stress detection. We experimented with our model on a benchmark social media
dataset for minority stress detection (LGBTQ+ MiSSoM+). The dataset is
comprised of 5,789 human-annotated Reddit posts from LGBTQ+ subreddits. Our
approach enables the extraction of hidden linguistic nuances through
pretraining on a vast amount of raw data, while also engaging in transductive
learning to jointly develop representations for both labeled training data and
unlabeled test data. The RoBERTa-GCN model achieved an accuracy of 0.86 and an
F1 score of 0.86, surpassing the performance of other baseline models in
predicting LGBTQ+ minority stress. Improved prediction of minority stress
expressions on social media could lead to digital health interventions to
improve the wellbeing of LGBTQ+ people-a community with high rates of
stress-sensitive health problems.
comment: This paper is accepted in 2024 IEEE 11th International Conference on
Data Science and Advanced Analytics (DSAA)
☆ Advancing Complex Medical Communication in Arabic with Sporo AraSum: Surpassing Existing Large Language Models
The increasing demand for multilingual capabilities in healthcare underscores
the need for AI models adept at processing diverse languages, particularly in
clinical documentation and decision-making. Arabic, with its complex
morphology, syntax, and diglossia, poses unique challenges for natural language
processing (NLP) in medical contexts. This case study evaluates Sporo AraSum, a
language model tailored for Arabic clinical documentation, against JAIS, the
leading Arabic NLP model. Using synthetic datasets and modified PDQI-9 metrics
modified ourselves for the purposes of assessing model performances in a
different language. The study assessed the models' performance in summarizing
patient-physician interactions, focusing on accuracy, comprehensiveness,
clinical utility, and linguistic-cultural competence.
Results indicate that Sporo AraSum significantly outperforms JAIS in
AI-centric quantitative metrics and all qualitative attributes measured in our
modified version of the PDQI-9. AraSum's architecture enables precise and
culturally sensitive documentation, addressing the linguistic nuances of Arabic
while mitigating risks of AI hallucinations. These findings suggest that Sporo
AraSum is better suited to meet the demands of Arabic-speaking healthcare
environments, offering a transformative solution for multilingual clinical
workflows. Future research should incorporate real-world data to further
validate these findings and explore broader integration into healthcare
systems.
comment: arXiv admin note: text overlap with arXiv:2411.06713
☆ Disentangling Memory and Reasoning Ability in Large Language Models
Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, Yongfeng Zhang
Large Language Models (LLMs) have demonstrated strong performance in handling
complex tasks requiring both extensive knowledge and reasoning abilities.
However, the existing LLM inference pipeline operates as an opaque process
without explicit separation between knowledge retrieval and reasoning steps,
making the model's decision-making process unclear and disorganized. This
ambiguity can lead to issues such as hallucinations and knowledge forgetting,
which significantly impact the reliability of LLMs in high-stakes domains. In
this paper, we propose a new inference paradigm that decomposes the complex
inference process into two distinct and clear actions: (1) memory recall: which
retrieves relevant knowledge, and (2) reasoning: which performs logical steps
based on the recalled knowledge. To facilitate this decomposition, we introduce
two special tokens memory and reason, guiding the model to distinguish between
steps that require knowledge retrieval and those that involve reasoning. Our
experiment results show that this decomposition not only improves model
performance but also enhances the interpretability of the inference process,
enabling users to identify sources of error and refine model responses
effectively. The code is available at
https://github.com/MingyuJ666/Disentangling-Memory-and-Reasoning.
☆ Utilizing Large Language Models to Synthesize Product Desirability Datasets
This research explores the application of large language models (LLMs) to
generate synthetic datasets for Product Desirability Toolkit (PDT) testing, a
key component in evaluating user sentiment and product experience. Utilizing
gpt-4o-mini, a cost-effective alternative to larger commercial LLMs, three
methods, Word+Review, Review+Word, and Supply-Word, were each used to
synthesize 1000 product reviews. The generated datasets were assessed for
sentiment alignment, textual diversity, and data generation cost. Results
demonstrated high sentiment alignment across all methods, with Pearson
correlations ranging from 0.93 to 0.97. Supply-Word exhibited the highest
diversity and coverage of PDT terms, although with increased generation costs.
Despite minor biases toward positive sentiments, in situations with limited
test data, LLM-generated synthetic data offers significant advantages,
including scalability, cost savings, and flexibility in dataset production.
comment: 9 pages, 2 figures, 6 tables
☆ PatentEdits: Framing Patent Novelty as Textual Entailment
A patent must be deemed novel and non-obvious in order to be granted by the
US Patent Office (USPTO). If it is not, a US patent examiner will cite the
prior work, or prior art, that invalidates the novelty and issue a non-final
rejection. Predicting what claims of the invention should change given the
prior art is an essential and crucial step in securing invention rights, yet
has not been studied before as a learnable task. In this work we introduce the
PatentEdits dataset, which contains 105K examples of successful revisions that
overcome objections to novelty. We design algorithms to label edits sentence by
sentence, then establish how well these edits can be predicted with large
language models (LLMs). We demonstrate that evaluating textual entailment
between cited references and draft sentences is especially effective in
predicting which inventive claims remained unchanged or are novel in relation
to prior art.
☆ When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
Extending context window sizes allows large language models (LLMs) to process
longer sequences and handle more complex tasks. Rotary Positional Embedding
(RoPE) has become the de facto standard due to its relative positional encoding
properties that benefit long-context training. However, we observe that using
RoPE with BFloat16 format results in numerical issues, causing it to deviate
from its intended relative positional encoding, especially in long-context
scenarios. This issue arises from BFloat16's limited precision and accumulates
as context length increases, with the first token contributing significantly to
this problem. To address this, we develop AnchorAttention, a plug-and-play
attention method that alleviates numerical issues caused by BFloat16, improves
long-context capabilities, and speeds up training. AnchorAttention reduces
unnecessary attention computations, maintains semantic coherence, and boosts
computational efficiency by treating the first token as a shared anchor with a
consistent position ID, making it visible to all documents within the training
context. Experiments on three types of LLMs demonstrate that AnchorAttention
significantly improves long-context performance and reduces training time by
over 50\% compared to standard full attention mechanisms, while preserving the
original LLM's capabilities on general tasks. Our code is available at
https://github.com/haonan3/AnchorContext.
☆ LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models
Salvatore Mario Carta, Stefano Chessa, Giulia Contu, Andrea Corriga, Andrea Deidda, Gianni Fenu, Luca Frigau, Alessandro Giuliani, Luca Grassi, Marco Manolo Manca, Mirko Marras, Francesco Mola, Bastianino Mossa, Piergiorgio Mura, Marco Ortu, Leonardo Piano, Simone Pisano, Alessia Pisu, Alessandro Sebastian Podda, Livio Pompianu, Simone Seu, Sandro Gabriele Tiddia
Minority languages are vital to preserving cultural heritage, yet they face
growing risks of extinction due to limited digital resources and the dominance
of artificial intelligence models trained on high-resource languages. This
white paper proposes a framework to generate linguistic tools for low-resource
languages, focusing on data creation to support the development of language
models that can aid in preservation efforts. Sardinian, an endangered language,
serves as the case study to demonstrate the framework's effectiveness. By
addressing the data scarcity that hinders intelligent applications for such
languages, we contribute to promoting linguistic diversity and support ongoing
efforts in language standardization and revitalization through modern
technologies.
☆ AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations NeurIPS 2024
State-of-the-art multimodal web agents, powered by Multimodal Large Language
Models (MLLMs), can autonomously execute many web tasks by processing user
instructions and interacting with graphical user interfaces (GUIs). Current
strategies for building web agents rely on (i) the generalizability of
underlying MLLMs and their steerability via prompting, and (ii) large-scale
fine-tuning of MLLMs on web-related tasks. However, web agents still struggle
to automate tasks on unseen websites and domains, limiting their applicability
to enterprise-specific and proprietary platforms. Beyond generalization from
large-scale pre-training and fine-tuning, we propose building agents for
few-shot adaptability using human demonstrations. We introduce the AdaptAgent
framework that enables both proprietary and open-weights multimodal web agents
to adapt to new websites and domains using few human demonstrations (up to 2).
Our experiments on two popular benchmarks -- Mind2Web & VisualWebArena -- show
that using in-context demonstrations (for proprietary models) or
meta-adaptation demonstrations (for meta-learned open-weights models) boosts
task success rate by 3.36% to 7.21% over non-adapted state-of-the-art models,
corresponding to a relative increase of 21.03% to 65.75%. Furthermore, our
additional analyses (a) show the effectiveness of multimodal demonstrations
over text-only ones, (b) shed light on the influence of different data
selection strategies during meta-learning on the generalization of the agent,
and (c) demonstrate the effect of number of few-shot examples on the web
agent's success rate. Overall, our results unlock a complementary axis for
developing widely applicable multimodal web agents beyond large-scale
pre-training and fine-tuning, emphasizing few-shot adaptability.
comment: 18 pages, 3 figures, an abridged version to appear in NeurIPS 2024
AFM Workshop
☆ WaterPark: A Robustness Assessment of Language Model Watermarking
To mitigate the misuse of large language models (LLMs), such as
disinformation, automated phishing, and academic cheating, there is a pressing
need for the capability of identifying LLM-generated texts. Watermarking
emerges as one promising solution: it plants statistical signals into LLMs'
generative processes and subsequently verifies whether LLMs produce given
texts. Various watermarking methods (``watermarkers'') have been proposed; yet,
due to the lack of unified evaluation platforms, many critical questions remain
under-explored: i) What are the strengths/limitations of various watermarkers,
especially their attack robustness? ii) How do various design choices impact
their robustness? iii) How to optimally operate watermarkers in adversarial
environments?
To fill this gap, we systematize existing LLM watermarkers and watermark
removal attacks, mapping out their design spaces. We then develop WaterPark, a
unified platform that integrates 10 state-of-the-art watermarkers and 12
representative attacks. More importantly, leveraging WaterPark, we conduct a
comprehensive assessment of existing watermarkers, unveiling the impact of
various design choices on their attack robustness. For instance, a
watermarker's resilience to increasingly intensive attacks hinges on its
context dependency. We further explore the best practices to operate
watermarkers in adversarial environments. For instance, using a generic
detector alongside a watermark-specific detector improves the security of
vulnerable watermarkers. We believe our study sheds light on current LLM
watermarking techniques while WaterPark serves as a valuable testbed to
facilitate future research.
comment: 22 pages
☆ CAFE A Novel Code switching Dataset for Algerian Dialect French and English
Houssam Eddine-Othman Lachemat, Akli Abbas, Nourredine Oukas, Yassine El Kheir, Samia Haboussi, Absar Showdhury Shammur
The paper introduces and publicly releases (Data download link available
after acceptance) CAFE -- the first Code-switching dataset between Algerian
dialect, French, and english languages. The CAFE speech data is unique for (a)
its spontaneous speaking style in vivo human-human conversation capturing
phenomena like code-switching and overlapping speech, (b) addresses distinct
linguistic challenges in North African Arabic dialect; (c) the CAFE captures
dialectal variations from various parts of Algeria within different
sociolinguistic contexts. CAFE data contains approximately 37 hours of speech,
with a subset, CAFE-small, of 2 hours and 36 minutes released with manual human
annotation including speech segmentation, transcription, explicit annotation of
code-switching points, overlapping speech, and other events such as noises, and
laughter among others. The rest approximately 34.58 hours contain pseudo label
transcriptions. In addition to the data release, the paper also highlighted the
challenges of using state-of-the-art Automatic Speech Recognition (ASR) models
such as Whisper large-v2,3 and PromptingWhisper to handle such content.
Following, we benchmark CAFE data with the aforementioned Whisper models and
show how well-designed data processing pipelines and advanced decoding
techniques can improve the ASR performance in terms of Mixed Error Rate (MER)
of 0.310, Character Error Rate (CER) of 0.329 and Word Error Rate (WER) of
0.538.
comment: 24 pages, submitted to tallip
☆ Unification of Balti and trans-border sister dialects in the essence of LLMs and AI Technology SC
The language called Balti belongs to the Sino-Tibetan, specifically the
Tibeto-Burman language family. It is understood with variations, across
populations in India, China, Pakistan, Nepal, Tibet, Burma, and Bhutan,
influenced by local cultures and producing various dialects. Considering the
diverse cultural, socio-political, religious, and geographical impacts, it is
important to step forward unifying the dialects, the basis of common root,
lexica, and phonological perspectives, is vital. In the era of globalization
and the increasingly frequent developments in AI technology, understanding the
diversity and the efforts of dialect unification is important to understanding
commonalities and shortening the gaps impacted by unavoidable circumstances.
This article analyzes and examines how artificial intelligence AI in the
essence of Large Language Models LLMs, can assist in analyzing, documenting,
and standardizing the endangered Balti Language, based on the efforts made in
different dialects so far.
comment: Accepted by IEEE conference ISCSLP 2024
☆ Transformer-Based Contextualized Language Models Joint with Neural Networks for Natural Language Inference in Vietnamese
Natural Language Inference (NLI) is a task within Natural Language Processing
(NLP) that holds value for various AI applications. However, there have been
limited studies on Natural Language Inference in Vietnamese that explore the
concept of joint models. Therefore, we conducted experiments using various
combinations of contextualized language models (CLM) and neural networks. We
use CLM to create contextualized work presentations and use Neural Networks for
classification. Furthermore, we have evaluated the strengths and weaknesses of
each joint model and identified the model failure points in the Vietnamese
context. The highest F1 score in this experiment, up to 82.78\% in the
benchmark dataset (ViNLI). By conducting experiments with various models, the
most considerable size of the CLM is XLM-R (355M). That combination has
consistently demonstrated superior performance compared to fine-tuning strong
pre-trained language models like PhoBERT (+6.58\%), mBERT (+19.08\%), and XLM-R
(+0.94\%) in terms of F1-score. This article aims to introduce a novel approach
or model that attains improved performance for Vietnamese NLI. Overall, we find
that the joint approach of CLM and neural networks is simple yet capable of
achieving high-quality performance, which makes it suitable for applications
that require efficient resource utilization.
☆ On the Way to LLM Personalization: Learning to Remember User Conversations
Large Language Models (LLMs) have quickly become an invaluable assistant for
a variety of tasks. However, their effectiveness is constrained by their
ability to tailor responses to human preferences and behaviors via
personalization. Prior work in LLM personalization has largely focused on style
transfer or incorporating small factoids about the user, as knowledge injection
remains an open challenge. In this paper, we explore injecting knowledge of
prior conversations into LLMs to enable future work on less redundant,
personalized conversations. We identify two real-world constraints: (1)
conversations are sequential in time and must be treated as such during
training, and (2) per-user personalization is only viable in
parameter-efficient settings. To this aim, we propose PLUM, a pipeline
performing data augmentation for up-sampling conversations as question-answer
pairs, that are then used to finetune a low-rank adaptation adapter with a
weighted cross entropy loss. Even in this first exploration of the problem, we
perform competitively with baselines such as RAG, attaining an accuracy of
81.5% across 100 conversations.
comment: 16 pages, 6 tables, 3 figures
☆ Executable QR codes with Machine Learning for Industrial Applications
Executable QR codes, also known as eQR codes or just sQRy, are a special kind
of QR codes that embed programs conceived to run on mobile devices like
smartphones. Since the program is directly encoded in binary form within the QR
code, it can be executed even when the reading device is not provided with
Internet access. The applications of this technology are manifold, and range
from smart user guides to advisory systems. The first programming language made
available for eQR is QRtree, which enables the implementation of decision trees
aimed, for example, at guiding the user in operating/maintaining a complex
machinery or for reaching a specific location.
In this work, an additional language is proposed, we term QRind, which was
specifically devised for Industry. It permits to integrate distinct
computational blocks into the QR code, e.g., machine learning models to enable
predictive maintenance and algorithms to ease machinery usage. QRind permits
the Industry 4.0/5.0 paradigms to be implemented, in part, also in those cases
where Internet is unavailable.
comment: preprint, 4 pages, 2024
☆ Fact-Level Confidence Calibration and Self-Correction
Confidence calibration in LLMs, i.e., aligning their self-assessed confidence
with the actual accuracy of their responses, enabling them to self-evaluate the
correctness of their outputs. However, current calibration methods for LLMs
typically estimate two scalars to represent overall response confidence and
correctness, which is inadequate for long-form generation where the response
includes multiple atomic facts and may be partially confident and correct.
These methods also overlook the relevance of each fact to the query. To address
these challenges, we propose a Fact-Level Calibration framework that operates
at a finer granularity, calibrating confidence to relevance-weighted
correctness at the fact level. Furthermore, comprehensive analysis under the
framework inspired the development of Confidence-Guided Fact-level
Self-Correction ($\textbf{ConFix}$), which uses high-confidence facts within a
response as additional knowledge to improve low-confidence ones. Extensive
experiments across four datasets and six models demonstrate that ConFix
effectively mitigates hallucinations without requiring external knowledge
sources such as retrieval systems.
comment: Code is available at https://github.com/yuanyige/fact-calibration
☆ Combining Autoregressive and Autoencoder Language Models for Text Classification
This paper presents CAALM-TC (Combining Autoregressive and Autoencoder
Language Models for Text Classification), a novel method that enhances text
classification by integrating autoregressive and autoencoder language models.
Autoregressive large language models such as Open AI's GPT, Meta's Llama or
Microsoft's Phi offer promising prospects for content analysis practitioners,
but they generally underperform supervised BERT based models for text
classification. CAALM leverages autoregressive models to generate contextual
information based on input texts, which is then combined with the original text
and fed into an autoencoder model for classification. This hybrid approach
capitalizes on the extensive contextual knowledge of autoregressive models and
the efficient classification capabilities of autoencoders. Experimental results
on four benchmark datasets demonstrate that CAALM consistently outperforms
existing methods, particularly in tasks with smaller datasets and more abstract
classification objectives. The findings indicate that CAALM offers a scalable
and effective solution for automated content analysis in social science
research that minimizes sample size requirements.
☆ VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
Large multimodal models (LMMs) with advanced video analysis capabilities have
recently garnered significant attention. However, most evaluations rely on
traditional methods like multiple-choice questions in benchmarks such as
VideoMME and LongVideoBench, which are prone to lack the depth needed to
capture the complex demands of real-world users. To address this limitation-and
due to the prohibitive cost and slow pace of human annotation for video
tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS
Chatbot Arena's framework, designed to automatically assess LMMs' video
analysis abilities. VideoAutoArena utilizes user simulation to generate
open-ended, adaptive questions that rigorously assess model performance in
video understanding. The benchmark features an automated, scalable evaluation
framework, incorporating a modified ELO Rating System for fair and continuous
comparisons across multiple LMMs. To validate our automated judging system, we
construct a 'gold standard' using a carefully curated subset of human
annotations, demonstrating that our arena strongly aligns with human judgment
while maintaining scalability. Additionally, we introduce a fault-driven
evolution strategy, progressively increasing question complexity to push models
toward handling more challenging video analysis scenarios. Experimental results
demonstrate that VideoAutoArena effectively differentiates among
state-of-the-art LMMs, providing insights into model strengths and areas for
improvement. To further streamline our evaluation, we introduce VideoAutoBench
as an auxiliary benchmark, where human annotators label winners in a subset of
VideoAutoArena battles. We use GPT-4o as a judge to compare responses against
these human-validated answers. Together, VideoAutoArena and VideoAutoBench
offer a cost-effective, and scalable framework for evaluating LMMs in
user-centric video analysis.
comment: Project Page: https://videoautoarena.github.io/
☆ Leveraging Prior Experience: An Expandable Auxiliary Knowledge Base for Text-to-SQL
Large Language Models (LLMs) exhibit impressive problem-solving skills across
many tasks, but they still underperform compared to humans in various
downstream applications, such as text-to-SQL. On the BIRD benchmark
leaderboard, human performance achieves an accuracy of 92.96\%, whereas the
top-performing method reaches only 72.39\%. Notably, these state-of-the-art
(SoTA) methods predominantly rely on in-context learning to simulate human-like
reasoning. However, they overlook a critical human skill: continual learning.
Inspired by the educational practice of maintaining mistake notebooks during
our formative years, we propose LPE-SQL (Leveraging Prior Experience: An
Expandable Auxiliary Knowledge Base for Text-to-SQL), a novel framework
designed to augment LLMs by enabling continual learning without requiring
parameter fine-tuning. LPE-SQL consists of four modules that \textbf{i)}
retrieve relevant entries, \textbf{ii)} efficient sql generation, \textbf{iii)}
generate the final result through a cross-consistency mechanism and
\textbf{iv)} log successful and failed tasks along with their reasoning
processes or reflection-generated tips. Importantly, the core module of LPE-SQL
is the fourth one, while the other modules employ foundational methods,
allowing LPE-SQL to be easily integrated with SoTA technologies to further
enhance performance. Our experimental results demonstrate that this continual
learning approach yields substantial performance gains, with the smaller
Llama-3.1-70B model with surpassing the performance of the larger
Llama-3.1-405B model using SoTA methods.
☆ BIPro: Zero-shot Chinese Poem Generation via Block Inverse Prompting Constrained Generation Framework
Recently, generative pre-trained models have made significant strides,
particularly highlighted by the release of ChatGPT and GPT-4, which exhibit
superior cross-domain capabilities. However, these models still face challenges
on constrained writing tasks like poem generation under open-domain titles. In
response to this challenge, we introduce Block Inverse Prompting (BIPro)
constrained generation framework. BIPro leverages two block inverse prompting
methods, revise and rewrite, that mimic the process of human text writing using
block generative models. It significantly improves the zero-shot generation
quality on the formidable constrained generation task of open-domain
traditional-form Chinese poem generation. Based on a less powerful block
generative model GLM-10B-Chinese, poems composed via BIPro without priming or
additional training outperform both most advanced direct generative systems
like GPT-4 or GLM-4 and best domain-specific systems such as Yusheng,
Shisanbai, or Baidu Poetry Helper in human evaluation by proficient poets.
Finally, BIPro considerably narrows the gap between AI-generated works and
short-listed human literary arts in another human evaluation, unveiling the
promising potential of block generative models in improving the quality of
constrained generation.
☆ AIDBench: A benchmark for evaluating the authorship identification capability of large language models
As large language models (LLMs) rapidly advance and integrate into daily
life, the privacy risks they pose are attracting increasing attention. We focus
on a specific privacy risk where LLMs may help identify the authorship of
anonymous texts, which challenges the effectiveness of anonymity in real-world
systems such as anonymous peer review systems. To investigate these risks, we
present AIDBench, a new benchmark that incorporates several author
identification datasets, including emails, blogs, reviews, articles, and
research papers. AIDBench utilizes two evaluation methods: one-to-one
authorship identification, which determines whether two texts are from the same
author; and one-to-many authorship identification, which, given a query text
and a list of candidate texts, identifies the candidate most likely written by
the same author as the query text. We also introduce a Retrieval-Augmented
Generation (RAG)-based method to enhance the large-scale authorship
identification capabilities of LLMs, particularly when input lengths exceed the
models' context windows, thereby establishing a new baseline for authorship
identification using LLMs. Our experiments with AIDBench demonstrate that LLMs
can correctly guess authorship at rates well above random chance, revealing new
privacy risks posed by these powerful models. The source code and data will be
made publicly available after acceptance.
comment: 21 pages, 7 figures
☆ Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM
Jiawei Yu, Yuang Li, Xiaosong Qiao, Huan Zhao, Xiaofeng Zhao, Wei Tang, Min Zhang, Hao Yang, Jinsong Su
Text-to-speech (TTS) models have been widely adopted to enhance automatic
speech recognition (ASR) systems using text-only corpora, thereby reducing the
cost of labeling real speech data. Existing research primarily utilizes
additional text data and predefined speech styles supported by TTS models. In
this paper, we propose Hard-Synth, a novel ASR data augmentation method that
leverages large language models (LLMs) and advanced zero-shot TTS. Our approach
employs LLMs to generate diverse in-domain text through rewriting, without
relying on additional text data. Rather than using predefined speech styles, we
introduce a hard prompt selection method with zero-shot TTS to clone speech
styles that the ASR model finds challenging to recognize. Experiments
demonstrate that Hard-Synth significantly enhances the Conformer model,
achieving relative word error rate (WER) reductions of 6.5\%/4.4\% on
LibriSpeech dev/test-other subsets. Additionally, we show that Hard-Synth is
data-efficient and capable of reducing bias in ASR.
☆ Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding
Efficient inference in large language models (LLMs) has become a critical
focus as their scale and complexity grow. Traditional autoregressive decoding,
while effective, suffers from computational inefficiencies due to its
sequential token generation process. Speculative decoding addresses this
bottleneck by introducing a two-stage framework: drafting and verification. A
smaller, efficient model generates a preliminary draft, which is then refined
by a larger, more sophisticated model. This paper provides a comprehensive
survey of speculative decoding methods, categorizing them into draft-centric
and model-centric approaches. We discuss key ideas associated with each method,
highlighting their potential for scaling LLM inference. This survey aims to
guide future research in optimizing speculative decoding and its integration
into real-world LLM applications.
☆ Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control
Lyrics generation presents unique challenges, particularly in achieving
precise syllable control while adhering to song form structures such as verses
and choruses. Conventional line-by-line approaches often lead to unnatural
phrasing, underscoring the need for more granular syllable management. We
propose a framework for lyrics generation that enables multi-level syllable
control at the word, phrase, line, and paragraph levels, aware of song form.
Our approach generates complete lyrics conditioned on input text and song form,
ensuring alignment with specified syllable constraints. Generated lyrics
samples are available at: https://tinyurl.com/lyrics9999
☆ Patience Is The Key to Large Language Model Reasoning
Recent advancements in the field of large language models, particularly
through the Chain of Thought (CoT) approach, have demonstrated significant
improvements in solving complex problems. However, existing models either tend
to sacrifice detailed reasoning for brevity due to user preferences, or require
extensive and expensive training data to learn complicated reasoning ability,
limiting their potential in solving complex tasks. To bridge this gap,
following the concept of scaling test-time, we propose a simple method by
encouraging models to adopt a more patient reasoning style without the need of
introducing new knowledge or skills. To employ a preference optimization
approach, we generate detailed reasoning processes as positive examples and
simple answers as negative examples, thereby training the model to favor
thoroughness in its responses. Our results demonstrate a performance increase
of up to 6.7% on GSM8k with training just on a lightweight dataset.
comment: The dataset and model are available at
https://huggingface.co/datasets/yuyijiong/patient-math-cot
☆ Explainable LLM-driven Multi-dimensional Distillation for E-Commerce Relevance Learning WWW 2025
Effective query-item relevance modeling is pivotal for enhancing user
experience and safeguarding user satisfaction in e-commerce search systems.
Recently, benefiting from the vast inherent knowledge, Large Language Model
(LLM) approach demonstrates strong performance and long-tail generalization
ability compared with previous neural-based specialized relevance learning
methods. Though promising, current LLM-based methods encounter the following
inadequacies in practice: First, the massive parameters and computational
demands make it difficult to be deployed online. Second, distilling LLM models
to online models is a feasible direction, but the LLM relevance modeling is a
black box, and its rich intrinsic knowledge is difficult to extract and apply
online. To improve the interpretability of LLM and boost the performance of
online relevance models via LLM, we propose an Explainable LLM-driven
Multi-dimensional Distillation framework for e-commerce relevance learning,
which comprises two core components: (1) An Explainable LLM for relevance
modeling (ELLM-rele), which decomposes the relevance learning into intermediate
steps and models relevance learning as a Chain-of-Thought (CoT) reasoning,
thereby enhancing both interpretability and performance of LLM. (2) A
Multi-dimensional Knowledge Distillation (MKD) architecture that transfers the
knowledge of ELLM-rele to current deployable interaction-based and
representation-based student models from both the relevance score distribution
and CoT reasoning aspects. Through distilling the probabilistic and CoT
reasoning knowledge, MKD improves both the semantic interaction and long-tail
generalization abilities of student models. Extensive offline evaluations and
online experiments on Taobao search ad scene demonstrate that our proposed
framework significantly enhances e-commerce relevance learning performance and
user experience.
comment: Submitted to WWW 2025
☆ Breaking the Cycle of Recurring Failures: Applying Generative AI to Root Cause Analysis in Legacy Banking Systems
Traditional banks face significant challenges in digital transformation,
primarily due to legacy system constraints and fragmented ownership. Recent
incidents show that such fragmentation often results in superficial incident
resolutions, leaving root causes unaddressed and causing recurring failures. We
introduce a novel approach to post-incident analysis, integrating
knowledge-based GenAI agents with the "Five Whys" technique to examine problem
descriptions and change request data. This method uncovered that approximately
70% of the incidents previously attributed to management or vendor failures
were due to underlying internal code issues. We present a case study to show
the impact of our method. By scanning over 5,000 projects, we identified over
400 files with a similar root cause. Overall, we leverage the knowledge-based
agents to automate and elevate root cause analysis, transforming it into a more
proactive process. These agents can be applied across other phases of the
software development lifecycle, further improving development processes.
☆ LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts
As large language models (LLMs) show impressive performance on complex tasks,
they still struggle with longer contextual understanding and high computational
costs. To balance efficiency and quality, we introduce LLMSteer, a
fine-tuning-free framework that enhances LLMs through query-independent
attention steering. Tested on popular LLMs and datasets, LLMSteer narrows the
performance gap with baselines by 65.9% and reduces the runtime delay by up to
4.8x compared to recent attention steering methods.
☆ MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers NeurIPS2024
In order to reduce the computational complexity of large language models,
great efforts have been made to to improve the efficiency of transformer models
such as linear attention and flash-attention. However, the model size and
corresponding computational complexity are constantly scaled up in pursuit of
higher performance. In this work, we present MemoryFormer, a novel transformer
architecture which significantly reduces the computational complexity (FLOPs)
from a new perspective. We eliminate nearly all the computations of the
transformer model except for the necessary computation required by the
multi-head attention operation. This is made possible by utilizing an
alternative method for feature transformation to replace the linear projection
of fully-connected layers. Specifically, we first construct a group of
in-memory lookup tables that store a large amount of discrete vectors to
replace the weight matrix used in linear projection. We then use a hash
algorithm to retrieve a correlated subset of vectors dynamically based on the
input embedding. The retrieved vectors combined together will form the output
embedding, which provides an estimation of the result of matrix multiplication
operation in a fully-connected layer. Compared to conducting matrix
multiplication, retrieving data blocks from memory is a much cheaper operation
which requires little computations. We train MemoryFormer from scratch and
conduct extensive experiments on various benchmarks to demonstrate the
effectiveness of the proposed model.
comment: NeurIPS2024
☆ Training Bilingual LMs with Data Constraints in the Targeted Language
Large language models are trained on massive scrapes of the web, as required
by current scaling laws. Most progress is made for English, given its abundance
of high-quality pretraining data. For most other languages, however, such high
quality pretraining data is unavailable. In this work, we study how to boost
pretrained model performance in a data constrained target language by enlisting
data from an auxiliary language for which high quality data is available. We
study this by quantifying the performance gap between training with data in a
data-rich auxiliary language compared with training in the target language,
exploring the benefits of translation systems, studying the limitations of
model scaling for data constrained languages, and proposing new methods for
upsampling data from the auxiliary language. Our results show that stronger
auxiliary datasets result in performance gains without modification to the
model or training objective for close languages, and, in particular, that
performance gains due to the development of more information-rich English
pretraining datasets can extend to targeted language settings with limited
data.
comment: 22 pages, 14 figures, 15 tables
☆ MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning
Contemporary embodied agents, such as Voyager in Minecraft, have demonstrated
promising capabilities in open-ended individual learning. However, when powered
with open large language models (LLMs), these agents often struggle with
rudimentary tasks, even when fine-tuned on domain-specific knowledge. Inspired
by human cultural learning, we present \collabvoyager, a novel framework that
enhances Voyager with lifelong collaborative learning through explicit
perspective-taking. \collabvoyager introduces three key innovations: (1) theory
of mind representations linking percepts, beliefs, desires, and actions; (2)
natural language communication between agents; and (3) semantic memory of task
and environment knowledge and episodic memory of collaboration episodes. These
advancements enable agents to reason about their and others' mental states,
empirically addressing two prevalent failure modes: false beliefs and faulty
task executions. In mixed-expertise Minecraft experiments, \collabvoyager
agents outperform Voyager counterparts, significantly improving task completion
rate by $66.6\% (+39.4\%)$ for collecting one block of dirt and $70.8\%
(+20.8\%)$ for collecting one wood block. They exhibit emergent behaviors like
knowledge transfer from expert to novice agents and collaborative code
correction. \collabvoyager agents also demonstrate the ability to adapt to
out-of-distribution tasks by using their previous experiences and beliefs
obtained through collaboration. In this open-ended social learning paradigm,
\collabvoyager paves the way for the democratic development of embodied AI,
where agents learn in deployment from both peer and environmental feedback.
☆ A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
Large Language Models are prone to off-topic misuse, where users may prompt
these models to perform tasks beyond their intended scope. Current guardrails,
which often rely on curated examples or custom classifiers, suffer from high
false-positive rates, limited adaptability, and the impracticality of requiring
real-world data that is not available in pre-production. In this paper, we
introduce a flexible, data-free guardrail development methodology that
addresses these challenges. By thoroughly defining the problem space
qualitatively and passing this to an LLM to generate diverse prompts, we
construct a synthetic dataset to benchmark and train off-topic guardrails that
outperform heuristic approaches. Additionally, by framing the task as
classifying whether the user prompt is relevant with respect to the system
prompt, our guardrails effectively generalize to other misuse categories,
including jailbreak and harmful prompts. Lastly, we further contribute to the
field by open-sourcing both the synthetic dataset and the off-topic guardrail
models, providing valuable resources for developing guardrails in
pre-production environments and supporting future research and development in
LLM safety.
comment: 8 pages, 5 figures
♻ ☆ Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
Computational models of syntax are predominantly text-based. Here we propose
that the most basic first step in the evolution of syntax can be modeled
directly from raw speech in a fully unsupervised way. We focus on one of the
most ubiquitous and elementary suboperation of syntax -- concatenation. We
introduce spontaneous concatenation: a phenomenon where convolutional neural
networks (CNNs) trained on acoustic recordings of individual words start
generating outputs with two or even three words concatenated without ever
accessing data with multiple words in the input. We replicate this finding in
several independently trained models with different hyperparameters and
training data. Additionally, networks trained on two words learn to embed words
into novel unobserved word combinations. We also show that the concatenated
outputs contain precursors to compositionality. To our knowledge, this is a
previously unreported property of CNNs trained in the ciwGAN/fiwGAN setting on
raw speech and has implications both for our understanding of how these
architectures learn as well as for modeling syntax and its evolution in the
brain from raw acoustic inputs. We also propose a potential neural mechanism
called disinhibition that outlines a possible neural pathway towards
concatenation and compositionality and suggests our modeling is useful for
generating testable prediction for biological and artificial neural processing
of speech.
♻ ☆ From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui
One of the most striking findings in modern research on large language models
(LLMs) is that scaling up compute during training leads to better results.
However, less attention has been given to the benefits of scaling compute
during inference. This survey focuses on these inference-time approaches. We
explore three areas under a unified mathematical formalism: token-level
generation algorithms, meta-generation algorithms, and efficient generation.
Token-level generation algorithms, often called decoding algorithms, operate by
sampling a single token at a time or constructing a token-level search space
and then selecting an output. These methods typically assume access to a
language model's logits, next-token distributions, or probability scores.
Meta-generation algorithms work on partial or full sequences, incorporating
domain knowledge, enabling backtracking, and integrating external information.
Efficient generation methods aim to reduce token costs and improve the speed of
generation. Our survey unifies perspectives from three research communities:
traditional natural language processing, modern LLMs, and machine learning
systems.
♻ ☆ When Context Leads but Parametric Memory Follows in Large Language Models EMNLP 2024
Large language models (LLMs) have demonstrated remarkable progress in
leveraging diverse knowledge sources. This study investigates how nine widely
used LLMs allocate knowledge between local context and global parameters when
answering open-ended questions in knowledge-consistent scenarios. We introduce
a novel dataset, WikiAtomic, and systematically vary context sizes to analyze
how LLMs prioritize and utilize the provided information and their parametric
knowledge in knowledge-consistent scenarios. Additionally, we also study their
tendency to hallucinate under varying context sizes. Our findings reveal
consistent patterns across models, including a consistent reliance on both
contextual (around 70%) and parametric (around 30%) knowledge, and a decrease
in hallucinations with increasing context. These insights highlight the
importance of more effective context organization and developing models that
use input more deterministically for robust performance.
comment: Accepted by EMNLP 2024 Main Conference
♻ ☆ Neuron Patching: Semantic-based Neuron-level Language Model Repair for Code Generation
Language Models (LMs) have become widely used in software engineering,
especially for tasks such as code generation, where they are referred to as
code LMs. These models have proven effective in generating code, making it
easier for developers to automate coding activities. However, research has
highlighted a significant limitation: despite their effectiveness, LMs often
produce code that is incorrect, buggy, or not fully functional. Updating these
models with limited data can be prohibitively challenging, yet it is essential
to maximize their utility. This may require hot-fix techniques (updating models
with limited data) to resolve. In this paper, we propose \ul{M}odel
\ul{I}mprovement via \ul{N}euron \ul{T}argeting (\textsc{MINT}), a novel
approach for repairing code LMs. MINT leverages the semantic property of
language models to perform neuron-level repairs in a novel way. Further, by
analyzing the relationships between the model's latent representations, the
incorrect outputs, and the desired outputs, \textsc{MINT} determines which
neurons are worth updating. This approach ensures that only the neurons crucial
to the model's failure are targeted, avoiding unnecessary changes and allowing
for a more efficient and precise repair process. \textsc{MINT} is effective,
efficient, and reliable, capable of correcting a neural model by patching a
minimum number of neurons (usually one or two neurons). Our approach is
evaluated on three coding tasks: line-level code generation, shellcode
generation, and intent-to-bash translation. The experimental results
demonstrate that the proposed approach significantly outperforms the
state-of-the-art in both effectiveness and efficiency measures. In addition, we
analyze and discuss the side effects of model repair techniques, including the
balance between generalization and specificity, and the performance after
multiple repairs in succession.
comment: 13 pages, 7 figures, 7 tables, under peer-review
♻ ☆ Predicting User Intents and Musical Attributes from Music Discovery Conversations
Intent classification is a text understanding task that identifies user needs
from input text queries. While intent classification has been extensively
studied in various domains, it has not received much attention in the music
domain. In this paper, we investigate intent classification models for music
discovery conversation, focusing on pre-trained language models. Rather than
only predicting functional needs: intent classification, we also include a task
for classifying musical needs: musical attribute classification. Additionally,
we propose a method of concatenating previous chat history with just
single-turn user queries in the input text, allowing the model to understand
the overall conversation context better. Our proposed model significantly
improves the F1 score for both user intent and musical attribute
classification, and surpasses the zero-shot and few-shot performance of the
pretrained Llama 3 model.
comment: 8 pages, 4 figures
♻ ☆ Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
In this paper, we focus on monolithic Multimodal Large Language Models
(MLLMs) that integrate visual encoding and language decoding into a single LLM.
In particular, we identify that existing pre-training strategies for monolithic
MLLMs often suffer from unstable optimization or catastrophic forgetting. To
address this issue, our core idea is to embed a new visual parameter space into
a pre-trained LLM, thereby stably learning visual knowledge from noisy data
while freezing the LLM. Based on this principle, we present Mono-InternVL, a
novel monolithic MLLM that seamlessly integrates a set of visual experts via a
multimodal mixture-of-experts structure. Moreover, we propose an innovative
pre-training strategy to maximize the visual capability of Mono-InternVL,
namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed
as a progressive learning process for visual experts, which aims to fully
exploit the visual knowledge from noisy data to high-quality data. To validate
our approach, we conduct extensive experiments on 16 benchmarks. Experimental
results confirm the superior performance of Mono-InternVL than existing
monolithic MLLMs on 13 of 16 multimodal benchmarks, e.g., +80 points over Emu3
on OCRBench. Compared to the modular baseline, i.e., InternVL-1.5,
Mono-InternVL still retains comparable multimodal performance while reducing up
to 67% first token latency. Code and model are released at
https://huggingface.co/OpenGVLab/Mono-InternVL-2B.
♻ ☆ TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs NeurIPS 2024
Zhuofeng Li, Zixing Gou, Xiangnan Zhang, Zhongyuan Liu, Sirui Li, Yuntong Hu, Chen Ling, Zheng Zhang, Liang Zhao
Text-Attributed Graphs (TAGs) augment graph structures with natural language
descriptions, facilitating detailed depictions of data and their
interconnections across various real-world settings. However, existing TAG
datasets predominantly feature textual information only at the nodes, with
edges typically represented by mere binary or categorical attributes. This lack
of rich textual edge annotations significantly limits the exploration of
contextual relationships between entities, hindering deeper insights into
graph-structured data. To address this gap, we introduce Textual-Edge Graphs
Datasets and Benchmark (TEG-DB), a comprehensive and diverse collection of
benchmark textual-edge datasets featuring rich textual descriptions on nodes
and edges. The TEG-DB datasets are large-scale and encompass a wide range of
domains, from citation networks to social networks. In addition, we conduct
extensive benchmark experiments on TEG-DB to assess the extent to which current
techniques, including pre-trained language models, graph neural networks, and
their combinations, can utilize textual node and edge information. Our goal is
to elicit advancements in textual-edge graph research, specifically in
developing methodologies that exploit rich textual node and edge descriptions
to enhance graph analysis and provide deeper insights into complex real-world
networks. The entire TEG-DB project is publicly accessible as an open-source
repository on Github, accessible at
https://github.com/Zhuofeng-Li/TEG-Benchmark.
comment: Accepted by NeurIPS 2024
♻ ☆ Neon: News Entity-Interaction Extraction for Enhanced Question Answering
Capturing fresh information in near real-time and using it to augment
existing large language models (LLMs) is essential to generate up-to-date,
grounded, and reliable output. This problem becomes particularly challenging
when LLMs are used for informational tasks in rapidly evolving fields, such as
Web search related to recent or unfolding events involving entities, where
generating temporally relevant responses requires access to up-to-the-hour news
sources. However, the information modeled by the parametric memory of LLMs is
often outdated, and Web results from prototypical retrieval systems may fail to
capture the latest relevant information and struggle to handle conflicting
reports in evolving news. To address this challenge, we present the NEON
framework, designed to extract emerging entity interactions -- such as events
or activities -- as described in news articles. NEON constructs an
entity-centric timestamped knowledge graph that captures such interactions,
thereby facilitating enhanced QA capabilities related to news events. Our
framework innovates by integrating open Information Extraction (openIE) style
tuples into LLMs to enable in-context retrieval-augmented generation. This
integration demonstrates substantial improvements in QA performance when
tackling temporal, entity-centric search queries. Through NEON, LLMs can
deliver more accurate, reliable, and up-to-date responses.
♻ ☆ Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models NeurIPS 2024
Bowen Ping, Shuo Wang, Hanqing Wang, Xu Han, Yuzhuang Xu, Yukun Yan, Yun Chen, Baobao Chang, Zhiyuan Liu, Maosong Sun
Fine-tuning is a crucial process for adapting large language models (LLMs) to
diverse applications. In certain scenarios, such as multi-tenant serving,
deploying multiple LLMs becomes necessary to meet complex demands. Recent
studies suggest decomposing a fine-tuned LLM into a base model and
corresponding delta weights, which are then compressed using low-rank or
low-bit approaches to reduce costs. In this work, we observe that existing
low-rank and low-bit compression methods can significantly harm the model
performance for task-specific fine-tuned LLMs (e.g., WizardMath for math
problems). Motivated by the long-tail distribution of singular values in the
delta weights, we propose a delta quantization approach using mixed-precision.
This method employs higher-bit representation for singular vectors
corresponding to larger singular values. We evaluate our approach on various
fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs.
Experimental results demonstrate that our approach performs comparably to full
fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a
considerable margin. Additionally, we show that our method is compatible with
various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its
generalizability.
comment: NeurIPS 2024
♻ ☆ SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Enhanced Code Generation
Large language models demonstrate exceptional performance in simple code
generation tasks but still face challenges in tackling complex problems. These
challenges may stem from insufficient reasoning and problem decomposition
capabilities. To address this issue, we propose a reasoning-augmented data
generation process, SRA-MCTS, which guides the model to autonomously generate
high-quality intermediate reasoning paths. This creates a positive feedback
loop, enabling continuous improvement. Our method operates entirely through the
model itself without requiring additional supervision. By synthesizing natural
language reasoning paths and translating them into executable code, the
approach ensures analytical accuracy and enhances the success rate in solving
complex tasks. Experimental results show that, even without additional
supervisory signals, our method achieves performance improvements across
different model scales, demonstrating the significant potential of
self-improvement in small models. Furthermore, the method remains robust when
traditional Chain-of-Thought (CoT) approaches exhibit performance degradation,
with notable improvements observed in diversity metrics such as pass@10. We
encourage further exploration of reasoning processes within training data to
enhance the ability of language models to address complex problems.
♻ ☆ SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language Models
In this paper, we propose Singular Values and Orthonormal Regularized
Singular Vectors Adaptation, or SORSA, a novel PEFT method. Each SORSA adapter
consists of two main parts: trainable principal singular weights $W_p = U_p
\text{diag}(S_p) V^\top_p$, and frozen residual weights $W_r = U_r
\text{diag}(S_r) V^\top_r$. These parts are initialized by performing singular
value decomposition (SVD) on pre-trained weights. Moreover, we implement and
analyze an orthonormal regularizer, which we prove could decrease the condition
number of $W_p$ and make the optimization more efficient. SORSA adapters could
be merged during inference, thus eliminating any inference latency. We also
introduce a method to analyze the variation of the parameters by performing SVD
and discuss and analyze SORSA's superiority in minimizing the alteration in the
SVD aspect. After all, SORSA shows a faster convergence than LoRA and PiSSA in
our experiments. On the GSM-8K benchmark, Llama 2 7B adapted using SORSA
achieved 56.03% accuracy, surpassing LoRA (42.30%), AdaLoRA (47.30%), Full FT
(49.05%), and PiSSA (53.07%). On the MATH benchmark, SORSA achieved 10.36%
accuracy, outperforming LoRA (5.50%), AdaLoRA (6.48%), Full FT (7.22%), and
PiSSA (7.44%). We conclude that SORSA offers a new perspective on
parameter-efficient fine-tuning, demonstrating remarkable performance.
♻ ☆ Beyond Isolation: Multi-Agent Synergy for Improving Knowledge Graph Construction
This paper introduces CooperKGC, a novel framework challenging the
conventional solitary approach of large language models (LLMs) in knowledge
graph construction (KGC). CooperKGC establishes a collaborative processing
network, assembling a team capable of concurrently addressing entity, relation,
and event extraction tasks. Experimentation demonstrates that fostering
collaboration within CooperKGC enhances knowledge selection, correction, and
aggregation capabilities across multiple rounds of interactions.
comment: Accepted by CCKS 2024, best english candidate paper
♻ ☆ Rich Semantic Knowledge Enhanced Large Language Models for Few-shot Chinese Spell Checking ACL 2024
Chinese Spell Checking (CSC) is a widely used technology, which plays a vital
role in speech to text (STT) and optical character recognition (OCR). Most of
the existing CSC approaches relying on BERT architecture achieve excellent
performance. However, limited by the scale of the foundation model, BERT-based
method does not work well in few-shot scenarios, showing certain limitations in
practical applications. In this paper, we explore using an in-context learning
method named RS-LLM (Rich Semantic based LLMs) to introduce large language
models (LLMs) as the foundation model. Besides, we study the impact of
introducing various Chinese rich semantic information in our framework. We
found that by introducing a small number of specific Chinese rich semantic
structures, LLMs achieve better performance than the BERT-based model on
few-shot CSC task. Furthermore, we conduct experiments on multiple datasets,
and the experimental results verified the superiority of our proposed
framework.
comment: This paper is accepted by Findings of the Association for
Computational Linguistics: ACL 2024
♻ ☆ Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods
Large language model unlearning aims to remove harmful information that LLMs
have learnt to prevent their use for malicious purposes. LLMU and RMU have been
proposed as two methods for LLM unlearning, achieving impressive results on
unlearning benchmarks. We study in detail the efficacy of these methods by
evaluating their impact on general model capabilities on the WMDP benchmark as
well as a biology benchmark we create. Our experiments show that RMU generally
leads to better preservation of model capabilities, for similar or better
unlearning. We further test the robustness of these methods and find that doing
5-shot prompting or rephrasing the question in simple ways can lead to an over
ten-fold increase in accuracy on unlearning benchmarks. Finally, we show that
training on unrelated data can almost completely recover pre-unlearning
performance, demonstrating that these methods fail at truly unlearning. The
code is available at: https://github.com/JaiDoshi/Knowledge-Erasure.
comment: 9 pages, 2 figures
♻ ☆ Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models NeurIPS 2024
Large language models (LLMs) have rapidly advanced and demonstrated
impressive capabilities. In-Context Learning (ICL) and Parameter-Efficient
Fine-Tuning (PEFT) are currently two mainstream methods for augmenting LLMs to
downstream tasks. ICL typically constructs a few-shot learning scenario, either
manually or by setting up a Retrieval-Augmented Generation (RAG) system,
helping models quickly grasp domain knowledge or question-answering patterns
without changing model parameters. However, this approach involves trade-offs,
such as slower inference speed and increased space occupancy. PEFT assists the
model in adapting to tasks through minimal parameter modifications, but the
training process still demands high hardware requirements, even with a small
number of parameters involved. To address these challenges, we propose
Reference Trustable Decoding (RTD), a paradigm that allows models to quickly
adapt to new tasks without fine-tuning, maintaining low inference costs. RTD
constructs a reference datastore from the provided training examples and
optimizes the LLM's final vocabulary distribution by flexibly selecting
suitable references based on the input, resulting in more trustable responses
and enabling the model to adapt to downstream tasks at a low cost. Experimental
evaluations on various LLMs using different benchmarks demonstrate that RTD
establishes a new paradigm for augmenting models to downstream tasks.
Furthermore, our method exhibits strong orthogonality with traditional methods,
allowing for concurrent usage. Our code can be found at
https://github.com/ShiLuohe/ReferenceTrustableDecoding
comment: Accepted by the Thirty-Eighth Annual Conference on Neural Information
Processing Systems (NeurIPS 2024)
♻ ☆ Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption
Large Language Models (LLMs), epitomized by ChatGPT's release in late 2022,
have revolutionized various industries with their advanced language
comprehension. However, their efficiency is challenged by the Transformer
architecture's struggle with handling long texts. KV Cache has emerged as a
pivotal solution to this issue, converting the time complexity of token
generation from quadratic to linear, albeit with increased GPU memory overhead
proportional to conversation length. With the development of the LLM community
and academia, various KV Cache compression methods have been proposed. In this
review, we dissect the various properties of KV Cache and elaborate on various
methods currently used to optimize the KV Cache space usage of LLMs. These
methods span the pre-training phase, deployment phase, and inference phase, and
we summarize the commonalities and differences among these methods.
Additionally, we list some metrics for evaluating the long-text capabilities of
large language models, from both efficiency and capability perspectives. Our
review thus sheds light on the evolving landscape of LLM optimization, offering
insights into future advancements in this dynamic field. Links to the papers
mentioned in this review can be found in our Github Repo
https://github.com/zcli-charlie/Awesome-KV-Cache.
comment: Published on the First Conference on Language Modeling (COLM 2024)
♻ ☆ Demystifying Large Language Models for Medicine: A Primer
Qiao Jin, Nicholas Wan, Robert Leaman, Shubo Tian, Zhizheng Wang, Yifan Yang, Zifeng Wang, Guangzhi Xiong, Po-Ting Lai, Qingqing Zhu, Benjamin Hou, Maame Sarfo-Gyamfi, Gongbo Zhang, Aidan Gilson, Balu Bhasuran, Zhe He, Aidong Zhang, Jimeng Sun, Chunhua Weng, Ronald M. Summers, Qingyu Chen, Yifan Peng, Zhiyong Lu
Large language models (LLMs) represent a transformative class of AI tools
capable of revolutionizing various aspects of healthcare by generating
human-like responses across diverse contexts and adapting to novel tasks
following human instructions. Their potential application spans a broad range
of medical tasks, such as clinical documentation, matching patients to clinical
trials, and answering medical questions. In this primer paper, we propose an
actionable guideline to help healthcare professionals more efficiently utilize
LLMs in their work, along with a set of best practices. This approach consists
of several main phases, including formulating the task, choosing LLMs, prompt
engineering, fine-tuning, and deployment. We start with the discussion of
critical considerations in identifying healthcare tasks that align with the
core capabilities of LLMs and selecting models based on the selected task and
data, performance requirements, and model interface. We then review the
strategies, such as prompt engineering and fine-tuning, to adapt standard LLMs
to specialized medical tasks. Deployment considerations, including regulatory
compliance, ethical guidelines, and continuous monitoring for fairness and
bias, are also discussed. By providing a structured step-by-step methodology,
this tutorial aims to equip healthcare professionals with the tools necessary
to effectively integrate LLMs into clinical practice, ensuring that these
powerful technologies are applied in a safe, reliable, and impactful manner.
comment: Under review