arXiv daily

Computation and Language (cs.CL)

Wed, 24 May 2023

Other arXiv digests in this category:Thu, 14 Sep 2023; Wed, 13 Sep 2023; Tue, 12 Sep 2023; Mon, 11 Sep 2023; Fri, 08 Sep 2023; Tue, 05 Sep 2023; Fri, 01 Sep 2023; Thu, 31 Aug 2023; Wed, 30 Aug 2023; Tue, 29 Aug 2023; Mon, 28 Aug 2023; Fri, 25 Aug 2023; Thu, 24 Aug 2023; Wed, 23 Aug 2023; Tue, 22 Aug 2023; Mon, 21 Aug 2023; Fri, 18 Aug 2023; Thu, 17 Aug 2023; Wed, 16 Aug 2023; Tue, 15 Aug 2023; Mon, 14 Aug 2023; Fri, 11 Aug 2023; Thu, 10 Aug 2023; Wed, 09 Aug 2023; Tue, 08 Aug 2023; Mon, 07 Aug 2023; Fri, 04 Aug 2023; Thu, 03 Aug 2023; Wed, 02 Aug 2023; Tue, 01 Aug 2023; Mon, 31 Jul 2023; Fri, 28 Jul 2023; Thu, 27 Jul 2023; Wed, 26 Jul 2023; Tue, 25 Jul 2023; Mon, 24 Jul 2023; Fri, 21 Jul 2023; Thu, 20 Jul 2023; Wed, 19 Jul 2023; Tue, 18 Jul 2023; Mon, 17 Jul 2023; Fri, 14 Jul 2023; Thu, 13 Jul 2023; Wed, 12 Jul 2023; Tue, 11 Jul 2023; Mon, 10 Jul 2023; Fri, 07 Jul 2023; Thu, 06 Jul 2023; Wed, 05 Jul 2023; Tue, 04 Jul 2023; Mon, 03 Jul 2023; Fri, 30 Jun 2023; Thu, 29 Jun 2023; Wed, 28 Jun 2023; Tue, 27 Jun 2023; Mon, 26 Jun 2023; Fri, 23 Jun 2023; Thu, 22 Jun 2023; Wed, 21 Jun 2023; Tue, 20 Jun 2023; Fri, 16 Jun 2023; Thu, 15 Jun 2023; Tue, 13 Jun 2023; Mon, 12 Jun 2023; Fri, 09 Jun 2023; Thu, 08 Jun 2023; Wed, 07 Jun 2023; Tue, 06 Jun 2023; Mon, 05 Jun 2023; Fri, 02 Jun 2023; Thu, 01 Jun 2023; Wed, 31 May 2023; Tue, 30 May 2023; Mon, 29 May 2023; Fri, 26 May 2023; Thu, 25 May 2023; Tue, 23 May 2023; Mon, 22 May 2023; Fri, 19 May 2023; Thu, 18 May 2023; Wed, 17 May 2023; Tue, 16 May 2023; Mon, 15 May 2023; Fri, 12 May 2023; Thu, 11 May 2023; Wed, 10 May 2023; Tue, 09 May 2023; Mon, 08 May 2023; Fri, 05 May 2023; Thu, 04 May 2023; Wed, 03 May 2023; Tue, 02 May 2023; Mon, 01 May 2023; Fri, 28 Apr 2023; Thu, 27 Apr 2023; Wed, 26 Apr 2023; Tue, 25 Apr 2023; Mon, 24 Apr 2023; Fri, 21 Apr 2023; Thu, 20 Apr 2023; Wed, 19 Apr 2023; Tue, 18 Apr 2023; Mon, 17 Apr 2023; Fri, 14 Apr 2023; Thu, 13 Apr 2023; Wed, 12 Apr 2023; Tue, 11 Apr 2023; Mon, 10 Apr 2023
1.I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

Authors:Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, Artemis Panagopoulou, Yue Yang, Marianna Apidianaki, Smaranda Muresan

Abstract: Visual metaphors are powerful rhetorical devices used to persuade or communicate creative ideas through images. Similar to linguistic metaphors, they convey meaning implicitly through symbolism and juxtaposition of the symbols. We propose a new task of generating visual metaphors from linguistic metaphors. This is a challenging task for diffusion-based text-to-image models, such as DALL$\cdot$E 2, since it requires the ability to model implicit meaning and compositionality. We propose to solve the task through the collaboration between Large Language Models (LLMs) and Diffusion Models: Instruct GPT-3 (davinci-002) with Chain-of-Thought prompting generates text that represents a visual elaboration of the linguistic metaphor containing the implicit meaning and relevant objects, which is then used as input to the diffusion-based text-to-image models.Using a human-AI collaboration framework, where humans interact both with the LLM and the top-performing diffusion model, we create a high-quality dataset containing 6,476 visual metaphors for 1,540 linguistic metaphors and their associated visual elaborations. Evaluation by professional illustrators shows the promise of LLM-Diffusion Model collaboration for this task.To evaluate the utility of our Human-AI collaboration framework and the quality of our dataset, we perform both an intrinsic human-based evaluation and an extrinsic evaluation using visual entailment as a downstream task.

2.AMELI: Enhancing Multimodal Entity Linking with Fine-Grained Attributes

Authors:Barry Menglong Yao, Yu Chen, Qifan Wang, Sijia Wang, Minqian Liu, Zhiyang Xu, Licheng Yu, Lifu Huang

Abstract: We propose attribute-aware multimodal entity linking, where the input is a mention described with a text and image, and the goal is to predict the corresponding target entity from a multimodal knowledge base (KB) where each entity is also described with a text description, a visual image and a set of attributes and values. To support this research, we construct AMELI, a large-scale dataset consisting of 18,472 reviews and 35,598 products. To establish baseline performance on AMELI, we experiment with the current state-of-the-art multimodal entity linking approaches and our enhanced attribute-aware model and demonstrate the importance of incorporating the attribute information into the entity linking process. To be best of our knowledge, we are the first to build benchmark dataset and solutions for the attribute-aware multimodal entity linking task. Datasets and codes will be made publicly available.

3.In-Context Demonstration Selection with Cross Entropy Difference

Authors:Dan Iter, Reid Pryzant, Ruochen Xu, Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu

Abstract: Large language models (LLMs) can use in-context demonstrations to improve performance on zero-shot tasks. However, selecting the best in-context examples is challenging because model performance can vary widely depending on the selected examples. We present a cross-entropy difference (CED) method for selecting in-context demonstrations. Our method is based on the observation that the effectiveness of in-context demonstrations negatively correlates with the perplexity of the test example by a language model that was finetuned on that demonstration. We utilize parameter efficient finetuning to train small models on training data that are used for computing the cross-entropy difference between a test example and every candidate in-context demonstration. This metric is used to rank and select in-context demonstrations independently for each test input. We evaluate our method on a mix-domain dataset that combines 8 benchmarks, representing 4 text generation tasks, showing that CED for in-context demonstration selection can improve performance for a variety of LLMs.

4.SenteCon: Leveraging Lexicons to Learn Human-Interpretable Language Representations

Authors:Victoria Lin, Louis-Philippe Morency

Abstract: Although deep language representations have become the dominant form of language featurization in recent years, in many settings it is important to understand a model's decision-making process. This necessitates not only an interpretable model but also interpretable features. In particular, language must be featurized in a way that is interpretable while still characterizing the original text well. We present SenteCon, a method for introducing human interpretability in deep language representations. Given a passage of text, SenteCon encodes the text as a layer of interpretable categories in which each dimension corresponds to the relevance of a specific category. Our empirical evaluations indicate that encoding language with SenteCon provides high-level interpretability at little to no cost to predictive performance on downstream tasks. Moreover, we find that SenteCon outperforms existing interpretable language representations with respect to both its downstream performance and its agreement with human characterizations of the text.

5.Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation

Authors:Bashar Alhafni, Go Inoue, Christian Khairallah, Nizar Habash

Abstract: Grammatical error correction (GEC) is a well-explored problem in English with many existing models and datasets. However, research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity. In this paper, we present the first results on Arabic GEC by using two newly developed Transformer-based pretrained sequence-to-sequence models. We address the task of multi-class Arabic grammatical error detection (GED) and present the first results on multi-class Arabic GED. We show that using GED information as auxiliary input in GEC models improves GEC performance across three datasets spanning different genres. Moreover, we also investigate the use of contextual morphological preprocessing in aiding GEC systems. Our models achieve state-of-the-art results on two Arabic GEC shared tasks datasets and establish a strong benchmark on a newly created dataset.

6.Centering the Margins: Outlier-Based Identification of Harmed Populations in Toxicity Detection

Authors:Vyoma Raman, Eve Fleisig, Dan Klein

Abstract: A standard method for measuring the impacts of AI on marginalized communities is to determine performance discrepancies between specified demographic groups. These approaches aim to address harms toward vulnerable groups, but they obscure harm patterns faced by intersectional subgroups or shared across demographic groups. We instead operationalize "the margins" as data points that are statistical outliers due to having demographic attributes distant from the "norm" and measure harms toward these outliers. We propose a Group-Based Performance Disparity Index (GPDI) that measures the extent to which a subdivision of a dataset into subgroups identifies those facing increased harms. We apply our approach to detecting disparities in toxicity detection and find that text targeting outliers is 28% to 86% more toxic for all types of toxicity examined. We also discover that model performance is consistently worse for demographic outliers, with disparities in error between outliers and non-outliers ranging from 28% to 71% across toxicity types. Our outlier-based analysis has comparable or higher GPDI than traditional subgroup-based analyses, suggesting that outlier analysis enhances identification of subgroups facing greater harms. Finally, we find that minoritized racial and religious groups are most associated with outliers, which suggests that outlier analysis is particularly beneficial for identifying harms against those groups.

7.Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

Authors:Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, Scott Wen-tau Yih

Abstract: Language models (LMs) often struggle to pay enough attention to the input context, and generate texts that are unfaithful or contain hallucinations. To mitigate this issue, we present context-aware decoding (CAD), which follows a contrastive output distribution that amplifies the difference between the output probabilities when a model is used with and without context. Our experiments show that CAD, without additional training, significantly improves the faithfulness of different LM families, including OPT, GPT, LLaMA and FLAN-T5 for summarization tasks (e.g., 14.3% gain for LLaMA in factuality metrics). Furthermore, CAD is particularly effective in overriding a model's prior knowledge when it contradicts the provided context, leading to substantial improvements in tasks where resolving the knowledge conflict is essential.

8.Mastering the ABCDs of Complex Questions: Answer-Based Claim Decomposition for Fine-grained Self-Evaluation

Authors:Nishant Balepur, Jie Huang, Samraj Moorjani, Hari Sundaram, Kevin Chen-Chuan Chang

Abstract: When answering complex questions, large language models (LLMs) may produce answers that do not satisfy all criteria of the question. While existing self-evaluation techniques aim to detect if such answers are correct, these techniques are unable to determine which criteria of the question are satisfied by the generated answers. To address this issue, we propose answer-based claim decomposition (ABCD), a prompting strategy that decomposes questions into a series of true/false claims that can be used to verify which criteria of the input question an answer satisfies. Using the decomposed ABCD claims, we perform fine-grained self-evaluation. Through preliminary experiments on three datasets, including a newly-collected challenge dataset ObscureQA, we find that GPT-3.5 has some ability to determine to what extent its answer satisfies the criteria of the input question, and can give insights into the errors and knowledge gaps of the model.

9.DialogVCS: Robust Natural Language Understanding in Dialogue System Upgrade

Authors:Zefan Cai, Xin Zheng, Tianyu Liu, Xu Wang, Haoran Meng, Jiaqi Han, Gang Yuan, Binghuai Lin, Baobao Chang, Yunbo Cao

Abstract: In the constant updates of the product dialogue systems, we need to retrain the natural language understanding (NLU) model as new data from the real users would be merged into the existent data accumulated in the last updates. Within the newly added data, new intents would emerge and might have semantic entanglement with the existing intents, e.g. new intents that are semantically too specific or generic are actually subset or superset of some existing intents in the semantic space, thus impairing the robustness of the NLU model. As the first attempt to solve this problem, we setup a new benchmark consisting of 4 Dialogue Version Control dataSets (DialogVCS). We formulate the intent detection with imperfect data in the system update as a multi-label classification task with positive but unlabeled intents, which asks the models to recognize all the proper intents, including the ones with semantic entanglement, in the inference. We also propose comprehensive baseline models and conduct in-depth analyses for the benchmark, showing that the semantically entangled intents can be effectively recognized with an automatic workflow.

10.Don't Take This Out of Context! On the Need for Contextual Models and Evaluations for Stylistic Rewriting

Authors:Akhila Yerukola, Xuhui Zhou, Maarten Sap

Abstract: Most existing stylistic text rewriting methods operate on a sentence level, but ignoring the broader context of the text can lead to generic, ambiguous, and incoherent rewrites. In this paper, we propose the integration of preceding textual context into both the rewriting and evaluation stages of stylistic text rewriting, focusing on formality, toxicity, and sentiment transfer tasks. We conduct a comparative evaluation of rewriting through few-shot prompting of GPT-3.5 and GPT NeoX, comparing non-contextual rewrites to contextual rewrites. Our experiments show that humans often prefer contextual rewrites over non-contextual ones, but automatic metrics (e.g., BLEU, sBERT) do not. To bridge this gap, we propose context-infused versions of common automatic metrics, and show that these better reflect human preferences. Overall, our paper highlights the importance of integrating preceding textual context into both the rewriting and evaluation stages of stylistic text rewriting.

11.Human-Centered Metrics for Dialog System Evaluation

Authors:Salvatore Giorgi, Shreya Havaldar, Farhan Ahmed, Zuhaib Akhtar, Shalaka Vaidya, Gary Pan, Lyle H. Ungar, H. Andrew Schwartz, Joao Sedoc

Abstract: We present metrics for evaluating dialog systems through a psychologically-grounded "human" lens: conversational agents express a diversity of both states (short-term factors like emotions) and traits (longer-term factors like personality) just as people do. These interpretable metrics consist of five measures from established psychology constructs that can be applied both across dialogs and on turns within dialogs: emotional entropy, linguistic style and emotion matching, as well as agreeableness and empathy. We compare these human metrics against 6 state-of-the-art automatic metrics (e.g. BARTScore and BLEURT) on 7 standard dialog system data sets. We also introduce a novel data set, the Three Bot Dialog Evaluation Corpus, which consists of annotated conversations from ChatGPT, GPT-3, and BlenderBot. We demonstrate the proposed human metrics offer novel information, are uncorrelated with automatic metrics, and lead to increased accuracy beyond existing automatic metrics for predicting crowd-sourced dialog judgements. The interpretability and unique signal of our proposed human-centered framework make it a valuable tool for evaluating and improving dialog systems.

12.Bi-Drop: Generalizable Fine-tuning for Pre-trained Language Models via Adaptive Subnetwork Optimization

Authors:Shoujie Tong, Heming Xia, Damai Dai, Tianyu Liu, Binghuai Lin, Yunbo Cao, Zhifang Sui

Abstract: Pretrained language models have achieved remarkable success in a variety of natural language understanding tasks. Nevertheless, finetuning large pretrained models on downstream tasks is susceptible to overfitting if the training set is limited, which will lead to diminished performance. In this work, we propose a dynamic fine-tuning strategy for pretrained language models called Bi-Drop. It utilizes the gradient information of various sub-models generated by dropout to update the model parameters selectively. Experiments on the GLUE benchmark show that Bi-Drop outperforms previous fine-tuning methods by a considerable margin, and exhibits consistent superiority over vanilla fine-tuning across various pretrained models. Furthermore, empirical results indicate that Bi-Drop yields substantial improvements in the multiple task or domain transfer, data imbalance, and low-resource scenarios, demonstrating superb generalization ability and robustness.

13.UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning

Authors:Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, Shafiq Joty

Abstract: Charts are very popular for analyzing data, visualizing key insights and answering complex reasoning questions about data. To facilitate chart-based data analysis using natural language, several downstream tasks have been introduced recently such as chart question answering and chart summarization. However, most of the methods that solve these tasks use pretraining on language or vision-language tasks that do not attempt to explicitly model the structure of the charts (e.g., how data is visually encoded and how chart elements are related to each other). To address this, we first build a large corpus of charts covering a wide variety of topics and visual styles. We then present UniChart, a pretrained model for chart comprehension and reasoning. UniChart encodes the relevant text, data, and visual elements of charts and then uses a chart-grounded text decoder to generate the expected output in natural language. We propose several chart-specific pretraining tasks that include: (i) low-level tasks to extract the visual elements (e.g., bars, lines) and data from charts, and (ii) high-level tasks to acquire chart understanding and reasoning skills. We find that pretraining the model on a large corpus with chart-specific low- and high-level tasks followed by finetuning on three down-streaming tasks results in state-of-the-art performance on three downstream tasks.

14.Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models

Authors:Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, Vered Shwartz

Abstract: The escalating debate on AI's capabilities warrants developing reliable metrics to assess machine "intelligence". Recently, many anecdotal examples were used to suggest that newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM); however, prior work reached conflicting conclusions regarding those abilities. We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust. We further examine the factors impacting performance on N-ToM tasks and discover that LLMs struggle with adversarial examples, indicating reliance on shallow heuristics rather than robust ToM abilities. We caution against drawing conclusions from anecdotal examples, limited benchmark testing, and using human-designed psychological tests to evaluate models.

15.BeamSearchQA: Large Language Models are Strong Zero-Shot QA Solver

Authors:Hao Sun, Xiao Liu, Yeyun Gong, Yan Zhang, Nan Duan

Abstract: Open-domain question answering is a crucial task that often requires accessing external information. Existing methods typically adopt a single-turn retrieve-then-read approach, where relevant documents are first retrieved, and questions are then answered based on the retrieved information. However, there are cases where answering a question requires implicit knowledge that is not directly retrievable from the question itself. In this work, we propose a novel question-answering pipeline called eamSearchQA. Our approach leverages large language models(LLMs) to iteratively generate new questions about the original question, enabling an iterative reasoning process. By iteratively refining and expanding the scope of the question, our method aims to capture and utilize hidden knowledge that may not be directly obtainable through retrieval. We evaluate our approach on the widely-used open-domain NQ and WebQ datasets. The experimental results demonstrate that BeamSearchQA significantly outperforms other zero-shot baselines, indicating its effectiveness in tackling the challenges of open-domain question answering.

16.Using Natural Language Explanations to Rescale Human Judgments

Authors:Manya Wadhwa, Jifan Chen, Junyi Jessy Li, Greg Durrett

Abstract: The rise of large language models (LLMs) has brought a critical need for high-quality human-labeled data, particularly for processes like human feedback and evaluation. A common practice is to label data via consensus annotation over the judgments of multiple crowdworkers. However, different annotators may have different interpretations of labeling schemes unless given extensive training, and for subjective NLP tasks, even trained expert annotators can diverge heavily. We show that these nuances can be captured by high quality natural language explanations, and propose a method to rescale ordinal annotation in the presence of disagreement using LLMs. Specifically, we feed Likert ratings and corresponding natural language explanations into an LLM and prompt it to produce a numeric score. This score should reflect the underlying assessment of the example by the annotator. The presence of explanations allows the LLM to homogenize ratings across annotators in spite of scale usage differences. We explore our technique in the context of a document-grounded question answering task on which large language models achieve near-human performance. Among questions where annotators identify incompleteness in the answers, our rescaling improves correlation between nearly all annotator pairs, improving pairwise correlation on these examples by an average of 0.2 Kendall's tau.

17.SSD-2: Scaling and Inference-time Fusion of Diffusion Language Models

Authors:Xiaochuang Han, Sachin Kumar, Yulia Tsvetkov, Marjan Ghazvininejad

Abstract: Diffusion-based language models (LMs) have been shown to be competent generative models that are easy to control at inference and are a promising alternative to autoregressive LMs. While autoregressive LMs have benefited immensely from scaling and instruction-based learning, existing studies on diffusion LMs have been conducted on a relatively smaller scale. Starting with a recently proposed diffusion model SSD-LM, in this work we explore methods to scale it from 0.4B to 13B parameters, proposing several techniques to improve its training and inference efficiency. We call the new model SSD-2. We further show that this model can be easily finetuned to follow instructions. Finally, leveraging diffusion models' capability at inference-time control, we show that SSD-2 facilitates novel ensembles with 100x smaller models that can be customized and deployed by individual users. We find that compared to autoregressive models, the collaboration between diffusion models is more effective, leading to higher-quality and more relevant model responses due to their ability to incorporate bi-directional contexts.

18.A Controllable QA-based Framework for Decontextualization

Authors:Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, Kyle Lo

Abstract: Many real-world applications require surfacing extracted snippets to users, whether motivated by assistive tools for literature surveys or document cross-referencing, or needs to mitigate and recover from model generated inaccuracies., Yet, these passages can be difficult to consume when divorced from their original document context. In this work, we explore the limits of LLMs to perform decontextualization of document snippets in user-facing scenarios, focusing on two real-world settings - question answering and citation context previews for scientific documents. We propose a question-answering framework for decontextualization that allows for better handling of user information needs and preferences when determining the scope of rewriting. We present results showing state-of-the-art LLMs under our framework remain competitive with end-to-end approaches. We also explore incorporating user preferences into the system, finding our framework allows for controllability.

19.Measuring the Knowledge Acquisition-Utilization Gap in Pretrained Language Models

Authors:Amirhossein Kazemnejad, Mehdi Rezagholizadeh, Prasanna Parthasarathi, Sarath Chandar

Abstract: While pre-trained language models (PLMs) have shown evidence of acquiring vast amounts of knowledge, it remains unclear how much of this parametric knowledge is actually usable in performing downstream tasks. We propose a systematic framework to measure parametric knowledge utilization in PLMs. Our framework first extracts knowledge from a PLM's parameters and subsequently constructs a downstream task around this extracted knowledge. Performance on this task thus depends exclusively on utilizing the model's possessed knowledge, avoiding confounding factors like insufficient signal. As an instantiation, we study factual knowledge of PLMs and measure utilization across 125M to 13B parameter PLMs. We observe that: (1) PLMs exhibit two gaps - in acquired vs. utilized knowledge, (2) they show limited robustness in utilizing knowledge under distribution shifts, and (3) larger models close the acquired knowledge gap but the utilized knowledge gap remains. Overall, our study provides insights into PLMs' capabilities beyond their acquired knowledge.

20.Disentangled Phonetic Representation for Chinese Spelling Correction

Authors:Zihong Liang, Xiaojun Quan, Qifan Wang

Abstract: Chinese Spelling Correction (CSC) aims to detect and correct erroneous characters in Chinese texts. Although efforts have been made to introduce phonetic information (Hanyu Pinyin) in this task, they typically merge phonetic representations with character representations, which tends to weaken the representation effect of normal texts. In this work, we propose to disentangle the two types of features to allow for direct interaction between textual and phonetic information. To learn useful phonetic representations, we introduce a pinyin-to-character objective to ask the model to predict the correct characters based solely on phonetic information, where a separation mask is imposed to disable attention from phonetic input to text. To avoid overfitting the phonetics, we further design a self-distillation module to ensure that semantic information plays a major role in the prediction. Extensive experiments on three CSC benchmarks demonstrate the superiority of our method in using phonetic information.

21.ChatGPT and Simple Linguistic Inferences: Blind Spots and Blinds

Authors:Victoria Basmov, Yoav Goldberg, Reut Tsarfaty

Abstract: This paper sheds light on the limitations of ChatGPT's understanding capabilities, focusing on simple inference tasks that are typically easy for humans but appear to be challenging for the model. Specifically, we target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. We present expert-designed evaluation sets for these inference types and conduct experiments in a zero-shot setup. Our results show that the model struggles with these types of inferences, exhibiting moderate to low accuracy. Moreover, while ChatGPT demonstrates knowledge of the underlying linguistic concepts when prompted directly, it often fails to incorporate this knowledge to make correct inferences. Even more strikingly, further experiments show that embedding the premise under presupposition triggers or non-factive verbs causes the model to predict entailment more frequently {regardless} of the correct semantic label. Overall these results suggest that, despite GPT's celebrated language understanding capacity, ChatGPT has blindspots with respect to certain types of entailment, and that certain entailment-cancelling features act as ``blinds'' overshadowing the semantics of the embedded premise. Our analyses emphasize the need for further research into the linguistic comprehension and reasoning capabilities of LLMs, in order to improve their reliability, and establish their trustworthiness for real-world applications.

22.Adapting Language Models to Compress Contexts

Authors:Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen

Abstract: Transformer-based language models (LMs) are powerful and widely-applicable tools, but their usefulness is constrained by a finite context window and the expensive computational cost of processing long text documents. We propose to adapt pre-trained LMs into AutoCompressors. These models are capable of compressing long contexts into compact summary vectors, which are then accessible to the model as soft prompts. Summary vectors are trained with an unsupervised objective, whereby long documents are processed in segments and summary vectors from all previous segments are used in language modeling. We fine-tune OPT models on sequences of up to 30,720 tokens and show that AutoCompressors can utilize long contexts to improve perplexity. We evaluate AutoCompressors on in-context learning by compressing task demonstrations. We find that summary vectors are good substitutes for plain-text demonstrations, increasing accuracy while reducing inference cost. Finally, we explore the benefits of pre-computing summary vectors for large corpora by applying summary vectors to retrieval-augmented language modeling. Overall, AutoCompressors emerge as a simple and inexpensive solution for extending the context window of LMs while speeding up inference over long contexts.

23.Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark

Authors:Feng Jiang, Weihao Liu, Xiaomin Chu, Peifeng Li, Qiaoming Zhu, Haizhou Li

Abstract: Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings. Such a process unveils the discourse topic structure of a document that benefits quickly grasping and understanding the overall context of the document from a higher level. However, research and applications in this field have been restrained due to the lack of proper paragraph-level topic representations and large-scale, high-quality corpora in Chinese compared to the success achieved in English. Addressing these issues, we introduce a hierarchical paragraph-level topic structure representation with title, subheading, and paragraph that comprehensively models the document discourse topic structure. In addition, we ensure a more holistic representation of topic distribution within the document by using sentences instead of keywords to represent sub-topics. Following this representation, we construct the largest Chinese Paragraph-level Topic Structure corpus (CPTS), four times larger than the previously largest one. We also employ a two-stage man-machine collaborative annotation method to ensure the high quality of the corpus both in form and semantics. Finally, we validate the computability of CPTS on two fundamental tasks (topic segmentation and outline generation) by several strong baselines, and its efficacy has been preliminarily confirmed on the downstream task: discourse parsing. The representation, corpus, and benchmark we established will provide a solid foundation for future studies.

24.Large Language Models as Counterfactual Generator: Strengths and Weaknesses

Authors:Yongqi Li, Mayi Xu, Xin Miao, Shen Zhou, Tieyun Qian

Abstract: Large language models (LLMs) have demonstrated remarkable performance in a range of natural language understanding and generation tasks. Yet, their ability to generate counterfactuals, which can be used for areas like data augmentation, remains under-explored. This study aims to investigate the counterfactual generation capabilities of LLMs and analysis factors that influence this ability. First, we evaluate how effective are LLMs in counterfactual generation through data augmentation experiments for small language models (SLMs) across four tasks: sentiment analysis, natural language inference, named entity recognition, and relation extraction. While LLMs show promising enhancements in various settings, they struggle in complex tasks due to their self-limitations and the lack of logical guidance to produce counterfactuals that align with commonsense. Second, our analysis reveals the pivotal role of providing accurate task definitions and detailed step-by-step instructions to LLMs in generating counterfactuals. Interestingly, we also find that LLMs can generate reasonable counterfactuals even with unreasonable demonstrations, which illustrates that demonstrations are primarily to regulate the output format.This study provides the first comprehensive insight into counterfactual generation abilities of LLMs, and offers a novel perspective on utilizing LLMs for data augmentation to enhance SLMs.

25.Faithful Low-Resource Data-to-Text Generation through Cycle Training

Authors:Zhuoer Wang, Marcus Collins, Nikhita Vedula, Simone Filice, Shervin Malmasi, Oleg Rokhlenko

Abstract: Methods to generate text from structured data have advanced significantly in recent years, primarily due to fine-tuning of pre-trained language models on large datasets. However, such models can fail to produce output faithful to the input data, particularly on out-of-domain data. Sufficient annotated data is often not available for specific domains, leading us to seek an unsupervised approach to improve the faithfulness of output text. Since the problem is fundamentally one of consistency between the representations of the structured data and text, we evaluate the effectiveness of cycle training in this work. Cycle training uses two models which are inverses of each other: one that generates text from structured data, and one which generates the structured data from natural language text. We show that cycle training, when initialized with a small amount of supervised data (100 samples in our case), achieves nearly the same performance as fully supervised approaches for the data-to-text generation task on the WebNLG, E2E, WTQ, and WSQL datasets. We perform extensive empirical analysis with automated evaluation metrics and a newly designed human evaluation schema to reveal different cycle training strategies' effectiveness of reducing various types of generation errors. Our code is publicly available at https://github.com/Edillower/CycleNLG.

26.Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak Supervision for Text Classification

Authors:Chengyu Dong, Zihan Wang, Jingbo Shang

Abstract: Recent advances in weakly supervised text classification mostly focus on designing sophisticated methods to turn high-level human heuristics into quality pseudo-labels. In this paper, we revisit the seed matching-based method, which is arguably the simplest way to generate pseudo-labels, and show that its power was greatly underestimated. We show that the limited performance of seed matching is largely due to the label bias injected by the simple seed-match rule, which prevents the classifier from learning reliable confidence for selecting high-quality pseudo-labels. Interestingly, simply deleting the seed words present in the matched input texts can mitigate the label bias and help learn better confidence. Subsequently, the performance achieved by seed matching can be improved significantly, making it on par with or even better than the state-of-the-art. Furthermore, to handle the case when the seed words are not made known, we propose to simply delete the word tokens in the input text randomly with a high deletion ratio. Remarkably, seed matching equipped with this random deletion method can often achieve even better performance than that with seed deletion.

27.MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions

Authors:Zexuan Zhong, Zhengxuan Wu, Christopher D. Manning, Christopher Potts, Danqi Chen

Abstract: The information stored in large language models (LLMs) falls out of date quickly, and retraining from scratch is often not an option. This has recently given rise to a range of techniques for injecting new facts through updating model weights. Current evaluation paradigms are extremely limited, mainly validating the recall of edited facts, but changing one fact should cause rippling changes to the model's related beliefs. If we edit the UK Prime Minister to now be Rishi Sunak, then we should get a different answer to Who is married to the British Prime Minister? In this work, we present a benchmark MQuAKE (Multi-hop Question Answering for Knowledge Editing) comprising multi-hop questions that assess whether edited models correctly answer questions where the answer should change as an entailed consequence of edited facts. While we find that current knowledge-editing approaches can recall edited facts accurately, they fail catastrophically on the constructed multi-hop questions. We thus propose a simple memory-based approach, MeLLo, which stores all edited facts externally while prompting the language model iteratively to generate answers that are consistent with the edited facts. While MQuAKE remains challenging, we show that MeLLo scales well with LLMs (up to 175B) and outperforms previous model editors by a large margin.

28.Estimating Large Language Model Capabilities without Labeled Test Data

Authors:Harvey Yiyun Fu, Qinyuan Ye, Albert Xu, Xiang Ren, Robin Jia

Abstract: Large Language Models (LLMs) have exhibited an impressive ability to perform in-context learning (ICL) from only a few examples, but the success of ICL varies widely from task to task. Thus, it is important to quickly determine whether ICL is applicable to a new task, but directly evaluating ICL accuracy can be expensive in situations where test data is expensive to annotate -- the exact situations where ICL is most appealing. In this paper, we propose the task of ICL accuracy estimation, in which we predict the accuracy of an LLM when doing in-context learning on a new task given only unlabeled data for that task. To perform ICL accuracy estimation, we propose a method that trains a meta-model using LLM confidence scores as features. We compare our method to several strong accuracy estimation baselines on a new benchmark that covers 4 LLMs and 3 task collections. On average, the meta-model improves over all baselines and achieves the same estimation performance as directly evaluating on 40 labeled test examples per task, across the total 12 settings. We encourage future work to improve on our methods and evaluate on our ICL accuracy estimation benchmark to deepen our understanding of when ICL works.

29.AWESOME: GPU Memory-constrained Long Document Summarization using Memory Mechanism and Global Salient Content

Authors:Shuyang Cao, Lu Wang

Abstract: Long document summarization systems are critical for domains with lengthy and jargonladen text, yet they present significant challenges to researchers and developers with limited computing resources. Existing solutions mainly focus on efficient attentions or divide-and-conquer strategies. The former reduces theoretical time complexity, but is still memory-heavy. The latter methods sacrifice global context, leading to uninformative and incoherent summaries. This work aims to leverage the memory-efficient nature of divide-and-conquer methods while preserving global context. Concretely, our framework AWESOME uses two novel mechanisms: (1) External memory mechanisms track previously encoded document segments and their corresponding summaries, to enhance global document understanding and summary coherence. (2) Global salient content is further identified beforehand to augment each document segment to support its summarization. Extensive experiments on diverse genres of text, including government reports, transcripts, scientific papers, and novels, show that AWESOME produces summaries with improved informativeness, faithfulness, and coherence than competitive baselines on longer documents, while having a similar or smaller GPU memory footprint.

30.Machine Reading Comprehension using Case-based Reasoning

Authors:Dung Thai, Dhruv Agarwal, Mudit Chaudhary, Rajarshi Das, Manzil Zaheer, Jay-Yoon Lee, Hannaneh Hajishirzi, Andrew McCallum

Abstract: We present an accurate and interpretable method for answer extraction in machine reading comprehension that is reminiscent of case-based reasoning (CBR) from classical AI. Our method (CBR-MRC) builds on the hypothesis that contextualized answers to similar questions share semantic similarities with each other. Given a target question, CBR-MRC retrieves a set of similar questions from a memory of observed cases and predicts an answer by selecting the span in the target context that is most similar to the contextualized representations of answers in the retrieved cases. The semi-parametric nature of our approach allows CBR-MRC to attribute a prediction to the specific set of cases used during inference, making it a desirable choice for building reliable and debuggable QA systems. We show that CBR-MRC achieves high test accuracy comparable with large reader models, outperforming baselines by 11.5 and 8.4 EM on NaturalQuestions and NewsQA, respectively. Further, we also demonstrate the ability of CBR-MRC in identifying not just the correct answer tokens but also the span with the most relevant supporting evidence. Lastly, we observe that contexts for certain question types show higher lexical diversity than others and find CBR-MRC to be robust to these variations while performance using fully-parametric methods drops.

31.Mitigating Temporal Misalignment by Discarding Outdated Facts

Authors:Michael J. Q. Zhang, Eunsol Choi

Abstract: While large language models are able to retain vast amounts of world knowledge seen during pretraining, such knowledge is prone to going out of date and is nontrivial to update. Furthermore, these models are often used under temporal misalignment, tasked with answering questions about the present, despite having only been trained on data collected in the past. To mitigate the effects of temporal misalignment, we propose fact duration prediction: the task of predicting how long a given fact will remain true. In our experiments, we demonstrate how identifying facts that are prone to rapid change can help models avoid from reciting outdated information and identify which predictions require seeking out up-to-date knowledge sources. We also show how modeling fact duration improves calibration for knowledge-intensive tasks, such as open-retrieval question answering, under temporal misalignment by discarding volatile facts. Our data and code will be released publicly at https://github.com/mikejqzhang/mitigating_misalignment.

32.Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners

Authors:Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, Muhan Zhang

Abstract: The emergent few-shot reasoning capabilities of Large Language Models (LLMs) have excited the natural language and machine learning community over recent years. Despite of numerous successful applications, the underlying mechanism of such in-context capabilities still remains unclear. In this work, we hypothesize that the learned \textit{semantics} of language tokens do the most heavy lifting during the reasoning process. Different from human's symbolic reasoning process, the semantic representations of LLMs could create strong connections among tokens, thus composing a superficial logical chain. To test our hypothesis, we decouple semantics from the language reasoning process and evaluate three kinds of reasoning abilities, i.e., deduction, induction and abduction. Our findings reveal that semantics play a vital role in LLMs' in-context reasoning -- LLMs perform significantly better when semantics are consistent with commonsense but struggle to solve symbolic or counter-commonsense reasoning tasks by leveraging in-context new knowledge. The surprising observations question whether modern LLMs have mastered the inductive, deductive and abductive reasoning abilities as in human intelligence, and motivate research on unveiling the magic existing within the black-box LLMs. On the whole, our analysis provides a novel perspective on the role of semantics in developing and evaluating language models' reasoning abilities. Code is available at {\url{https://github.com/XiaojuanTang/ICSR}}.

33.Pre-training Intent-Aware Encoders for Zero- and Few-Shot Intent Classification

Authors:Mujeen Sung, James Gung, Elman Mansimov, Nikolaos Pappas, Raphael Shu, Salvatore Romeo, Yi Zhang, Vittorio Castelli

Abstract: Intent classification (IC) plays an important role in task-oriented dialogue systems as it identifies user intents from given utterances. However, models trained on limited annotations for IC often suffer from a lack of generalization to unseen intent classes. We propose a novel pre-training method for text encoders that uses contrastive learning with intent psuedo-labels to produce embeddings that are well-suited for IC tasks. By applying this pre-training strategy, we also introduce the pre-trained intent-aware encoder (PIE). Specifically, we first train a tagger to identify key phrases within utterances that are crucial for interpreting intents. We then use these extracted phrases to create examples for pre-training a text encoder in a contrastive manner. As a result, our PIE model achieves up to 5.4% and 4.0% higher accuracy than the previous state-of-the-art pre-trained sentence encoder for the N-way zero- and one-shot settings on four IC datasets.

34.Towards Few-shot Entity Recognition in Document Images: A Graph Neural Network Approach Robust to Image Manipulation

Authors:Prashant Krishnan, Zilong Wang, Yangkun Wang, Jingbo Shang

Abstract: Recent advances of incorporating layout information, typically bounding box coordinates, into pre-trained language models have achieved significant performance in entity recognition from document images. Using coordinates can easily model the absolute position of each token, but they might be sensitive to manipulations in document images (e.g., shifting, rotation or scaling), especially when the training data is limited in few-shot settings. In this paper, we propose to further introduce the topological adjacency relationship among the tokens, emphasizing their relative position information. Specifically, we consider the tokens in the documents as nodes and formulate the edges based on the topological heuristics from the k-nearest bounding boxes. Such adjacency graphs are invariant to affine transformations including shifting, rotations and scaling. We incorporate these graphs into the pre-trained language model by adding graph neural network layers on top of the language model embeddings, leading to a novel model LAGER. Extensive experiments on two benchmark datasets show that LAGER significantly outperforms strong baselines under different few-shot settings and also demonstrate better robustness to manipulations.

35.SummIt: Iterative Text Summarization via ChatGPT

Authors:Haopeng Zhang, Xiao Liu, Jiawei Zhang

Abstract: Existing text summarization systems have made significant progress in recent years but typically generates summaries in a single step. The one-shot summarization setting is sometimes inadequate, however, as the generated summary may contain hallucinations or overlook important details related to the reader's interests. In this paper, we address this limitation by proposing SummIt, an iterative text summarization framework based on large language models like ChatGPT. Our framework enables the model to refine the generated summary iteratively through self-evaluation and feedback, closely resembling the iterative process humans undertake when drafting and revising summaries. We also explore using in-context learning to guide the rationale generation and summary refinement. Furthermore, we explore the potential benefits of integrating knowledge and topic extractors into the framework to enhance summary faithfulness and controllability. We evaluate the performance of our framework on three benchmark summarization datasets through empirical and qualitative analyses. We also conduct a human evaluation to validate the effectiveness of the model's refinements and find a potential issue of over-correction. Our code is available at \url{https://github.com/hpzhang94/summ_it}.

36.ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Authors:Chenyang Le, Yao Qian, Long Zhou, Shujie Liu, Michael Zeng, Xuedong Huang

Abstract: Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.

37.PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts

Authors:Yunshui Li, Binyuan Hui, ZhiChao Yin, Min Yang, Fei Huang, Yongbin Li

Abstract: Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes \textbf{PaCE}, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.

38.Exploring Sentiment Analysis Techniques in Natural Language Processing: A Comprehensive Review

Authors:Karthick Prasad Gunasekaran

Abstract: Sentiment analysis (SA) is the automated process of detecting and understanding the emotions conveyed through written text. Over the past decade, SA has gained significant popularity in the field of Natural Language Processing (NLP). With the widespread use of social media and online platforms, SA has become crucial for companies to gather customer feedback and shape their marketing strategies. Additionally, researchers rely on SA to analyze public sentiment on various topics. In this particular research study, a comprehensive survey was conducted to explore the latest trends and techniques in SA. The survey encompassed a wide range of methods, including lexicon-based, graph-based, network-based, machine learning, deep learning, ensemble-based, rule-based, and hybrid techniques. The paper also addresses the challenges and opportunities in SA, such as dealing with sarcasm and irony, analyzing multi-lingual data, and addressing ethical concerns. To provide a practical case study, Twitter was chosen as one of the largest online social media platforms. Furthermore, the researchers shed light on the diverse application areas of SA, including social media, healthcare, marketing, finance, and politics. The paper also presents a comparative and comprehensive analysis of existing trends and techniques, datasets, and evaluation metrics. The ultimate goal is to offer researchers and practitioners a systematic review of SA techniques, identify existing gaps, and suggest possible improvements. This study aims to enhance the efficiency and accuracy of SA processes, leading to smoother and error-free outcomes.

39.Meta-Learning For Vision-and-Language Cross-lingual Transfer

Authors:Hanxu Hu, Frank Keller

Abstract: Current pre-trained vison-language models (PVLMs) achieve excellent performance on a range of multi-modal datasets. Recent work has aimed at building multilingual models, and a range of novel multilingual multi-modal datasets have been proposed. Current PVLMs typically perform poorly on these datasets when used for multi-modal zero-shot or few-shot cross-lingual transfer, especially for low-resource languages. To alleviate this problem, we propose a novel meta-learning fine-tuning framework. Our framework makes current PVLMs rapidly adaptive to new languages in vision-language scenarios by designing MAML in a cross-lingual multi-modal manner. Experiments show that our method boosts the performance of current state-of-the-art PVLMs in both zero-shot and few-shot cross-lingual transfer on a range of vision-language understanding tasks and datasets (XVNLI, xGQA, MaRVL, xFlicker&Co

40.Drafting Event Schemas using Language Models

Authors:Anisha Gunjal, Greg Durrett

Abstract: Past work has studied event prediction and event language modeling, sometimes mediated through structured representations of knowledge in the form of event schemas. Such schemas can lead to explainable predictions and forecasting of unseen events given incomplete information. In this work, we look at the process of creating such schemas to describe complex events. We use large language models (LLMs) to draft schemas directly in natural language, which can be further refined by human curators as necessary. Our focus is on whether we can achieve sufficient diversity and recall of key events and whether we can produce the schemas in a sufficiently descriptive style. We show that large language models are able to achieve moderate recall against schemas taken from two different datasets, with even better results when multiple prompts and multiple samples are combined. Moreover, we show that textual entailment methods can be used for both matching schemas to instances of events as well as evaluating overlap between gold and predicted schemas. Our method paves the way for easier distillation of event knowledge from large language model into schemas.

41.BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer

Authors:Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, Hannaneh Hajishirzi

Abstract: Despite remarkable advancements in few-shot generalization in natural language processing, most models are developed and evaluated primarily in English. To facilitate research on few-shot cross-lingual transfer, we introduce a new benchmark, called BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format and provides a fixed set of few-shot examples and instructions. BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer across a broad range of tasks and languages. Using BUFFET, we perform thorough evaluations of state-of-the-art multilingual large language models with different transfer methods, namely in-context learning and fine-tuning. Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer. In particular, ChatGPT with in-context learning often performs worse than much smaller mT5-base models fine-tuned on English task data and few-shot in-language examples. Our analysis suggests various avenues for future research in few-shot cross-lingual transfer, such as improved pretraining, understanding, and future evaluations.

42.Large Language Model Distillation Doesn't Need a Teacher

Authors:Ananya Harsh Jha, Dirk Groeneveld, Emma Strubell, Iz Beltagy

Abstract: Knowledge distillation trains a smaller student model to match the output distribution of a larger teacher to maximize the end-task performance under computational constraints. However, existing literature on language model distillation primarily focuses on compressing encoder-only models that are then specialized by task-specific supervised finetuning. We need to rethink this setup for more recent large language models with tens to hundreds of billions of parameters. Task-specific finetuning is impractical at this scale, and model performance is often measured using zero/few-shot prompting. Thus, in this work, we advocate for task-agnostic zero-shot evaluated distillation for large language models without access to end-task finetuning data. We propose a teacher-free task-agnostic distillation method, which uses a truncated version of the larger model for initialization, and continues pretraining this model using a language modeling objective. Our teacher-free method shines in a distillation regime where it is infeasible to fit both the student and teacher into the GPU memory. Despite its simplicity, our method can effectively reduce the model size by 50\%, matching or outperforming the vanilla distillation method on perplexity and accuracy on 13 zero-shot end-tasks while being 1.5x computationally efficient.

43.CAR: Conceptualization-Augmented Reasoner for Zero-Shot Commonsense Question Answering

Authors:Weiqi Wang, Tianqing Fang, Wenxuan Ding, Baixuan Xu, Xin Liu, Yangqiu Song, Antoine Bosselut

Abstract: The task of zero-shot commonsense question answering evaluates models on their capacity to reason about general scenarios beyond those presented in specific datasets. Existing approaches for tackling this task leverage external knowledge from CommonSense Knowledge Bases (CSKBs) by pretraining the model on synthetic QA pairs constructed from CSKBs. In these approaches, negative examples (distractors) are formulated by randomly sampling from CSKBs using fairly primitive keyword constraints. However, two bottlenecks limit these approaches: the inherent incompleteness of CSKBs limits the semantic coverage of synthetic QA pairs, and the lack of human annotations makes the sampled negative examples potentially uninformative and contradictory. To tackle these limitations above, we propose Conceptualization-Augmented Reasoner (CAR), a zero-shot commonsense question-answering framework that fully leverages the power of conceptualization. Specifically, CAR abstracts a commonsense knowledge triple to many higher-level instances, which increases the coverage of CSKB and expands the ground-truth answer space, reducing the likelihood of selecting false-negative distractors. Extensive experiments demonstrate that CAR more robustly generalizes to answering questions about zero-shot commonsense scenarios than existing methods, including large language models, such as GPT3.5 and ChatGPT. Our codes, data, and model checkpoints are available at https://github.com/HKUST-KnowComp/CAR.

44.ClusterLLM: Large Language Models as a Guide for Text Clustering

Authors:Yuwei Zhang, Zihan Wang, Jingbo Shang

Abstract: We introduce ClusterLLM, a novel text clustering framework that leverages feedback from an instruction-tuned large language model, such as ChatGPT. Compared with traditional unsupervised methods that builds upon "small" embedders, ClusterLLM exhibits two intriguing advantages: (1) it enjoys the emergent capability of LLM even if its embeddings are inaccessible; and (2) it understands the user's preference on clustering through textual instruction and/or a few annotated data. First, we prompt ChatGPT for insights on clustering perspective by constructing hard triplet questions <does A better correspond to B than C>, where A, B and C are similar data points that belong to different clusters according to small embedder. We empirically show that this strategy is both effective for fine-tuning small embedder and cost-efficient to query ChatGPT. Second, we prompt ChatGPT for helps on clustering granularity by carefully designed pairwise questions <do A and B belong to the same category>, and tune the granularity from cluster hierarchies that is the most consistent with the ChatGPT answers. Extensive experiments on 14 datasets show that ClusterLLM consistently improves clustering quality, at an average cost of ~$0.6 per dataset.

45.From Words to Wires: Generating Functioning Electronic Devices from Natural Language Descriptions

Authors:Peter Jansen

Abstract: In this work, we show that contemporary language models have a previously unknown skill -- the capacity for electronic circuit design from high-level textual descriptions, akin to code generation. We introduce two benchmarks: Pins100, assessing model knowledge of electrical components, and Micro25, evaluating a model's capability to design common microcontroller circuits and code in the Arduino ecosystem that involve input, output, sensors, motors, protocols, and logic -- with models such as GPT-4 and Claude-V1 achieving between 60% to 96% Pass@1 on generating full devices. We include six case studies of using language models as a design assistant for moderately complex devices, such as a radiation-powered random number generator, an emoji keyboard, a visible spectrometer, and several assistive devices, while offering a qualitative analysis performance, outlining evaluation challenges, and suggesting areas of development to improve complex circuit design and practical utility. With this work, we aim to spur research at the juncture of natural language processing and electronic design.

46.Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis

Authors:Sohee Yang, Jonghyeon Kim, Joel Jang, Seonghyeon Ye, Hyunji Lee, Minjoon Seo

Abstract: Large Language Models (LLMs) have demonstrated great capabilities in solving a wide range of tasks in a resource-efficient manner through prompting, which does not require task-specific training, but suffers from performance fluctuation when there are multiple prompt candidates. Previous works have introduced gradient-free probability-based prompt selection methods that aim to choose the optimal prompt among the candidates for a given task but fail to provide a comprehensive and fair comparison between each other. In this paper, we propose a unified framework to interpret and evaluate the existing probability-based prompt selection methods by performing extensive experiments on 13 common NLP tasks. We find that all existing methods can be unified into some variant of the method that maximizes the mutual information between the input and the corresponding model output (denoted as MI). Using the finding, we develop several variants of MI and increases the effectiveness of the best prompt selection method from 87.79% to 94.98%, measured as the ratio of the performance of the selected prompt to that of the optimal oracle prompt. Furthermore, we propose a novel calibration method called Calibration by Marginalization (CBM) that is orthogonal to existing methods and helps increase the prompt selection effectiveness of the best method by 99.44%. The code and datasets used in our work will be released at https://github.com/soheeyang/unified-prompt-selection.

47.Leveraging GPT-4 for Automatic Translation Post-Editing

Authors:Vikas Raunak, Amr Sharaf, Hany Hassan Awadallah, Arul Menezes

Abstract: While Neural Machine Translation (NMT) represents the leading approach to Machine Translation (MT), the outputs of NMT models still require translation post-editing to rectify errors and enhance quality, particularly under critical settings. In this work, we formalize the task of translation post-editing with Large Language Models (LLMs) and explore the use of GPT-4 to automatically post-edit NMT outputs across several language pairs. Our results demonstrate that GPT-4 is adept at translation post-editing and produces meaningful edits even when the target language is not English. Notably, we achieve state-of-the-art performance on WMT-22 English-Chinese, English-German, Chinese-English and German-English language pairs using GPT-4 based post-editing, as evaluated by state-of-the-art MT quality metrics.

48.ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games

Authors:Ruoyao Wang, Graham Todd, Eric Yuan, Ziang Xiao, Marc-Alexandre Côté, Peter Jansen

Abstract: In this work we examine the ability of language models to generate explicit world models of scientific and common-sense reasoning tasks by framing this as a problem of generating text-based games. To support this, we introduce ByteSized32, a corpus of 32 highly-templated text games written in Python totaling 24k lines of code, each centered around a particular task, and paired with a set of 16 unseen text game specifications for evaluation. We propose a suite of automatic and manual metrics for assessing simulation validity, compliance with task specifications, playability, winnability, and alignment with the physical world. In a single-shot evaluation of GPT-4 on this simulation-as-code-generation task, we find it capable of producing runnable games in 27% of cases, highlighting the difficulty of this challenge task. We discuss areas of future improvement, including GPT-4's apparent capacity to perform well at simulating near canonical task solutions, with performance dropping off as simulations include distractors or deviate from canonical solutions in the action space.

49.Interpretable by Design Visual Question Answering

Authors:Xingyu Fu, Ben Zhou, Sihao Chen, Mark Yatskar, Dan Roth

Abstract: Model interpretability has long been a hard problem for the AI community especially in the multimodal setting, where vision and language need to be aligned and reasoned at the same time. In this paper, we specifically focus on the problem of Visual Question Answering (VQA). While previous researches try to probe into the network structures of black-box multimodal models, we propose to tackle the problem from a different angle -- to treat interpretability as an explicit additional goal. Given an image and question, we argue that an interpretable VQA model should be able to tell what conclusions it can get from which part of the image, and show how each statement help to arrive at an answer. We introduce InterVQA: Interpretable-by-design VQA, where we design an explicit intermediate dynamic reasoning structure for VQA problems and enforce symbolic reasoning that only use the structure for final answer prediction to take place. InterVQA produces high-quality explicit intermediate reasoning steps, while maintaining similar to the state-of-the-art (sota) end-task performance.

50.Privacy Implications of Retrieval-Based Language Models

Authors:Yangsibo Huang, Samyak Gupta, Zexuan Zhong, Kai Li, Danqi Chen

Abstract: Retrieval-based language models (LMs) have demonstrated improved interpretability, factuality, and adaptability compared to their parametric counterparts, by incorporating retrieved text from external datastores. While it is well known that parametric models are prone to leaking private data, it remains unclear how the addition of a retrieval datastore impacts model privacy. In this work, we present the first study of privacy risks in retrieval-based LMs, particularly $k$NN-LMs. Our goal is to explore the optimal design and training procedure in domains where privacy is of concern, aiming to strike a balance between utility and privacy. Crucially, we find that $k$NN-LMs are more susceptible to leaking private information from their private datastore than parametric models. We further explore mitigations of privacy risks. When privacy information is targeted and readily detected in the text, we find that a simple sanitization step would completely eliminate the risks, while decoupling query and key encoders achieves an even better utility-privacy trade-off. Otherwise, we consider strategies of mixing public and private data in both datastore and encoder training. While these methods offer modest improvements, they leave considerable room for future work. Together, our findings provide insights for practitioners to better understand and mitigate privacy risks in retrieval-based LMs. Our code is available at: https://github.com/Princeton-SysML/kNNLM_privacy .

51.Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective

Authors:Ziang Xiao, Susu Zhang, Vivian Lai, Q. Vera Liao

Abstract: We address the fundamental challenge in Natural Language Generation (NLG) model evaluation, the design and validation of evaluation metrics. Recognizing the limitations of existing metrics and issues with human judgment, we propose using measurement theory, the foundation of test design, as a framework for conceptualizing and evaluating the validity and reliability of NLG evaluation metrics. This approach offers a systematic method for defining "good" metrics, developing robust metrics, and assessing metric performance. In this paper, we introduce core concepts in measurement theory in the context of NLG evaluation and key methods to evaluate the performance of NLG metrics. Through this framework, we aim to promote the design, evaluation, and interpretation of valid and reliable metrics, ultimately contributing to the advancement of robust and effective NLG models in real-world settings.

52.Extracting Psychological Indicators Using Question Answering

Authors:Luka Pavlović

Abstract: In this work, we propose a method for extracting text spans that may indicate one of the BIG5 psychological traits using a question-answering task with examples that have no answer for the asked question. We utilized the RoBERTa model fine-tuned on SQuAD 2.0 dataset. The model was further fine-tuned utilizing comments from Reddit. We examined the effect of the percentage of examples with no answer in the training dataset on the overall performance. The results obtained in this study are in line with the SQuAD 2.0 benchmark and present a good baseline for further research.

53.Text encoders are performance bottlenecks in contrastive vision-language models

Authors:Amita Kamath, Jack Hessel, Kai-Wei Chang

Abstract: Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach doesn't require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multi-modal matching performance on ControlledImCaps: a new evaluation benchmark we collect+release consisting of fine-grained compositional images+captions. Specifically -- our results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive vision+language models. We release data+code.

54.PIVOINE: Instruction Tuning for Open-world Information Extraction

Authors:Keming Lu, Xiaoman Pan, Kaiqiang Song, Hongming Zhang, Dong Yu, Jianshu Chen

Abstract: We consider the problem of Open-world Information Extraction (Open-world IE), which extracts comprehensive entity profiles from unstructured texts. Different from the conventional closed-world setting of Information Extraction (IE), Open-world IE considers a more general situation where entities and relations could be beyond a predefined ontology. More importantly, we seek to develop a large language model (LLM) that is able to perform Open-world IE to extract desirable entity profiles characterized by (possibly fine-grained) natural language instructions. We achieve this by finetuning LLMs using instruction tuning. In particular, we construct INSTRUCTOPENWIKI, a substantial instruction tuning dataset for Open-world IE enriched with a comprehensive corpus, extensive annotations, and diverse instructions. We finetune the pretrained BLOOM models on INSTRUCTOPENWIKI and obtain PIVOINE, an LLM for Open-world IE with strong instruction-following capabilities. Our experiments demonstrate that PIVOINE significantly outperforms traditional closed-world methods and other LLM baselines, displaying impressive generalization capabilities on both unseen instructions and out-of-ontology cases. Consequently, PIVOINE emerges as a promising solution to tackle the open-world challenge in IE effectively.

55.Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering

Authors:Wang Zhu, Jesse Thomason, Robin Jia

Abstract: We train a language model (LM) to robustly answer multistep questions by generating and answering sub-questions. We propose Chain-of-Questions, a framework that trains a model to generate sub-questions and sub-answers one at a time by leveraging human annotated question decomposition meaning representation (QDMR). The key technical challenge is that QDMR only contains sub-questions but not answers to those sub-questions, so we treat sub-answers as latent variables and optimize them using a novel dynamic mixture of Hard-EM and MAPO. Chain-of-Questions greatly outperforms strong neuro-symbolic methods by 9.0 F1 on DROP contrast set, and outperforms GPT-3.5 by 24.3 F1 on HOTPOTQA adversarial set, thus demonstrating the effectiveness and robustness of our framework.

56.M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Authors:Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse, Osama Mohammed Afzal, Tarek Mahmoud, Alham Fikri Aji, Preslav Nakov

Abstract: Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries, but this has also resulted in concerns regarding the potential misuse of such texts in journalism, educational, and academic context. In this work, we aim to develop automatic systems to identify machine-generated text and to detect potential misuse. We first introduce a large-scale benchmark M4, which is multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Using the dataset, we experiment with a number of methods and we show that it is challenging for detectors to generalize well on unseen examples if they are either from different domains or are generated by different large language models. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and there is a lot of room for improvement. We believe that our dataset M4, which covers different generators, domains and languages, will enable future research towards more robust approaches for this pressing societal problem. The M4 dataset is available at https://github.com/mbzuai-nlp/M4.

57.Identifying Informational Sources in News Articles

Authors:Alexander Spangher, Nanyun Peng, Jonathan May, Emilio Ferrara

Abstract: News articles are driven by the informational sources journalists use in reporting. Modeling when, how and why sources get used together in stories can help us better understand the information we consume and even help journalists with the task of producing it. In this work, we take steps toward this goal by constructing the largest and widest-ranging annotated dataset, to date, of informational sources used in news writing. We show that our dataset can be used to train high-performing models for information detection and source attribution. We further introduce a novel task, source prediction, to study the compositionality of sources in news articles. We show good performance on this task, which we argue is an important proof for narrative science exploring the internal structure of news articles and aiding in planning-based language generation, and an important step towards a source-recommendation system to aid journalists.

58.Coverage-based Example Selection for In-Context Learning

Authors:Shivanshu Gupta, Sameer Singh, Matt Gardner

Abstract: In-context learning (ICL), the ability of large language models to perform novel tasks by conditioning on a prompt with a few task examples, requires demonstrations that are informative about the test instance. The standard approach of independently selecting the most similar examples selects redundant demonstrations while overlooking important information. This work proposes a framework for assessing the informativeness of demonstrations based on their coverage of salient aspects (e.g., reasoning patterns) of the test input. Using this framework, we show that contextual token embeddings effectively capture these salient aspects, and their recall measured using BERTScore-Recall (BSR) yields a reliable measure of informativeness. Further, we extend recall metrics like BSR to propose their set versions to find maximally informative sets of demonstrations. On 6 complex compositional generation tasks and 7 diverse LLMs, we show that Set-BSR outperforms the standard similarity-based approach by up to 16% on average and, despite being learning-free, often surpasses methods that leverage task or LLM-specific training.

59.PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions

Authors:Anthony Chen, Panupong Pasupat, Sameer Singh, Hongrae Lee, Kelvin Guu

Abstract: The remarkable capabilities of large language models have been accompanied by a persistent drawback: the generation of false and unsubstantiated claims commonly known as "hallucinations". To combat this issue, recent research has introduced approaches that involve editing and attributing the outputs of language models, particularly through prompt-based editing. However, the inference cost and speed of using large language models for editing currently bottleneck prompt-based methods. These bottlenecks motivate the training of compact editors, which is challenging due to the scarcity of training data for this purpose. To overcome these challenges, we exploit the power of large language models to introduce corruptions (i.e., noise) into text and subsequently fine-tune compact editors to denoise the corruptions by incorporating relevant evidence. Our methodology is entirely unsupervised and provides us with faux hallucinations for training in any domain. Our Petite Unsupervised Research and Revision model, PURR, not only improves attribution over existing editing methods based on fine-tuning and prompting, but also achieves faster execution times by orders of magnitude.

60.From Shortcuts to Triggers: Backdoor Defense with Denoised PoE

Authors:Qin Liu, Fei Wang, Chaowei Xiao, Muhao Chen

Abstract: Language models are often at risk of diverse backdoor attacks, especially data poisoning. Thus, it is important to investigate defense solutions for addressing them. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers, leaving a universal defense against various backdoor attacks with diverse triggers largely unexplored. In this paper, we propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised Product-of-Experts), which is inspired by the shortcut nature of backdoor attacks, to defend various backdoor attacks. DPoE consists of two models: a shallow model that captures the backdoor shortcuts and a main model that is prevented from learning the backdoor shortcuts. To address the label flip caused by backdoor attackers, DPoE incorporates a denoising design. Experiments on SST-2 dataset show that DPoE significantly improves the defense performance against various types of backdoor triggers including word-level, sentence-level, and syntactic triggers. Furthermore, DPoE is also effective under a more challenging but practical setting that mixes multiple types of trigger.

61.CoLaDa: A Collaborative Label Denoising Framework for Cross-lingual Named Entity Recognition

Authors:Tingting Ma, Qianhui Wu, Huiqiang Jiang, Börje F. Karlsson, Tiejun Zhao, Chin-Yew Lin

Abstract: Cross-lingual named entity recognition (NER) aims to train an NER system that generalizes well to a target language by leveraging labeled data in a given source language. Previous work alleviates the data scarcity problem by translating source-language labeled data or performing knowledge distillation on target-language unlabeled data. However, these methods may suffer from label noise due to the automatic labeling process. In this paper, we propose CoLaDa, a Collaborative Label Denoising Framework, to address this problem. Specifically, we first explore a model-collaboration-based denoising scheme that enables models trained on different data sources to collaboratively denoise pseudo labels used by each other. We then present an instance-collaboration-based strategy that considers the label consistency of each token's neighborhood in the representation space for denoising. Experiments on different benchmark datasets show that the proposed CoLaDa achieves superior results compared to previous methods, especially when generalizing to distant languages.

62.Structural Ambiguity and its Disambiguation in Language Model Based Parsers: the Case of Dutch Clause Relativization

Authors:Gijs Wijnholds, Michael Moortgat

Abstract: This paper addresses structural ambiguity in Dutch relative clauses. By investigating the task of disambiguation by grounding, we study how the presence of a prior sentence can resolve relative clause ambiguities. We apply this method to two parsing architectures in an attempt to demystify the parsing and language model components of two present-day neural parsers. Results show that a neurosymbolic parser, based on proof nets, is more open to data bias correction than an approach based on universal dependencies, although both setups suffer from a comparable initial data bias.

63.Frugal Prompting for Dialog Models

Authors:Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan Goyal

Abstract: The use of large language models (LLMs) in natural language processing (NLP) tasks is rapidly increasing, leading to changes in how researchers approach problems in the field. To fully utilize these models' abilities, a better understanding of their behavior for different input protocols is required. With LLMs, users can directly interact with the models through a text-based interface to define and solve various tasks. Hence, understanding the conversational abilities of these LLMs, which may not have been specifically trained for dialog modeling, is also important. This study examines different approaches for building dialog systems using LLMs by considering various aspects of the prompt. As part of prompt tuning, we experiment with various ways of providing instructions, exemplars, current query and additional context. The research also analyzes the representations of dialog history that have the optimal usable-information density. Based on the findings, the paper suggests more compact ways of providing dialog history information while ensuring good performance and reducing model's inference-API costs. The research contributes to a better understanding of how LLMs can be effectively used for building interactive systems.

64.Universal Self-adaptive Prompting

Authors:Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Hanjun Dai, Julian Martin Eisenschlos, Sercan O. Arik, Tomas Pfister

Abstract: A hallmark of modern large language models (LLMs) is their impressive general zero-shot and few-shot abilities, often elicited through prompt-based and/or in-context learning. However, while highly coveted and being the most general, zero-shot performances in LLMs are still typically weaker due to the lack of guidance and the difficulty of applying existing automatic prompt design methods in general tasks when ground-truth labels are unavailable. In this study, we address this by presenting Universal Self-adaptive Prompting (USP), an automatic prompt design approach specifically tailored for zero-shot learning (while compatible with few-shot). Requiring only a small amount of unlabeled data & an inference-only LLM, USP is highly versatile: to achieve universal prompting, USP categorizes a possible NLP task into one of the three possible task types, and then uses a corresponding selector to select the most suitable queries & zero-shot model-generated responses as pseudo-demonstrations, thereby generalizing ICL to the zero-shot setup in a fully automated way. We evaluate zero-shot USP with two PaLM models, and demonstrate performances that are considerably stronger than standard zero-shot baselines and are comparable to or even superior than few-shot baselines across more than 20 natural language understanding (NLU) and natural language generation (NLG) tasks.

65.Towards Reliable Misinformation Mitigation: Generalization, Uncertainty, and GPT-4

Authors:Kellin Pelrine, Meilina Reksoprodjo, Caleb Gupta, Joel Christoph, Reihaneh Rabbany

Abstract: Misinformation poses a critical societal challenge, and current approaches have yet to produce an effective solution. We propose focusing on generalization, soft classification, and leveraging recent large language models to create more practical tools in contexts where perfect predictions remain unattainable. We begin by demonstrating that GPT-4 and other language models can outperform existing methods in the literature. Next, we explore their generalization, revealing that GPT-4 and RoBERTa-large exhibit critical differences in failure modes, which offer potential for significant performance improvements. Finally, we show that these models can be employed in soft classification frameworks to better quantify uncertainty. We find that models with inferior hard classification results can achieve superior soft classification performance. Overall, this research lays groundwork for future tools that can drive real-world progress on misinformation.

66.Aligning Language Models to User Opinions

Authors:EunJeong Hwang, Bodhisattwa Prasad Majumder, Niket Tandon

Abstract: An important aspect of developing LLMs that interact with humans is to align models' behavior to their users. It is possible to prompt an LLM into behaving as a certain persona, especially a user group or ideological persona the model captured during its pertaining stage. But, how to best align an LLM with a specific user and not a demographic or ideological group remains an open question. Mining public opinion surveys (by Pew Research), we find that the opinions of a user and their demographics and ideologies are not mutual predictors. We use this insight to align LLMs by modeling both user opinions as well as user demographics and ideology, achieving up to 7 points accuracy gains in predicting public opinions from survey questions across a broad set of topics. In addition to the typical approach of prompting LLMs with demographics and ideology, we discover that utilizing the most relevant past opinions from individual users enables the model to predict user opinions more accurately.

67.Discriminator-Guided Multi-step Reasoning with Language Models

Authors:Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Lu Wang

Abstract: In the context of multi-step reasoning, language models (LMs) probabilities are often miscalibrated -- solutions with high probabilities are not always correct. Therefore, greedy decoding, which is the standard decoding method for reasoning tasks, often yields incorrect solutions. In addition, methods such as self-consistency and verifiers rely on sampling from the LM distribution and do not tackle the underlying issue. To address this, we introduce Guiding Multi-step ReAsoning with a CorrectnEss Discriminator (GRACE), a stepwise decoding approach that nudges the model towards producing correct reasoning steps. GRACE employs a discriminator model, which is trained to differentiate correct steps from invalid ones, to adjust decoding preferences based on the correctness of each reasoning step. Importantly, GRACE does not require fine-tuning or re-training the LMs. When compared with conventional decoding strategies over four popular math reasoning benchmarks, GRACE exhibits significant improvements in both final answer accuracy and step correctness, outperforming both greedy decoding and self-consistency.\footnote{Our code can be found at \url{https://github.com/mukhal/grace.}}

68.Modeling Appropriate Language in Argumentation

Authors:Timon Ziegenbein, Shahbaz Syed, Felix Lange, Martin Potthast, Henning Wachsmuth

Abstract: Online discussion moderators must make ad-hoc decisions about whether the contributions of discussion participants are appropriate or should be removed to maintain civility. Existing research on offensive language and the resulting tools cover only one aspect among many involved in such decisions. The question of what is considered appropriate in a controversial discussion has not yet been systematically addressed. In this paper, we operationalize appropriate language in argumentation for the first time. In particular, we model appropriateness through the absence of flaws, grounded in research on argument quality assessment, especially in aspects from rhetoric. From these, we derive a new taxonomy of 14 dimensions that determine inappropriate language in online discussions. Building on three argument quality corpora, we then create a corpus of 2191 arguments annotated for the 14 dimensions. Empirical analyses support that the taxonomy covers the concept of appropriateness comprehensively, showing several plausible correlations with argument quality dimensions. Moreover, results of baseline approaches to assessing appropriateness suggest that all dimensions can be modeled computationally on the corpus.

69.Trade-Offs Between Fairness and Privacy in Language Modeling

Authors:Cleo Matzken, Steffen Eger, Ivan Habernal

Abstract: Protecting privacy in contemporary NLP models is gaining in importance. So does the need to mitigate social biases of such models. But can we have both at the same time? Existing research suggests that privacy preservation comes at the price of worsening biases in classification tasks. In this paper, we explore the extent to which this tradeoff really holds when we incorporate both privacy preservation and de-biasing techniques into training text generation models. How does improving the model along one dimension affect the other dimension as well as the utility of the model? We conduct an extensive set of experiments that include bias detection, privacy attacks, language modeling, and performance on downstream tasks.

70.A Fair and In-Depth Evaluation of Existing End-to-End Entity Linking Systems

Authors:Hannah Bast, Matthias Hertel, Natalie Prange

Abstract: Existing evaluations of entity linking systems often say little about how the system is going to perform for a particular application. There are four fundamental reasons for this: many benchmarks focus on named entities; it is hard to define which other entities to include; there are ambiguities in entity recognition and entity linking; many benchmarks have errors or artifacts that invite overfitting or lead to evaluation results of limited meaningfulness. We provide a more meaningful and fair in-depth evaluation of a variety of existing end-to-end entity linkers. We characterize the strengths and weaknesses of these linkers and how well the results from the respective publications can be reproduced. Our evaluation is based on several widely used benchmarks, which exhibit the problems mentioned above to various degrees, as well as on two new benchmarks, which address these problems.

71.Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with SocKET Benchmark

Authors:Minje Choi, Jiaxin Pei, Sagar Kumar, Chang Shu, David Jurgens

Abstract: Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks. While LLMs are increasingly deployed in many forms including conversational agents that interact with humans, we lack a grounded benchmark to measure how well LLMs understand \textit{social} language. Here, we introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge which we group into five categories: humor & sarcasm, offensiveness, sentiment & emotion, and trustworthiness. In tests on the benchmark, we demonstrate that current models attain only moderate performance but reveal significant potential for task transfer among different types and categories of tasks, which were predicted from theory. Through zero-shot evaluations, we show that pretrained models already possess some innate but limited capabilities of social language understanding and training on one category of tasks can improve zero-shot testing on others. Our benchmark provides a systematic way to analyze model performance on an important dimension of language and points to clear room for improvement to build more socially-aware LLMs. The associated resources are released at https://github.com/minjechoi/SOCKET.

72.How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench

Authors:Qinyuan Ye, Harvey Yiyun Fu, Xiang Ren, Robin Jia

Abstract: We investigate the predictability of large language model (LLM) capabilities: given records of past experiments using different model families, numbers of parameters, tasks, and numbers of in-context examples, can we accurately predict LLM performance on new experiment configurations? Answering this question has practical implications for LLM users (e.g., deciding which models to try), developers (e.g., prioritizing evaluation on representative tasks), and the research community (e.g., identifying hard-to-predict capabilities that warrant further investigation). We study the performance prediction problem on experiment records from BIG-bench. On a random train-test split, an MLP-based predictor achieves RMSE below 5%, demonstrating the presence of learnable patterns within the experiment records. Further, we formulate the problem of searching for "small-bench," an informative subset of BIG-bench tasks from which the performance of the full set can be maximally recovered, and find a subset as informative for evaluating new model families as BIG-bench Hard, while being 3x smaller.

73.Cross-lingual Data Augmentation for Document-grounded Dialog Systems in Low Resource Languages

Authors:Qi Gou, Zehua Xia, Wenzhe Du

Abstract: This paper proposes a framework to address the issue of data scarcity in Document-Grounded Dialogue Systems(DGDS). Our model leverages high-resource languages to enhance the capability of dialogue generation in low-resource languages. Specifically, We present a novel pipeline CLEM (Cross-Lingual Enhanced Model) including adversarial training retrieval (Retriever and Re-ranker), and Fid (fusion-in-decoder) generator. To further leverage high-resource language, we also propose an innovative architecture to conduct alignment across different languages with translated training. Extensive experiment results demonstrate the effectiveness of our model and we achieved 4th place in the DialDoc 2023 Competition. Therefore, CLEM can serve as a solution to resource scarcity in DGDS and provide useful guidance for multi-lingual alignment tasks.

74.Adversarial Demonstration Attacks on Large Language Models

Authors:Jiongxiao Wang, Zichen Liu, Keun Hee Park, Muhao Chen, Chaowei Xiao

Abstract: With the emergence of more powerful large language models (LLMs), such as ChatGPT and GPT-4, in-context learning (ICL) has gained significant prominence in leveraging these models for specific tasks by utilizing data-label pairs as precondition prompts. While incorporating demonstrations can greatly enhance the performance of LLMs across various tasks, it may introduce a new security concern: attackers can manipulate only the demonstrations without changing the input to perform an attack. In this paper, we investigate the security concern of ICL from an adversarial perspective, focusing on the impact of demonstrations. We propose an ICL attack based on TextAttack, which aims to only manipulate the demonstration without changing the input to mislead the models. Our results demonstrate that as the number of demonstrations increases, the robustness of in-context learning would decreases. Furthermore, we also observe that adversarially attacked demonstrations exhibit transferability to diverse input examples. These findings emphasize the critical security risks associated with ICL and underscore the necessity for extensive research on the robustness of ICL, particularly given its increasing significance in the advancement of LLMs.

75.Editing Commonsense Knowledge in GPT

Authors:Anshita Gupta, Debanjan Mondal, Akshay Krishna Sheshadri, Wenlong Zhao, Xiang Lorraine Li, Sarah Wiegreffe, Niket Tandon

Abstract: Memory editing methods for updating encyclopedic knowledge in transformers have received increasing attention for their efficacy, specificity, and generalization advantages. However, it remains unclear if such methods can be adapted for the more nuanced domain of commonsense knowledge. We propose $MEMIT_{CSK}$, an adaptation of MEMIT to edit commonsense mistakes in GPT-2 Large and XL. We extend editing to various token locations and employ a robust layer selection strategy. Models edited by $MEMIT_{CSK}$ outperforms the fine-tuning baselines by 10.97% and 10.73% F1 scores on subsets of PEP3k and 20Q. We further propose a novel evaluation dataset, MEMIT-CSK-PROBE, that contains unaffected neighborhood, affected neighborhood, affected paraphrase, and affected reasoning challenges. $MEMIT_{CSK}$ demonstrates favorable semantic generalization, outperforming fine-tuning baselines by 13.72% and 5.57% overall scores on MEMIT-CSK-PROBE. These results suggest a compelling future direction of incorporating context-specific user feedback concerning commonsense in GPT by direct model editing, rectifying and customizing model behaviors via human-in-the-loop systems.

76.PESCO: Prompt-enhanced Self Contrastive Learning for Zero-shot Text Classification

Authors:Yau-Shian Wang, Ta-Chung Chi, Ruohong Zhang, Yiming Yang

Abstract: We present PESCO, a novel contrastive learning framework that substantially improves the performance of zero-shot text classification. We formulate text classification as a neural text matching problem where each document is treated as a query, and the system learns the mapping from each query to the relevant class labels by (1) adding prompts to enhance label matching, and (2) using retrieved labels to enrich the training set in a self-training loop of contrastive learning. PESCO achieves state-of-the-art performance on four benchmark text classification datasets. On DBpedia, we achieve 98.5\% accuracy without any labeled data, which is close to the fully-supervised result. Extensive experiments and analyses show all the components of PESCO are necessary for improving the performance of zero-shot text classification.

77.Detecting and Characterizing Political Incivility on Social Media

Authors:Sagi Penzel, Nir Lotan, Alon Zoizner, Einat Minkov

Abstract: Researchers of political communication study the impact and perceptions of political incivility on social media. Yet, so far, relatively few works attempted to automatically detect and characterize political incivility. In our work, we study political incivility in Twitter, presenting several research contributions. First, we present state-of-the-art incivility detection results using a large dataset, which we collected and labeled via crowd sourcing. Importantly, we distinguish between uncivil political speech that is impolite and intolerant anti-democratic discourse. Applying political incivility detection at large-scale, we derive insights regarding the prevalence of this phenomenon across users, and explore the network characteristics of users who are susceptible to disseminating uncivil political content online. Finally, we propose an approach for modeling social context information about the tweet author alongside the tweet content, showing that this leads to significantly improved performance on the task of political incivility detection. This result holds promise for related tasks, such as hate speech and stance detection.

78.Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks

Authors:Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, Monojit Choudhury

Abstract: Recent explorations with commercial Large Language Models (LLMs) have shown that non-expert users can jailbreak LLMs by simply manipulating the prompts; resulting in degenerate output behavior, privacy and security breaches, offensive outputs, and violations of content regulator policies. Limited formal studies have been carried out to formalize and analyze these attacks and their mitigations. We bridge this gap by proposing a formalism and a taxonomy of known (and possible) jailbreaks. We perform a survey of existing jailbreak methods and their effectiveness on open-source and commercial LLMs (such as GPT 3.5, OPT, BLOOM, and FLAN-T5-xxl). We further propose a limited set of prompt guards and discuss their effectiveness against known attack types.

79.Getting Sick After Seeing a Doctor? Diagnosing and Mitigating Knowledge Conflicts in Event Temporal Reasoning

Authors:Tianqing Fang, Zhaowei Wang, Wenxuan Zhou, Hongming Zhang, Yangqiu Song, Muhao Chen

Abstract: Event temporal reasoning aims at identifying the temporal relations between two or more events. However, knowledge conflicts arise when there is a mismatch between the actual temporal relations of events in the context and the prior knowledge or biases learned by the model. We first systematically define distinct kinds of bias in event temporal reasoning, which include event relation prior bias, tense bias, narrative bias, and dependency bias, as indicators to study knowledge conflicts. To mitigate such event-related knowledge conflict, we introduce a Counterfactual Data Augmentation based method that can be applied to both Pre-trained Language Models (PLMs) and Large Language Models (LLMs) either as additional training data or demonstrations for In-Context Learning. Experiments suggest the importance of mitigating knowledge conflicts in event temporal reasoning tasks for reducing hallucination and highlight the potential of counterfactual data augmentation for improving model performance.

80.OverPrompt: Enhancing ChatGPT Capabilities through an Efficient In-Context Learning Approach

Authors:Jiazheng Li, Runcong Zhao, Yulan He, Lin Gui

Abstract: The exceptional performance of pre-trained large language models has revolutionised various applications, but their adoption in production environments is hindered by prohibitive costs and inefficiencies, particularly when utilising long prompts. This paper proposes OverPrompt, an in-context learning method aimed at improving LLM efficiency and performance by processing multiple inputs in parallel. Evaluated across diverse datasets, OverPrompt enhances task efficiency and integrates a diverse range of examples for improved performance. Particularly, it amplifies fact-checking and sentiment analysis tasks when supplemented with contextual information. Synthetic data grouping further enhances performance, suggesting a viable approach for data augmentation.

81.Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

Authors:Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D. Manning

Abstract: A trustworthy real-world prediction system should be well-calibrated; that is, its confidence in an answer is indicative of the likelihood that the answer is correct, enabling deferral to a more expensive expert in cases of low-confidence predictions. While recent studies have shown that unsupervised pre-training produces large language models (LMs) that are remarkably well-calibrated, the most widely-used LMs in practice are fine-tuned with reinforcement learning with human feedback (RLHF-LMs) after the initial unsupervised pre-training stage, and results are mixed as to whether these models preserve the well-calibratedness of their ancestors. In this paper, we conduct a broad evaluation of computationally feasible methods for extracting confidence scores from LLMs fine-tuned with RLHF. We find that with the right prompting strategy, RLHF-LMs verbalize probabilities that are much better calibrated than the model's conditional probabilities, enabling fairly well-calibrated predictions. Through a combination of prompting strategy and temperature scaling, we find that we can reduce the expected calibration error of RLHF-LMs by over 50%.

82.GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP

Authors:Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed

Abstract: The recent emergence of ChatGPT has brought a revolutionary change in the landscape of NLP. Although ChatGPT has consistently shown impressive performance on English benchmarks, its exact capabilities on most other languages remain largely unknown. To better understand ChatGPT's capabilities on Arabic, we present a large-scale evaluation of the model on a broad range of Arabic NLP tasks. Namely, we evaluate ChatGPT on 32 diverse natural language understanding and generation tasks on over 60 different datasets. To the best of our knowledge, our work offers the first performance analysis of ChatGPT on Arabic NLP at such a massive scale. Our results show that, despite its success on English benchmarks, ChatGPT trained in-context (few-shot) is consistently outperformed by much smaller dedicated models finetuned on Arabic. These results suggest that there is significant place for improvement for instruction-tuned LLMs such as ChatGPT.

83.Improving Factuality of Abstractive Summarization without Sacrificing Summary Quality

Authors:Tanay Dixit, Fei Wang, Muhao Chen

Abstract: Improving factual consistency of abstractive summarization has been a widely studied topic. However, most of the prior works on training factuality-aware models have ignored the negative effect it has on summary quality. We propose EFACTSUM (i.e., Effective Factual Summarization), a candidate summary generation and ranking technique to improve summary factuality without sacrificing summary quality. We show that using a contrastive learning framework with our refined candidate summaries leads to significant gains on both factuality and similarity-based metrics. Specifically, we propose a ranking strategy in which we effectively combine two metrics, thereby preventing any conflict during training. Models trained using our approach show up to 6 points of absolute improvement over the base model with respect to FactCC on XSUM and 11 points on CNN/DM, without negatively affecting either similarity-based metrics or absractiveness.

84.Benchmarking Arabic AI with Large Language Models

Authors:Ahmed Abdelali, Hamdy Mubarak, Shammur Absar Chowdhury, Maram Hasanain, Basel Mousi, Sabri Boughorbel, Yassine El Kheir, Daniel Izham, Fahim Dalvi, Majd Hawasly, Nizi Nazar, Yousseif Elshahawy, Ahmed Ali, Nadir Durrani, Natasa Milic-Frayling, Firoj Alam

Abstract: With large Foundation Models (FMs), language technologies (AI in general) are entering a new paradigm: eliminating the need for developing large-scale task-specific datasets and supporting a variety of tasks through set-ups ranging from zero-shot to few-shot learning. However, understanding FMs capabilities requires a systematic benchmarking effort by comparing FMs performance with the state-of-the-art (SOTA) task-specific models. With that goal, past work focused on the English language and included a few efforts with multiple languages. Our study contributes to ongoing research by evaluating FMs performance for standard Arabic NLP and Speech processing, including a range of tasks from sequence tagging to content classification across diverse domains. We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM, addressing 33 unique tasks using 59 publicly available datasets resulting in 96 test setups. For a few tasks, FMs performs on par or exceeds the performance of the SOTA models but for the majority it under-performs. Given the importance of prompt for the FMs performance, we discuss our prompt strategies in detail and elaborate on our findings. Our future work on Arabic AI will explore few-shot prompting, expand the range of tasks, and investigate additional open-source models.

85.Large Language Models are Effective Table-to-Text Generators, Evaluators, and Feedback Providers

Authors:Yilun Zhao, Haowei Zhang, Shengyun Si, Linyong Nan, Xiangru Tang, Arman Cohan

Abstract: Large language models (LLMs) have shown remarkable ability on controllable text generation. However, the potential of LLMs in generating text from structured tables remains largely under-explored. In this paper, we study the capabilities of LLMs for table-to-text generation tasks, particularly aiming to investigate their performance in generating natural language statements that can be logically entailed by a provided table. First, we investigate how LLMs compare to state-of-the-art table-to-text fine-tuned models, and demonstrate that LLMs can generate statements with higher faithfulness compared with previous state-of-the-art fine-tuned models. Given this finding, we next explore whether LLMs can serve as faithfulness-level automated evaluation metrics. Through human evaluation, we show that evaluation metrics adopted from LLMs correlates better with human judgments compared with existing faithfulness-level metrics. Finally, we demonstrate that LLMs using chain-of-thought prompting can generate high-fidelity natural language feedback for other table-to-text models' generations, provide insights for future work regarding the distillation of text generation capabilities from LLMs to smaller models.

86.Dolphin: A Challenging and Diverse Benchmark for Arabic NLG

Authors:El Moatez Billah Nagoudi, Ahmed El-Shangiti, AbdelRahim Elmadany, Muhammad Abdul-Mageed

Abstract: We present Dolphin, a novel benchmark that addresses the need for an evaluation framework for the wide collection of Arabic languages and varieties. The proposed benchmark encompasses a broad range of 13 different NLG tasks, including text summarization, machine translation, question answering, and dialogue generation, among others. Dolphin comprises a substantial corpus of 40 diverse and representative public datasets across 50 test splits, carefully curated to reflect real-world scenarios and the linguistic richness of Arabic. It sets a new standard for evaluating the performance and generalization capabilities of Arabic and multilingual models, promising to enable researchers to push the boundaries of current methodologies. We provide an extensive analysis of Dolphin, highlighting its diversity and identifying gaps in current Arabic NLG research. We also evaluate several Arabic and multilingual models on our benchmark, allowing us to set strong baselines against which researchers can compare.

87.MuLER: Detailed and Scalable Reference-based Evaluation

Authors:Taelin Karidi, Leshem Choshen, Gal Patel, Omri Abend

Abstract: We propose a novel methodology (namely, MuLER) that transforms any reference-based evaluation metric for text generation, such as machine translation (MT) into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types (e.g., errors in translating names of locations). MuLER thus enables a detailed error analysis which can lead to targeted improvement efforts for specific phenomena. We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability in MT evaluation, and other tasks, such as summarization. Analyzing all submissions to WMT in 2014-2020, we find consistent trends. For example, nouns and verbs are among the most frequent POS tags. However, they are among the hardest to translate. Performance on most POS tags improves with overall system performance, but a few are not thus correlated (their identity changes from language to language). Preliminary experiments with summarization reveal similar trends.

88.Reasoning with Language Model is Planning with World Model

Authors:Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, Zhiting Hu

Abstract: Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. The deficiency stems from the key fact that LLMs lack an internal $\textit{world model}$ to predict the world $\textit{state}$ (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, $\underline{R}\textit{easoning vi}\underline{a} \underline{P}\textit{lanning}$ $\textbf{(RAP)}$. RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monto Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and task-specific rewards, and obtains a high-reward reasoning path efficiently with a proper balance between exploration $\textit{vs.}$ exploitation. We apply RAP to a variety of challenging reasoning problems including plan generation, math reasoning, and logical inference. Empirical results on these tasks demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.

89.How To Control Text Simplification? An Empirical Study of Control Tokens for Meaning Preserving Controlled Simplification

Authors:Sweta Agrawal, Marine Carpuat

Abstract: Text simplification rewrites text to be more readable for a specific audience, while preserving its meaning. However, determining what makes a text easy to read depends on who are the intended readers. Recent work has introduced a wealth of techniques to control output simplicity, ranging from specifying the desired reading grade level to providing control tokens that directly encode low-level simplification edit operations. However, it remains unclear how to set the input parameters that control simplification in practice. Existing approaches set them at the corpus level, disregarding the complexity of individual source text, and do not directly evaluate them at the instance level. In this work, we conduct an empirical study to understand how different control mechanisms impact the adequacy and simplicity of model outputs. Based on these insights, we introduce a simple method for predicting control tokens at the sentence level to enhance the quality of the simplified text. Predicting control token values using features extracted from the original complex text and a user-specified degree of complexity improves the quality of the simplified outputs over corpus-level search-based heuristics.

90.RefGPT: Reference -> Truthful & Customized Dialogues Generation by GPTs and for GPTs

Authors:Dongjie Yang, Ruifeng Yuan, YuanTao Fan, YiFei Yang, Zili Wang, Shushen Wang, Hai Zhao

Abstract: General chat models, like ChatGPT, have attained impressive capability to resolve a wide range of NLP tasks by tuning Large Language Models (LLMs) with high-quality instruction data. However, collecting human-written high-quality data, especially multi-turn dialogues, is expensive and unattainable for most people. Though previous studies have used powerful LLMs to generate the dialogues automatically, but they all suffer from generating untruthful dialogues because of the LLMs hallucination. Therefore, we propose a method called RefGPT to generate enormous truthful and customized dialogues without worrying about factual errors caused by the model hallucination. RefGPT solves the model hallucination in dialogue generation by restricting the LLMs to leverage the given reference instead of reciting their own knowledge to generate dialogues. Additionally, RefGPT adds detailed controls on every utterances to enable highly customization capability, which previous studies have ignored. On the basis of RefGPT, we also propose two high-quality dialogue datasets generated by GPT-4, namely RefGPT-Fact and RefGPT-Code. RefGPT-Fact is 100k multi-turn dialogue datasets based on factual knowledge and RefGPT-Code is 76k multi-turn dialogue dataset covering a wide range of coding scenarios. Our code and datasets are released in https://github.com/ziliwangnlp/RefGPT

91.The ACL OCL Corpus: advancing Open science in Computational Linguistics

Authors:Shaurya Rohatgi, Yanxia Qin, Benjamin Aw, Niranjana Unnithan, Min-Yen Kan

Abstract: We present a scholarly corpus from the ACL Anthology to assist Open scientific research in the Computational Linguistics domain, named as ACL OCL. Compared with previous ARC and AAN versions, ACL OCL includes structured full-texts with logical sections, references to figures, and links to a large knowledge resource (semantic scholar). ACL OCL contains 74k scientific papers, together with 210k figures extracted up to September 2022. To observe the development in the computational linguistics domain, we detect the topics of all OCL papers with a supervised neural model. We observe ''Syntax: Tagging, Chunking and Parsing'' topic is significantly shrinking and ''Natural Language Generation'' is resurging. Our dataset is open and available to download from HuggingFace in https://huggingface.co/datasets/ACL-OCL/ACL-OCL-Corpus.

92.An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics

Authors:Saba Ahmadi, Aishwarya Agrawal

Abstract: Recently, reference-free metrics such as CLIPScore (Hessel et al., 2021) and UMIC (Lee et al., 2021) have been proposed for automatic evaluation of image captions, demonstrating a high correlation with human judgment. In this work, our focus lies in evaluating the robustness of these metrics in scenarios that require distinguishing between two captions with high lexical overlap but very different meanings. Our findings reveal that despite their high correlation with human judgment, both CLIPScore and UMIC struggle to identify fine-grained errors in captions. However, when comparing different types of fine-grained errors, both metrics exhibit limited sensitivity to implausibility of captions and strong sensitivity to lack of sufficient visual grounding. Probing further into the visual grounding aspect, we found that both CLIPScore and UMIC are impacted by the size of image-relevant objects mentioned in the caption, and that CLIPScore is also sensitive to the number of mentions of image-relevant objects in the caption. In terms of linguistic aspects of a caption, we found that both metrics lack the ability to comprehend negation, UMIC is sensitive to caption lengths, and CLIPScore is insensitive to the structure of the sentence. We hope our findings will serve as a valuable guide towards improving reference-free evaluation in image captioning.

93.The Art of SOCRATIC QUESTIONING: Zero-shot Multimodal Reasoning with Recursive Thinking and Self-Questioning

Authors:Jingyuan Qi, Zhiyang Xu, Ying Shen, Minqian Liu, Di Jin, Qifan Wang, Lifu Huang

Abstract: Chain-of-Thought prompting (CoT) enables large-scale language models to solve complex reasoning problems by decomposing the problem and tackling it step-by-step. However, Chain-of-Thought is a greedy thinking process that requires the language model to come up with a starting point and generate the next step solely based on previous steps. This thinking process is different from how humans approach a complex problem e.g., we proactively raise sub-problems related to the original problem and recursively answer them. In this work, we propose Socratic Questioning, a divide-and-conquer fashion algorithm that simulates the self-questioning and recursive thinking process. Socratic Questioning is driven by a Self-Questioning module that employs a large-scale language model to propose sub-problems related to the original problem as intermediate steps and Socratic Questioning recursively backtracks and answers the sub-problems until reaches the original problem. We apply our proposed algorithm to the visual question-answering task as a case study and by evaluating it on three public benchmark datasets, we observe a significant performance improvement over all baselines on (almost) all datasets. In addition, the qualitative analysis clearly demonstrates the intermediate thinking steps elicited by Socratic Questioning are similar to the human's recursively thinking process of a complex reasoning problem.

94.A RelEntLess Benchmark for Modelling Graded Relations between Named Entities

Authors:Asahi Ushio, Jose Camacho Collados, Steven Schockaert

Abstract: Relations such as "is influenced by", "is known for" or "is a competitor of" are inherently graded: we can rank entity pairs based on how well they satisfy these relations, but it is hard to draw a line between those pairs that satisfy them and those that do not. Such graded relations play a central role in many applications, yet they are typically not covered by existing Knowledge Graphs. In this paper, we consider the possibility of using Large Language Models (LLMs) to fill this gap. To this end, we introduce a new benchmark, in which entity pairs have to be ranked according to how much they satisfy a given graded relation. The task is formulated as a few-shot ranking problem, where models only have access to a description of the relation and five prototypical instances. We use the proposed benchmark to evaluate state-of-the-art relation embedding strategies as well as several recent LLMs, covering both publicly available LLMs and closed models such as GPT-4. Overall, we find a strong correlation between model size and performance, with smaller Language Models struggling to outperform a naive baseline. The results of the largest Flan-T5 and OPT models are remarkably strong, although a clear gap with human performance remains.

95.LLMDet: A Large Language Models Detection Tool

Authors:Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng, Tat-Seng Chua

Abstract: With the advancement of generative language models, the generated text has come remarkably close to high-quality human-authored text in terms of fluency and diversity. This calls for a highly practical detection tool that can identify the source of text, preferably pinpointing the language model it originates from. However, existing detection tools typically require access to language models and can only differentiate between machine-generated and human-authored text, failing to meet the requirements of rapid detection and text tracing. Therefore, in this paper, we propose an efficient, secure, and scalable detection tool called LLMDet, which calculates the proxy perplexity of text by utilizing the prior information of the model's next-token probabilities, obtained through pre-training. Subsequently, we use the self-watermarking information of the model, as measured by proxy perplexity, to detect the source of the text. We found that our method demonstrates impressive detection performance while ensuring speed and security, particularly achieving a recognition accuracy of 97.97\% for human-authored text. Furthermore, our detection tool also shows promising results in identifying the large language model (e.g., GPT-2, OPT, LLaMA, Vicuna...) responsible for the text. We release the code and processed data at \url{https://github.com/TrustedLLM/LLMDet}.

96.Sentiment Analysis in the Era of Large Language Models: A Reality Check

Authors:Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, Lidong Bing

Abstract: Sentiment analysis (SA) has been a long-standing research area in natural language processing. It can offer rich insights into human sentiments and opinions and has thus seen considerable interest from both academia and industry. With the advent of large language models (LLMs) such as ChatGPT, there is a great potential for their employment on SA problems. However, the extent to which existing LLMs can be leveraged for different sentiment analysis tasks remains unclear. This paper aims to provide a comprehensive investigation into the capabilities of LLMs in performing various sentiment analysis tasks, from conventional sentiment classification to aspect-based sentiment analysis and multifaceted analysis of subjective texts. We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets. Our study reveals that while LLMs demonstrate satisfactory performance in simpler tasks, they lag behind in more complex tasks requiring deeper understanding or structured sentiment information. However, LLMs significantly outperform SLMs in few-shot learning settings, suggesting their potential when annotation resources are limited. We also highlight the limitations of current evaluation practices in assessing LLMs' SA abilities and propose a novel benchmark, \textsc{SentiEval}, for a more comprehensive and realistic evaluation. Data and code during our investigations are available at \url{https://github.com/DAMO-NLP-SG/LLM-Sentiment}.

97.Are Chatbots Ready for Privacy-Sensitive Applications? An Investigation into Input Regurgitation and Prompt-Induced Sanitization

Authors:Aman Priyanshu, Supriti Vijay, Ayush Kumar, Rakshit Naidu, Fatemehsadat Mireshghallah

Abstract: LLM-powered chatbots are becoming widely adopted in applications such as healthcare, personal assistants, industry hiring decisions, etc. In many of these cases, chatbots are fed sensitive, personal information in their prompts, as samples for in-context learning, retrieved records from a database, or as part of the conversation. The information provided in the prompt could directly appear in the output, which might have privacy ramifications if there is sensitive information there. As such, in this paper, we aim to understand the input copying and regurgitation capabilities of these models during inference and how they can be directly instructed to limit this copying by complying with regulations such as HIPAA and GDPR, based on their internal knowledge of them. More specifically, we find that when ChatGPT is prompted to summarize cover letters of a 100 candidates, it would retain personally identifiable information (PII) verbatim in 57.4% of cases, and we find this retention to be non-uniform between different subgroups of people, based on attributes such as gender identity. We then probe ChatGPT's perception of privacy-related policies and privatization mechanisms by directly instructing it to provide compliant outputs and observe a significant omission of PII from output.

98.Injecting Knowledge into Biomedical Pre-trained Models via Polymorphism and Synonymous Substitution

Authors:Hongbo Zhang, Xiang Wan, Benyou Wang

Abstract: Pre-trained language models (PLMs) were considered to be able to store relational knowledge present in the training data. However, some relational knowledge seems to be discarded unsafely in PLMs due to \textbf{report bias}: low-frequency relational knowledge might be underexpressed compared to high-frequency one in PLMs. This gives us a hint that relational knowledge might not be redundant to the stored knowledge of PLMs, but rather be complementary. To additionally inject relational knowledge into PLMs, we propose a simple-yet-effective approach to inject relational knowledge into PLMs, which is inspired by three observations (namely, polymorphism, synonymous substitution, and association). In particular, we switch entities in the training corpus to related entities (either hypernyms/hyponyms/synonyms, or arbitrarily-related concepts). Experimental results show that the proposed approach could not only better capture relational knowledge, but also improve the performance in various biomedical downstream tasks. Our model is available in \url{https://github.com/StevenZHB/BioPLM_InjectingKnowledge}.

99.Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation

Authors:Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, Timothy Baldwin

Abstract: Instruction tuning has shown great promise in the field of natural language processing. However, the research on multilingual instruction tuning has been limited due to the scarcity of high-quality instruction-response datasets. To address this gap, we present Bactrian-X, a comprehensive multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. Leveraging this dataset, we train a set of adapters using low-rank adaptation (LoRA), which are lightweight components seamlessly integrated with foundational models. These adapters have a significantly smaller parameter count than the base model, making them easily replaceable and usable as plug-ins for different languages or language groups. Through extensive experiments on 52 languages, we demonstrate the superior performance of our models in various multilingual evaluation settings. Our proposed models outperform both the vanilla models and the existing instruction-tuned models. The code and models are publicly available at https://github.com/mbzuai-nlp/bactrian-x.

100.Unlocking Temporal Question Answering for Large Language Models Using Code Execution

Authors:Xingxuan Li, Liying Cheng, Qingyu Tan, Hwee Tou Ng, Shafiq Joty, Lidong Bing

Abstract: Large language models (LLMs) have made significant progress in natural language processing (NLP), and are utilized extensively in various applications. Recent works, such as chain-of-thought (CoT), have shown that intermediate reasoning steps can improve the performance of LLMs for complex reasoning tasks, such as math problems and symbolic question-answering tasks. However, we notice the challenge that LLMs face when it comes to temporal reasoning. Our preliminary experiments show that generating intermediate reasoning steps does not always boost the performance of complex temporal question-answering tasks. Therefore, we propose a novel framework that combines the extraction capability of LLMs and the logical reasoning capability of a Python solver to tackle this issue. Extensive experiments and analysis demonstrate the effectiveness of our framework in handling intricate time-bound reasoning tasks.

101.An Efficient Multilingual Language Model Compression through Vocabulary Trimming

Authors:Asahi Ushio, Yi Zhou, Jose Camacho-Collados

Abstract: Multilingual language model (LM) have become a powerful tool in NLP especially for non-English languages. Nevertheless, model parameters of multilingual LMs remain large due to the larger embedding matrix of the vocabulary covering tokens in different languages. On the contrary, monolingual LMs can be trained in a target language with the language-specific vocabulary only, but this requires a large budget and availability of reliable corpora to achieve a high-quality LM from scratch. In this paper, we propose vocabulary-trimming (VT), a method to reduce a multilingual LM vocabulary to a target language by deleting irrelevant tokens from its vocabulary. In theory, VT can compress any existing multilingual LM to build monolingual LMs in any language covered by the multilingual LM. In our experiments, we show that VT can retain the original performance of the multilingual LM, while being smaller in size (in general around 50% of the original vocabulary size is enough) than the original multilingual LM. The evaluation is performed over four NLP tasks (two generative and two classification tasks) among four widely used multilingual LMs in seven languages. Finally, we show that this methodology can keep the best of both monolingual and multilingual worlds by keeping a small size as monolingual models without the need for specifically retraining them, and even limiting potentially harmful social biases.

102.ChatAgri: Exploring Potentials of ChatGPT on Cross-linguistic Agricultural Text Classification

Authors:Biao Zhao, Weiqiang Jin, Javier Del Ser, Guang Yang

Abstract: In the era of sustainable smart agriculture, a massive amount of agricultural news text is being posted on the Internet, in which massive agricultural knowledge has been accumulated. In this context, it is urgent to explore effective text classification techniques for users to access the required agricultural knowledge with high efficiency. Mainstream deep learning approaches employing fine-tuning strategies on pre-trained language models (PLMs), have demonstrated remarkable performance gains over the past few years. Nonetheless, these methods still face many drawbacks that are complex to solve, including: 1. Limited agricultural training data due to the expensive-cost and labour-intensive annotation; 2. Poor domain transferability, especially of cross-linguistic ability; 3. Complex and expensive large models deployment.Inspired by the extraordinary success brought by the recent ChatGPT (e.g. GPT-3.5, GPT-4), in this work, we systematically investigate and explore the capability and utilization of ChatGPT applying to the agricultural informatization field. ....(shown in article).... Code has been released on Github https://github.com/albert-jin/agricultural_textual_classification_ChatGPT.

103.Dior-CVAE: Diffusion Priors in Variational Dialog Generation

Authors:Tianyu Yang, Thy Thy Tran, Iryna Gurevych

Abstract: Conditional variational autoencoders (CVAEs) have been used recently for diverse response generation, by introducing latent variables to represent the relationship between a dialog context and its potential responses. However, the diversity of the generated responses brought by a CVAE model is limited due to the oversimplified assumption of the isotropic Gaussian prior. We propose, Dior-CVAE, a hierarchical CVAE model with an informative prior produced by a diffusion model. Dior-CVAE derives a series of layer-wise latent variables using attention mechanism and infusing them into decoder layers accordingly. We propose memory dropout in the latent infusion to alleviate posterior collapse. The prior distribution of the latent variables is parameterized by a diffusion model to introduce a multimodal distribution. Overall, experiments on two popular open-domain dialog datasets indicate the advantages of our approach over previous Transformer-based variational dialog models in dialog response generation. We publicly release the code for reproducing Dior-CVAE and all baselines at https://github.com/SkyFishMoon/Latent-Diffusion-Response-Generation.

104.ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

Authors:Heming Xia, Qingxiu Dong, Lei Li, Jingjing Xu, Ziwei Qin, Zhifang Sui

Abstract: Recently, Pretrained Language Models (PLMs) have been serving as general-purpose interfaces, posing a significant demand for comprehensive visual knowledge. However, it remains unclear how well current PLMs and their visually augmented counterparts (VaLMs) can master visual commonsense knowledge. To investigate this, we propose ImageNetVC, a fine-grained, human-annotated dataset specifically designed for zero-shot visual commonsense evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we delve into the fundamental visual commonsense knowledge of both unimodal PLMs and VaLMs, uncovering the scaling law and the influence of the backbone model on VaLMs. Furthermore, we investigate the factors affecting the visual commonsense knowledge of large-scale models, providing insights into the development of language models enriched with visual commonsense knowledge. Our code and dataset are available at https://github.com/hemingkx/ImageNetVC.

105.How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Authors:Xinpeng Wang, Leonie Weissweiler, Hinrich Schütze, Barbara Plank

Abstract: Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves consistently under different initialisation settings. We release our code as an efficient transformer-based model distillation framework for further studies.

106.SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models

Authors:Zekun Wang, Jingchang Chen, Wangchunshu Zhou, Ming Liu, Bing Qin

Abstract: Despite achieving remarkable performance on various vision-language tasks, Transformer-based pretrained vision-language models (VLMs) still suffer from efficiency issues arising from long inputs and numerous parameters, limiting their real-world applications. However, the huge computation is redundant for most samples and the degree of redundancy and the respective components vary significantly depending on tasks and input instances. In this work, we propose an adaptive acceleration method SmartTrim for VLMs, which adjusts the inference overhead based on the complexity of instances. Specifically, SmartTrim incorporates lightweight trimming modules into the backbone to perform task-specific pruning on redundant inputs and parameters, without the need for additional pre-training or data augmentation. Since visual and textual representations complement each other in VLMs, we propose to leverage cross-modal interaction information to provide more critical semantic guidance for identifying redundant parts. Meanwhile, we introduce a self-distillation strategy that encourages the trimmed model to be consistent with the full-capacity model, which yields further performance gains. Experimental results demonstrate that SmartTrim significantly reduces the computation overhead (2-3 times) of various VLMs with comparable performance (only a 1-2% degradation) on various vision-language tasks. Compared to previous acceleration methods, SmartTrim attains a better efficiency-performance trade-off, demonstrating great potential for application in resource-constrained scenarios.

107.Self-ICL: Zero-Shot In-Context Learning with Self-Generated Demonstrations

Authors:Wei-Lin Chen, Cheng-Kuang Wu, Hsin-Hsi Chen

Abstract: Large language models (LMs) have exhibited superior in-context learning (ICL) ability to adopt to target tasks by prompting with a few input-output demonstrations. Towards better ICL, different methods are proposed to select representative demonstrations from existing training corpora. However, such a setting is not aligned with real-world practices, as end-users usually query LMs without accesses to demonstration pools. Inspired by evidence suggesting LMs' zero-shot capabilities are underrated, and the role of demonstrations are primarily for exposing models' intrinsic functionalities, we introduce Self-ICL, a simple framework for zero-shot ICL. Given a test input, Self-ICL first prompts the model to generate pseudo-inputs. Next, the model predicts pseudo-labels for the pseudo-inputs via zero-shot prompting. Finally, we construct pseudo-demonstrations from pseudo-input-label pairs, and perform ICL for the test input. Evaluation on BIG-Bench Hard shows Self-ICL steadily surpasses zero-shot and zero-shot chain-of-thought baselines on head-to-head and all-task average performance. Our findings suggest the possibility to bootstrap LMs' intrinsic capabilities towards better zero-shot performance.

108.Is GPT-4 a Good Data Analyst?

Authors:Liying Cheng, Xingxuan Li, Lidong Bing

Abstract: As large language models (LLMs) have demonstrated their powerful capabilities in plenty of domains and tasks, including context understanding, code generation, language generation, data storytelling, etc., many data analysts may raise concerns if their jobs will be replaced by AI. This controversial topic has drawn a lot of attention in public. However, we are still at a stage of divergent opinions without any definitive conclusion. Motivated by this, we raise the research question of "is GPT-4 a good data analyst?" in this work and aim to answer it by conducting head-to-head comparative studies. In detail, we regard GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains. We propose a framework to tackle the problems by carefully designing the prompts for GPT-4 to conduct experiments. We also design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4. Experimental results show that GPT-4 can achieve comparable performance to humans. We also provide in-depth discussions about our results to shed light on further studies before we reach the conclusion that GPT-4 can replace data analysts.

109.Active Learning for Natural Language Generation

Authors:Yotam Perlitz, Ariel Gera, Michal Shmueli-Scheuer, Dafna Sheinwald, Noam Slonim, Liat Ein-Dor

Abstract: The field of text generation suffers from a severe shortage of labeled data due to the extremely expensive and time consuming process involved in manual annotation. A natural approach for coping with this problem is active learning (AL), a well-known machine learning technique for improving annotation efficiency by selectively choosing the most informative examples to label. However, while AL has been well-researched in the context of text classification, its application to text generation remained largely unexplored. In this paper, we present a first systematic study of active learning for text generation, considering a diverse set of tasks and multiple leading AL strategies. Our results indicate that existing AL strategies, despite their success in classification, are largely ineffective for the text generation scenario, and fail to consistently surpass the baseline of random example selection. We highlight some notable differences between the classification and generation scenarios, and analyze the selection behaviors of existing AL strategies. Our findings motivate exploring novel approaches for applying AL to NLG tasks.

110.Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

Authors:Veniamin Veselovsky, Manoel Horta Ribeiro, Akhil Arora, Martin Josifoski, Ashton Anderson, Robert West

Abstract: Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.

111.Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks

Authors:Xiao Pu, Mingqi Gao, Xiaojun Wan

Abstract: Research on automated text summarization relies heavily on human and automatic evaluation. While recent work on human evaluation mainly adopted intrinsic evaluation methods, judging the generic quality of text summaries, e.g. informativeness and coherence, our work focuses on evaluating the usefulness of text summaries with extrinsic methods. We carefully design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment. We carry out experiments using system rankings and user behavior data to evaluate the performance of different summarization models. We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks. The results show that summaries generated by fine-tuned models lead to higher consistency in usefulness across all three tasks, as rankings of fine-tuned summarization systems are close across downstream tasks according to the proposed extrinsic metrics. Summaries generated by models in the zero-shot setting, however, are found to be biased towards the text classification and similarity assessment tasks, due to its general and less detailed summary style. We further evaluate the correlation of 14 intrinsic automatic metrics with human criteria and show that intrinsic automatic metrics perform well in evaluating the usefulness of summaries in the question-answering task, but are less effective in the other two tasks. This highlights the limitations of relying solely on intrinsic automatic metrics in evaluating the performance and usefulness of summaries.

112.SETI: Systematicity Evaluation of Textual Inference

Authors:Xiyan Fu, Anette Frank

Abstract: We propose SETI (Systematicity Evaluation of Textual Inference), a novel and comprehensive benchmark designed for evaluating pre-trained language models (PLMs) for their systematicity capabilities in the domain of textual inference. Specifically, SETI offers three different NLI tasks and corresponding datasets to evaluate various types of systematicity in reasoning processes. In order to solve these tasks, models are required to perform compositional inference based on known primitive constituents. We conduct experiments of SETI on six widely used PLMs. Results show that various PLMs are able to solve unseen compositional inferences when having encountered the knowledge of how to combine primitives, with good performance. However, they are considerably limited when this knowledge is unknown to the model (40-100% points decrease). Furthermore, we find that PLMs can improve drastically once exposed to crucial compositional knowledge in minimalistic shots. These findings position SETI as the first benchmark for measuring the future progress of PLMs in achieving systematicity generalization in the textual inference.

113.Ghostbuster: Detecting Text Ghostwritten by Large Language Models

Authors:Vivek Verma, Eve Fleisig, Nicholas Tomlin, Dan Klein

Abstract: We introduce Ghostbuster, a state-of-the-art system for detecting AI-generated text. Our method works by passing documents through a series of weaker language models and running a structured search over possible combinations of their features, then training a classifier on the selected features to determine if the target document was AI-generated. Crucially, Ghostbuster does not require access to token probabilities from the target model, making it useful for detecting text generated by black-box models or unknown model versions. In conjunction with our model, we release three new datasets of human and AI-generated text as detection benchmarks that cover multiple domains (student essays, creative fiction, and news) and task setups: document-level detection, author identification, and a challenge task of paragraph-level detection. Ghostbuster averages 99.1 F1 across all three datasets on document-level detection, outperforming previous approaches such as GPTZero and DetectGPT by up to 32.7 F1.

114.Ranger: A Toolkit for Effect-Size Based Multi-Task Evaluation

Authors:Mete Sertkan, Sophia Althammer, Sebastian Hofstätter

Abstract: In this paper, we introduce Ranger - a toolkit to facilitate the easy use of effect-size-based meta-analysis for multi-task evaluation in NLP and IR. We observed that our communities often face the challenge of aggregating results over incomparable metrics and scenarios, which makes conclusions and take-away messages less reliable. With Ranger, we aim to address this issue by providing a task-agnostic toolkit that combines the effect of a treatment on multiple tasks into one statistical evaluation, allowing for comparison of metrics and computation of an overall summary effect. Our toolkit produces publication-ready forest plots that enable clear communication of evaluation results over multiple tasks. Our goal with the ready-to-use Ranger toolkit is to promote robust, effect-size-based evaluation and improve evaluation standards in the community. We provide two case studies for common IR and NLP settings to highlight Ranger's benefits.

115.A Monte Carlo Language Model Pipeline for Zero-Shot Sociopolitical Event Extraction

Authors:Erica Cai, Brendan O'Connor

Abstract: We consider dyadic zero-shot event extraction (EE) to identify actions between pairs of actors. The \emph{zero-shot} setting allows social scientists or other non-computational researchers to extract any customized, user-specified set of events without training, resulting in a \emph{dyadic} event database, allowing insight into sociopolitical relational dynamics among actors and the higher level organizations or countries they represent. Unfortunately, we find that current zero-shot EE methods perform poorly for the task, with issues including word sense ambiguity, modality mismatch, and efficiency. Straightforward application of large language model prompting typically performs even worse. We address these challenges with a new fine-grained, multi-stage generative question-answer method, using a Monte Carlo approach to exploit and overcome the randomness of generative outputs. It performs 90\% fewer queries than a previous approach, with strong performance on the widely-used Automatic Content Extraction dataset. Finally, we extend our method to extract affiliations of actor arguments and demonstrate our method and findings on a dyadic international relations case study.

116.Decomposing Complex Queries for Tip-of-the-tongue Retrieval

Authors:Kevin Lin, Kyle Lo, Joseph E. Gonzalez, Dan Klein

Abstract: When re-finding items, users who forget or are uncertain about identifying details often rely on creative strategies for expressing their information needs -- complex queries that describe content elements (e.g., book characters or events), information beyond the document text (e.g., descriptions of book covers), or personal context (e.g., when they read a book). This retrieval setting, called tip of the tongue (TOT), is especially challenging for models heavily reliant on lexical and semantic overlap between query and document text. In this work, we introduce a simple yet effective framework for handling such complex queries by decomposing the query into individual clues, routing those as sub-queries to specialized retrievers, and ensembling the results. This approach allows us to take advantage of off-the-shelf retrievers (e.g., CLIP for retrieving images of book covers) or incorporate retriever-specific logic (e.g., date constraints). We show that our framework incorportating query decompositions into retrievers can improve gold book recall up to 7% relative again for Recall@5 on a new collection of 14,441 real-world query-book pairs from an online community for resolving TOT inquiries.

117.Understanding Arithmetic Reasoning in Language Models using Causal Mediation Analysis

Authors:Alessandro Stolfo, Yonatan Belinkov, Mrinmaya Sachan

Abstract: Mathematical reasoning in large language models (LLMs) has garnered attention in recent research, but there is limited understanding of how these models process and store information related to arithmetic tasks. In this paper, we present a mechanistic interpretation of LLMs for arithmetic-based questions using a causal mediation analysis framework. By intervening on the activations of specific model components and measuring the resulting changes in predicted probabilities, we identify the subset of parameters responsible for specific predictions. We analyze two pre-trained language models with different sizes (2.8B and 6B parameters). Experimental results reveal that a small set of mid-late layers significantly affect predictions for arithmetic-based questions, with distinct activation patterns for correct and wrong predictions. We also investigate the role of the attention mechanism and compare the model's activation patterns for arithmetic queries with the prediction of factual knowledge. Our findings provide insights into the mechanistic interpretation of LLMs for arithmetic tasks and highlight the specific components involved in arithmetic reasoning.

118.Reasoning over Hierarchical Question Decomposition Tree for Explainable Question Answering

Authors:Jiajie Zhang, Shulin Cao, Tingjia Zhang, Xin Lv, Jiaxin Shi, Qi Tian, Juanzi Li, Lei Hou

Abstract: Explainable question answering (XQA) aims to answer a given question and provide an explanation why the answer is selected. Existing XQA methods focus on reasoning on a single knowledge source, e.g., structured knowledge bases, unstructured corpora, etc. However, integrating information from heterogeneous knowledge sources is essential to answer complex questions. In this paper, we propose to leverage question decomposing for heterogeneous knowledge integration, by breaking down a complex question into simpler ones, and selecting the appropriate knowledge source for each sub-question. To facilitate reasoning, we propose a novel two-stage XQA framework, Reasoning over Hierarchical Question Decomposition Tree (RoHT). First, we build the Hierarchical Question Decomposition Tree (HQDT) to understand the semantics of a complex question; then, we conduct probabilistic reasoning over HQDT from root to leaves recursively, to aggregate heterogeneous knowledge at different tree levels and search for a best solution considering the decomposing and answering probabilities. The experiments on complex QA datasets KQA Pro and Musique show that our framework outperforms SOTA methods significantly, demonstrating the effectiveness of leveraging question decomposing for knowledge integration and our RoHT framework.

119.Learning the String Partial Order

Authors:Tianyu Liu, Afra Amini, Mrinmaya Sachan, Ryan Cotterell

Abstract: We show that most structured prediction problems can be solved in linear time and space by considering them as partial orderings of the tokens in the input string. Our method computes real numbers for each token in an input string and sorts the tokens accordingly, resulting in as few as 2 total orders of the tokens in the string. Each total order possesses a set of edges oriented from smaller to greater tokens. The intersection of total orders results in a partial order over the set of input tokens, which is then decoded into a directed graph representing the desired structure. Experiments show that our method achieves 95.4 LAS and 96.9 UAS by using an intersection of 2 total orders, 95.7 LAS and 97.1 UAS with 4 on the English Penn Treebank dependency parsing benchmark. Our method is also the first linear-complexity coreference resolution model and achieves 79.2 F1 on the English OntoNotes benchmark, which is comparable with state of the art.

120.Who Wrote this Code? Watermarking for Code Generation

Authors:Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, Gunhee Kim

Abstract: Large language models for code have recently shown remarkable performance in generating executable code. However, this rapid advancement has been accompanied by many legal and ethical concerns, such as code licensing issues, code plagiarism, and malware generation, making watermarking machine-generated code a very timely problem. Despite such imminent needs, we discover that existing watermarking and machine-generated text detection methods for LLMs fail to function with code generation tasks properly. Hence, in this work, we propose a new watermarking method, SWEET, that significantly improves upon previous approaches when watermarking machine-generated code. Our proposed method selectively applies watermarking to the tokens with high enough entropy, surpassing a defined threshold. The experiments on code generation benchmarks show that our watermarked code has superior quality compared to code produced by the previous state-of-the-art LLM watermarking method. Furthermore, our watermark method also outperforms DetectGPT for the task of machine-generated code detection.

121.Lawyer LLaMA Technical Report

Authors:Quzhe Huang, Mingxu Tao, Zhenwei An, Chen Zhang, Cong Jiang, Zhibin Chen, Zirui Wu, Yansong Feng

Abstract: Large Language Models (LLMs), like LLaMA, have exhibited remarkable performances across various tasks. Nevertheless, when deployed to specific domains such as law or medicine, the models still confront the challenge of a deficiency in domain-specific knowledge and an inadequate capability to leverage that knowledge to resolve domain-related problems. In this paper, we focus on the legal domain and explore how to inject domain knowledge during the continual training stage and how to design proper supervised finetune tasks to help the model tackle practical issues. Moreover, to alleviate the hallucination problem during model's generation, we add a retrieval module and extract relevant articles before the model answers any queries. Augmenting with the extracted evidence, our model could generate more reliable responses. We release our data and model at https://github.com/AndrewZhe/lawyer-llama.

122.Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations

Authors:Siqi Ouyang, Lei Li

Abstract: Large language models (LLMs) have demonstrated remarkable language proficiency, but they face challenges when solving interactive tasks independently. Existing methods either rely on gradient access, which is often inaccessible in state-of-the-art LLMs like GPT-4, or necessitate diverse and high-quality in-context demonstrations. In this study, we propose LLM-PO, a novel approach that enables LLMs to address these tasks without gradient access or extensive demonstrations. The key idea is to maintain a text-based plan and ask LLMs to reflect on pros and cons of the current plan based on experience collected with it, to update the plan, and to collect more experiences with the new plan. Experiments on HotpotQA demonstrate that LLM-PO achieves higher or on par success rates compared to in-context learning (ICL) baselines while requiring less inference cost.

123.Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning

Authors:Ximing Lu, Faeze Brahman, Peter West, Jaehun Jang, Khyathi Chandu, Abhilasha Ravichander, Lianhui Qin, Prithviraj Ammanabrolu, Liwei Jiang, Sahana Ramnath, Nouha Dziri, Jillian Fisher, Bill Yuchen Lin, Skyler Hallinan, Xiang Ren, Sean Welleck, Yejin Choi

Abstract: Large language models excel at a variety of language tasks when prompted with examples or instructions. Yet controlling these models through prompting alone is limited. Tailoring language models through fine-tuning (e.g., via reinforcement learning) can be effective, but it is expensive and requires model access. We propose Inference-time Policy Adapters (IPA), which efficiently tailors a language model such as GPT-3 without fine-tuning it. IPA guides a large base model during decoding time through a lightweight policy adaptor trained to optimize an arbitrary user objective with reinforcement learning. On five challenging text generation tasks, such as toxicity reduction and open-domain generation, IPA consistently brings significant improvements over off-the-shelf language models. It outperforms competitive baseline methods, sometimes even including expensive fine-tuning. In particular, tailoring GPT-2 with IPA can outperform GPT-3, while tailoring GPT- 3 with IPA brings a major performance boost over GPT-3 (and sometimes even over GPT-4). Our promising results highlight the potential of IPA as a lightweight alternative to tailoring extreme-scale language models.

124.Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing

Authors:Tianyi Tang, Hongyuan Lu, Yuchen Eleanor Jiang, Haoyang Huang, Dongdong Zhang, Wayne Xin Zhao, Furu Wei

Abstract: Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually be expressed in different forms, and the evaluation with a single or few references may not accurately reflect the quality of the model's hypotheses. To address this issue, this paper presents a novel method, named Para-Ref, to enhance existing evaluation benchmarks by enriching the number of references. We leverage large language models (LLMs) to paraphrase a single reference into multiple high-quality ones in diverse expressions. Experimental results on representative NLG tasks of machine translation, text summarization, and image caption demonstrate that our method can effectively improve the correlation with human evaluation for sixteen automatic evaluation metrics by +7.82% in ratio. We release the code and data at https://github.com/RUCAIBox/Para-Ref.

125.ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind

Authors:Xiaomeng Ma, Lingyu Gao, Qihui Xu

Abstract: Theory of Mind (ToM), the capacity to comprehend the mental states of distinct individuals, is essential for numerous practical applications. With the development of large language models, there is a heated debate about whether they are able to perform ToM tasks. Previous studies have used different tasks and prompts to test the ToM on large language models and the results are inconsistent: some studies asserted these models are capable of exhibiting ToM, while others suggest the opposite. In this study, We present ToMChallenges, a dataset for comprehensively evaluating Theory of Mind based on Sally-Anne and Smarties tests. We created 30 variations of each test (e.g., changing the person's name, location, and items). For each variation, we test the model's understanding of different aspects: reality, belief, 1st order belief, and 2nd order belief. We adapt our data for various tasks by creating unique prompts tailored for each task category: Fill-in-the-Blank, Multiple Choice, True/False, Chain-of-Thought True/False, Question Answering, and Text Completion. If the model has a robust ToM, it should be able to achieve good performance for different prompts across different tests. We evaluated two GPT-3.5 models, text-davinci-003 and gpt-3.5-turbo-0301, with our datasets. Our results indicate that consistent performance in ToM tasks remains a challenge.

126.Annotation Imputation to Individualize Predictions: Initial Studies on Distribution Dynamics and Model Predictions

Authors:London Lowmanstone, Ruyuan Wan, Risako Owan, Jaehyung Kim, Dongyeop Kang

Abstract: Annotating data via crowdsourcing is time-consuming and expensive. Owing to these costs, dataset creators often have each annotator label only a small subset of the data. This leads to sparse datasets with examples that are marked by few annotators; if an annotator is not selected to label an example, their opinion regarding it is lost. This is especially concerning for subjective NLP datasets where there is no correct label: people may have different valid opinions. Thus, we propose using imputation methods to restore the opinions of all annotators for all examples, creating a dataset that does not leave out any annotator's view. We then train and prompt models with data from the imputed dataset (rather than the original sparse dataset) to make predictions about majority and individual annotations. Unfortunately, the imputed data provided by our baseline methods does not improve predictions. However, through our analysis of it, we develop a strong understanding of how different imputation methods impact the original data in order to inform future imputation techniques. We make all of our code and data publicly available.

127.Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

Authors:Daman Arora, Himanshu Gaurav Singh, Mausam

Abstract: The performance on Large Language Models (LLMs) on existing reasoning benchmarks has shot up considerably over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem solving abilities of LLMs. We curate 450 challenging pre-engineering mathematics, physics and chemistry problems from the IIT JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is essential for solving problems in this benchmark. Our evaluation on the GPT series of models reveals that although performance improves with newer models, the best being GPT-4, the highest performance, even after using techniques like Self-Consistency and Chain-of-Thought prompting is less than 40 percent. Our analysis demonstrates that errors in algebraic manipulation and failure in retrieving relevant domain specific concepts are primary contributors to GPT4's low performance. Given the challenging nature of the benchmark, we hope that it can guide future research in problem solving using LLMs. Our code and dataset is available here.

128.HuatuoGPT, towards Taming Language Model to Be a Doctor

Authors:Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, Haizhou Li

Abstract: In this paper, we present HuatuoGPT, a large language model (LLM) for medical consultation. The core recipe of HuatuoGPT is to leverage both \textit{distilled data from ChatGPT} and \textit{real-world data from doctors} in the supervised fine-tuned stage. The responses of ChatGPT are usually detailed, well-presented and informative while it cannot perform like a doctor in many aspects, e.g. for integrative diagnosis. We argue that real-world data from doctors would be complementary to distilled data in the sense the former could tame a distilled language model to perform like doctors. To better leverage the strengths of both data, we train a reward model to align the language model with the merits that both data bring, following an RLAIF (reinforced learning from AI feedback) fashion. To evaluate and benchmark the models, we propose a comprehensive evaluation scheme (including automatic and manual metrics). Experimental results demonstrate that HuatuoGPT achieves state-of-the-art results in performing medical consultation among open-source LLMs in GPT-4 evaluation, human evaluation, and medical benchmark datasets. It is worth noting that by using additional real-world data and RLAIF, the distilled language model (i.e., HuatuoGPT) outperforms its teacher model ChatGPT in most cases. Our code, data, and models are publicly available at \url{https://github.com/FreedomIntelligence/HuatuoGPT}. The online demo is available at \url{https://www.HuatuoGPT.cn/}.

129.Meta-Learning Online Adaptation of Language Models

Authors:Nathan Hu, Eric Mitchell, Christopher D. Manning, Chelsea Finn

Abstract: Large language models encode surprisingly broad knowledge about the world into their parameters. However, the knowledge in static language models can fall out of date, limiting the model's effective "shelf life." While online fine-tuning can reduce this degradation, we find that fine-tuning on a stream of documents using standard optimizers such as Adam leads to a disappointingly low level of information uptake. We hypothesize that online fine-tuning does not sufficiently 'attend' to important information. That is, the gradient signal from important tokens representing factual information is drowned out by the gradient from inherently noisy tokens, suggesting a dynamic, context-aware learning rate may be beneficial. To test this hypothesis, we meta-train a small, autoregressive model to reweight the language modeling loss for each token during online fine-tuning, with the objective of maximizing the out-of-date base language model's ability to answer questions about a document after a single weighted gradient step. We call this approach Context-aware Meta-learned Loss Scaling (CaMeLS). Across three different distributions of documents, our experiments find that fine-tuning on streams of thousands of documents with CaMeLS substantially improves knowledge retention compared to standard online fine-tuning. Finally, we find that the meta-learned weights are general, and that a single reweighting model can be used to enhance the online adaptation of many LMs.

130.Contrastive Learning of Sentence Embeddings from Scratch

Authors:Junlei Zhang, Zhenzhong Lan, Junxian He

Abstract: Contrastive learning has been the dominant approach to train state-of-the-art sentence embeddings. Previous studies have typically learned sentence embeddings either through the use of human-annotated natural language inference (NLI) data or via large-scale unlabeled sentences in an unsupervised manner. However, even in the case of unlabeled data, their acquisition presents challenges in certain domains due to various reasons. To address these issues, we present SynCSE, a contrastive learning framework that trains sentence embeddings with synthesized data. Specifically, we explore utilizing large language models to synthesize the required data samples for contrastive learning, including (1) producing positive and negative annotations given unlabeled sentences (SynCSE-partial), and (2) generating sentences along with their corresponding annotations from scratch (SynCSE-scratch). Experimental results on sentence similarity and reranking tasks indicate that both SynCSE-partial and SynCSE-scratch greatly outperform unsupervised baselines, and SynCSE-partial even achieves comparable performance to the supervised models in most settings.

131.Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Authors:Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, Seunghyun Park

Abstract: Advances in Large Language Models (LLMs) have inspired a surge of research exploring their expansion into the visual domain. While recent models exhibit promise in generating abstract captions for images and conducting natural conversations, their performance on text-rich images leaves room for improvement. In this paper, we propose the Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details typically overlooked by existing methods. Cream integrates vision and auxiliary encoders, complemented by a contrastive feature alignment technique, resulting in a more effective understanding of textual information within document images. Our approach, thus, seeks to bridge the gap between vision and language understanding, paving the way for more sophisticated Document Intelligence Assistants. Rigorous evaluations across diverse tasks, such as visual question answering on document images, demonstrate the efficacy of Cream as a state-of-the-art model in the field of visual document understanding. We provide our codebase and newly-generated datasets at https://github.com/naver-ai/cream

132.Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions

Authors:Jiahuan Li, Hao Zhou, Shujian Huang, Shanbo Chen, Jiajun Chen

Abstract: Large-scale Pretrained Language Models~(LLMs), such as ChatGPT and GPT4, have shown strong abilities in multilingual translations, without being explicitly trained on parallel corpora. It is interesting how the LLMs obtain their ability to carry out translation instructions for different languages. In this paper, we present a detailed analysis by finetuning a multilingual pretrained language model, XGLM-7B, to perform multilingual translation following given instructions. Firstly, we show that the multilingual LLMs have stronger translation abilities than previously demonstrated. For a certain language pair, the performance depends on both the language families and the amount of data used in the pretraining phase. Secondly, we find that LLMs' ability to carry out translation instructions relies on the understanding of translation instruction and the alignment among different languages. With proper enhancement, LLMs could perform the translation task well even for those language pairs unseen during the instruction tuning phase.

133.Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples

Authors:Philipp Sadler, David Schlangen

Abstract: NLP tasks are typically defined extensionally through datasets containing example instantiations (e.g., pairs of image i and text t), but motivated intensionally through capabilities invoked in verbal descriptions of the task (e.g., "t is a description of i, for which the content of i needs to be recognised and understood"). We present Pento-DIARef, a diagnostic dataset in a visual domain of puzzle pieces where referring expressions are generated by a well-known symbolic algorithm (the "Incremental Algorithm"), which itself is motivated by appeal to a hypothesised capability (eliminating distractors through application of Gricean maxims). Our question then is whether the extensional description (the dataset) is sufficient for a neural model to pick up the underlying regularity and exhibit this capability given the simple task definition of producing expressions from visual inputs. We find that a model supported by a vision detection step and a targeted data generation scheme achieves an almost perfect BLEU@1 score and sentence accuracy, whereas simpler baselines do not.

134.STAR: Boosting Low-Resource Event Extraction by Structure-to-Text Data Generation with Large Language Models

Authors:Mingyu Derek Ma, Xiaoxuan Wang, Po-Nien Kung, P. Jeffrey Brantingham, Nanyun Peng, Wei Wang

Abstract: Structure prediction tasks such as event extraction require an in-depth understanding of the output structure and sub-task dependencies, thus they still heavily rely on task-specific training data to obtain reasonable performance. Due to the high cost of human annotation, low-resource event extraction, which requires minimal human cost, is urgently needed in real-world information extraction applications. We propose to synthesize data instances given limited seed demonstrations to boost low-resource event extraction performance. We propose STAR, a structure-to-text data generation method that first generates complicated event structures (Y) and then generates input passages (X), all with Large Language Models. We design fine-grained step-by-step instructions and the error cases and quality issues identified through self-reflection can be self-refined. Our experiments indicate that data generated by STAR can significantly improve the low-resource event extraction performance and they are even more effective than human-curated data points in some cases.

135.CSTS: Conditional Semantic Textual Similarity

Authors:Ameet Deshpande, Carlos E. Jimenez, Howard Chen, Vishvak Murahari, Victoria Graf, Tanmay Rajpurohit, Ashwin Kalyan, Danqi Chen, Karthik Narasimhan

Abstract: Semantic textual similarity (STS) has been a cornerstone task in NLP that measures the degree of similarity between a pair of sentences, with applications in information retrieval, question answering, and embedding methods. However, it is an inherently ambiguous task, with the sentence similarity depending on the specific aspect of interest. We resolve this ambiguity by proposing a novel task called conditional STS (C-STS) which measures similarity conditioned on an aspect elucidated in natural language (hereon, condition). As an example, the similarity between the sentences "The NBA player shoots a three-pointer." and "A man throws a tennis ball into the air to serve." is higher for the condition "The motion of the ball." (both upward) and lower for "The size of the ball." (one large and one small). C-STS's advantages are two-fold: (1) it reduces the subjectivity and ambiguity of STS, and (2) enables fine-grained similarity evaluation using diverse conditions. C-STS contains almost 20,000 instances from diverse domains and we evaluate several state-of-the-art models to demonstrate that even the most performant fine-tuning and in-context learning models (GPT-4, Flan, SimCSE) find it challenging, with Spearman correlation scores of <50. We encourage the community to evaluate their models on C-STS to provide a more holistic view of semantic similarity and natural language understanding.

136.Dynamic Masking Rate Schedules for MLM Pretraining

Authors:Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, Matthew L. Leavitt

Abstract: Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%. Our work instead dynamically schedules the masking ratio throughout training. We found that linearly decreasing the masking rate from 30% to 15% over the course of pretraining improves average GLUE accuracy by 0.46% in BERT-base, compared to a standard 15% fixed rate. Further analyses demonstrate that the gains from scheduling come from being exposed to both high and low masking rate regimes. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models and achieve up to a 1.89x speedup in pretraining.

137.Referral Augmentation for Zero-Shot Information Retrieval

Authors:Michael Tang, Shunyu Yao, John Yang, Karthik Narasimhan

Abstract: We propose Referral-Augmented Retrieval (RAR), a simple technique that concatenates document indices with referrals, i.e. text from other documents that cite or link to the given document, to provide significant performance gains for zero-shot information retrieval. The key insight behind our method is that referrals provide a more complete, multi-view representation of a document, much like incoming page links in algorithms like PageRank provide a comprehensive idea of a webpage's importance. RAR works with both sparse and dense retrievers, and outperforms generative text expansion techniques such as DocT5Query and Query2Doc a 37% and 21% absolute improvement on ACL paper retrieval Recall@10 -- while also eliminating expensive model training and inference. We also analyze different methods for multi-referral aggregation and show that RAR enables up-to-date information retrieval without re-training.

138.Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

Authors:Ziwei He, Meng Yang, Minwei Feng, Jingcheng Yin, Xinbing Wang, Jingwen Leng, Zhouhan Lin

Abstract: The transformer model is known to be computationally demanding, and prohibitively costly for long sequences, as the self-attention module uses a quadratic time and space complexity with respect to sequence length. Many researchers have focused on designing new forms of self-attention or introducing new parameters to overcome this limitation, however a large portion of them prohibits the model to inherit weights from large pretrained models. In this work, the transformer's inefficiency has been taken care of from another perspective. We propose Fourier Transformer, a simple yet effective approach by progressively removing redundancies in hidden sequence using the ready-made Fast Fourier Transform (FFT) operator to perform Discrete Cosine Transformation (DCT). Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Experiments show that our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA with significant improvement in both speed and space. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART and other efficient models. \footnote{Our code is publicly available at \url{https://github.com/LUMIA-Group/FourierTransformer}}

139.The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing

Authors:Debayan Banerjee, Pranav Ajit Nair, Ricardo Usbeck, Chris Biemann

Abstract: In this work, we analyse the role of output vocabulary for text-to-text (T2T) models on the task of SPARQL semantic parsing. We perform experiments within the the context of knowledge graph question answering (KGQA), where the task is to convert questions in natural language to the SPARQL query language. We observe that the query vocabulary is distinct from human vocabulary. Language Models (LMs) are pre-dominantly trained for human language tasks, and hence, if the query vocabulary is replaced with a vocabulary more attuned to the LM tokenizer, the performance of models may improve. We carry out carefully selected vocabulary substitutions on the queries and find absolute gains in the range of 17% on the GrailQA dataset.

140.Another Dead End for Morphological Tags? Perturbed Inputs and Parsing

Authors:Alberto Muñoz-Ortiz, David Vilares

Abstract: The usefulness of part-of-speech tags for parsing has been heavily questioned due to the success of word-contextualized parsers. Yet, most studies are limited to coarse-grained tags and high quality written content; while we know little about their influence when it comes to models in production that face lexical errors. We expand these setups and design an adversarial attack to verify if the use of morphological information by parsers: (i) contributes to error propagation or (ii) if on the other hand it can play a role to correct mistakes that word-only neural parsers make. The results on 14 diverse UD treebanks show that under such attacks, for transition- and graph-based models their use contributes to degrade the performance even faster, while for the (lower-performing) sequence labeling parsers they are helpful. We also show that if morphological tags were utopically robust against lexical perturbations, they would be able to correct parsing mistakes.

141.On Degrees of Freedom in Defining and Testing Natural Language Understanding

Authors:Saku Sugawara, Shun Tsugita

Abstract: Natural language understanding (NLU) studies often exaggerate or underestimate the capabilities of systems, thereby limiting the reproducibility of their findings. These erroneous evaluations can be attributed to the difficulty of defining and testing NLU adequately. In this position paper, we reconsider this challenge by identifying two types of researcher degrees of freedom. We revisit Turing's original interpretation of the Turing test and indicate that an NLU test does not provide an operational definition; it merely provides inductive evidence that the test subject understands the language sufficiently well to meet stakeholder objectives. In other words, stakeholders are free to arbitrarily define NLU through their objectives. To use the test results as inductive evidence, stakeholders must carefully assess if the interpretation of test scores is valid or not. However, designing and using NLU tests involve other degrees of freedom, such as specifying target skills and defining evaluation metrics. As a result, achieving consensus among stakeholders becomes difficult. To resolve this issue, we propose a validity argument, which is a framework comprising a series of validation criteria across test components. By demonstrating that current practices in NLU studies can be associated with those criteria and organizing them into a comprehensive checklist, we prove that the validity argument can serve as a coherent guideline for designing credible test sets and facilitating scientific communication.

142.Topic-Guided Self-Introduction Generation for Social Media Users

Authors:Chunpu Xu, Jing Li, Piji Li, Min Yang

Abstract: Millions of users are active on social media. To allow users to better showcase themselves and network with others, we explore the auto-generation of social media self-introduction, a short sentence outlining a user's personal interests. While most prior work profiles users with tags (e.g., ages), we investigate sentence-level self-introductions to provide a more natural and engaging way for users to know each other. Here we exploit a user's tweeting history to generate their self-introduction. The task is non-trivial because the history content may be lengthy, noisy, and exhibit various personal interests. To address this challenge, we propose a novel unified topic-guided encoder-decoder (UTGED) framework; it models latent topics to reflect salient user interest, whose topic mixture then guides encoding a user's history and topic words control decoding their self-introduction. For experiments, we collect a large-scale Twitter dataset, and extensive results show the superiority of our UTGED to the advanced encoder-decoder models without topic modeling.

143.Pre-training Multi-party Dialogue Models with Latent Discourse Inference

Authors:Yiyang Li, Xinting Huang, Wei Bi, Hai Zhao

Abstract: Multi-party dialogues are more difficult for models to understand than one-to-one two-party dialogues, since they involve multiple interlocutors, resulting in interweaving reply-to relations and information flows. To step over these obstacles, an effective way is to pre-train a model that understands the discourse structure of multi-party dialogues, namely, to whom each utterance is replying. However, due to the lack of explicitly annotated discourse labels in multi-party dialogue corpora, previous works fail to scale up the pre-training process by putting aside the unlabeled multi-party conversational data for nothing. To fully utilize the unlabeled data, we propose to treat the discourse structures as latent variables, then jointly infer them and pre-train the discourse-aware model by unsupervised latent variable inference methods. Experiments on multiple downstream tasks show that our pre-trained model outperforms strong baselines by large margins and achieves state-of-the-art (SOTA) results, justifying the effectiveness of our method. The official implementation of this paper is available at https://github.com/EricLee8/MPD_EMVI.

144.HiTIN: Hierarchy-aware Tree Isomorphism Network for Hierarchical Text Classification

Authors:He Zhu, Chong Zhang, Junjie Huang, Junran Wu, Ke Xu

Abstract: Hierarchical text classification (HTC) is a challenging subtask of multi-label classification as the labels form a complex hierarchical structure. Existing dual-encoder methods in HTC achieve weak performance gains with huge memory overheads and their structure encoders heavily rely on domain knowledge. Under such observation, we tend to investigate the feasibility of a memory-friendly model with strong generalization capability that could boost the performance of HTC without prior statistics or label semantics. In this paper, we propose Hierarchy-aware Tree Isomorphism Network (HiTIN) to enhance the text representations with only syntactic information of the label hierarchy. Specifically, we convert the label hierarchy into an unweighted tree structure, termed coding tree, with the guidance of structural entropy. Then we design a structure encoder to incorporate hierarchy-aware information in the coding tree into text representations. Besides the text encoder, HiTIN only contains a few multi-layer perceptions and linear transformations, which greatly saves memory. We conduct experiments on three commonly used datasets and the results demonstrate that HiTIN could achieve better test performance and less memory consumption than state-of-the-art (SOTA) methods.

145.Are Pre-trained Language Models Useful for Model Ensemble in Chinese Grammatical Error Correction?

Authors:Chenming Tang, Xiuyu Wu, Yunfang Wu

Abstract: Model ensemble has been in widespread use for Grammatical Error Correction (GEC), boosting model performance. We hypothesize that model ensemble based on the perplexity (PPL) computed by pre-trained language models (PLMs) should benefit the GEC system. To this end, we explore several ensemble strategies based on strong PLMs with four sophisticated single models. However, the performance does not improve but even gets worse after the PLM-based ensemble. This surprising result sets us doing a detailed analysis on the data and coming up with some insights on GEC. The human references of correct sentences is far from sufficient in the test data, and the gap between a correct sentence and an idiomatic one is worth our attention. Moreover, the PLM-based ensemble strategies provide an effective way to extend and improve GEC benchmark data. Our source code is available at https://github.com/JamyDon/PLM-based-CGEC-Model-Ensemble.

146.SciReviewGen: A Large-scale Dataset for Automatic Literature Review Generation

Authors:Tetsu Kasanishi, Masaru Isonuma, Junichiro Mori, Ichiro Sakata

Abstract: Automatic literature review generation is one of the most challenging tasks in natural language processing. Although large language models have tackled literature review generation, the absence of large-scale datasets has been a stumbling block to the progress. We release SciReviewGen, consisting of over 10,000 literature reviews and 690,000 papers cited in the reviews. Based on the dataset, we evaluate recent transformer-based summarization models on the literature review generation task, including Fusion-in-Decoder extended for literature review generation. Human evaluation results show that some machine-generated summaries are comparable to human-written reviews, while revealing the challenges of automatic literature review generation such as hallucinations and a lack of detailed information. Our dataset and code are available at https://github.com/tetsu9923/SciReviewGen.

147.Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model Fine-tuning

Authors:Zhen-Ru Zhang, Chuanqi Tan, Haiyang Xu, Chengyu Wang, Jun Huang, Songfang Huang

Abstract: Fine-tuning large pre-trained language models on various downstream tasks with whole parameters is prohibitively expensive. Hence, Parameter-efficient fine-tuning has attracted attention that only optimizes a few task-specific parameters with the frozen pre-trained model. In this work, we focus on prefix tuning, which only optimizes continuous prefix vectors (i.e. pseudo tokens) inserted into Transformer layers. Based on the observation that the learned syntax and semantics representation varies a lot at different layers, we argue that the adaptive prefix will be further tailored to each layer than the fixed one, enabling the fine-tuning more effective and efficient. Thus, we propose Adaptive Prefix Tuning (APT) to adjust the prefix in terms of both fine-grained token level and coarse-grained layer level with a gate mechanism. Experiments on the SuperGLUE and NER datasets show the effectiveness of APT. In addition, taking the gate as a probing, we validate the efficiency and effectiveness of the variable prefix.

148.Neural Summarization of Electronic Health Records

Authors:Koyena Pal, Seyed Ali Bahrainian, Laura Mercurio, Carsten Eickhoff

Abstract: Hospital discharge documentation is among the most essential, yet time-consuming documents written by medical practitioners. The objective of this study was to automatically generate hospital discharge summaries using neural network summarization models. We studied various data preparation and neural network training techniques that generate discharge summaries. Using nursing notes and discharge summaries from the MIMIC-III dataset, we studied the viability of the automatic generation of various sections of a discharge summary using four state-of-the-art neural network summarization models (BART, T5, Longformer and FLAN-T5). Our experiments indicated that training environments including nursing notes as the source, and discrete sections of the discharge summary as the target output (e.g. "History of Present Illness") improve language model efficiency and text quality. According to our findings, the fine-tuned BART model improved its ROUGE F1 score by 43.6% against its standard off-the-shelf version. We also found that fine-tuning the baseline BART model with other setups caused different degrees of improvement (up to 80% relative improvement). We also observed that a fine-tuned T5 generally achieves higher ROUGE F1 scores than other fine-tuned models and a fine-tuned FLAN-T5 achieves the highest ROUGE score overall, i.e., 45.6. For majority of the fine-tuned language models, summarizing discharge summary report sections separately outperformed the summarization the entire report quantitatively. On the other hand, fine-tuning language models that were previously instruction fine-tuned showed better performance in summarizing entire reports. This study concludes that a focused dataset designed for the automatic generation of discharge summaries by a language model can produce coherent Discharge Summary sections.

149.SAIL: Search-Augmented Instruction Learning

Authors:Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Danny Fox, Helen Meng, James Glass

Abstract: Large language models (LLMs) have been significantly improved by instruction fine-tuning, but still lack transparency and the ability to utilize up-to-date knowledge and information. In this work, we propose search-augmented instruction learning (SAIL), which grounds the language generation and instruction following abilities on complex search results generated by in-house and external search engines. With an instruction tuning corpus, we collect search results for each training case from different search APIs and domains, and construct a new search-grounded training set containing \textit{(instruction, grounding information, response)} triplets. We then fine-tune the LLaMA-7B model on the constructed training set. Since the collected results contain unrelated and disputing languages, the model needs to learn to ground on trustworthy search results, filter out distracting passages, and generate the target response. The search result-denoising process entails explicit trustworthy information selection and multi-hop reasoning, since the retrieved passages might be informative but not contain the instruction-following answer. Experiments show that the fine-tuned SAIL-7B model has a strong instruction-following ability, and it performs significantly better on transparency-sensitive tasks, including open-ended question answering and fact checking.

150.Boosting Cross-lingual Transferability in Multilingual Models via In-Context Learning

Authors:Sunkyoung Kim, Dayeon Ki, Yireun Kim, Jinsik Lee

Abstract: Existing cross-lingual transfer (CLT) prompting methods are only concerned with monolingual demonstration examples in the source language. In this paper, we propose In-CLT, a novel cross-lingual transfer prompting method that leverages both source and target languages to construct the demonstration examples. We conduct comprehensive evaluations on multilingual benchmarks, focusing on question answering tasks. Experiment results show that In-CLT prompt not only improves multilingual models' cross-lingual transferability, but also demonstrates remarkable unseen language generalization ability. In-CLT prompting, in particular, improves model performance by 10 to 20\% points on average when compared to prior cross-lingual transfer approaches. We also observe the surprising performance gain on the other multilingual benchmarks, especially in reasoning tasks. Furthermore, we investigate the relationship between lexical similarity and pre-training corpora in terms of the cross-lingual transfer gap.

151.LMs with a Voice: Spoken Language Modeling beyond Speech Tokens

Authors:Eliya Nachmani, Alon Levkovitch, Julian Salazar, Chulayutsh Asawaroengchai, Soroosh Mariooryad, RJ Skerry-Ryan, Michelle Tadmor Ramanovich

Abstract: We present SPECTRON, a novel approach to adapting pre-trained language models (LMs) to perform speech continuation. By leveraging pre-trained speech encoders, our model generates both text and speech outputs with the entire system being trained end-to-end operating directly on spectrograms. Training the entire model in the spectrogram domain simplifies our speech continuation system versus existing cascade methods which use discrete speech representations. We further show our method surpasses existing spoken language models both in semantic content and speaker preservation while also benefiting from the knowledge transferred from pre-existing models. Audio samples can be found in our website https://michelleramanovich.github.io/spectron/spectron

152.Revisiting Parallel Context Windows: A Frustratingly Simple Alternative and Chain-of-Thought Deterioration

Authors:Kejuan Yang, Xiao Liu, Kaiwen Men, Aohan Zeng, Yuxiao Dong, Jie Tang

Abstract: We identify two crucial limitations in the evaluation of recent parallel-integrated method Parallel Context Windows (PCW), which extends the maximum context lengths of language models, e.g., 2048 for LLaMA, by harnessing window-wise attention and positional embedding techniques. We first show that a simple yet strong baseline, weighted sum ensemble, is missing for the in-context few-shot classification. Moreover, on more challenging Chain-of-Thought (CoT) reasoning (e.g., HotpotQA), PCW would present unexpected deterioration regarding question miscomprehension and false inference. Based on our findings, we suggest that the existing PCW design may not guarantee sufficient improvement and practicality in handling lengthy documents in real-world applications. More community efforts on enabling language models' long context understanding ability should be paid.

153.EvEval: A Comprehensive Evaluation of Event Semantics for Large Language Models

Authors:Zhengwei Tao, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Yanlin Feng, Jia Li, Wenpeng Hu

Abstract: Events serve as fundamental units of occurrence within various contexts. The processing of event semantics in textual information forms the basis of numerous natural language processing (NLP) applications. Recent studies have begun leveraging large language models (LLMs) to address event semantic processing. However, the extent that LLMs can effectively tackle these challenges remains uncertain. Furthermore, the lack of a comprehensive evaluation framework for event semantic processing poses a significant challenge in evaluating these capabilities. In this paper, we propose an overarching framework for event semantic processing, encompassing understanding, reasoning, and prediction, along with their fine-grained aspects. To comprehensively evaluate the event semantic processing abilities of models, we introduce a novel benchmark called EVEVAL. We collect 8 datasets that cover all aspects of event semantic processing. Extensive experiments are conducted on EVEVAL, leading to several noteworthy findings based on the obtained results.

154.Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

Authors:Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, Najoung Kim, He He

Abstract: Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to longer and compositional proofs. However, they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.

155.Revisiting Token Dropping Strategy in Efficient BERT Pretraining

Authors:Qihuang Zhong, Liang Ding, Juhua Liu, Xuebo Liu, Min Zhang, Bo Du, Dacheng Tao

Abstract: Token dropping is a recently-proposed strategy to speed up the pretraining of masked language models, such as BERT, by skipping the computation of a subset of the input tokens at several middle layers. It can effectively reduce the training time without degrading much performance on downstream tasks. However, we empirically find that token dropping is prone to a semantic loss problem and falls short in handling semantic-intense tasks. Motivated by this, we propose a simple yet effective semantic-consistent learning method (ScTD) to improve the token dropping. ScTD aims to encourage the model to learn how to preserve the semantic information in the representation space. Extensive experiments on 12 tasks show that, with the help of our ScTD, token dropping can achieve consistent and significant performance gains across all task types and model sizes. More encouragingly, ScTD saves up to 57% of pretraining time and brings up to +1.56% average improvement over the vanilla token dropping.

156.Self-Evolution Learning for Discriminative Language Model Pretraining

Authors:Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao

Abstract: Masked language modeling, widely used in discriminative language model (e.g., BERT) pretraining, commonly adopts a random masking strategy. However, random masking does not consider the importance of the different words in the sentence meaning, where some of them are more worthy to be predicted. Therefore, various masking strategies (e.g., entity-level masking) are proposed, but most of them require expensive prior knowledge and generally train from scratch without reusing existing model weights. In this paper, we present Self-Evolution learning (SE), a simple and effective token masking and learning method to fully and wisely exploit the knowledge from data. SE focuses on learning the informative yet under-explored tokens and adaptively regularizes the training by introducing a novel Token-specific Label Smoothing approach. Experiments on 10 tasks show that our SE brings consistent and significant improvements (+1.43~2.12 average scores) upon different PLMs. In-depth analyses demonstrate that SE improves linguistic knowledge learning and generalization.

157.A Simple and Effective Framework for Strict Zero-Shot Hierarchical Classification

Authors:Rohan Bhambhoria, Lei Chen, Xiaodan Zhu

Abstract: In recent years, large language models (LLMs) have achieved strong performance on benchmark tasks, especially in zero or few-shot settings. However, these benchmarks often do not adequately address the challenges posed in the real-world, such as that of hierarchical classification. In order to address this challenge, we propose refactoring conventional tasks on hierarchical datasets into a more indicative long-tail prediction task. We observe LLMs are more prone to failure in these cases. To address these limitations, we propose the use of entailment-contradiction prediction in conjunction with LLMs, which allows for strong performance in a strict zero-shot setting. Importantly, our method does not require any parameter updates, a resource-intensive process and achieves strong performance across multiple datasets.

158.Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

Authors:Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, Weizhu Chen

Abstract: Large language models are powerful text processors and reasoners, but are still subject to limitations including outdated knowledge and hallucinations, which necessitates connecting them to the world. Retrieval-augmented large language models have raised extensive attention for grounding model generation on external knowledge. However, retrievers struggle to capture relevance, especially for queries with complex information needs. Recent work has proposed to improve relevance modeling by having large language models actively involved in retrieval, i.e., to improve retrieval with generation. In this paper, we show that strong performance can be achieved by a method we call Iter-RetGen, which synergizes retrieval and generation in an iterative manner. A model output shows what might be needed to finish a task, and thus provides an informative context for retrieving more relevant knowledge which in turn helps generate a better output in the next iteration. Compared with recent work which interleaves retrieval with generation when producing an output, Iter-RetGen processes all retrieved knowledge as a whole and largely preserves the flexibility in generation without structural constraints. We evaluate Iter-RetGen on multi-hop question answering, fact verification, and commonsense reasoning, and show that it can flexibly leverage parametric knowledge and non-parametric knowledge, and is superior to or competitive with state-of-the-art retrieval-augmented baselines while causing fewer overheads of retrieval and generation. We can further improve performance via generation-augmented retrieval adaptation.

159.Gorilla: Large Language Model Connected with Massive APIs

Authors:Shishir G. Patil, Tianjun Zhang, Xin Wang, Joseph E. Gonzalez

Abstract: Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu

160.Learning Answer Generation using Supervision from Automatic Question Answering Evaluators

Authors:Matteo Gabburo, Siddhant Garg, Rik Koncel-Kedziorski, Alessandro Moschitti

Abstract: Recent studies show that sentence-level extractive QA, i.e., based on Answer Sentence Selection (AS2), is outperformed by Generation-based QA (GenQA) models, which generate answers using the top-k answer sentences ranked by AS2 models (a la retrieval-augmented generation style). In this paper, we propose a novel training paradigm for GenQA using supervision from automatic QA evaluation models (GAVA). Specifically, we propose three strategies to transfer knowledge from these QA evaluation models to a GenQA model: (i) augmenting training data with answers generated by the GenQA model and labelled by GAVA (either statically, before training, or (ii) dynamically, at every training epoch); and (iii) using the GAVA score for weighting the generator loss during the learning of the GenQA model. We evaluate our proposed methods on two academic and one industrial dataset, obtaining a significant improvement in answering accuracy over the previous state of the art.

161.Context-Aware Transformer Pre-Training for Answer Sentence Selection

Authors:Luca Di Liello, Siddhant Garg, Alessandro Moschitti

Abstract: Answer Sentence Selection (AS2) is a core component for building an accurate Question Answering pipeline. AS2 models rank a set of candidate sentences based on how likely they answer a given question. The state of the art in AS2 exploits pre-trained transformers by transferring them on large annotated datasets, while using local contextual information around the candidate sentence. In this paper, we propose three pre-training objectives designed to mimic the downstream fine-tuning task of contextual AS2. This allows for specializing LMs when fine-tuning for contextual AS2. Our experiments on three public and two large-scale industrial datasets show that our pre-training approaches (applied to RoBERTa and ELECTRA) can improve baseline contextual AS2 accuracy by up to 8% on some datasets.

162.ASPER: Answer Set Programming Enhanced Neural Network Models for Joint Entity-Relation Extraction

Authors:Trung Hoang Le, Huiping Cao, Tran Cao Son

Abstract: A plethora of approaches have been proposed for joint entity-relation (ER) extraction. Most of these methods largely depend on a large amount of manually annotated training data. However, manual data annotation is time consuming, labor intensive, and error prone. Human beings learn using both data (through induction) and knowledge (through deduction). Answer Set Programming (ASP) has been a widely utilized approach for knowledge representation and reasoning that is elaboration tolerant and adept at reasoning with incomplete information. This paper proposes a new approach, ASP-enhanced Entity-Relation extraction (ASPER), to jointly recognize entities and relations by learning from both data and domain knowledge. In particular, ASPER takes advantage of the factual knowledge (represented as facts in ASP) and derived knowledge (represented as rules in ASP) in the learning process of neural network models. We have conducted experiments on two real datasets and compare our method with three baselines. The results show that our ASPER model consistently outperforms the baselines.

163.Uncovering and Quantifying Social Biases in Code Generation

Authors:Yan Liu, Xiaokang Chen, Yan Gao, Zhe Su, Fengji Zhang, Daoguang Zan, Jian-Guang Lou, Pin-Yu Chen, Tsung-Yi Ho

Abstract: With the popularity of automatic code generation tools, such as Copilot, the study of the potential hazards of these tools is gaining importance. In this work, we explore the social bias problem in pre-trained code generation models. We propose a new paradigm to construct code prompts and successfully uncover social biases in code generation models. To quantify the severity of social biases in generated code, we develop a dataset along with three metrics to evaluate the overall social bias and fine-grained unfairness across different demographics. Experimental results on three pre-trained code generation models (Codex, InCoder, and CodeGen) with varying sizes, reveal severe social biases. Moreover, we conduct analysis to provide useful insights for further choice of code generation models with low social bias. (This work contains examples that potentially implicate stereotypes, associations, and other harms that could be offensive to individuals in certain social groups.)

164.Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages

Authors:Khalid Alnajjar, Mika Hämäläinen, Jack Rueter

Abstract: In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test our model, we annotated a small sentiment analysis corpus for the 4 endangered languages and Finnish. Our method reached at least 56\% accuracy for each endangered language. The models and the sentiment corpus will be released together with this paper. Our research shows that state-of-the-art neural models can be used with endangered languages with the only requirement being a dictionary between the endangered language and a majority language.

165.Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR

Authors:Kaushal Santosh Bhogale, Sai Sundaresan, Abhigyan Raman, Tahir Javed, Mitesh M. Khapra, Pratyush Kumar

Abstract: Improving ASR systems is necessary to make new LLM-based use-cases accessible to people across the globe. In this paper, we focus on Indian languages, and make the case that diverse benchmarks are required to evaluate and improve ASR systems for Indian languages. To address this, we collate Vistaar as a set of 59 benchmarks across various language and domain combinations, on which we evaluate 3 publicly available ASR systems and 2 commercial systems. We also train IndicWhisper models by fine-tuning the Whisper models on publicly available training datasets across 12 Indian languages totalling to 10.7K hours. We show that IndicWhisper significantly improves on considered ASR systems on the Vistaar benchmark. Indeed, IndicWhisper has the lowest WER in 39 out of the 59 benchmarks, with an average reduction of 4.1 WER. We open-source all datasets, code and models.

166.Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering

Authors:Avi Caciularu, Matthew E. Peters, Jacob Goldberger, Ido Dagan, Arman Cohan

Abstract: The integration of multi-document pre-training objectives into language models has resulted in remarkable improvements in multi-document downstream tasks. In this work, we propose extending this idea by pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. To that end, given a set (or cluster) of topically-related documents, we systematically generate semantically-oriented questions from a salient sentence in one document and challenge the model, during pre-training, to answer these questions while "peeking" into other topically-related documents. In a similar manner, the model is also challenged to recover the sentence from which the question was generated, again while leveraging cross-document information. This novel multi-document QA formulation directs the model to better recover cross-text informational relations, and introduces a natural augmentation that artificially increases the pre-training data. Further, unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation (e.g., QA) and long text generation (e.g., summarization). Following this scheme, we pre-train our model -- termed QAmden -- and evaluate its performance across several multi-document tasks, including multi-document QA, summarization, and query-focused summarization, yielding improvements of up to 7%, and significantly outperforms zero-shot GPT-3.5 and GPT-4.

167.Comparing Humans and Models on a Similar Scale: Towards Cognitive Gender Bias Evaluation in Coreference Resolution

Authors:Gili Lior, Gabriel Stanovsky

Abstract: Spurious correlations were found to be an important factor explaining model performance in various NLP tasks (e.g., gender or racial artifacts), often considered to be ''shortcuts'' to the actual task. However, humans tend to similarly make quick (and sometimes wrong) predictions based on societal and cognitive presuppositions. In this work we address the question: can we quantify the extent to which model biases reflect human behaviour? Answering this question will help shed light on model performance and provide meaningful comparisons against humans. We approach this question through the lens of the dual-process theory for human decision-making. This theory differentiates between an automatic unconscious (and sometimes biased) ''fast system'' and a ''slow system'', which when triggered may revisit earlier automatic reactions. We make several observations from two crowdsourcing experiments of gender bias in coreference resolution, using self-paced reading to study the ''fast'' system, and question answering to study the ''slow'' system under a constrained time setting. On real-world data humans make $\sim$3\% more gender-biased decisions compared to models, while on synthetic data models are $\sim$12\% more biased.

168.AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Authors:Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao

Abstract: Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual speech-to-speech (AV-S2ST) translation model without relying on intermediate text. AV-TranSpeech complements the audio stream with visual information to promote system robustness and opens up a host of practical applications: dictation or dubbing archival films. To mitigate the data scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised pre-training with unlabeled audio-visual data to learn contextual representation, and 2) introduce cross-modal distillation with S2ST models trained on the audio-only corpus to further reduce the requirements of visual data. Experimental results on two language pairs demonstrate that AV-TranSpeech outperforms audio-only models under all settings regardless of the type of noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation yields an improvement of 7.6 BLEU on average compared with baselines. Audio samples are available at https://AV-TranSpeech.github.io

169.Unit-based Speech-to-Speech Translation Without Parallel Data

Authors:Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi

Abstract: We propose an unsupervised speech-to-speech translation (S2ST) system that does not rely on parallel data between the source and target languages. Our approach maps source and target language speech signals into automatically discovered, discrete units and reformulates the problem as unsupervised unit-to-unit machine translation. We develop a three-step training procedure that involves (a) pre-training an unit-based encoder-decoder language model with a denoising objective (b) training it with word-by-word translated utterance pairs created by aligning monolingual text embedding spaces and (c) running unsupervised backtranslation bootstrapping off of the initial translation model. Our approach avoids mapping the speech signal into text and uses speech-to-unit and unit-to-speech models instead of automatic speech recognition and text to speech models. We evaluate our model on synthetic-speaker Europarl-ST English-German and German-English evaluation sets, finding that unit-based translation is feasible under this constrained scenario, achieving 9.29 ASR-BLEU in German to English and 8.07 in English to German.