arXiv daily

Computation and Language (cs.CL)

Tue, 13 Jun 2023

Other arXiv digests in this category:Thu, 14 Sep 2023; Wed, 13 Sep 2023; Tue, 12 Sep 2023; Mon, 11 Sep 2023; Fri, 08 Sep 2023; Tue, 05 Sep 2023; Fri, 01 Sep 2023; Thu, 31 Aug 2023; Wed, 30 Aug 2023; Tue, 29 Aug 2023; Mon, 28 Aug 2023; Fri, 25 Aug 2023; Thu, 24 Aug 2023; Wed, 23 Aug 2023; Tue, 22 Aug 2023; Mon, 21 Aug 2023; Fri, 18 Aug 2023; Thu, 17 Aug 2023; Wed, 16 Aug 2023; Tue, 15 Aug 2023; Mon, 14 Aug 2023; Fri, 11 Aug 2023; Thu, 10 Aug 2023; Wed, 09 Aug 2023; Tue, 08 Aug 2023; Mon, 07 Aug 2023; Fri, 04 Aug 2023; Thu, 03 Aug 2023; Wed, 02 Aug 2023; Tue, 01 Aug 2023; Mon, 31 Jul 2023; Fri, 28 Jul 2023; Thu, 27 Jul 2023; Wed, 26 Jul 2023; Tue, 25 Jul 2023; Mon, 24 Jul 2023; Fri, 21 Jul 2023; Thu, 20 Jul 2023; Wed, 19 Jul 2023; Tue, 18 Jul 2023; Mon, 17 Jul 2023; Fri, 14 Jul 2023; Thu, 13 Jul 2023; Wed, 12 Jul 2023; Tue, 11 Jul 2023; Mon, 10 Jul 2023; Fri, 07 Jul 2023; Thu, 06 Jul 2023; Wed, 05 Jul 2023; Tue, 04 Jul 2023; Mon, 03 Jul 2023; Fri, 30 Jun 2023; Thu, 29 Jun 2023; Wed, 28 Jun 2023; Tue, 27 Jun 2023; Mon, 26 Jun 2023; Fri, 23 Jun 2023; Thu, 22 Jun 2023; Wed, 21 Jun 2023; Tue, 20 Jun 2023; Fri, 16 Jun 2023; Thu, 15 Jun 2023; Mon, 12 Jun 2023; Fri, 09 Jun 2023; Thu, 08 Jun 2023; Wed, 07 Jun 2023; Tue, 06 Jun 2023; Mon, 05 Jun 2023; Fri, 02 Jun 2023; Thu, 01 Jun 2023; Wed, 31 May 2023; Tue, 30 May 2023; Mon, 29 May 2023; Fri, 26 May 2023; Thu, 25 May 2023; Wed, 24 May 2023; Tue, 23 May 2023; Mon, 22 May 2023; Fri, 19 May 2023; Thu, 18 May 2023; Wed, 17 May 2023; Tue, 16 May 2023; Mon, 15 May 2023; Fri, 12 May 2023; Thu, 11 May 2023; Wed, 10 May 2023; Tue, 09 May 2023; Mon, 08 May 2023; Fri, 05 May 2023; Thu, 04 May 2023; Wed, 03 May 2023; Tue, 02 May 2023; Mon, 01 May 2023; Fri, 28 Apr 2023; Thu, 27 Apr 2023; Wed, 26 Apr 2023; Tue, 25 Apr 2023; Mon, 24 Apr 2023; Fri, 21 Apr 2023; Thu, 20 Apr 2023; Wed, 19 Apr 2023; Tue, 18 Apr 2023; Mon, 17 Apr 2023; Fri, 14 Apr 2023; Thu, 13 Apr 2023; Wed, 12 Apr 2023; Tue, 11 Apr 2023; Mon, 10 Apr 2023
1.HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation

Authors:Qianyu He, Yikai Zhang, Jiaqing Liang, Yuncheng Huang, Yanghua Xiao, Yunwen Chen

Abstract: Similes play an imperative role in creative writing such as story and dialogue generation. Proper evaluation metrics are like a beacon guiding the research of simile generation (SG). However, it remains under-explored as to what criteria should be considered, how to quantify each criterion into metrics, and whether the metrics are effective for comprehensive, efficient, and reliable SG evaluation. To address the issues, we establish HAUSER, a holistic and automatic evaluation system for the SG task, which consists of five criteria from three perspectives and automatic metrics for each criterion. Through extensive experiments, we verify that our metrics are significantly more correlated with human ratings from each perspective compared with prior automatic metrics.

2.Question Decomposition Tree for Answering Complex Questions over Knowledge Bases

Authors:Xiang Huang, Sitao Cheng, Yiheng Shu, Yuheng Bao, Yuzhong Qu

Abstract: Knowledge base question answering (KBQA) has attracted a lot of interest in recent years, especially for complex questions which require multiple facts to answer. Question decomposition is a promising way to answer complex questions. Existing decomposition methods split the question into sub-questions according to a single compositionality type, which is not sufficient for questions involving multiple compositionality types. In this paper, we propose Question Decomposition Tree (QDT) to represent the structure of complex questions. Inspired by recent advances in natural language generation (NLG), we present a two-staged method called Clue-Decipher to generate QDT. It can leverage the strong ability of NLG model and simultaneously preserve the original questions. To verify that QDT can enhance KBQA task, we design a decomposition-based KBQA system called QDTQA. Extensive experiments show that QDTQA outperforms previous state-of-the-art methods on ComplexWebQuestions dataset. Besides, our decomposition method improves an existing KBQA system by 12% and sets a new state-of-the-art on LC-QuAD 1.0.

3.Soft Language Clustering for Multilingual Model Pre-training

Authors:Jiali Zeng, Yufan Jiang, Yongjing Yin, Yi Jing, Fandong Meng, Binghuai Lin, Yunbo Cao, Jie Zhou

Abstract: Multilingual pre-trained language models have demonstrated impressive (zero-shot) cross-lingual transfer abilities, however, their performance is hindered when the target language has distant typology from source languages or when pre-training data is limited in size. In this paper, we propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods. On the tasks of XTREME including text classification, sequence labeling, question answering, and sentence retrieval, both base- and large-size language models pre-trained with our proposed method exhibit consistent performance improvement. Furthermore, it provides substantial advantages for low-resource languages in unsupervised sentence retrieval and for target languages that differ greatly from the source language in cross-lingual transfer.

4.Rank-Aware Negative Training for Semi-Supervised Text Classification

Authors:Ahmed Murtadha, Shengfeng Pan, Wen Bo, Jianlin Su, Xinxin Cao, Wenze Zhang, Yunfeng Liu

Abstract: Semi-supervised text classification-based paradigms (SSTC) typically employ the spirit of self-training. The key idea is to train a deep classifier on limited labeled texts and then iteratively predict the unlabeled texts as their pseudo-labels for further training. However, the performance is largely affected by the accuracy of pseudo-labels, which may not be significant in real-world scenarios. This paper presents a Rank-aware Negative Training (RNT) framework to address SSTC in learning with noisy label manner. To alleviate the noisy information, we adapt a reasoning with uncertainty-based approach to rank the unlabeled texts based on the evidential support received from the labeled texts. Moreover, we propose the use of negative training to train RNT based on the concept that ``the input instance does not belong to the complementary label''. A complementary label is randomly selected from all labels except the label on-target. Intuitively, the probability of a true label serving as a complementary label is low and thus provides less noisy information during the training, resulting in better performance on the test data. Finally, we evaluate the proposed solution on various text classification benchmark datasets. Our extensive experiments show that it consistently overcomes the state-of-the-art alternatives in most scenarios and achieves competitive performance in the others. The code of RNT is publicly available at:https://github.com/amurtadha/RNT.

5.Human-Like Intuitive Behavior and Reasoning Biases Emerged in Language Models -- and Disappeared in GPT-4

Authors:Thilo Hagendorff, Sarah Fabi

Abstract: Large language models (LLMs) are currently at the forefront of intertwining AI systems with human communication and everyday life. Therefore, it is of great importance to evaluate their emerging abilities. In this study, we show that LLMs, most notably GPT-3, exhibit behavior that strikingly resembles human-like intuition -- and the cognitive errors that come with it. However, LLMs with higher cognitive capabilities, in particular ChatGPT and GPT-4, learned to avoid succumbing to these errors and perform in a hyperrational manner. For our experiments, we probe LLMs with the Cognitive Reflection Test (CRT) as well as semantic illusions that were originally designed to investigate intuitive decision-making in humans. Moreover, we probe how sturdy the inclination for intuitive-like decision-making is. Our study demonstrates that investigating LLMs with methods from psychology has the potential to reveal otherwise unknown emergent traits.

6.SqueezeLLM: Dense-and-Sparse Quantization

Authors:Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

Abstract: Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing model weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is open-sourced and available online.

7.Hybrid lemmatization in HuSpaCy

Authors:Péter Berkecz, György Orosz, Zsolt Szántó, Gergő Szabó, Richárd Farkas

Abstract: Lemmatization is still not a trivial task for morphologically rich languages. Previous studies showed that hybrid architectures usually work better for these languages and can yield great results. This paper presents a hybrid lemmatizer utilizing both a neural model, dictionaries and hand-crafted rules. We introduce a hybrid architecture along with empirical results on a widely used Hungarian dataset. The presented methods are published as three HuSpaCy models.

8.Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation

Authors:Yuchen Han, Chen Xu, Tong Xiao, Jingbo Zhu

Abstract: Pre-training and fine-tuning is a paradigm for alleviating the data scarcity problem in end-to-end speech translation (E2E ST). The commonplace "modality gap" between speech and text data often leads to inconsistent inputs between pre-training and fine-tuning. However, we observe that this gap occurs in the early stages of fine-tuning, but does not have a major impact on the final performance. On the other hand, we find that there has another gap, which we call the "capacity gap": high resource tasks (such as ASR and MT) always require a large model to fit, when the model is reused for a low resource task (E2E ST), it will get a sub-optimal performance due to the over-fitting. In a case study, we find that the regularization plays a more important role than the well-designed modality adaption method, which achieves 29.0 for en-de and 40.3 for en-fr on the MuST-C dataset. Code and models are available at https://github.com/hannlp/TAB.

9.Is Anisotropy Inherent to Transformers?

Authors:Nathan Godey, Éric de la Clergerie, Benoît Sagot

Abstract: The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations tend to demonstrate that anisotropy might actually be inherent to Transformers-based models.

10.Rethink the Effectiveness of Text Data Augmentation: An Empirical Analysis

Authors:Zhengxiang Shi, Aldo Lipani

Abstract: In recent years, language models (LMs) have made remarkable progress in advancing the field of natural language processing (NLP). However, the impact of data augmentation (DA) techniques on the fine-tuning (FT) performance of these LMs has been a topic of ongoing debate. In this study, we evaluate the effectiveness of three different FT methods in conjugation with back-translation across an array of 7 diverse NLP tasks, including classification and regression types, covering single-sentence and sentence-pair tasks. Contrary to prior assumptions that DA does not contribute to the enhancement of LMs' FT performance, our findings reveal that continued pre-training on augmented data can effectively improve the FT performance of the downstream tasks. In the most favourable case, continued pre-training improves the performance of FT by more than 10% in the few-shot learning setting. Our finding highlights the potential of DA as a powerful tool for bolstering LMs' performance.

11.NAVER LABS Europe's Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource Track

Authors:Edward Gow-Smith, Alexandre Berard, Marcely Zanon Boito, Ioan Calapodescu

Abstract: This paper presents NAVER LABS Europe's systems for Tamasheq-French and Quechua-Spanish speech translation in the IWSLT 2023 Low-Resource track. Our work attempts to maximize translation quality in low-resource settings using multilingual parameter-efficient solutions that leverage strong pre-trained models. Our primary submission for Tamasheq outperforms the previous state of the art by 7.5 BLEU points on the IWSLT 2022 test set, and achieves 23.6 BLEU on this year's test set, outperforming the second best participant by 7.7 points. For Quechua, we also rank first and achieve 17.7 BLEU, despite having only two hours of translation data. Finally, we show that our proposed multilingual architecture is also competitive for high-resource languages, outperforming the best unconstrained submission to the IWSLT 2021 Multilingual track, despite using much less training data and compute.

12.Tokenization with Factorized Subword Encoding

Authors:David Samuel, Lilja Øvrelid

Abstract: In recent years, language models have become increasingly larger and more complex. However, the input representations for these models continue to rely on simple and greedy subword tokenization methods. In this paper, we propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model. The effectiveness of the proposed tokenization method, referred to as the Factorizer, is evaluated on language modeling and morpho-syntactic tasks for 7 diverse languages. Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.

13.A Cloud-based Machine Learning Pipeline for the Efficient Extraction of Insights from Customer Reviews

Authors:Robert Lakatos, Gergo Bogacsovics, Balazs Harangi, Istvan Lakatos, Attila Tiba, Janos Toth, Marianna Szabo, Andras Hajdu

Abstract: The efficiency of natural language processing has improved dramatically with the advent of machine learning models, particularly neural network-based solutions. However, some tasks are still challenging, especially when considering specific domains. In this paper, we present a cloud-based system that can extract insights from customer reviews using machine learning methods integrated into a pipeline. For topic modeling, our composite model uses transformer-based neural networks designed for natural language processing, vector embedding-based keyword extraction, and clustering. The elements of our model have been integrated and further developed to meet better the requirements of efficient information extraction, topic modeling of the extracted information, and user needs. Furthermore, our system can achieve better results than this task's existing topic modeling and keyword extraction solutions. Our approach is validated and compared with other state-of-the-art methods using publicly available datasets for benchmarking.

14.NoCoLA: The Norwegian Corpus of Linguistic Acceptability

Authors:Matias Jentoft, David Samuel

Abstract: While there has been a surge of large language models for Norwegian in recent years, we lack any tool to evaluate their understanding of grammaticality. We present two new Norwegian datasets for this task. NoCoLA_class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences. On the other hand, NoCoLA_zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner, i.e. without any further training. In this paper, we describe both datasets in detail, show how to use them for different flavors of language models, and conduct a comparative study of the existing Norwegian language models.

15.Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

Authors:Dmitry Karpov, Mikhail Burtsev

Abstract: This article investigates the knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large sample number (361,560 single-label, 170,930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the "Yandex Que" raw data. By evaluating the RuQTopics - trained models on the six matching classes of the Russian MASSIVE subset, we have proved that the RuQTopics dataset is suitable for real-world conversational tasks, as the Russian-only models trained on this dataset consistently yield an accuracy around 85\% on this subset. We also have figured out that for the multilingual BERT, trained on the RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e-11) with the approximate size of the pretraining BERT's data for the corresponding language. At the same time, the correlation of the language-wise accuracy with the linguistical distance from Russian is not statistically significant.

16.ChatGPT vs Human-authored Text: Insights into Controllable Text Summarization and Sentence Style Transfer

Authors:Dongqi Pu, Vera Demberg

Abstract: Large-scale language models, like ChatGPT, have garnered significant media attention and stunned the public with their remarkable capacity for generating coherent text from short natural language prompts. In this paper, we aim to conduct a systematic inspection of ChatGPT's performance in two controllable generation tasks, with respect to ChatGPT's ability to adapt its output to different target audiences (expert vs. layman) and writing styles (formal vs. informal). Additionally, we evaluate the faithfulness of the generated text, and compare the model's performance with human-authored texts. Our findings indicate that the stylistic variations produced by humans are considerably larger than those demonstrated by ChatGPT, and the generated texts diverge from human samples in several characteristics, such as the distribution of word types. Moreover, we observe that ChatGPT sometimes incorporates factual errors or hallucinations when adapting the text to suit a specific style.

17.Adversarial Capsule Networks for Romanian Satire Detection and Sentiment Analysis

Authors:Sebastian-Vasile Echim, Răzvan-Alexandru Smădu, Andrei-Marius Avram, Dumitru-Clementin Cercel, Florin Pop

Abstract: Satire detection and sentiment analysis are intensively explored natural language processing (NLP) tasks that study the identification of the satirical tone from texts and extracting sentiments in relationship with their targets. In languages with fewer research resources, an alternative is to produce artificial examples based on character-level adversarial processes to overcome dataset size limitations. Such samples are proven to act as a regularization method, thus improving the robustness of models. In this work, we improve the well-known NLP models (i.e., Convolutional Neural Networks, Long Short-Term Memory (LSTM), Bidirectional LSTM, Gated Recurrent Units (GRUs), and Bidirectional GRUs) with adversarial training and capsule networks. The fine-tuned models are used for satire detection and sentiment analysis tasks in the Romanian language. The proposed framework outperforms the existing methods for the two tasks, achieving up to 99.08% accuracy, thus confirming the improvements added by the capsule layers and the adversarial training in NLP approaches.

18.GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Speech Emotion Recognition

Authors:Yu Pan, Yanni Hu, Yuguang Yang, Jixun Yao, Wen Fei, Lei Ma, Heng Lu

Abstract: Contrastive Language-Audio Pretraining (CLAP) has recently exhibited impressive success in diverse fields. In this paper, we propose GEmo-CLAP, a kind of efficient gender-attribute-enhanced CLAP model for speech emotion recognition (SER). Specifically, we first build an effective emotion CLAP model termed Emo-CLAP for SER, utilizing various self-supervised learning based pre-trained models. Then, considering the importance of the gender attribute in speech emotion modeling, two GEmo-CLAP approaches are further proposed to integrate the emotion and gender information of speech signals, forming more reasonable objectives. Extensive experiments conducted on the IEMOCAP corpus demonstrate that our proposed two GEmo-CLAP approaches consistently outperform the baseline Emo-CLAP with different pre-trained models, while also achieving superior recognition performance compared with other state-of-the-art methods.

19.Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks

Authors:Veniamin Veselovsky, Manoel Horta Ribeiro, Robert West

Abstract: Large language models (LLMs) are remarkable data annotators. They can be used to generate high-fidelity supervised training data, as well as survey and experimental data. With the widespread adoption of LLMs, human gold--standard annotations are key to understanding the capabilities of LLMs and the validity of their results. However, crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs, as crowd workers have financial incentives to use LLMs to increase their productivity and income. To investigate this concern, we conducted a case study on the prevalence of LLM usage by crowd workers. We reran an abstract summarization task from the literature on Amazon Mechanical Turk and, through a combination of keystroke detection and synthetic text classification, estimate that 33-46% of crowd workers used LLMs when completing the task. Although generalization to other, less LLM-friendly tasks is unclear, our results call for platforms, researchers, and crowd workers to find new ways to ensure that human data remain human, perhaps using the methodology proposed here as a stepping stone. Code/data: https://github.com/epfl-dlab/GPTurk

20.Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted Sentiment Classification Benchmark

Authors:Łukasz Augustyniak, Szymon Woźniak, Marcin Gruza, Piotr Gramacki, Krzysztof Rajda, Mikołaj Morzy, Tomasz Kajdanowicz

Abstract: Despite impressive advancements in multilingual corpora collection and model training, developing large-scale deployments of multilingual models still presents a significant challenge. This is particularly true for language tasks that are culture-dependent. One such example is the area of multilingual sentiment analysis, where affective markers can be subtle and deeply ensconced in culture. This work presents the most extensive open massively multilingual corpus of datasets for training sentiment models. The corpus consists of 79 manually selected datasets from over 350 datasets reported in the scientific literature based on strict quality criteria. The corpus covers 27 languages representing 6 language families. Datasets can be queried using several linguistic and functional features. In addition, we present a multi-faceted sentiment classification benchmark summarizing hundreds of experiments conducted on different base models, training objectives, dataset collections, and fine-tuning strategies.

21.WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences

Authors:Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zeng, Zhengxiao Du, Peng Zhang, Yuxiao Dong, Jie Tang

Abstract: We present WebGLM, a web-enhanced question-answering system based on the General Language Model (GLM). Its goal is to augment a pre-trained large language model (LLM) with web search and retrieval capabilities while being efficient for real-world deployments. To achieve this, we develop WebGLM with strategies for the LLM-augmented retriever, bootstrapped generator, and human preference-aware scorer. Specifically, we identify and address the limitations of WebGPT (OpenAI), through which WebGLM is enabled with accuracy, efficiency, and cost-effectiveness advantages. In addition, we propose systematic criteria for evaluating web-enhanced QA systems. We conduct multi-dimensional human evaluation and quantitative ablation studies, which suggest the outperformance of the proposed WebGLM designs over existing systems. WebGLM with the 10-billion-parameter GLM (10B) is shown to perform better than the similar-sized WebGPT (13B) and even comparably to WebGPT (175B) in human evaluation. The code, demo, and data are at \url{https://github.com/THUDM/WebGLM}.

22.BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information

Authors:Mehran Kazemi, Quan Yuan, Deepti Bhatia, Najoung Kim, Xin Xu, Vaiva Imbrasaite, Deepak Ramachandran

Abstract: Automated reasoning with unstructured natural text is a key requirement for many potential applications of NLP and for developing robust AI systems. Recently, Language Models (LMs) have demonstrated complex reasoning capacities even without any finetuning. However, existing evaluation for automated reasoning assumes access to a consistent and coherent set of information over which models reason. When reasoning in the real-world, the available information is frequently inconsistent or contradictory, and therefore models need to be equipped with a strategy to resolve such conflicts when they arise. One widely-applicable way of resolving conflicts is to impose preferences over information sources (e.g., based on source credibility or information recency) and adopt the source with higher preference. In this paper, we formulate the problem of reasoning with contradictory information guided by preferences over sources as the classical problem of defeasible reasoning, and develop a dataset called BoardgameQA for measuring the reasoning capacity of LMs in this setting. BoardgameQA also incorporates reasoning with implicit background knowledge, to better reflect reasoning problems in downstream applications. We benchmark various LMs on BoardgameQA and the results reveal a significant gap in the reasoning capacity of state-of-the-art LMs on this problem, showing that reasoning with conflicting information does not surface out-of-the-box in LMs. While performance can be improved with finetuning, it nevertheless remains poor.

23.Questioning the Survey Responses of Large Language Models

Authors:Ricardo Dominguez-Olmedo, Moritz Hardt, Celestine Mendler-Dünner

Abstract: As large language models increase in capability, researchers have started to conduct surveys of all kinds on these models with varying scientific motivations. In this work, we examine what we can learn from a model's survey responses on the basis of the well-established American Community Survey (ACS) by the U.S. Census Bureau. Evaluating more than a dozen different models, varying in size from a few hundred million to ten billion parameters, hundreds of thousands of times each on questions from the ACS, we systematically establish two dominant patterns. First, smaller models have a significant position and labeling bias, for example, towards survey responses labeled with the letter "A". This A-bias diminishes, albeit slowly, as model size increases. Second, when adjusting for this labeling bias through randomized answer ordering, models still do not trend toward US population statistics or those of any cognizable population. Rather, models across the board trend toward uniformly random aggregate statistics over survey responses. This pattern is robust to various different ways of prompting the model, including what is the de-facto standard. Our findings demonstrate that aggregate statistics of a language model's survey responses lack the signals found in human populations. This absence of statistical signal cautions about the use of survey responses from large language models at present time.

24.arXiVeri: Automatic table verification with GPT

Authors:Gyungin Shin, Weidi Xie, Samuel Albanie

Abstract: Without accurate transcription of numerical data in scientific documents, a scientist cannot draw accurate conclusions. Unfortunately, the process of copying numerical data from one paper to another is prone to human error. In this paper, we propose to meet this challenge through the novel task of automatic table verification (AutoTV), in which the objective is to verify the accuracy of numerical data in tables by cross-referencing cited sources. To support this task, we propose a new benchmark, arXiVeri, which comprises tabular data drawn from open-access academic papers on arXiv. We introduce metrics to evaluate the performance of a table verifier in two key areas: (i) table matching, which aims to identify the source table in a cited document that corresponds to a target table, and (ii) cell matching, which aims to locate shared cells between a target and source table and identify their row and column indices accurately. By leveraging the flexible capabilities of modern large language models (LLMs), we propose simple baselines for table verification. Our findings highlight the complexity of this task, even for state-of-the-art LLMs like OpenAI's GPT-4. The code and benchmark will be made publicly available.