arXiv daily

Computation and Language (cs.CL)

Thu, 22 Jun 2023

Other arXiv digests in this category:Thu, 14 Sep 2023; Wed, 13 Sep 2023; Tue, 12 Sep 2023; Mon, 11 Sep 2023; Fri, 08 Sep 2023; Tue, 05 Sep 2023; Fri, 01 Sep 2023; Thu, 31 Aug 2023; Wed, 30 Aug 2023; Tue, 29 Aug 2023; Mon, 28 Aug 2023; Fri, 25 Aug 2023; Thu, 24 Aug 2023; Wed, 23 Aug 2023; Tue, 22 Aug 2023; Mon, 21 Aug 2023; Fri, 18 Aug 2023; Thu, 17 Aug 2023; Wed, 16 Aug 2023; Tue, 15 Aug 2023; Mon, 14 Aug 2023; Fri, 11 Aug 2023; Thu, 10 Aug 2023; Wed, 09 Aug 2023; Tue, 08 Aug 2023; Mon, 07 Aug 2023; Fri, 04 Aug 2023; Thu, 03 Aug 2023; Wed, 02 Aug 2023; Tue, 01 Aug 2023; Mon, 31 Jul 2023; Fri, 28 Jul 2023; Thu, 27 Jul 2023; Wed, 26 Jul 2023; Tue, 25 Jul 2023; Mon, 24 Jul 2023; Fri, 21 Jul 2023; Thu, 20 Jul 2023; Wed, 19 Jul 2023; Tue, 18 Jul 2023; Mon, 17 Jul 2023; Fri, 14 Jul 2023; Thu, 13 Jul 2023; Wed, 12 Jul 2023; Tue, 11 Jul 2023; Mon, 10 Jul 2023; Fri, 07 Jul 2023; Thu, 06 Jul 2023; Wed, 05 Jul 2023; Tue, 04 Jul 2023; Mon, 03 Jul 2023; Fri, 30 Jun 2023; Thu, 29 Jun 2023; Wed, 28 Jun 2023; Tue, 27 Jun 2023; Mon, 26 Jun 2023; Fri, 23 Jun 2023; Wed, 21 Jun 2023; Tue, 20 Jun 2023; Fri, 16 Jun 2023; Thu, 15 Jun 2023; Tue, 13 Jun 2023; Mon, 12 Jun 2023; Fri, 09 Jun 2023; Thu, 08 Jun 2023; Wed, 07 Jun 2023; Tue, 06 Jun 2023; Mon, 05 Jun 2023; Fri, 02 Jun 2023; Thu, 01 Jun 2023; Wed, 31 May 2023; Tue, 30 May 2023; Mon, 29 May 2023; Fri, 26 May 2023; Thu, 25 May 2023; Wed, 24 May 2023; Tue, 23 May 2023; Mon, 22 May 2023; Fri, 19 May 2023; Thu, 18 May 2023; Wed, 17 May 2023; Tue, 16 May 2023; Mon, 15 May 2023; Fri, 12 May 2023; Thu, 11 May 2023; Wed, 10 May 2023; Tue, 09 May 2023; Mon, 08 May 2023; Fri, 05 May 2023; Thu, 04 May 2023; Wed, 03 May 2023; Tue, 02 May 2023; Mon, 01 May 2023; Fri, 28 Apr 2023; Thu, 27 Apr 2023; Wed, 26 Apr 2023; Tue, 25 Apr 2023; Mon, 24 Apr 2023; Fri, 21 Apr 2023; Thu, 20 Apr 2023; Wed, 19 Apr 2023; Tue, 18 Apr 2023; Mon, 17 Apr 2023; Fri, 14 Apr 2023; Thu, 13 Apr 2023; Wed, 12 Apr 2023; Tue, 11 Apr 2023; Mon, 10 Apr 2023
1.From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought

Authors:Lionel Wong, Gabriel Grand, Alexander K. Lew, Noah D. Goodman, Vikash K. Mansinghka, Jacob Andreas, Joshua B. Tenenbaum

Abstract: How does language inform our downstream thinking? In particular, how do humans make meaning from language -- and how can we leverage a theory of linguistic meaning to build machines that think in more human-like ways? In this paper, we propose \textit{rational meaning construction}, a computational framework for language-informed thinking that combines neural models of language with probabilistic models for rational inference. We frame linguistic meaning as a context-sensitive mapping from natural language into a \textit{probabilistic language of thought} (PLoT) -- a general-purpose symbolic substrate for probabilistic, generative world modeling. Our architecture integrates two powerful computational tools that have not previously come together: we model thinking with \textit{probabilistic programs}, an expressive representation for flexible commonsense reasoning; and we model meaning construction with \textit{large language models} (LLMs), which support broad-coverage translation from natural language utterances to code expressions in a probabilistic programming language. We illustrate our framework in action through examples covering four core domains from cognitive science: probabilistic reasoning, logical and relational reasoning, visual and physical reasoning, and social reasoning about agents and their plans. In each, we show that LLMs can generate context-sensitive translations that capture pragmatically-appropriate linguistic meanings, while Bayesian inference with the generated programs supports coherent and robust commonsense reasoning. We extend our framework to integrate cognitively-motivated symbolic modules to provide a unified commonsense thinking interface from language. Finally, we explore how language can drive the construction of world models themselves.

2.Constructing Colloquial Dataset for Persian Sentiment Analysis of Social Microblogs

Authors:Mojtaba Mazoochi ICT Research Institute, Tehran, Iran, Leyla Rabiei Iran Telecommunication Research Center, Farzaneh Rahmani Iran Telecommunication Research Center, Zeinab Rajabi Iran Telecommunication Research Center

Abstract: Introduction: Microblogging websites have massed rich data sources for sentiment analysis and opinion mining. In this regard, sentiment classification has frequently proven inefficient because microblog posts typically lack syntactically consistent terms and representatives since users on these social networks do not like to write lengthy statements. Also, there are some limitations to low-resource languages. The Persian language has exceptional characteristics and demands unique annotated data and models for the sentiment analysis task, which are distinctive from text features within the English dialect. Method: This paper first constructs a user opinion dataset called ITRC-Opinion by collaborative environment and insource way. Our dataset contains 60,000 informal and colloquial Persian texts from social microblogs such as Twitter and Instagram. Second, this study proposes a new deep convolutional neural network (CNN) model for more effective sentiment analysis of colloquial text in social microblog posts. The constructed datasets are used to evaluate the presented model. Furthermore, some models, such as LSTM, CNN-RNN, BiLSTM, and BiGRU with different word embeddings, including Fasttext, Glove, and Word2vec, investigated our dataset and evaluated the results. Results: The results demonstrate the benefit of our dataset and the proposed model (72% accuracy), displaying meaningful improvement in sentiment classification performance.

3.Vec2Vec: A Compact Neural Network Approach for Transforming Text Embeddings with High Fidelity

Authors:Andrew Kean Gao

Abstract: Vector embeddings have become ubiquitous tools for many language-related tasks. A leading embedding model is OpenAI's text-ada-002 which can embed approximately 6,000 words into a 1,536-dimensional vector. While powerful, text-ada-002 is not open source and is only available via API. We trained a simple neural network to convert open-source 768-dimensional MPNet embeddings into text-ada-002 embeddings. We compiled a subset of 50,000 online food reviews. We calculated MPNet and text-ada-002 embeddings for each review and trained a simple neural network to for 75 epochs. The neural network was designed to predict the corresponding text-ada-002 embedding for a given MPNET embedding. Our model achieved an average cosine similarity of 0.932 on 10,000 unseen reviews in our held-out test dataset. We manually assessed the quality of our predicted embeddings for vector search over text-ada-002-embedded reviews. While not as good as real text-ada-002 embeddings, predicted embeddings were able to retrieve highly relevant reviews. Our final model, Vec2Vec, is lightweight (<80 MB) and fast. Future steps include training a neural network with a more sophisticated architecture and a larger dataset of paired embeddings to achieve greater performance. The ability to convert between and align embedding spaces may be helpful for interoperability, limiting dependence on proprietary models, protecting data privacy, reducing costs, and offline operations.

4.Multilingual Neural Machine Translation System for Indic to Indic Languages

Authors:Sudhansu Bala Das, Divyajyoti Panda, Tapas Kumar Mishra, Bidyut Kr. Patra, Asif Ekbal

Abstract: This paper gives an Indic-to-Indic (IL-IL) MNMT baseline model for 11 ILs implemented on the Samanantar corpus and analyzed on the Flores-200 corpus. All the models are evaluated using the BLEU score. In addition, the languages are classified under three groups namely East Indo- Aryan (EI), Dravidian (DR), and West Indo-Aryan (WI). The effect of language relatedness on MNMT model efficiency is studied. Owing to the presence of large corpora from English (EN) to ILs, MNMT IL-IL models using EN as a pivot are also built and examined. To achieve this, English- Indic (EN-IL) models are also developed, with and without the usage of related languages. Results reveal that using related languages is beneficial for the WI group only, while it is detrimental for the EI group and shows an inconclusive effect on the DR group, but it is useful for EN-IL models. Thus, related language groups are used to develop pivot MNMT models. Furthermore, the IL corpora are transliterated from the corresponding scripts to a modified ITRANS script, and the best MNMT models from the previous approaches are built on the transliterated corpus. It is observed that the usage of pivot models greatly improves MNMT baselines with AS-TA achieving the minimum BLEU score and PA-HI achieving the maximum score. Among languages, AS, ML, and TA achieve the lowest BLEU score, whereas HI, PA, and GU perform the best. Transliteration also helps the models with few exceptions. The best increment of scores is observed in ML, TA, and BN and the worst average increment is observed in KN, HI, and PA, across all languages. The best model obtained is the PA-HI language pair trained on PAWI transliterated corpus which gives 24.29 BLEU.

5.Natural Language Generation for Advertising: A Survey

Authors:Soichiro Murakami, Sho Hoshino, Peinan Zhang

Abstract: Natural language generation methods have emerged as effective tools to help advertisers increase the number of online advertisements they produce. This survey entails a review of the research trends on this topic over the past decade, from template-based to extractive and abstractive approaches using neural networks. Additionally, key challenges and directions revealed through the survey, including metric optimization, faithfulness, diversity, multimodality, and the development of benchmark datasets, are discussed.

6.Generative Multimodal Entity Linking

Authors:Senbao Shi, Zhenran Xu, Baotian Hu, Min Zhang

Abstract: Multimodal Entity Linking (MEL) is the task of mapping mentions with multimodal contexts to the referent entities from a knowledge base (e.g., Wikipedia). Prior MEL methods mainly focus on designing complex multimodal interaction mechanisms and require fine-tuning all model parameters, which can be prohibitively costly and difficult to scale in the era of Large Language Models (LLMs). In this work, we propose GEMEL, a simple yet effective Generative Multimodal Entity Linking method, which leverages the capabilities of LLMs from large-scale pre-training to directly generate target entity names. We keep the vision and language model frozen and only train a linear layer to enable cross-modality interactions. To adapt LLMs to the MEL task, we take advantage of the emerging in-context learning (ICL) capability of LLMs by retrieving multimodal instances as demonstrations. Extensive experiments show that with only ~0.3% of the model parameters fine-tuned, GEMEL achieves state-of-the-art results on two well-established MEL datasets (4.1% accuracy gains on WikiDiverse and 15.4% accuracy gains on WikiMEL). Our approach is compatible with any off-the-shelf language model, paving the way towards an efficient and general solution for utilizing LLMs in the MEL task.

7.Mapping and Cleaning Open Commonsense Knowledge Bases with Generative Translation

Authors:Julien Romero, Simon Razniewski

Abstract: Structured knowledge bases (KBs) are the backbone of many know\-ledge-intensive applications, and their automated construction has received considerable attention. In particular, open information extraction (OpenIE) is often used to induce structure from a text. However, although it allows high recall, the extracted knowledge tends to inherit noise from the sources and the OpenIE algorithm. Besides, OpenIE tuples contain an open-ended, non-canonicalized set of relations, making the extracted knowledge's downstream exploitation harder. In this paper, we study the problem of mapping an open KB into the fixed schema of an existing KB, specifically for the case of commonsense knowledge. We propose approaching the problem by generative translation, i.e., by training a language model to generate fixed-schema assertions from open ones. Experiments show that this approach occupies a sweet spot between traditional manual, rule-based, or classification-based canonicalization and purely generative KB construction like COMET. Moreover, it produces higher mapping accuracy than the former while avoiding the association-based noise of the latter.

8.Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4

Authors:Mario Rodríguez-Cantelar, Chen Zhang, Chengguang Tang, Ke Shi, Sarik Ghazarian, João Sedoc, Luis Fernando D'Haro, Alexander Rudnicky

Abstract: The advent and fast development of neural networks have revolutionized the research on dialogue systems and subsequently have triggered various challenges regarding their automatic evaluation. Automatic evaluation of open-domain dialogue systems as an open challenge has been the center of the attention of many researchers. Despite the consistent efforts to improve automatic metrics' correlations with human evaluation, there have been very few attempts to assess their robustness over multiple domains and dimensions. Also, their focus is mainly on the English language. All of these challenges prompt the development of automatic evaluation metrics that are reliable in various domains, dimensions, and languages. This track in the 11th Dialogue System Technology Challenge (DSTC11) is part of the ongoing effort to promote robust and multilingual automatic evaluation metrics. This article describes the datasets and baselines provided to participants and discusses the submission and result details of the two proposed subtasks.

9.Natural Language Processing in Electronic Health Records in Relation to Healthcare Decision-making: A Systematic Review

Authors:Elias Hossain, Rajib Rana, Niall Higgins, Jeffrey Soar, Prabal Datta Barua, Anthony R. Pisani, Ph. D, Kathryn Turner}

Abstract: Background: Natural Language Processing (NLP) is widely used to extract clinical insights from Electronic Health Records (EHRs). However, the lack of annotated data, automated tools, and other challenges hinder the full utilisation of NLP for EHRs. Various Machine Learning (ML), Deep Learning (DL) and NLP techniques are studied and compared to understand the limitations and opportunities in this space comprehensively. Methodology: After screening 261 articles from 11 databases, we included 127 papers for full-text review covering seven categories of articles: 1) medical note classification, 2) clinical entity recognition, 3) text summarisation, 4) deep learning (DL) and transfer learning architecture, 5) information extraction, 6) Medical language translation and 7) other NLP applications. This study follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Result and Discussion: EHR was the most commonly used data type among the selected articles, and the datasets were primarily unstructured. Various ML and DL methods were used, with prediction or classification being the most common application of ML or DL. The most common use cases were: the International Classification of Diseases, Ninth Revision (ICD-9) classification, clinical note analysis, and named entity recognition (NER) for clinical descriptions and research on psychiatric disorders. Conclusion: We find that the adopted ML models were not adequately assessed. In addition, the data imbalance problem is quite important, yet we must find techniques to address this underlining problem. Future studies should address key limitations in studies, primarily identifying Lupus Nephritis, Suicide Attempts, perinatal self-harmed and ICD-9 classification.

10.Unveiling Global Narratives: A Multilingual Twitter Dataset of News Media on the Russo-Ukrainian Conflict

Authors:Sherzod Hakimov, Gullal S. Cheema

Abstract: The ongoing Russo-Ukrainian conflict has been a subject of intense media coverage worldwide. Understanding the global narrative surrounding this topic is crucial for researchers that aim to gain insights into its multifaceted dimensions. In this paper, we present a novel dataset that focuses on this topic by collecting and processing tweets posted by news or media companies on social media across the globe. We collected tweets from February 2022 to May 2023 to acquire approximately 1.5 million tweets in 60 different languages. Each tweet in the dataset is accompanied by processed tags, allowing for the identification of entities, stances, concepts, and sentiments expressed. The availability of the dataset serves as a valuable resource for researchers aiming to investigate the global narrative surrounding the ongoing conflict from various aspects such as who are the prominent entities involved, what stances are taken, where do these stances originate, and how are the different concepts related to the event portrayed.

11.xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Authors:Mingda Chen, Kevin Heffernan, Onur Çelebi, Alex Mourachko, Holger Schwenk

Abstract: We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xSIM++ also reports performance for different error types, offering more fine-grained feedback for model development.

12.Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation

Authors:Ran Zhang, Jihed Ouni, Steffen Eger

Abstract: While summarization has been extensively researched in natural language processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility, information sharing, and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation. We build the first CLCTS corpus, leveraging historical fictive texts and Wikipedia summaries in English and German, and examine the effectiveness of popular transformer end-to-end models with different intermediate task finetuning tasks. Additionally, we explore the potential of ChatGPT for CLCTS as a summarizer and an evaluator. Overall, we report evaluations from humans, ChatGPT, and several recent automatic evaluation metrics where we find our intermediate task finetuned end-to-end models generate bad to moderate quality summaries; ChatGPT as a summarizer (without any finetuning) provides moderate to good quality outputs and as an evaluator correlates moderately with human evaluations though it is prone to giving lower scores. ChatGPT also seems to be very adept at normalizing historical text. We finally test ChatGPT in a scenario with adversarially attacked and unseen source documents and find that ChatGPT is better at omission and entity swap than negating against its prior knowledge.

13.AudioPaLM: A Large Language Model That Can Speak and Listen

Authors:Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor, Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, Christian Frank

Abstract: We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples

14.Tracking public attitudes toward ChatGPT on Twitter using sentiment analysis and topic modeling

Authors:Ratanond Koonchanok, Yanling Pan, Hyeju Jang

Abstract: ChatGPT sets a new record with the fastest-growing user base, as a chatbot powered by a large language model (LLM). While it demonstrates state-of-the-art capabilities in a variety of language-generating tasks, it also raises widespread public concerns regarding its societal impact. In this paper, we utilize natural language processing approaches to investigate the public attitudes towards ChatGPT by applying sentiment analysis and topic modeling techniques to Twitter data. Our result shows that the overall sentiment is largely neutral to positive, which also holds true across different occupation groups. Among a wide range of topics mentioned in tweets, the most popular topics are Artificial Intelligence, Search Engines, Education, Writing, and Question Answering.

15.Conversation Derailment Forecasting with Graph Convolutional Networks

Authors:Enas Altarawneh, Ammeta Agrawal, Michael Jenkin, Manos Papagelis

Abstract: Online conversations are particularly susceptible to derailment, which can manifest itself in the form of toxic communication patterns like disrespectful comments or verbal abuse. Forecasting conversation derailment predicts signs of derailment in advance enabling proactive moderation of conversations. Current state-of-the-art approaches to address this problem rely on sequence models that treat dialogues as text streams. We propose a novel model based on a graph convolutional neural network that considers dialogue user dynamics and the influence of public perception on conversation utterances. Through empirical evaluation, we show that our model effectively captures conversation dynamics and outperforms the state-of-the-art models on the CGA and CMV benchmark datasets by 1.5\% and 1.7\%, respectively.

16.Speech Emotion Diarization: Which Emotion Appears When?

Authors:Yingzhi Wang, Mirco Ravanelli, Alaa Nfissi, Alya Yacoubi

Abstract: Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of "Who speaks when?", Speech Emotion Diarization answers the question of "Which emotion appears when?". To facilitate the evaluation of the performance and establish a common benchmark for researchers, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually-annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.

17.Towards Explainable Evaluation Metrics for Machine Translation

Authors:Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger

Abstract: Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for machine translation (for example, COMET or BERTScore) are based on black-box large language models. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are more transparent. To foster more widespread acceptance of novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties as well as key goals of explainable machine translation metrics and provide a comprehensive synthesis of recent techniques, relating them to our established goals and properties. In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4. Finally, we contribute a vision of next-generation approaches, including natural language explanations. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent machine translation systems.

18.CamChoice: A Corpus of Multiple Choice Questions and Candidate Response Distributions

Authors:Adian Liusie, Vatsal Raina, Andrew Mullooly, Kate Knill, Mark J. F. Gales

Abstract: Multiple Choice examinations are a ubiquitous form of assessment that is used to measure the ability of candidates across various domains and tasks. Maintaining the quality of proposed questions is of great importance to test designers, and therefore newly proposed questions go through several pre-test evaluation stages before they can be deployed into real-world exams. This process is currently quite manual, which can lead to time lags in the question development cycle. Automating this process would lead to a large improvement in efficiency, however, current datasets do not contain sufficient pre-test analysis information. In this paper, we introduce CamChoice; a multiple-choice comprehension dataset with questions at different target levels, where questions have the true candidate selected options distributions. We introduce the task of candidate distribution matching, propose several evaluation metrics for the task, and demonstrate that automatic systems trained on RACE++ can be leveraged as baselines for our task. We further demonstrate that these automatic systems can be used for practical pre-test evaluation tasks such as detecting underperforming distractors, where our detection systems can automatically identify poor distractors that few candidates select. We release the data publicly for future research.

19.Named entity recognition in resumes

Authors:Ege Kesim, Aysu Deliahmetoglu

Abstract: Named entity recognition (NER) is used to extract information from various documents and texts such as names and dates. It is important to extract education and work experience information from resumes in order to filter them. Considering the fact that all information in a resume has to be entered to the companys system manually, automatizing this process will save time of the companies. In this study, a deep learning-based semi-automatic named entity recognition system has been implemented with a focus on resumes in the field of IT. Firstly, resumes of employees from five different IT related fields has been annotated. Six transformer based pre-trained models have been adapted to named entity recognition problem using the annotated data. These models have been selected among popular models in the natural language processing field. The obtained system can recognize eight different entity types which are city, date, degree, diploma major, job title, language, country and skill. Models used in the experiments are compared using micro, macro and weighted F1 scores and the performance of the methods was evaluated. Taking these scores into account for test set the best micro and weighted F1 score is obtained by RoBERTa and the best macro F1 score is obtained by Electra model.

20.Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Authors:Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, Bryan Hooi

Abstract: The task of empowering large language models (LLMs) to accurately express their confidence, referred to as confidence elicitation, is essential in ensuring reliable and trustworthy decision-making processes. Previous methods, which primarily rely on model logits, have become less suitable for LLMs and even infeasible with the rise of closed-source LLMs (e.g., commercialized LLM APIs). This leads to a growing need to explore the untapped area of \emph{non-logit-based} approaches to estimate the uncertainty of LLMs. Hence, in this study, we investigate approaches for confidence elicitation that do not require model fine-tuning or access to proprietary information. We introduce three categories of methods: verbalize-based, consistency-based, and their hybrid methods for benchmarking, and evaluate their performance across five types of datasets and four widely-used LLMs. Our analysis of these methods uncovers several key insights: 1) LLMs often exhibit a high degree of overconfidence when verbalizing their confidence; 2) Prompting strategies such as CoT, Top-K and Multi-step confidences improve calibration of verbalized confidence; 3) Consistency-based methods outperform the verbalized confidences in most cases, with particularly notable improvements on the arithmetic reasoning task; 4) Hybrid methods consistently deliver the best performance over their baselines, thereby emerging as a promising state-of-the-art approach; 5) Despite these advancements, all investigated methods continue to struggle with challenging tasks, such as those requiring professional knowledge, leaving significant scope for improvement of confidence elicitation.

21.Semi-automated extraction of research topics and trends from NCI funding in radiological sciences from 2000-2020

Authors:Mark Nguyen, Peter Beidler, Joseph Tsai, August Anderson, Daniel Chen, Paul Kinahan, John Kang

Abstract: Investigators, funders, and the public desire knowledge on topics and trends in publicly funded research but current efforts in manual categorization are limited in scale and understanding. We developed a semi-automated approach to extract and name research topics, and applied this to \$1.9B of NCI funding over 21 years in the radiological sciences to determine micro- and macro-scale research topics and funding trends. Our method relies on sequential clustering of existing biomedical-based word embeddings, naming using subject matter experts, and visualization to discover trends at a macroscopic scale above individual topics. We present results using 15 and 60 cluster topics, where we found that 2D projection of grant embeddings reveals two dominant axes: physics-biology and therapeutic-diagnostic. For our dataset, we found that funding for therapeutics- and physics-based research have outpaced diagnostics- and biology-based research, respectively. We hope these results may (1) give insight to funders on the appropriateness of their funding allocation, (2) assist investigators in contextualizing their work and explore neighboring research domains, and (3) allow the public to review where their tax dollars are being allocated.