1.An Empirical Study on the Effectiveness of Noisy Label Learning for Program Understanding

Authors:Wenhan Wang, Yanzhou Li, Anran Li, Jian Zhang, Wei Ma, Yang Liu

Abstract: Recently, deep learning models have been widely applied in program understanding tasks, and these models achieve state-of-the-art results on many benchmark datasets. A major challenge of deep learning for program understanding is that the effectiveness of these approaches depends on the quality of their datasets, and these datasets often contain noisy data samples. A typical kind of noise in program understanding datasets is label noises, which means that the target outputs for some inputs are mislabeled. Label noises may have a negative impact on the performance of deep learning models, so researchers have proposed various approaches to alleviate the impact of noisy labels, and formed a new research topic: noisy label learning (NLL). In this paper, we conduct an empirical study on the effectiveness of noisy label learning on deep learning for program understanding datasets. We evaluate various noisy label learning approaches and deep learning models on two tasks: program classification and code summarization. From the evaluation results, we find that the impact of label noise and NLL approaches on small deep learning models and large pre-trained models are different: small models are prone to label noises in program classification and NLL approaches can improve their robustness, while large pre-trained models are robust against label noises and NLL does not significantly improve their performances. On the other hand, NLL approaches have shown satisfying results in identifying noisy labeled samples for both tasks, indicating that these techniques can benefit researchers in building high-quality program understanding datasets.

2.Crème de la Crem: Composable Representable Executable Machines (Architectural Pearl)

Authors:Marco Perone, Georgios Karachalias

Abstract: In this paper we describe how to build software architectures as a composition of state machines, using ideas and principles from the field of Domain-Driven Design. By definition, our approach is modular, allowing one to compose independent subcomponents to create bigger systems, and representable, allowing the implementation of a system to be kept in sync with its graphical representation. In addition to the design itself we introduce the Crem library, which provides a concrete state machine implementation that is both compositional and representable, Crem uses Haskell's advanced type-level features to allow users to specify allowed and forbidden state transitions, and to encode complex state machine -- and therefore domain-specific -- properties. Moreover, since Crem's state machines are representable, Crem can automatically generate graphical representations of systems from their domain implementations.

3.Rule-based Graph Repair using Minimally Restricted Consistency-Improving Transformations

Authors:Alexander Lauer

Abstract: Model-driven software engineering is a suitable method for dealing with the ever-increasing complexity of software development processes. Graphs and graph transformations have proven useful for representing such models and changes to them. These models must satisfy certain sets of constraints. An example are the multiplicities of a class structure. During the development process, a change to a model may result in an inconsistent model that must at some point be repaired. This problem is called model repair. In particular, we will consider rule-based graph repair which is defined as follows: Given a graph $G$, a constraint $c$ such that $G$ does not satisfy $c$, and a set of rules $R$, use the rules of $\mathcal{R}$ to transform $G$ into a graph that satisfies $c$. Known notions of consistency have either viewed consistency as a binary property, either a graph is consistent w.r.t. a constraint $c$ or not, or only viewed the number of violations of the first graph of a constraint. In this thesis, we introduce new notions of consistency, which we call consistency-maintaining and consistency-increasing transformations and rules, respectively. This is based on the possibility that a constraint can be satisfied up to a certain nesting level. We present constructions for direct consistency-maintaining or direct consistency-increasing application conditions, respectively. Finally, we present an rule-based graph repair approach that is able to repair so-called \emph{circular conflict-free constraints}, and so-called circular conflict-free sets of constraints. Intuitively, a set of constraint $C$ is circular conflict free, if there is an ordering $c_1, \ldots, c_n$ of all constraints of $C$ such that there is no $j <i$ such that a repair of $c_i$ at all graphs satisfying $c_j$ leads to a graph not satisfying $c_j$.

4.Generative Type Inference for Python

Authors:Yun Peng, Chaozheng Wang, Wenxuan Wang, Cuiyun Gao, Michael R. Lyu

Abstract: Python is a popular dynamic programming language, evidenced by its ranking as the second most commonly used language on GitHub. However, its dynamic type system can lead to potential type errors, leading researchers to explore automatic type inference approaches for Python programs. The rule-based type inference approaches can ensure the accuracy of predicted variable types, but they suffer from low coverage problems. Supervised type inference approaches, while feature-agnostic, require large, high-quality annotated datasets and are limited to pre-defined types. As zero-shot approaches, the cloze-style approaches reformulate the type inference problem into a fill-in-the-blank problem. However, their performance is limited. This paper introduces TypeGen, a few-shot generative type inference approach that incorporates static domain knowledge from static analysis. TypeGen creates chain-of-thought (COT) prompts by translating the type inference steps of static analysis into prompts based on the type dependency graphs (TDGs), enabling language models to learn from how static analysis infers types. By combining COT prompts with code slices and type hints, TypeGen constructs example prompts from human annotations. TypeGen only requires very few annotated examples to teach language models to generate similar COT prompts via in-context learning. Moreover, TypeGen enhances the interpretability of results through the use of the input-explanation-output strategy. Experiments show that TypeGen outperforms the best baseline Type4Py by 10.0% for argument type prediction and 22.5% in return value type prediction in terms of top-1 Exact Match by using only five examples. Furthermore, TypeGen achieves substantial improvements of 27% to 84% compared to the zero-shot performance of large language models with parameter sizes ranging from 1.3B to 175B in terms of top-1 Exact Match.

5.Need-driven decision-making and prototyping for DLT: Framework and web-based tool

Authors:Tomas Bueno Momčilović, Matthias Buchinger, Dian Balta

Abstract: In its 14 years, distributed ledger technology has attracted increasing attention, investments, enthusiasm, and user base. However, ongoing doubts about its usefulness and recent losses of trust in prominent cryptocurrencies have fueled deeply skeptical assessments. Multiple groups attempted to disentangle the technology from the associated hype and controversy by building workflows for rapid prototyping and informed decision-making, but their mostly isolated work leaves users only with fewer unclarities. To bridge the gaps between these contributions, we develop a holistic analytical framework and open-source web tool for making evidence-based decisions. Consisting of three stages - evaluation, elicitation, and design - the framework relies on input from the users' domain knowledge, maps their choices, and provides an output of needed technology bundles. We apply it to an example clinical use case to clarify the directions of our contribution charts for prototyping, hopefully driving the conversation towards ways to enhance further tools and approaches.

6.CertPri: Certifiable Prioritization for Deep Neural Networks via Movement Cost in Feature Space

Authors:Haibin Zheng, Jinyin Chen, Haibo Jin

Abstract: Deep neural networks (DNNs) have demonstrated their outperformance in various software systems, but also exhibit misbehavior and even result in irreversible disasters. Therefore, it is crucial to identify the misbehavior of DNN-based software and improve DNNs' quality. Test input prioritization is one of the most appealing ways to guarantee DNNs' quality, which prioritizes test inputs so that more bug-revealing inputs can be identified earlier with limited time and manual labeling efforts. However, the existing prioritization methods are still limited from three aspects: certifiability, effectiveness, and generalizability. To overcome the challenges, we propose CertPri, a test input prioritization technique designed based on a movement cost perspective of test inputs in DNNs' feature space. CertPri differs from previous works in three key aspects: (1) certifiable: it provides a formal robustness guarantee for the movement cost; (2) effective: it leverages formally guaranteed movement costs to identify malicious bug-revealing inputs; and (3) generic: it can be applied to various tasks, data, models, and scenarios. Extensive evaluations across 2 tasks (i.e., classification and regression), 6 data forms, 4 model structures, and 2 scenarios (i.e., white-box and black-box) demonstrate CertPri's superior performance. For instance, it significantly improves 53.97% prioritization effectiveness on average compared with baselines. Its robustness and generalizability are 1.41~2.00 times and 1.33~3.39 times that of baselines on average, respectively.

7.Is this Snippet Written by ChatGPT? An Empirical Study with a CodeBERT-Based Classifier

Authors:Phuong T. Nguyen, Juri Di Rocco, Claudio Di Sipio, Riccardo Rubei, Davide Di Ruscio, Massimiliano Di Penta

Abstract: Since its launch in November 2022, ChatGPT has gained popularity among users, especially programmers who use it as a tool to solve development problems. However, while offering a practical solution to programming problems, ChatGPT should be mainly used as a supporting tool (e.g., in software education) rather than as a replacement for the human being. Thus, detecting automatically generated source code by ChatGPT is necessary, and tools for identifying AI-generated content may need to be adapted to work effectively with source code. This paper presents an empirical study to investigate the feasibility of automated identification of AI-generated code snippets, and the factors that influence this ability. To this end, we propose a novel approach called GPTSniffer, which builds on top of CodeBERT to detect source code written by AI. The results show that GPTSniffer can accurately classify whether code is human-written or AI-generated, and outperforms two baselines, GPTZero and OpenAI Text Classifier. Also, the study shows how similar training data or a classification context with paired snippets helps to boost classification performances.