1.VFFINDER: A Graph-based Approach for Automated Silent Vulnerability-Fix Identification

Authors:Son Nguyen, Thanh Trong Vu, Hieu Dinh Vo

Abstract: The increasing reliance of software projects on third-party libraries has raised concerns about the security of these libraries due to hidden vulnerabilities. Managing these vulnerabilities is challenging due to the time gap between fixes and public disclosures. Moreover, a significant portion of open-source projects silently fix vulnerabilities without disclosure, impacting vulnerability management. Existing tools like OWASP heavily rely on public disclosures, hindering their effectiveness in detecting unknown vulnerabilities. To tackle this problem, automated identification of vulnerability-fixing commits has emerged. However, identifying silent vulnerability fixes remains challenging. This paper presents VFFINDER, a novel graph-based approach for automated silent vulnerability fix identification. VFFINDER captures structural changes using Abstract Syntax Trees (ASTs) and represents them in annotated ASTs. VFFINDER distinguishes vulnerability-fixing commits from non-fixing ones using attention-based graph neural network models to extract structural features. We conducted experiments to evaluate VFFINDER on a dataset of 36K+ fixing and non-fixing commits in 507 real-world C/C++ projects. Our results show that VFFINDER significantly improves the state-of-the-art methods by 39-83% in Precision, 19-148% in Recall, and 30-109% in F1. Especially, VFFINDER speeds up the silent fix identification process by up to 47% with the same review effort of 5% compared to the existing approaches.

2.Parsing Fortran-77 with proprietary extensions

Authors:Younoussa Sow, Larisa Safina, Léandre Brault, Papa Ibou Diouf, Stéphane Ducasse, Nicolas Anquetil

Abstract: Far from the latest innovations in software development, many organizations still rely on old code written in "obsolete" programming languages. Because this source code is old and proven it often contributes significantly to the continuing success of these organizations. Yet to keep the applications relevant and running in an evolving environment, they sometimes need to be updated or migrated to new languages or new platforms. One difficulty of working with these "veteran languages" is being able to parse the source code to build a representation of it. Parsing can also allow modern software development tools and IDEs to offer better support to these veteran languages. We initiated a project between our group and the Framatome company to help migrate old Fortran-77 with proprietary extensions (called Esope) into more modern Fortran. In this paper, we explain how we parsed the Esope language with a combination of island grammar and regular parser to build an abstract syntax tree of the code.

3.Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection

Authors:Muslim Chochlov Deptment of Computer Science and Information Systems, University of Limerick, Ireland, Gul Aftab Ahmed Deptment of Computer Science, Trinity College Dublin, Ireland, James Vincent Patten Deptment of Computer Science and Information Systems, University of Limerick, Ireland, Guoxian Lu WN Digital IPD and Trustworthiness Enabling, Huawei Technologies Co., Ltd., Shanghai, China, Wei Hou Huawei Vulnerability Management Center, Huawei Technologies Co., Ltd., Shenzhen, Guangdong, China, David Gregg Deptment of Computer Science, Trinity College Dublin, Ireland, Jim Buckley Deptment of Computer Science and Information Systems, University of Limerick, Ireland

Abstract: Code clones can detrimentally impact software maintenance and manually detecting them in very large codebases is impractical. Additionally, automated approaches find detection of Type 3 and Type 4 (inexact) clones very challenging. While the most recent artificial deep neural networks (for example BERT-based artificial neural networks) seem to be highly effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We therefore introduce SSCD, a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale (in line with our industrial partner's requirements). It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search. SSCD thus avoids the pairwise-comparison bottleneck of other Neural Network approaches while also using parallel, GPU-accelerated search to tackle scalability. This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting. The configuration analysis suggests that shorter input lengths and text-only based neural network models demonstrate better efficiency in SSCD, while only slightly decreasing effectiveness. The evaluation results suggest that SSCD is more effective than state-of-the-art approaches like SAGA and SourcererCC. It is also highly efficient: in its optimal setting, SSCD effectively locates clones in the entire 320 million LOC BigCloneBench (a standard clone detection benchmark) in just under three hours.

4.Improving students' code correctness and test completeness by informal specifications

Authors:Arno Broeders, Ruud Hermans, Sylvia Stuurman, Lex Bijlsma, Harrie Passier

Abstract: The quality of software produced by students is often poor. How to teach students to develop good quality software has long been a topic in computer science education and research. We must conclude that we still do not have a good answer to this question. Specifications are necessary to determine the correctness of software, to develop error-free software and to write complete tests. Several attempts have been made to teach students to write specifications before writing code. So far, that has not proven to be very successful: Students do not like to write a specification and do not see the benefits of writing specifications. In this paper we focus on the use of informal specifications. Instead of teaching students how to write specifications, we teach them how to use informal specifications to develop correct software. The results were surprising: the number of errors in software and the completeness of tests both improved considerably and, most importantly, students really appreciate the specifications. We think that if students appreciate specification, we have a key to teach them how to specify and to appreciate its value.

5.A study on the impact of pre-trained model on Just-In-Time defect prediction

Authors:Yuxiang Guo, Xiaopeng Gao, Zhenyu Zhang, W. K. Chan, Bo Jiang

Abstract: Previous researchers conducting Just-In-Time (JIT) defect prediction tasks have primarily focused on the performance of individual pre-trained models, without exploring the relationship between different pre-trained models as backbones. In this study, we build six models: RoBERTaJIT, CodeBERTJIT, BARTJIT, PLBARTJIT, GPT2JIT, and CodeGPTJIT, each with a distinct pre-trained model as its backbone. We systematically explore the differences and connections between these models. Specifically, we investigate the performance of the models when using Commit code and Commit message as inputs, as well as the relationship between training efficiency and model distribution among these six models. Additionally, we conduct an ablation experiment to explore the sensitivity of each model to inputs. Furthermore, we investigate how the models perform in zero-shot and few-shot scenarios. Our findings indicate that each model based on different backbones shows improvements, and when the backbone's pre-training model is similar, the training resources that need to be consumed are much more closer. We also observe that Commit code plays a significant role in defect detection, and different pre-trained models demonstrate better defect detection ability with a balanced dataset under few-shot scenarios. These results provide new insights for optimizing JIT defect prediction tasks using pre-trained models and highlight the factors that require more attention when constructing such models. Additionally, CodeGPTJIT and GPT2JIT achieved better performance than DeepJIT and CC2Vec on the two datasets respectively under 2000 training samples. These findings emphasize the effectiveness of transformer-based pre-trained models in JIT defect prediction tasks, especially in scenarios with limited training data.

6.Revisiting File Context for Source Code Summarization

Authors:Aakash Bansal, Chia-Yi Su, Collin McMillan

Abstract: Source code summarization is the task of writing natural language descriptions of source code. A typical use case is generating short summaries of subroutines for use in API documentation. The heart of almost all current research into code summarization is the encoder-decoder neural architecture, and the encoder input is almost always a single subroutine or other short code snippet. The problem with this setup is that the information needed to describe the code is often not present in the code itself -- that information often resides in other nearby code. In this paper, we revisit the idea of ``file context'' for code summarization. File context is the idea of encoding select information from other subroutines in the same file. We propose a novel modification of the Transformer architecture that is purpose-built to encode file context and demonstrate its improvement over several baselines. We find that file context helps on a subset of challenging examples where traditional approaches struggle.

7.Contextual Predictive Mutation Testing

Authors:Kush Jain, Uri Alon, Alex Groce, Claire Le Goues

Abstract: Mutation testing is a powerful technique for assessing and improving test suite quality that artificially introduces bugs and checks whether the test suites catch them. However, it is also computationally expensive and thus does not scale to large systems and projects. One promising recent approach to tackling this scalability problem uses machine learning to predict whether the tests will detect the synthetic bugs, without actually running those tests. However, existing predictive mutation testing approaches still misclassify 33% of detection outcomes on a randomly sampled set of mutant-test suite pairs. We introduce MutationBERT, an approach for predictive mutation testing that simultaneously encodes the source method mutation and test method, capturing key context in the input representation. Thanks to its higher precision, MutationBERT saves 33% of the time spent by a prior approach on checking/verifying live mutants. MutationBERT, also outperforms the state-of-the-art in both same project and cross project settings, with meaningful improvements in precision, recall, and F1 score. We validate our input representation, and aggregation approaches for lifting predictions from the test matrix level to the test suite level, finding similar improvements in performance. MutationBERT not only enhances the state-of-the-art in predictive mutation testing, but also presents practical benefits for real-world applications, both in saving developer time and finding hard to detect mutants.

8.Mind the Gap: The Difference Between Coverage and Mutation Score Can Guide Testing Efforts

Authors:Kush Jain, Goutamkumar Tulajappa Kalburgi, Claire Le Goues, Alex Groce

Abstract: An "adequate" test suite should effectively find all inconsistencies between a system's requirements/specifications and its implementation. Practitioners frequently use code coverage to approximate adequacy, while academics argue that mutation score may better approximate true (oracular) adequacy coverage. High code coverage is increasingly attainable even on large systems via automatic test generation, including fuzzing. In light of all of these options for measuring and improving testing effort, how should a QA engineer spend their time? We propose a new framework for reasoning about the extent, limits, and nature of a given testing effort based on an idea we call the oracle gap, or the difference between source code coverage and mutation score for a given software element. We conduct (1) a large-scale observational study of the oracle gap across popular Maven projects, (2) a study that varies testing and oracle quality across several of those projects and (3) a small-scale observational study of highly critical, well-tested code across comparable blockchain projects. We show that the oracle gap surfaces important information about the extent and quality of a test effort beyond either adequacy metric alone. In particular, it provides a way for practitioners to identify source files where it is likely a weak oracle tests important code.