1.ToxiSpanSE: An Explainable Toxicity Detection in Code Review Comments

Authors:Jaydeb Saker, Sayma Sultana, Steven R. Wilson, Amiangshu Bosu

Abstract: Background: The existence of toxic conversations in open-source platforms can degrade relationships among software developers and may negatively impact software product quality. To help mitigate this, some initial work has been done to detect toxic comments in the Software Engineering (SE) domain. Aims: Since automatically classifying an entire text as toxic or non-toxic does not help human moderators to understand the specific reason(s) for toxicity, we worked to develop an explainable toxicity detector for the SE domain. Method: Our explainable toxicity detector can detect specific spans of toxic content from SE texts, which can help human moderators by automatically highlighting those spans. This toxic span detection model, ToxiSpanSE, is trained with the 19,651 code review (CR) comments with labeled toxic spans. Our annotators labeled the toxic spans within 3,757 toxic CR samples. We explored several types of models, including one lexicon-based approach and five different transformer-based encoders. Results: After an extensive evaluation of all models, we found that our fine-tuned RoBERTa model achieved the best score with 0.88 $F1$, 0.87 precision, and 0.93 recall for toxic class tokens, providing an explainable toxicity classifier for the SE domain. Conclusion: Since ToxiSpanSE is the first tool to detect toxic spans in the SE domain, this tool will pave a path to combat toxicity in the SE community.

2.Systematic Review on Privacy Categorization

Authors:Paola Inverardi, Patrizio Migliarini, Massimiliano Palmiero

Abstract: In the modern digital world users need to make privacy and security choices that have far-reaching consequences. Researchers are increasingly studying people's decisions when facing with privacy and security trade-offs, the pressing and time consuming disincentives that influence those decisions, and methods to mitigate them. This work aims to present a systematic review of the literature on privacy categorization, which has been defined in terms of profile, profiling, segmentation, clustering and personae. Privacy categorization involves the possibility to classify users according to specific prerequisites, such as their ability to manage privacy issues, or in terms of which type of and how many personal information they decide or do not decide to disclose. Privacy categorization has been defined and used for different purposes. The systematic review focuses on three main research questions that investigate the study contexts, i.e. the motivations and research questions, that propose privacy categorisations; the methodologies and results of privacy categorisations; the evolution of privacy categorisations over time. Ultimately it tries to provide an answer whether privacy categorization as a research attempt is still meaningful and may have a future.

3.Specification, Validation and Verification of Social, Legal, Ethical, Empathetic and Cultural Requirements for Autonomous Agents

Authors:Sinem Getir Yaman, Ana Cavalcanti, Radu Calinescu, Colin Paterson, Pedro Ribeiro, Beverley Townsend

Abstract: Autonomous agents are increasingly being proposed for use in healthcare, assistive care, education, and other applications governed by complex human-centric norms. To ensure compliance with these norms, the rules they induce need to be unambiguously defined, checked for consistency, and used to verify the agent. In this paper, we introduce a framework for formal specification, validation and verification of social, legal, ethical, empathetic and cultural (SLEEC) rules for autonomous agents. Our framework comprises: (i) a language for specifying SLEEC rules and rule defeaters (that is, circumstances in which a rule does not apply or an alternative form of the rule is required); (ii) a formal semantics (defined in the process algebra tock-CSP) for the language; and (iii) methods for detecting conflicts and redundancy within a set of rules, and for verifying the compliance of an autonomous agent with such rules. We show the applicability of our framework for two autonomous agents from different domains: a firefighter UAV, and an assistive-dressing robot.

4.Compositionality in Model-Based Testing

Authors:Gijs van Cuyck, Lars van Arragon, Jan Tretmans

Abstract: Model-based testing (MBT) promises a scalable solution to testing large systems, if a model is available. Creating these models for large systems, however, has proven to be difficult. Composing larger models from smaller ones could solve this, but our current MBT conformance relation $\textbf{uioco}$ is not compositional, i.e. correctly tested components, when composed into a system, can still lead to a faulty system. To catch these integration problems, we introduce a new relation over component models called $\textbf{mutual acceptance}$. Mutually accepting components are guaranteed to communicate correctly, which makes MBT compositional. In addition to providing compositionality, mutual acceptance has benefits when retesting systems with updated components, and when diagnosing systems consisting of components.

5.ConStaBL -- A Fresh Look at Software Engineering with State Machines

Authors:Karthika Venkatesan, Sujit Kumar Chakrabarti

Abstract: Statechart is a visual modelling language for systems. In this paper, we extend our earlier work on modular statecharts with local variables and present an updated operational semantics for statecharts with concurrency. Our variant of the statechart has local variables, which interact significantly with the remainder of the language semantics. Our semantics does not allow transition conflicts in simulations and is stricter than most other available semantics of statecharts in that sense. It allows arbitrary interleaving of concurrently executing action code, which allows more precise modelling of systems and upstream analysis of the same. We present the operational semantics in the form of the simulation algorithm. We also establish the criteria based on our semantics for defining conflicting transitions and valid simulations. Our semantics is executable and can be used to simulate statechart models and verify their correctness. We present a preliminary setup to carry out fuzz testing of Statechart models, an idea that does not seem to have a precedent in literature. We have used our simulator in conjunction with a well-known fuzzer to do fuzz testing of statechart models of non-trivial sizes and have found issues in them that would have been hard to find through inspection.

6.Exploring and Characterizing Large Language Models For Embedded System Development and Debugging

Authors:Zachary Englhardt, Richard Li, Dilini Nissanka, Zhihan Zhang, Girish Narayanswamy, Joseph Breda, Xin Liu, Shwetak Patel, Vikram Iyer

Abstract: Large language models (LLMs) have shown remarkable abilities to generate code, however their ability to develop software for embedded systems, which requires cross-domain knowledge of hardware and software has not been studied. In this paper we systematically evaluate leading LLMs (GPT-3.5, GPT-4, PaLM 2) to assess their performance for embedded system development, study how human programmers interact with these tools, and develop an AI-based software engineering workflow for building embedded systems. We develop an an end-to-end hardware-in-the-loop evaluation platform for verifying LLM generated programs using sensor actuator pairs. We compare all three models with N=450 experiments and find surprisingly that GPT-4 especially shows an exceptional level of cross-domain understanding and reasoning, in some cases generating fully correct programs from a single prompt. In N=50 trials, GPT-4 produces functional I2C interfaces 66% of the time. GPT-4 also produces register-level drivers, code for LoRa communication, and context-specific power optimizations for an nRF52 program resulting in over 740x current reduction to 12.2 uA. We also characterize the models' limitations to develop a generalizable workflow for using LLMs in embedded system development. We evaluate the workflow with 15 users including novice and expert programmers. We find that our workflow improves productivity for all users and increases the success rate for building a LoRa environmental sensor from 25% to 100%, including for users with zero hardware or C/C++ experience.

7.Towards Automated Classification of Code Review Feedback to Support Analytics

Authors:Asif Kamal Turzo, Fahim Faysal, Ovi Poddar, Jaydeb Sarker, Anindya Iqbal, Amiangshu Bosu

Abstract: Background: As improving code review (CR) effectiveness is a priority for many software development organizations, projects have deployed CR analytics platforms to identify potential improvement areas. The number of issues identified, which is a crucial metric to measure CR effectiveness, can be misleading if all issues are placed in the same bin. Therefore, a finer-grained classification of issues identified during CRs can provide actionable insights to improve CR effectiveness. Although a recent work by Fregnan et al. proposed automated models to classify CR-induced changes, we have noticed two potential improvement areas -- i) classifying comments that do not induce changes and ii) using deep neural networks (DNN) in conjunction with code context to improve performances. Aims: This study aims to develop an automated CR comment classifier that leverages DNN models to achieve a more reliable performance than Fregnan et al. Method: Using a manually labeled dataset of 1,828 CR comments, we trained and evaluated supervised learning-based DNN models leveraging code context, comment text, and a set of code metrics to classify CR comments into one of the five high-level categories proposed by Turzo and Bosu. Results: Based on our 10-fold cross-validation-based evaluations of multiple combinations of tokenization approaches, we found a model using CodeBERT achieving the best accuracy of 59.3%. Our approach outperforms Fregnan et al.'s approach by achieving 18.7% higher accuracy. Conclusion: Besides facilitating improved CR analytics, our proposed model can be useful for developers in prioritizing code review feedback and selecting reviewers.