IEEE Transactions on Software Engineering - new TOC TOC Alert for Publication# 32
- 2023 Reviewers Listel febrero 12, 2024 a las 1:23 pm
- Federated Learning for Software Engineering: A Case Study of Code Clone Detection and Defect Predictionel enero 3, 2024 a las 1:19 pm
In various research domains, artificial intelligence (AI) has gained significant prominence, leading to the development of numerous learning-based models in research laboratories, which are evaluated using benchmark datasets. While the models proposed in previous studies may demonstrate satisfactory performance on benchmark datasets, translating academic findings into practical applications for industry practitioners presents challenges. This can entail either the direct adoption of trained academic models into industrial applications, leading to a performance decrease, or retraining models with industrial data, a task often hindered by insufficient data instances or skewed data distributions. Real-world industrial data is typically significantly more intricate than benchmark datasets, frequently exhibiting data-skewing issues, such as label distribution skews and quantity skews. Furthermore, accessing industrial data, particularly source code, can prove challenging for Software Engineering (SE) researchers due to privacy policies. This limitation hinders SE researchers’ ability to gain insights into industry developers’ concerns and subsequently enhance their proposed models. To bridge the divide between academic models and industrial applications, we introduce a federated learning (FL)-based framework called Almity. Our aim is to simplify the process of implementing research findings into practical use for both SE researchers and industry developers. Almity enhances model performance on sensitive skewed data distributions while ensuring data privacy and security. It introduces an innovative aggregation strategy that takes into account three key attributes: data scale, data balance, and minority class learnability. This strategy is employed to refine model parameters, thereby enhancing model performance on sensitive skewed datasets. In our evaluation, we employ two well-established SE tasks, i.e., code clone detection and defect prediction, as evaluation tasks. We compare the performance of Almity on both machine learning (ML) and deep learning (DL) models against two mainstream training methods, specifically the Centralized Training Method (CTM) and Vanilla Federated Learning (VFL), to validate the effectiveness and generalizability of Almity. Our experimental results demonstrate that our framework is not only feasible but also practical in real-world scenarios. Almity consistently enhances the performance of learning-based models, outperforming baseline training methods across all types of data distributions.
- Code Review Automation: Strengths and Weaknesses of the State of the Artel enero 1, 2024 a las 1:19 pm
The automation of code review has been tackled by several researchers with the goal of reducing its cost. The adoption of deep learning in software engineering pushed the automation to new boundaries, with techniques imitating developers in generative tasks, such as commenting on a code change as a reviewer would do or addressing a reviewer's comment by modifying code. The performance of these techniques is usually assessed through quantitative metrics, e.g., the percentage of instances in the test set for which correct predictions are generated, leaving many open questions on the techniques’ capabilities. For example, knowing that an approach is able to correctly address a reviewer's comment in 10% of cases is of little value without knowing what was asked by the reviewer: What if in all successful cases the code change required to address the comment was just the removal of an empty line? In this paper we aim at characterizing the cases in which three code review automation techniques tend to succeed or fail in the two above-described tasks. The study has a strong qualitative focus, with $\sim$∼105 man-hours of manual inspection invested in manually analyzing correct and wrong predictions generated by the three techniques, for a total of 2,291 inspected predictions. The output of this analysis are two taxonomies reporting, for each of the two tasks, the types of code changes on which the experimented techniques tend to succeed or to fail, pointing to areas for future work. A result of our manual analysis was also the identification of several issues in the datasets used to train and test the experimented techniques. Finally, we assess the importance of researching in techniques specialized for code review automation by comparing their performance with ChatGPT, a general purpose large language model, finding that ChatGPT struggles in commenting code as a human reviewer would do.
- On Effectiveness and Efficiency of Gamified Exploratory GUI Testingel diciembre 28, 2023 a las 1:19 pm
Context: Gamification appears to improve enjoyment and quality of execution of software engineering activities, including software testing. Though commonly employed in industry, manual exploratory testing of web application GUIs was proven to be mundane and expensive. Gamification applied to that kind of testing activity has the potential to overcome its limitations, though no empirical research has explored this area yet. Goal: Collect preliminary insights on how gamification, when performed by novice testers, affects the effectiveness, efficiency, test case realism, and user experience in exploratory testing of web applications. Method: Common gamification features augment an existing exploratory testing tool: Final Score with Leaderboard, Injected Bugs, Progress Bar, and Exploration Highlights. The original tool and the gamified version are then compared in an experiment involving 144 participants. User experience is elicited using the Technology Acceptance Model (TAM) questionnaire instrument. Results: Statistical analysis identified several significant differences for metrics that represent the effectiveness and efficiency of tests showing an improvement in coverage when they were developed with gamification. Additionally, user experience is improved with gamification. Conclusions: Gamification of exploratory testing has a tangible effect on how testers create test cases for web applications. While the results are mixed, the effects are most beneficial and interesting and warrant more research in the future. Further research shall be aimed at confirming the presented results in the context of state-of-the-art testing tools and real-world development environments.
- Answering Uncertain, Under-Specified API Queries Assisted by Knowledge-Aware Human-AI Dialogueel diciembre 25, 2023 a las 1:22 pm
Developers’ API needs should be more pragmatic, such as seeking suggestive, explainable, and extensible APIs rather than the so-called best result. Existing API search research cannot meet these pragmatic needs because they are solely concerned with query-API relevance. This necessitates a focus on enhancing the entire query process, from query definition to query refinement through intent clarification to query results promoting divergent thinking about results. This paper designs a novel Knowledge-Aware Human-AI Dialog agent (KAHAID) which guides the developer to clarify the uncertain, under-specified query through multi-round question answering and recommends APIs for the clarified query with relevance explanation and extended suggestions (e.g., alternative, collaborating or opposite-function APIs). We systematically evaluate KAHAID. In terms of human-AI dialogue process, it achieves a high diversity of question options (the average diversity between any two options is 74.9%) and the ability to guide developers to find APIs using fewer dialogue rounds (no more than 3 rounds on average). For API recommendation, KAHAID achieves an MRR and MAP of 0.769 and 0.794, outperforming state-of-the-art API search approaches BIKER and CLEAR by at least 47% in MRR and 226.7% in MAP. For knowledge extension, KAHAID obtains an MRR and MAP of 0.815 and 0.864, surpassing state-of-the-art query clarification approaches by at least 42% in MRR and 45.2% in MAP. As the first of its kind, KAHAID opens the door to integrating the immediate response capability of API research and the interaction, clarification, explanation, and extensibility capability of social-technical information seeking.