590N: SE Reading Group

Spring 2019 — Monday, 3:30pm — CSE 203

Subscribe to the calendar: iCal or Google Calendar.

We’ll be reading and discussing exciting recent papers from the software engineering community. Participants should subscribe to the 590n mailing list. Note the list also has many current and former department members interested in software engineering.

Some paper links may point into the ACM Digital Library or the Springer online collection. Using a UW IP address, or the UW libraries off-campus access, should provide access.

Date	Who	What
Apr 1	Mike, Rene	Introduction
Apr 8	Sam K.	Chaff from the Wheat: Characterizing and Determining Valid Bug Reports
Apr 15	Rashmi	Static Automated Program Repair for Heap Properties (ICSE 2018)
Apr 22	Martin	Detecting Incorrect Build Rules (contact Martin K (kelloggm at cs dot washington dot edu) for a preprint)
Apr 29	Zhen	Automatically Generating Precise Oracles from Structured Natural Language Specifications (ICSE 19)
May 6	Mike	VFix: Value-Flow-Guided Precise Program Repair for Null Pointer Dereferences (ICSE 19)
May 13	Rene	Test selection pair: Predictive Test Selection (draft; should get camera-ready); Assessing Transition-based Test Selection Algorithms at Google
May 20	Jose	BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes (ICSE 19) (optional: BugsJS: A Benchmark of JavaScript Bugs (ICST 19) and Bugs.jar: A Large-scale, Diverse Dataset of Real-world Java Bugs (MSR 18)
May 27		Memorial Day
Jun 3	Mike	SMOKE: Scalable Path-Sensitive Memory Leak Detection for Millions of Lines of Code

Paper Suggestions

BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes (ICSE 2019)
Should computer scientists experiment more? (IEEE Computer 1998)
Large-Scale Analysis of Framework-Specific Exceptions in Android Apps (ICSE 2018)
Automatically Generating Precise Oracles from Structured Natural Language Specifications (ICSE 2019)
Training Binary Classifiers as Data Structure Invariants (ICSE 2019)
Automatic Patch Generation by Learning Correct Code (POPL 2016)
Are deep neural networks the best choice for modeling source code? (FSE 2017)
Interactive Production Performance Feedback in the IDE (ICSE 2019)
Oreo: Detection of Clones in the Twilight Zone (ESEC/FSE 2018)
DetReduce
Static Automated Program Repair for Heap Properties (ICSE 2018)
Towards a Theory of Software Development Expertise (ESEC/FSE 2018)
Do Android taint analysis tools keep their promises? (ESEC/FSE 2018)
Graph Embedding based Familial Analysis of Android Malware using Unsupervised Learning
Test Equivalence Analysis (Mike)
Scalable (Mike)
Reasonably Most General Clients (Mike)
Predictive Test Selection
Assessing Transition-based Test Selection Algorithms at Google
Chaff from the Wheat: Characterizing and Determining Valid Bug Reports
Automatic patch generation by learning correct code (POPL 16)
BUGSJS: A Benchmark of JavaScript Bugs
VFix: Value-Flow-Guided Precise Program Repair for Null Pointer Dereferences
Detecting Incorrect Build Rules (ICSE 2019)

Paper Abstracts

BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes (ICSE 2019)
- Abstract: Fault-detection, localization, and repair methods are vital to software quality; but it is difficult to evaluate their generality, applicability, and current effectiveness. Large, diverse, realistic datasets of durably-reproducible faults and fixes are vital to good experimental evaluation of approaches to software quality, but they are difficult and expensive to assemble and keep current. Modern continuous-integration (CI) approaches, like TRAVIS-CI, which are widely used, fully configurable, and executed within custom-built containers, promise a path toward much larger defect datasets. If we can identify and archive failing and subsequent passing runs, the containers will provide a substantial assurance of durable future reproducibility of build and test. Several obstacles, however, must be overcome to make this a practical reality. We describe BUGSWARM, a toolset that navigates these obstacles to enable the creation of a scalable, diverse, realistic, continuously growing set of durably reproducible failing and passing versions of real-world, open-source systems. The BUGSWARM toolkit has already gathered 3,091 fail-pass pairs, in Java and Python, all packaged within fully reproducible containers. Furthermore, the toolkit can be run periodically to detect fail-pass activities, thus growing the dataset continually.
Should computer scientists experiment more? (IEEE Computer 1998)
- Abstract: Computer scientists and practitioners defend their lack of experimentation with a wide range of arguments. Some arguments suggest that experimentation is inappropriate, too difficult, useless, and even harmful. This article discusses several such arguments to illustrate the importance of experimentation for computer science. It considers how the software industry is beginning to value experiments, because results may give a company a three- to five-year lead over the competition.
Large-Scale Analysis of Framework-Specific Exceptions in Android Apps (ICSE 2018)
- Abstract:Mobile apps have become ubiquitous. For app developers, it is a key priority to ensure their apps’ correctness and reliability. However, many apps still suffer from occasional to frequent crashes, weakening their competitive edge. Large-scale, deep analyses of the characteristics of real-world app crashes can provide useful insights to guide developers, or help improve testing and analysis tools. However, such studies do not exist — this paper fills this gap. Over a four-month long effort, we have collected 16,245 unique exception traces from 2,486 open-source Android apps, and observed that framework-specific exceptions account for the majority of these crashes. We then extensively investigated the 8,243 frameworkspecific exceptions (which took six person-months): (1) identifying their characteristics (e.g., manifestation locations, common fault categories), (2) evaluating their manifestation via state-of-the-art bug detection techniques, and (3) reviewing their fixes. Besides the insights they provide, these findings motivate and enable follow-up research on mobile apps, such as bug detection, fault localization and patch generation. In addition, to demonstrate the utility of our findings, we have optimized Stoat, a dynamic testing tool, and implemented ExLocator, an exception localization tool, for Android apps. Stoat is able to quickly uncover three previously-unknown, confirmed/fixed crashes in Gmail and Google+; ExLocator is capable of precisely locating the root causes of identified exceptions in real-world apps. Our substantial dataset is made publicly available to share with and benefit the community
Automatically Generating Precise Oracles from Structured Natural Language Specifications (ICSE 2019)
- Abstract:Software specifications often use natural language to describe the desired behavior, but such specifications are difficult to verify automatically. We present Swami, an automated technique that extracts test oracles and generates executable tests from structured natural language specifications. Swami focuses on exceptional behavior and boundary conditions that often cause field failures but that developers often fail to manually write tests for. Evaluated on the official JavaScript specification (ECMA- 262), 98.4% of the tests Swami generated were precise to the specification. Using Swami to augment developer-written test suites improved coverage and identified 1 previously unknown defect and 15 missing JavaScript features in Rhino, 1 previously unknown defect in Node.js, and 18 semantic ambiguities in the ECMA-262 specification.
Training Binary Classifiers as Data Structure Invariants (ICSE 2019)
- Abstract:We present a technique that enables us to distinguish valid from invalid data structure objects. The technique is based on building an artificial neural network, more precisely a binary classifier, and training it to identify valid and invalid instances of a data structure. The obtained classifier can then be used in place of the data structure’s invariant, in order to attempt to identify (in)correct behaviors in programs manipulating the structure. In order to produce the valid objects to train the network, an assumed-correct set of object building routines is randomly executed. Invalid instances are produced by generating values for object fields that “break” the collected valid values, i.e., that assign values to object fields that have not been observed as feasible in the assumed-correct program executions that led to the collected valid instances. We experimentally assess this approach, over a benchmark of data structures. We show that this learning technique produces classifiers that achieve significantly better accuracy in classifying valid/invalid objects compared to a technique for dynamic invariant detection, and leads to improved bug finding.
Interactive Production Performance Feedback in the IDE (ICSE 2019)
- Abstract:Software specifications often use natural language to describe the desired behavior, but such specifications are difficult to verify automatically. We present Swami, an automated technique that extracts test oracles and generates executable tests from structured natural language specifications. Swami focuses on exceptional behavior and boundary conditions that often cause field failures but that developers often fail to manually write tests for. Evaluated on the official JavaScript specification (ECMA-262), 98.4% of the tests Swami generated were precise to the specification. Using Swami to augment developer-written test suites improved coverage and identified 1 previously unknown defect and 15 missing JavaScript features in Rhino, 1 previously unknown defect in Node.js, and 18 semantic ambiguities in the ECMA-262 specification.
[Oreo: Detection of Clones in the Twilight Zone] (ESEC/FSE 2018)
- Abstract:Source code clones are categorized into four types of increasing difficulty of detection, ranging from purely textual (Type-1) to purely semantic (Type-4). Most clone detectors reported in the literature work well up to Type-3, which accounts for syntactic differences. In between Type-3 and Type-4, however, there lies a spectrum of clones that, although still exhibiting some syntactic similarities, are extremely hard to detect -- the Twilight Zone. Most clone detectors reported in the literature fail to operate in this zone. We present Oreo, a novel approach to source code clone detection that not only detects Type-1 to Type-3 clones accurately, but is also capable of detecting harder-to-detect clones in the Twilight Zone. Oreo is built using a combination of machine learning, information retrieval, and software metrics. We evaluate the recall of Oreo on BigCloneBench, and perform manual evaluation for precision. Oreo has both high recall and precision. More importantly, it pushes the boundary in detection of clones with moderate to weak syntactic similarity in a scalable manner.

Also see the suggestions from last quarter.