590N: SE Reading Group

Winter 2019 — Monday, 3:30pm — CSE 203

Subscribe to the calendar: iCal or Google Calendar.

We’ll be reading and discussing exciting recent papers from the software engineering community. Participants should subscribe to the 590n mailing list. Note the list also has many current and former department members interested in software engineering.

Some paper links may point into the ACM Digital Library or the Springer online collection. Using a UW IP address, or the UW libraries off-campus access, should provide access.

Date	Who	What
Jan 7	Mike, Rene	Introduction
Jan 14	Martin Kellogg	Efficient static checking of library updates (FSE 2018) by Darius Foo, Hendy Chua, Jason Yeo, Ming Yi Ang, and Asankhaya Sharma
Jan 21	No one	No meeting – MLK Day
Jan 28	Yim Register and Zhen Zhang	Darwinian Data Structure Selection (FSE 2018) by Michail Basios, Lingbo Li, Fan Wu, Leslie Kanthan, Earl Barr
Feb 4	No one	Snow day
Feb 11	Doug Woos	Snow day
Feb 18	No one	No meeting – Presidents’ Day
Feb 25	Yufeng	Testing Probabilistic Programming Systems (FSE 2018) by Saikat Dutta, Owolabi Legunsen, Zixin Huang, Sasa Misailovic
Mar 4	Remy	Aroma: Code Recommendation via Structural Code Search (arxiv 2018) by Sifei Luan, Di Yang, Koushik Sen, Satish Chandra
Mar 11	Rashmi	Randomized Testing of Distributed Systems with Probabilistic Guarantees (OOPSLA 2018) by Burcu Kulahcioglu Ozkan, Rupak Majumdar, Filip Niksic, Mitra Tabaei Befrouei, Georg Weissenbacher
Mar 18	Sam Kaufman	A practical guide for using statistical tests to assess randomized algorithms in software engineering (ICSE 2011) by Andrea Arcuri, Lionel Briand; and skim Statistical Errors in Software Engineering Experiments: A Preliminary Literature Review (ICSE 2018) by Rolando Reyes, Oscar Dieste, Efrain Fonseca, Natalia Juristo

Paper Suggestions

Analzying the Analyzers: FlowDroid/IccTA, AmanDroid, and DroidSafe (ISSTA 2018) by Lina Qiu, Yingying Wang, Julia Rubin
Do Android Taint Analysis Tools Keep their Promises? (FSE 2018) by Felix Pauck, Eric Bodden, Heike Wehrheim
Aroma: Code Recommendation via Structural Code Search (arxiv 2018) by Sifei Luan, Di Yang, Koushik Sen, Satish Chandra
code2vec: Learning Distributed Representations of Code (POPL 2019) by Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav
Randomized Testing of Distributed Systems with Probabilistic Guarantees (OOPSLA 2018) by Burcu Kulahcioglu Ozkan, Rupak Majumdar, Filip Niksic, Mitra Tabaei Befrouei, Georg Weissenbacher
One Test to Rule Them All (ISSTA 2017) by Alex Groce, Josie Holmes, Kevin Kellar
Repositioning of Static Analysis Alarms (ISSTA 2018) by Tukaram Muske, Rohith Talluri, Alexander Serebrenik
Automated patch extraction via syntax- and semantics-aware Delta debugging on source code changes (FSE 2018) by Masatomo Hashimoto, Akira Mori, Tomonori Izumida
On automatically proving the correctness of math.h implementations (OOPSLA 2018) by Wonyeol Lee, Rahul Sharma, and Alex Aiken
Towards a theory of software development expertise (FSE 2018) by Sebastian Baltes and Stephan Diehl
What makes a code change easier to review: an empirical investigation on code change reviewability (FSE 2018) by Achyudh Ram, Anand Ashok Sawant, Marco Castelluccio, and Alberto Bacchelli
How should compilers explain problems to developers? (FSE 2018) by Titus Barik, Denae Ford, Emerson Murphy-Hill, and Chris Parnin
[The Impact of Regular Expression Denial of Service (ReDoS) in Practice: an Empirical Study at the Ecosystem Scale] (ESEC/FSE 2018) by James Davis, Christy Coghlan, Francisco Servant and Dongyoon Lee
[Adversarial Symbolic Execution for Detecting Concurrency-related Cache Timing Leaks] (ESEC/FSE 2018) by Shengjian Guo, Meng Wu and Chao Wang
[Data Race Detection on Compressed Traces] (ESEC/FSE 2018) by Dileep Kini, Umang Mathur and Mahesh Viswanathan
[An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems] (ESEC/FSE 2018) by Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou and Yongming Wu
[Oreo: Detection of Clones in the Twilight Zone] (ESEC/FSE 2018) by Vaibhav Saini, Farima Farmahini Farahani, Yadong Lu, Pierre Baldi and Cristina Lopes

Paper Abstracts

Analzying the Analyzers: FlowDroid/IccTA, AmanDroid, and DroidSafe (ISSTA 2018)
- Abstract: Numerous static analysis techniques have recently been proposed for identifying information flows in mobile applications. These techniques are compared to each other, usually on a set of syntactic benchmarks. Yet, configurations used for such comparisons are rarely described. Our experience shows that tools are often compared under different setup, rendering the comparisons irreproducible and largely inaccurate. In this paper, we provide a large, controlled, and independent comparison of the three most prominent static analysis tools: FlowDroid combined with IccTA, Amandroid, and DroidSafe. We evaluate all tools using common configuration setup and on the same set of benchmark applications. We compare the results of our analysis to the results reported in previous studies, identify main reasons for inaccuracy in existing tools, and provide suggestions for future research.
DeepBugs: A Learning Approach to Name-based Bug Detection (OOPSLA 2018)
- Abstract: Natural language elements in source code, e.g., the names of variables and functions, convey useful information. However, most existing bug detection tools ignore this information and therefore miss some classes of bugs. The few existing name-based bug detection approaches reason about names on a syntactic level and rely on manually designed and tuned algorithms to detect bugs. This paper presents DeepBugs, a learning approach to name-based bug detection, which reasons about names based on a semantic representation and which automatically learns bug detectors instead of manually writing them. We formulate bug detection as a binary classification problem and train a classifier that distinguishes correct from incorrect code. To address the challenge that effectively learning a bug detector requires examples of both correct and incorrect code, we create likely incorrect code examples from an existing corpus of code through simple code transformations. A novel insight learned from our work is that learning from artificially seeded bugs yields bug detectors that are effective at finding bugs in real-world code. We implement our idea into a framework for learning-based and name-based bug detection. Three bug detectors built on top of the framework detect accidentally swapped function arguments, incorrect binary operators, and incorrect operands in binary operations. Applying the approach to a corpus of 150,000 JavaScript files yields bug detectors that have a high accuracy (between 89% and 95%), are very efficient (less than 20 milliseconds per analyzed file), and reveal 102 programming mistakes (with 68% true positive rate) in real-world code.
A practical guide for using statistical tests to assess randomized algorithms in software engineering (ICSE 2011)
- Abstract: Randomized algorithms have been used to successfully address many different types of software engineering problems. This type of algorithms employ a degree of randomness as part of their logic. Randomized algorithms are useful for difficult problems where a precise solution cannot be derived in a deterministic way within reasonable time. However, randomized algorithms produce different results on every run when applied to the same problem instance. It is hence important to assess the effectiveness of randomized algorithms by collecting data from a large enough number of runs. The use of rigorous statistical tests is then essential to provide support to the conclusions derived by analyzing such data. In this paper, we provide a systematic review of the use of randomized algorithms in selected software engineering venues in 2009. Its goal is not to perform a complete survey but to get a representative snapshot of current practice in software engineering research. We show that randomized algorithms are used in a significant percentage of papers but that, in most cases, randomness is not properly accounted for. This casts doubts on the validity of most empirical results assessing randomized algorithms. There are numerous statistical tests, based on different assumptions, and it is not always clear when and how to use these tests. We hence provide practical guidelines to support empirical research on randomized algorithms in software engineering.
Aroma: Code Recommendation via Structural Code Search (arxiv 2018)
- Abstract: Programmers often write code which have similarity to existing code written somewhere. A tool that could help programmers to search such similar code would be immensely useful. Such a tool could help programmers to extend partially written code snippets to completely implement necessary functionality, help to discover extensions to the partial code which are commonly done by other programmers, help to crosscheck against similar code written by other programmers, or help to add extra code which would avoid common mistakes and errors. We propose Aroma, a tool and technique for code recommendation via structural code search. Aroma indexes a huge code corpus including thousands of open-source projects, takes a partial code snippet as input, searches the indexed method bodies which contain the partial code snippet, clusters and intersects the results of search to recommend a small set of succinct code snippets which contain the query snippet and which appears as part of several programs in the corpus. We evaluated Aroma on several randomly selected queries created from the corpus and as well as those derived from the code snippets obtained from Stack Overflow, a popular website for discussing code. We found that Aroma was able to retrieve and recommend most relevant code snippets efficiently.
code2vec: Learning Distributed Representations of Code (POPL 2019)
- Abstract: We present a neural model for representing snippets of code as continuous distributed vectors (“code embeddings”). The main idea is to represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of the snippet. This is performed by decomposing code to a collection of paths in its abstract syntax tree, and learning the atomic representation of each path simultaneously with learning how to aggregate a set of them. We demonstrate the effectiveness of our approach by using it to predict a method’s name from the vector representation of its body. We evaluate our approach by training a model on a dataset of 14M methods. We show that code vectors trained on this dataset can predict method names from files that were completely unobserved during training. Furthermore, we show that our model learns useful method name vectors that capture semantic similarities, combinations, and analogies. Comparing previous techniques over the same data set, our approach obtains a relative improvement of over 75%, being the first to successfully predict method names based on a large, cross-project, corpus. Our trained model, visualizations and vector similarities are available as an interactive online demo at http://code2vec.org. The code, data and trained models are available at https://github.com/tech-srl/code2vec.
Randomized Testing of Distributed Systems with Probabilistic Guarantees (OOPSLA 2018)
- Abstract: Several recently proposed randomized testing tools for concurrent and distributed systems come with theoretical guarantees on their success. The key to these guarantees is a notion of bug depth—the minimum length of a sequence of events sufficient to expose the bug—and a characterization of d-hitting families of schedules—a set of schedules guaranteed to cover every bug of given depth d. Previous results show that in certain cases the size of a d-hitting family can be significantly smaller than the total number of possible schedules. However, these results either assume shared-memory multithreading, or that the underlying partial ordering of events is known statically and has special structure. These assumptions are not met by distributed message-passing applications. In this paper, we present a randomized scheduling algorithm for testing distributed systems. In contrast to previous approaches, our algorithm works for arbitrary partially ordered sets of events revealed online as the program is being executed. We show that for partial orders of width at most w and size at most n (both statically unknown), our algorithm is guaranteed to sample from at most w2 nd−1 schedules, for every fixed bug depth d. Thus, our algorithm discovers a bug of depth d with probability at least 1 / (w2 nd−1). As a special case, our algorithm recovers a previous randomized testing algorithm for multithreaded programs. Our algorithm is simple to implement, but the correctness arguments depend on difficult combinatorial results about online dimension and online chain partitioning of partially ordered sets. We have implemented our algorithm in a randomized testing tool for distributed message-passing programs. We show that our algorithm can find bugs in distributed systems such as Zookeeper and Cassandra, and empirically outperforms naive random exploration while providing theoretical guarantees.
Verified Three-Way Program Merge (OOPSLA 2018)
- Abstract: Even though many programmers rely on 3-way merge tools to integrate changes from different branches, such tools can introduce subtle bugs in the integration process. This paper aims to mitigate this problem by defining a semantic notion of conflict-freedom, which ensures that the merged program does not introduce new unwanted behaviors. We also show how to verify this property using a novel, compositional algorithm that combines lightweight summarization for shared program fragments with precise relational reasoning for the modifications. Towards this goal, our method uses a 4-way differencing algorithm on abstract syntax trees to represent different program versions as edits applied to a shared program with holes. This representation allows our verification algorithm to reason about different edits in isolation and compose them to obtain an overall proof of conflict freedom. We have implemented the proposed technique in a new tool called SafeMerge for Java and evaluate it on 52 real-world merge scenarios obtained from Github. The experimental results demonstrate the benefits of our approach over syntactic conflict-freedom and indicate that SafeMerge is both precise and practical.
Statistical Errors in Software Engineering Experiments: A Preliminary Literature Review (ICSE 2018)
- Abstract: Background: Statistical concepts and techniques are often applied incorrectly, even in mature disciplines such as medicine or psychology. Surprisingly, there are very few works that study statistical problems in software engineering (SE). Aim: Assess the existence of statistical errors in SE experiments. Method: Compile the most common statistical errors in experimental disciplines. Survey experiments published in ICSE to assess whether errors occur in high quality SE publications. Results: The same errors as identified in others disciplines were found in ICSE experiments, where 30% of the reviewed papers included several error types such as: a) missing statistical hypotheses, b) missing sample size calculation, c) failure to assess statistical test assumptions, and d) uncorrected multiple testing. This rather large error rate is greater for research papers where experiments are confined to the validation section. The origin of the errors can be traced back to: a) researchers not having sufficient statistical training, and, b) a profusion of exploratory research. Conclusions: This paper provides preliminary evidence that SE research suffers from the same statistical problems as other experimental disciplines. However, the SE community appears to be unaware of any shortcomings in its experiments, whereas other disciplines work hard to avoid these threats. Further research is necessary to find the underlying causes and set up corrective measures, but there are some potentially effective actions and are a priori easy to implement: a) improve the statistical training of SE researchers, and b) enforce quality assessment and reporting guidelines in SE publications.
One Test to Rule Them All (ISSTA 2017)
- Abstract: Test reduction has long been seen as critical for automated testing. However, traditional test reduction simply reduces the length of a test, but does not attempt to reduce semantic complexity. This paper extends previous efforts with algorithms for normalizing and generalizing tests. Rewriting tests into a normal form can reduce semantic complexity and even remove steps from an already deltadebugged test. Moreover, normalization dramatically reduces the number of tests that a reader must examine, partially addressing the “fuzzer taming” problem of discovering distinct faults in a set of failing tests. Generalization, in contrast, takes a test and reports what aspects of the test could have been changed while preserving the property that the test fails. Normalization plus generalization aids understanding of tests, including tests for complex and widely used APIs such as the NumPy numeric computation library and the ArcPy GIS scripting package. Normalization frequently reduces the number of tests to be examined by well over an order of magnitude, and often to just one test per fault. Together, ideally, normalization and generalization allow a user to replace reading a large set of tests that vary in unimportant ways with reading one annotated summary test.
Repositioning of Static Analysis Alarms (ISSTA 2018)
- Abstract: The large number of alarms reported by static analysis tools is often recognized as one of the major obstacles to industrial adoption of such tools. We present repositioning of alarms, a novel automatic postprocessing technique intended to reduce the number of reported alarms without affecting the errors uncovered by them. The reduction in the number of alarms is achieved by moving groups of related alarms along the control flow to a program point where they can be replaced by a single alarm. In the repositioning technique, as the locations of repositioned alarms are different than locations of the errors uncovered by them, we also maintain traceability links between a repositioned alarm and its corresponding original alarm(s). The presented technique is tool-agnostic and orthogonal to many other techniques available for postprocessing alarms. To evaluate the technique, we applied it as a postprocessing step to alarms generated for 4 verification properties on 16 open source and 4 industry applications. The results indicate that the alarms repositioning technique reduces the alarms count by up to 20% over the state-of-the-art alarms grouping techniques with a median reduction of 7.25%.
Automated patch extraction via syntax- and semantics-aware Delta debugging on source code changes (FSE 2018)
- Abstract: Delta debugging (DD) is an approach to automating the debugging activities based on systematic testing. DD algorithms find the cause of a regression of a program by minimizing the changes applied between a working version and a faulty version of the program. However, it is still an open problem to minimize a huge set of changes while avoiding any invalid subsets that do not result in testable programs, especially in case that no software configuration management system is available. In this paper, we propose a rule-based approach to syntactic and semantic decomposition of changes into independent components to facilitate DD on source code changes, and hence to extract patches automatically. For analyzing changes, we make use of tree differencing on abstract syntax trees instead of common differencing on plain texts. We have developed an experimental implementation for Java programs and applied it to 194 bug fixes from Defects4J and 8 real-life regression bugs from 6 open source Java projects. Compared to a DD tool based on plain text differencing, it extracted patches whose size is reduced by 50% at the cost of 5% more test executions for the former dataset and by 73% at the cost of 40% more test executions for the latter, both on average.
Darwinian Data Structure Selection (FSE 2018)
- Abstract: Data structure selection and tuning is laborious but can vastly improve an application’s performance and memory footprint. Some data structures share a common interface and enjoy multiple implementations. We call them Darwinian Data Structures (DDS), since we can subject their implementations to survival of the fittest. We introduce artemis a multi-objective, cloud-based search-based optimisation framework that automatically finds optimal, tuned DDS modulo a test suite, then changes an application to use that DDS. artemis achieves substantial performance improvements for every project in 5 Java projects from DaCapo benchmark, 8 popular projects and 30 uniformly sampled projects from GitHub. For execution time, CPU usage, and memory consumption, artemis finds at least one solution that improves all measures for 86% (37/43) of the projects. The median improvement across the best solutions is 4.8%, 10.1%, 5.1% for runtime, memory and CPU usage. These aggregate results understate artemis’s potential impact. Some of the benchmarks it improves are libraries or utility functions. Two examples are gson, a ubiquitous Java serialization framework, and xalan, Apache’s XML transformation tool. artemis improves gson by 16.5%, 1% and 2.2% for memory, runtime, and CPU; artemis improves xalan’s memory consumption by 23.5%. Every client of these projects will benefit from these performance improvements.
Do Android Taint Analysis Tools Keep their Promises? (FSE 2018)
- Abstract: In recent years, researchers have developed a number of tools to conduct taint analysis of Android applications. While all the respective papers aim at providing a thorough empirical evaluation, comparability is hindered by varying or unclear evaluation targets. Sometimes, the apps used for evaluation are not precisely described. In other cases, authors use an established benchmark but cover it only partially. In yet other cases, the evaluations differ in terms of the data leaks searched for, or lack a ground truth to compare against. All those limitations make it impossible to truly compare the tools based on those published evaluations. We thus present ReproDroid, a framework allowing the accurate comparison of Android taint analysis tools. ReproDroid supports researchers in inferring the ground truth for data leaks in apps, in automatically applying tools to benchmarks, and in evaluating the obtained results. We use ReproDroid to comparatively evaluate on equal grounds the six prominent taint analysis tools Amandroid, DIALDroid, DidFail, DroidSafe, FlowDroid and IccTA. The results are largely positive although four tools violate some promises concerning features and accuracy. Finally, we contribute to the area of unbiased benchmarking with a new and improved version of the open test suite DroidBench.
Testing Probabilistic Programming Systems (FSE 2018)
- Abstract: Probabilistic programming systems (PP systems) allow developers to model stochastic phenomena and perform efficient inference on the models. The number and adoption of probabilistic programming systems is growing significantly. However, there is no prior study of bugs in these systems and no methodology for systematically testing PP systems. Yet, testing PP systems is highly non-trivial, especially when they perform approximate inference. In this paper, we characterize 118 previously reported bugs in three open-source PP systems—Edward, Pyro and Stan—and propose ProbFuzz, an extensible system for testing PP systems. ProbFuzz allows a developer to specify templates of probabilistic models, from which it generates concrete probabilistic programs and data for testing. ProbFuzz uses language-specific translators to generate these concrete programs, which use the APIs of each PP system. ProbFuzz finds potential bugs by checking the output from running the generated programs against several oracles, including an accuracy checker. Using ProbFuzz, we found 67 previously unknown bugs in recent versions of these PP systems. Developers already accepted 51 bug fixes that we submitted to the three PP systems, and their underlying systems, PyTorch and TensorFlow.
On automatically proving the correctness of math.h implementations (OOPSLA 2018)
- Abstract: Industry standard implementations of math.h claim (often without formal proof) tight bounds on floating-point errors. We demonstrate a novel static analysis that proves these bounds and verifies the correctness of these implementations. Our key insight is a reduction of this verification task to a set of mathematical optimization problems that can be solved by off-the-shelf computer algebra systems. We use this analysis to prove the correctness of implementations in Intel's math library automatically. Prior to this work, these implementations could only be verified with significant manual effort.
Towards a theory of software development expertise (FSE 2018)
- Abstract: Software development includes diverse tasks such as implementing new features, analyzing requirements, and fixing bugs. Being an expert in those tasks requires a certain set of skills, knowledge, and experience. Several studies investigated individual aspects of software development expertise, but what is missing is a comprehensive theory. We present a first conceptual theory of software development expertise that is grounded in data from a mixed-methods survey with 335 software developers and in literature on expertise and expert performance. Our theory currently focuses on programming, but already provides valuable insights for researchers, developers, and employers. The theory describes important properties of software development expertise and which factors foster or hinder its formation, including how developers' performance may decline over time. Moreover, our quantitative results show that developers' expertise self-assessments are context-dependent and that experience is not necessarily related to expertise.
What makes a code change easier to review: an empirical investigation on code change reviewability (FSE 2018)
- Abstract: Peer code review is a practice widely adopted in software projects to improve the quality of code. In current code review practices, code changes are manually inspected by developers other than the author before these changes are integrated into a project or put into production. We conducted a study to obtain an empirical understanding of what makes a code change easier to review. To this end, we surveyed published academic literature and sources from gray literature (blogs and white papers), we interviewed ten professional developers, and we designed and deployed a reviewability evaluation tool that professional developers used to rate the reviewability of 98 changes. We find that reviewability is defined through several factors, such as the change description, size, and coherent commit history. We provide recommendations for practitioners and researchers.
How should compilers explain problems to developers? (FSE 2018)
- Abstract: Compilers primarily give feedback about problems to developers through the use of error messages. Unfortunately, developers routinely find these messages to be confusing and unhelpful. In this paper, we postulate that because error messages present poor explanations, theories of explanation---such as Toulmin's model of argument---can be applied to improve their quality. To understand how compilers should present explanations to developers, we conducted a comparative evaluation with 68 professional software developers and an empirical study of compiler error messages found in Stack Overflow questions across seven different programming languages. Our findings suggest that, given a pair of error messages, developers significantly prefer the error message that employs proper argument structure over a deficient argument structure when neither offers a resolution---but will accept a deficient argument structure if it provides a resolution to the problem. Human-authored explanations on Stack Overflow converge to one of the three argument structures: those that provide a resolution to the error, simple arguments, and extended arguments that provide additional evidence for the problem. Finally, we contribute three practical design principles to inform the design and evaluation of compiler error messages.
Efficient static checking of library updates (FSE 2018)
- Abstract: Software engineering practices have evolved to the point where a developer writing a new application today doesn’t start from scratch, but reuses a number of open source libraries and components. These third-party libraries evolve independently of the applications in which they are used, and may not maintain stable interfaces as bugs and vulnerabilities in them are fixed. This in turn causes API incompatibilities in downstream applications which must be manually resolved. Oversight here may manifest in many ways, from test failures to crashes at runtime. To address this problem, we present a static analysis for automatically and efficiently checking if a library upgrade introduces an API incompatibility. Our analysis does not rely on reported version information from library developers, and instead computes the actual differences between methods in libraries across different versions. The analysis is scalable, enabling real-time diff queries involving arbitrary pairs of library versions. It supports a vulnerability remediation product which suggests library upgrades automatically and is lightweight enough to be part of a continuous integration/delivery (CI/CD) pipeline. To evaluate the effectiveness of our approach, we determine semantic versioning adherence of a corpus of open source libraries taken from Maven Central, PyPI, and RubyGems. We find that on average, 26% of library versions are in violation of semantic versioning. We also analyze a collection of popular open source projects from GitHub to determine if we can automatically update libraries in them without causing API incompatibilities. Our results indicate that we can suggest upgrades automatically for 10% of the libraries.

Also see the suggestions from last quarter.