Symbolic execution is a popular program analysis technique capable of exploring multiple execution paths in parallel, allowing for automatically exercising unexpected program behaviors and possibly revealing bugs.
Over the last two decades, symbolic execution techniques have been extensively used in both academic and industrial projects. For instance, symbolic execution tools have been used by Microsoft since 2008 in the testing process of its applications.
One fundamental issue in symbolic execution is the path explosion problem. Real-world applications may contain a very large number of program behaviors, leaving symbolic execution with up to an exponential number of possible execution paths to explore. One of the sources behind path explosion lies in symbolic memory pointers: a memory access to a symbolic address may potentially reference the entire address space, leading to the generation of a huge number of execution states. Since this can be quite challenging to do in practice, most symbolic execution frameworks are forced to trade soundness for performance. Unfortunately, this approach makes reasoning over symbolic pointers inaccurate, possibly leading symbolic execution to miss several program paths.
Lately, a new symbolic execution memory model has been proposed: MemSight. It allows for accurate pointer reasoning while keeping performance at a reasonable level. While it is an improvement in the accurate modelling of symbolic memory, it still does not scale in presence of memory intensive operations such as 'memset' and 'memcpy' of the standard C library.
The main goal of this project is to devise a new symbolic memory model able to handle memory intensive operation (e.g., 'memset' and 'memcpy') in an efficient and scalable way. Due to the large use of such operations in real-world software, handling them efficiently and already in the memory model design is crucial in order to increase the depth and the coverage of symbolic execution.
Static binary analysis, and in particular symbolic execution, is growing in popularity in the context of software testing, application security and vulnerability detection. The upsurge in the number of cyber-attacks and the increasingly complex applications they target call for automatic tools to help developers discover software vulnerabilities in the earliest stages of the development process.
Although memory-related errors are one of the main sources of software vulnerabilities in binary programs, modern program analysis techniques may still fail to pinpoint these errors in real-world complex applications. Furthermore, even when bugs are discovered and fixed by developers, the lack of automatic tools able to evaluate the reliability of the proposed software patches is a major concern.
From a theoretical perspective, exhaustive symbolic execution provides a sound and complete methodology for any decidable analysis; thus, it could be used to discover bugs and to reason about reliability of software patches. In spite of the flexibility of the technique, which stems from maintaining a detailed view of the machine state across multi-path executions and using satisfiability modulo theory (SMT) solvers to reason over symbolic inputs, symbolic execution is fraught with many pitfalls that limit its practical applicability. Most prominently, it can be severely affected by the combinatorial explosion of the execution space deriving, e.g., from symbolic memory operations. Indeed, even a single memory operation on a symbolic address can span the entire address space, possibly leading to the generation of a huge number of execution states.
To make symbolic execution scale, state-of-the-art symbolic execution engines trade soundness for performance by reducing the possible targets of symbolic addresses. Unfortunately, in this way many program behaviors will be lost, reducing the possibility of discovering non-trivial memory related vulnerabilities. Lately, MemSight [9] has been proposed as one of the first attempts for modelling symbolic pointers in a scalable way. Its main idea is to model the memory by maintaining a mapping between symbolic expressions and values. Unfortunately, it does not handle memory-intensive operations in a scalable way.
Motivated by the lack of automatic tools that are able to reason about memory operations, the main objective of this research project will be to develop a memory model capable of reasoning on symbolic pointers in a sound and scalable way even in presence of memory-intensive operations such as 'memset' and 'memcpy'. We plan to integrate this model, as a plugin, in a well-known symbolic execution engine developed by Santa Barbara University: angr [6]. We think that extending angr with such plugin may be a relevant contribution for the research community and practitioners alike. Indeed, since fine-grained and precise modeling of the memory without resorting to massive concretization remains an elusive goal in extant symbolic execution engines, a tool with such capabilities will be able to identify bugs in real-world software that cannot be detected by other state-of-the-art techniques. The development of new methodologies for the automatic generation of exploits is crucial in order to detect non-trivial software vulnerabilities. Indeed, hidden vulnerabilities may remain silent for months or even years, striking then back as harmful zero-day exploits that can severely affect both users and developers.