Parallel construction of interprocedural memory SSA form

Yulei Suia,*, Hua Yanb, Zheng Zhengc, Yunpeng Zhangd, Jingling Xueb

aUniversity of Technology Sydney, Australia
bBeihang University, China
cThe University of New South Wales, Australia
dUniversity of Houston, TX, United States

ARTICLE INFO

Article history:
Received 22 March 2018
Revised 3 August 2018
Accepted 12 September 2018
Available online 13 September 2018

ABSTRACT

Interprocedural memory SSA form, which provides a sparse data-flow representation for indirect memory operations, paves the way for many advanced program analyses. Any performance improvement for memory SSA construction benefits for a wide range of clients (e.g., bug detection and compiler optimisations). However, its construction is much more expensive than that for scalar-based SSA form. The memory objects distinguished at a pointer dereference significantly increases the number of variables that need to be put on SSA form, resulting in considerable analysis overhead when analyzing large programs (e.g., millions of lines of code).

This paper presents ParSSA, a fully parameterised approach for parallel construction of interprocedural memory SSA form by utilising multi-core computing resources. ParSSA partitions whole-program memory objects into uniquely identified memory regions. The indirect memory accesses in a function are fully parameterised using partitioned memory regions, so that the memory SSA construction of a parameterised function is readily parallelised. We implemented ParSSA in LLVM using Intel Threading Building Block (TBB) for creating parallel tasks. We evaluated ParSSA using 15 large applications. ParSSA achieves up to $6.9 \times$ speedup against the sequential version on an 8-core machine.

© 2018 Elsevier Inc. All rights reserved.

1. Introduction

Static Single Assignment (SSA) form (Rosen et al., 1988) is the mainstream intermediate representation used to perform analyses and optimizations of scalars in modern compilers (e.g., LLVM (Lattner and Adve, 2014), GCC (Novillo and Canada, 2007), and Java Hotspot (Kotzmann et al., 2008)). It provides a sparse data-flow representation in which every variable can only be defined once. To enable the sparsity of both scalars and indirect memory operations, various memory SSA forms (e.g., factored SSA (Choi et al., 1994), HSSA (Chow et al., 1996), and Tree SSA (Novillo and Canada, 2007)) have been proposed to support aggressive compiler optimizations. To reduce compile-time overhead, majority of their construction algorithms are intraprocedural, i.e., a pair of pointer dereferences in a function $f$ are conservatively treated as aliases if both may access memory objects defined outside $f$.

Compared to light-weight intraprocedural memory SSA form, its interprocedural counterpart provides fine-grained memory dependence by considering global alias information across functions. Due to improved precision, the resulting SSA representation is useful for many client applications, such as flow-sensitive pointer analysis (Hardekopf and Lin, 2011; Sui and Xue, 2016a), static memory error detection (Livshits and Lam, 2003; Sui et al., 2012), change impact analysis (Guo et al., 2016; Cai et al., 2016) and identifying redundant instrumentations to accelerate dynamic analysis (Ye et al., 2014).

Constructing interprocedural memory SSA form is expensive. Because of the undecidability of aliases (Landi, 1992), a memory operation (load or store) may access many different memory objects at a pointer dereference due to over-approximation. Unlike intraprocedural memory SSA forms in Open64 (Chow et al., 1996) and GCC (Novillo and Canada, 2007), which use a single virtual symbol (Chow et al., 1996) to represent all memory objects defined outside a function, an interprocedural SSA form distinguishes every object at a memory access with a unique name for SSA renaming, resulting in precise dependences between two memory operations. However, distinguishing objects accessed at pointer dereferences significantly increases the number of variables that need to be put in SSA. A memory SSA construction algorithm which involves a non-trivial data flow analysis (Novillo and Canada, 2007) takes substantial time for analysing large programs.
Nowadays, multi-core platforms are ubiquitous. It becomes imperative to exploit parallelism to accelerate memory SSA construction algorithms. However, such algorithms are often not directly amenable to parallelisation. For example, the algorithm in Hardekopf and Lin (2011) works on the interprocedural control-flow graph (ICFG) (Landi and Ryder, 1992) of a program. The approach is designed to be entirely sequential, by treating the whole program as a single graph, thereby hindering its parallelisation.

In this paper, we present ParSSA, a simple yet effective parallel approach to accelerating the construction of interprocedural memory SSA. ParSSA enables pre-analysis to partition the whole-program abstract memory objects into memory regions. Every region is uniquely identified in a program to represent a set of objects that are accessed equivalently using the results from an existing pointer analysis. The interprocedural memory dependences of a function are then fully parameterised using memory regions, so that the memory SSA construction of individual functions can be readily parallelised. The key contributions of this paper are:

- We propose ParSSA, the first parallel approach to constructing interprocedural memory SSA form for large-scale programs with millions of lines of code.
- We present a new approach to constructing fully parameterised interprocedural memory SSA form using memory region partitioning. The source code is available at https://github.com/SVF-tools/SVF/tree/master/lib/MSSA.
- We have evaluated ParSSA using a set of 15 large applications. ParSSA achieves up to 6.9x speedup against the sequential version on an 8-core machine.

The rest of this paper is organised as follows. Section 2 presents our ParSSA approach including the overview of ParSSA framework (Section 2.1), the examples of intraprocedural and interprocedural memory SSA forms (Section 2.2.1), the side-effect analysis (Section 2.2.2) and memory region generation and parallel construction (Section 2.2.3). Section 3 evaluates ParSSA including implementation (Section 3.3), methodology (Section 3.2), results and analysis (Section 3.3). Section 4 describes the related work. Finally, Section 5 concludes the paper and discusses some future work.

2. ParSSA approach

The key idea of our ParSSA approach is to parameterise every function of a program through partitioned memory regions, so that the indirect memory accesses in a function are fully parameterised through the side-effect analysis using these memory regions. Therefore, memory dependences across functions are decoupled, making memory SSA form construction readily parallelised.

2.1. Overview of ParSSA

The workflow of ParSSA is depicted in Fig. 1. The source code of a program is first compiled by the clang compiler front-end (FE) into bit-code files, which are merged by LLVM Gold Plugin (llv) at link time stage to produce a whole-program bit-code file. Then the “Pointer Analysis” module is invoked. Based on the information obtained, we first perform a lightweight side-effect analysis to capture interprocedural reference and modification of each abstract memory object. Thus, the (alias) set of indirect def (uses) at a statement \( \ell \) (i.e., a store, load or call site) in each function is obtained and denoted as \( D_{\ell} = \{ R_i : R_j \in D_{\ell} \} \).

The “Mem Region Partitioning” module partitions all the abstract memory objects of a program into a set of disjointed regions \( R_1, \ldots, R_n \). Then every statement \( \ell \) is annotated with each \( R_i \), where \( D_{\ell} \cap R_i = \emptyset \) \( (D_{\ell} \neq \emptyset, R_i \neq \emptyset) \), to make explicit the memory objects that may be defined (used) indirectly at \( \ell \). Once indirect uses and defs are identified, the interprocedural dependences are fully parameterised for every function by using uniquely named regions, so that we can achieve function level parallelism to produce a whole-program memory SSA form that has the same precision as the one built by a sequential algorithm. Our algorithm for constructing fully parameterised SSA form has been implemented in the open-source tool SVF (Sui and Xue, 2016b) (https://github.com/SVF-tools/SVF) based on the LLVM compiler.

ParSSA uses Intel Threading Building Block (TBB) to fork multiple threads for building SSA form for each parameterised functions. The concurrent queue data structure is used to store all the parameterised functions of a program after memory region partitioning. ParSSA performs parallel construction of memory SSA for every program function by allocating parallel tasks (“MSSA constructor”) using task groups in TBB, so that every allocated task constructs memory SSA modularly by choosing the next available parameterised function from the concurrent queue. Finally, the whole program memory SSA form is available when all the parallel tasks finish.

2.2. Parameterised memory SSA form

This section details our ParSSA approach. Section 2.2.1 describes the background knowledge and examples of memory SSA forms. Section 2.2.2 introduces whole-program side-effect analysis to discover interprocedural dependences across functions using results from a pointer analysis. Based on the side-effect analysis, Section 2.2.3 discusses memory region generation to parameterise program functions to enable parallel memory SSA construction.

2.2.1. Intraprocedural and interprocedural Memory SSA Form Examples

Without loss of generality, we follow the LLVM convention (Hardekopf and Lin, 2011; Sui and Xue, 2016a; Lhoûak and Chung, 2011) of separating all variables in a program into two disjoint sets: \( A \) containing all possible targets, i.e., address-taken variables of a pointer and \( T \) containing all top-level variables.

A program is represented by five types of statements: \( p = \& a \) (ADDR\&), \( p = q \) (COPY), \( p = \& q \) (LOAD), \( \& p = q \) (STORE), and \( p = \phi(q, r) \)
Fig. 2. Intraprocedural memory SSA with $\mu/X$ at stores/loads.

\[
\begin{array}{c|l}
\hline
\text{Pre-computed} & \text{Points-to} \\
\hline
p(x) = (a, b) & \\
p[y] = (b) & \\
p(x) = (a) & \\
\hline
\end{array}
\]

\[
\begin{array}{c|l}
\hline
\ell_1: \text{if } p = q: a \rightarrow x(b) & b = x(b) \\
\ell_2: \text{if } q = y: v = w_y(b) \\
\ell_3: \text{if } \chi(v) = t: \chi(v) = v(v) \\
\hline
(a) \text{ annotated } \mu/X & (b) \text{ } \chi(v) \text{ after SSA conversion} \\
\end{array}
\]

Fig. 3. Interprocedural SSA with $\mu/X$ at callsite and function entry/exit.

\[
\begin{array}{c|l}
\hline
\text{void foo}(p) & v = x(v) \\
\hline
\ell_1: \text{if } p = q: v = x(v) & \ell_2: \text{if } q = y: v = x(v) \\
\ell_2: \text{if } \text{bar}(p): v = x(v) & \ell_1: \text{if } \text{bar}(p): v = x(v) \\
\text{return } x(v) & \text{return } x(v) \\
\hline
(a) \text{ annotated } \mu/X & (b) \text{ } \chi(v) \text{ after SSA conversion} \\
\end{array}
\]

2.2. Whole-program side-effect analysis

This section introduces our side-effect analysis to discover interprocedural program dependences of a function using the results of Andersen’s pointer analysis (Andersen, 1994). Given a function $f$, the side-effect analysis determines the set $U \setminus D$ of the nonlocal memory objects (Definition 1) in $f$ that may be indirectly read (modified) when $f$ is executed, denoted as $f: U \setminus D$. The side-effect of each statement $\ell \in L_f$ in a function $f$, denoted as $\ell: U \setminus D$, is analysed individually.

Fig. 4 gives the rules of our side-effect analysis. The root causes for the interprocedural side-effect are loads and stores. For a load $p = sq$, the points-to set $p(q)$ of $q$ may contain nonlocal objects read in $f$ (LOAD). Similarly, [STORE] collects nonlocal objects in $p(q)$ that may be modified at a store. In contrast, address and copy statements do not contribute any side-effect according to [ALLOC] and [COPY]. Rule [PROC] simply collects the side-effect of a function $f$ by accumulating the computed side-effect of its statements.

For a callsite $\ell: _= f(\_)$ with its callee function $f$, the most conservative side-effect analysis is to assume that the set of all variables passed into this callsite may be read or modified by its callees invoked directly/indirectly. This naive approach is inaccurate due to a large number of unrealisable def-use chains created across the functions. Therefore, we only collect objects $E_{\ell} \rightarrow f$ (Definition 2), which are escaped from callsite $\ell$ to its callee $f$ as computed based on lines 6-10 in Algorithm 1. In the presence of recursion, [CALL] and [PROC] are recursively applied until a fixed point is reached.

Definition 1 (Nonlocal Objects). Consider a memory object $o \in A$ that is not a global object but accessed in a function $f$. We say that $o$ represents a local object if (1) $o$ is locally declared in $f$ and (2) $f$ does not appear in any recursion cycle, and a nonlocal object otherwise. We write $\text{Local}_f (\text{NonLocal}_f)$ to represent the set of all local (nonlocal) objects accessed in $f$.

Definition 2 (Callsite Escaped Objects). For a callsite $\ell: _= f(\_)$ with its callee function $f$, a set of escaped objects $E_{\ell} \rightarrow f$ represents all nonlocal objects passed into callsite $\ell$ that may be used or modified inside callee function $f$. $E_{\ell} \rightarrow f$ is pre-computed using Andersen’s points-to results according to Algorithm 1.

Theorem 1. (Soundness). Proof Sketch: Our side-effect analysis is sound because (1) the side-effect of a statement $\ell: U \setminus D$ (LOAD) and (CALL) is over-approximated due to the underlying sound pointer analysis, and (2) $f: U \setminus D$ records all the nonlocal locations in $\text{NonLocal}_f$ read and modified by $f$ (PROC).

2.2.3. Memory region generation

After side-effect analysis, we generate a set of memory regions, denoted $\gamma$. Every memory region represents a set of memory objects. Any two memory regions $R, R’ \in \gamma$ are disjointed, i.e., $R \cap R’ = \emptyset$. Algorithm 2 describes the region generation for function $f$. Initially, memory regions are collected from pointer dereferences based on points-to information (lines 2-3) and callsites based on side-effect analysis (lines 4-5). Then the regions are gradually refined by making all regions disjointed (lines 7-10) using a standard worklist algorithm.

Note that memory region generation is not limited to the results of a particular pointer analysis. More precise points-to information can help generate regions that have more precise dependence relations for memory SSA construction.

2.2.4. Parallel construction

Algorithm 3 describes the sequential version of constructing the memory SSA form for a parameterised function. There are three
Fig. 4. Rules to determine side-effect of a function including loads/stores and callsites.

**Algorithm 1.** Computing escaped objects

```
begin
W ← Ø;  O ← Ø;
foreach parameter p at callee ℓ : _ = f(p,...) do
  W ← W ∪ pt(p);
foreach global pointer g do
  W ← W ∪ pt(g);
while W ≠ Ø do
  o = W.pop();
  if O ∩ {o} ≠ Ø then
    O = O ∪ {o};  W = W ∪ pt(o);
  Ret = {o | o ∈ O ∧ o is non local object};
return Ret;
```

**Algorithm 2.** Memory region generation

```
begin
W ← Ø;
foreach pointer dereference *p in function f do
  R ← pt(p);  W ← W ∪ {R};
foreach callee ℓ : _ = f(...): U; B in f do
  R ← U;  R' ← B;  W ← W ∪ {R} ∪ {R'};
changed = true;
while changed do
  if R₁, R₂ ∈ W ∧ R₁ ∩ R₂ ≠ Ø then
    W ← W ∪ (R₁') ∪ {R₃};  R₃ ← R₂ \ R₁;
  else changed = false;
return W;
```

The objective is to show that our parallel memory SSA construction algorithm is significantly faster than the sequential one in analysing large-scale real-world applications with millions of lines of code.

3.1. Implementation

We have fully implemented ParSSA in LLVM-4.0.0. The source files of each benchmark are compiled into bit-code files using clang and then merged together using LLVM Gold Plugin (llv) at link time stage to produce a whole program bitcode file. The compiler flag mem2reg is applied to promote memory into registers.

We use flow-insensitive and field-sensitive Andersen’s analysis (Sui and Xue, 2016b) as pre-analysis to generate memory regions. The call graph of a program is constructed on-the-fly during points-to resolution. Our handling of field-sensitivity is ANSI-compliant (ISO90, 1990). The fields of an struct object are distinguish by their unique indices. ParSSA adopts a field-index-based approach to field-sensitivity similar to Pearce et al. (2007).
Algorithm 3. Memory SSA construction based for a parameterised function with $\mu/\chi$ annotations Cooper and Torczon (2011)

Function $\text{ConstructMemSSA}(f)$
begin
  $\text{CreateMuCh}(f)$
  $\text{InsertPhi}(f)$
  $\text{SSARename}(f)$
end

Function $\text{CreateMuCh}(f)$
begin
  foreach $bb \in f$ do
    foreach $(\ell : U, D) \in bb$ do
      foreach $R \in Y \wedge (R \cap U \neq \emptyset)$ do
        add $\mu(R)$ before $\ell$
      foreach $R \in Y \wedge (R \cap D \neq \emptyset)$ do
        add $R = \chi(R)$ after $\ell$
  end
end

Function $\text{InsertPhi}(f)$
begin
  let $DF(bb)$ be the dominant frontiers of $bb$.
  To reduce the number of inserted $\phi$, we compute $\text{GlobalNames}$ which denotes the set of regions that live across multiple basic blocks ($\text{Blocks}(R)$). A phi-function without user is pruned according to Section 9.3.3 in Cooper and Torczon (2011).
  foreach $R \in \text{GlobalNames}$ do
    $W \leftarrow \text{Blocks}(R)$
    while $W \neq \emptyset$ do
      $bb \leftarrow W.pop()$
      foreach $(bb') \in DF(bb)$ & $(bb' \neq bb)$ do
        insert $\phi$-function for $R$ in $bb'$;
      end
      $W \leftarrow W \cup \{bb'\}$
    end
  end
end

Function $\text{SSARename}(f)$
begin
  let $bb_{\text{entry}}$ be the basic block at the entry of function $f$
  Initialise the stacks (push version 0 of each memory region into its stack)
  $\text{RenameBB}(bb_{\text{entry}})$;
end

Function $\text{RenameBB}(bb)$
begin
  foreach $(R = \phi(\_)) \in bb$ do
    rename $R$ as $\text{NewSSANAME}(R)$;
  end
  if $(\mu(R)) \in bb$ or $(R' = \chi(R)) \in bb$ following the execution order in $bb$ do
    rewrite $R$ as $\text{top}(\text{stack}[R])$;
    rewrite $R'$ as $\text{NewSSANAME}(R)$;
  end
  foreach successor in the $\text{CFG}$ do
    fill in $\phi$-function parameters;
  end
  foreach successor $s$ in the dominator tree do
    $\text{RenameBB}(s)$
  end
  foreach $(R = \phi(\_))and(R = \phi(\_)) \in bb$ do
    $\text{Pop}(\text{stack}[R])$
  end
end

Function $\text{NewSSANAME}(R)$
begin
  $i \leftarrow \text{counter}[R]$;
  $\text{counter}[R] \leftarrow \text{counter}[R] + 1$;
  $\text{push}(R, \text{onto stack}[R])$;
end

Algorithm 4. Parallel memory SSA construction

Function $\text{Main(\_ThreadCount)}$
begin
  $\text{task}_{\text{group}} g(\_\text{ThreadCount})$;
  foreach $\text{task} \in g$ do
    $\text{task}.\text{run}(\_\text{ParallelConstruct})$; // parallel execution
end

Function $\text{ParallelConstruct()}$
begin
  $\text{ConcurrentQueue}$ is initialised with all the functions of a program
  while $\text{ConcurrentQueue} \neq \emptyset$ do
    $f \leftarrow \text{ConcurrentQueue}.\text{pop}()$
    $\text{ConstructMemSSA}(f)$
  end
end
For a struct allocation $p = \&o$, a field-insensitive object $o$ is created to represent the entire struct object. A field object $o_{fd}$ is derived from $o$ when analyzing a field access $q = \&p.fld$, where $fld$ is a constant. Thus, different fields (including index 0) are modeled using distinct (sub) objects. Two pointer dereferences are aliased if one refers to $o$ and another one refers to one of its fields e.g., $o_{fd}$ since it is the sub component of $o$. However, dereferences refer to different fields of $o$ are distinguished and not aliased.

For a pointer arithmetic $q = p + i$, if $p$ points to a struct object, we conservatively treat that $q$ can point any field of this struct object. This is based on the ANSI-compliant assumption that $i$ is not across the boundary of the object. A pointer arithmetic used for accessing an aggregate object out of the boundary may cause unsoundness. Arrays are treated monolithically, i.e., accessing any element of an array is treated as accessing the entire array object.

The parallelisation scheme can be summarised as a thread pool pattern (Pool, 2018). We use Intel’s Threading Building Blocks li-
Libraries (TBB) to allocate multiple threads for parallel construction of memory SSA as in Algorithm 3. The concurrent_queue data structure is used to store all parameterised functions. We use task_group to allocate parallel threads for constructing per-function memory SSA from a concurrent_queue data structure.

3.2. Methology

We evaluate ParSSA using 15 real-world applications including 10 large open-source C programs and 5 large C++ programs as listed in Table 1: make (a build automation tool), a2ps (a postScript filter), bison (a parser), tar (tar archiving), bash (a

![Fig. 6. Memory SSA construction time under different thread configurations.](image-url)
unix shell and command language), sendmail (an email server and client), python (a scripting language), vim (a text editor), emacs (a text editor), gdb (Linux debugger), dealII (finite element analysis), omnetpp (discrete event simulation), povray (image ray-tracing), soplex (linear programming) and xalan (XML process).

There are altogether over 4.7 million lines with the largest one, gdb, comprising over 1.8 million LOC. Experiments are conducted on a computer with 3.7G Hz Intel Xeon 8-core CPU and 16 GB memory, running Ubuntu Linux (kernel version 3.11.0). For each benchmark, we evaluate the performance advantages of our parallel implementation with two, four and eight threads (TBB tasks) enabled over the sequential one for constructing memory SSA of the same program in Table 1.

A scheduling strategy is implemented to optimise parallel task allocation. In order to avoid idle threads and workload imbalance, the goal of our strategy is to evenly assign workload to each thread, so that a better performance can be achieved. However, optimal scheduling as a classic partition problem (Gent and Walsh, 1998) is NP-complete. We have implemented a greedy algorithm introduced in Korf (2009) to produce results close to optimal in polynomial time (O(n log n)). Fig. 5 shows an example adapted from a2ps, which illustrates how the optimised scheduling can reduce the execution time. In ParSSA, each task (workload) corresponds to a function in the program. The size of each workload for a function f is estimated by the total number of the annotated $\mu$, $\chi$ and $\phi$ functions in f.

3.3. Results and analysis

Table 2 shows the numbers of annotated $\mu$, $\chi$ and $\phi$ functions for memory regions in each program. For the 15 programs evaluated, 5,219,036 annotations are added in total, with 1,151,238 added in the largest program gdb.

Fig. 6 gives memory SSA construction times under three different configurations (with thread counts being 1, 2, 4 and 8). For each program, we run every configuration five times and report the average time.

The blue lines in Fig. 6 represent the SSA construction time (seconds) using the optimised scheduling strategy (described in Section 3.2). The average speedups gained with two, four and eight threads are 1.79X, 3.15X and 4.71X respectively. The grey lines represent the construction time without the scheduling scheme. The performance results are worse when disabling the scheduling strategy, resulting in the speedups of 1.60X, 3.01X, 4.57X under the three configurations. This is caused by the imbalanced workloads for different parallel tasks, especially for some programs, e.g., emacs, a2ps, vim and gdb whose function sizes vary significantly. The function sizes of most C++ programs (e.g., soplex, dealII, povray and xalan) tend to be more balanced due to the object-oriented design patterns.

The total construction time of all the benchmarks by a sequential algorithm is significantly reduced from 3486.16 to 561.34 seconds using 8 threads under the scheduling strategy. The average speedup for eight threads is 4.7X. The maximum speedup observed is 6.9X (gdb). These results are promising, showing that our approach has the potential to be deployed in optimising compilers.

For the four small-size programs, a2ps, bash, bison, make, tar and soplex, the maximum speedups achieved are under 4X (even with 8 threads). For the medium-size programs, dealII, emacs, omnetpp, povray, python and sendmail, which have a relatively large number of pointers and annotations, greater speedups are observed, ranging from 4X to 5.6X, as shown in Fig. 6(b). For the most complex benchmarks, gdb, vim and xalan, all their speedups above 5.6X with 8 threads. In particular, the analysis time for the largest benchmark gdb has been cut from 2030.51 seconds to 290.53 seconds.

Fig. 7 compares further the speedups achieved under three different thread configurations. Compared to sequential execution, ParSSA with the scheduling strategy has achieved noticeable speedups for all the benchmarks evaluated, with the best reaching 6.9X (in gdb). This demonstrates that ParSSA is effective in accelerating memory SSA construction for large programs.

In general, better speedups are obtained when more threads are used. On average, the speedups gained with 4 threads are 1.74X higher than the speedups gained with 2 threads, while the speedups gained with 8 threads are 1.47X higher than the speedups gained with 4 threads. However, it worth noting that for small applications (e.g., a2ps, make and soplex), using 8 threads does not guarantee a better performance than using 4 threads. There are two reasons behind this phenomenon. First, more threads lead to higher synchronisation overheads in accessing the shared data, offsetting the speedups gained from parallelism. Second, for small programs, the overhead of initiating threads is not negligible. In addition, some programs have better speedups than others. The reason is that different programs have different inherent complexities in terms of memory SSA construction, resulting in different synchronisation overheads.

4. Related Work

Static Single Assignment (SSA)

SSA form is the mainstream representation in modern optimising compilers and program analysis tools. Memory SSA advances scalar-based SSA by providing a sparse data-flow representation for both top-level pointers and address-taken variables. Intraprocedural memory SSA forms (Chow et al., 1996; Novillo and Canada, 2007), which approximates conservatively the dependences across the functions is cheaper to compute than their interprocedural counterparts. Recently, the idea of staged analysis (Hardekopf and Lin, 2011; Sui and Xue, 2016a) provides an effective way for using pre-computed points-to information to bootstrap an interprocedural memory SSA. However, the algorithm is still costly for large programs with millions of lines of code.
Parallel Program Analysis

Méndez-Lojo et al. (2010) introduce a parallel implementation of Andersen’s pointer analysis for C programs based on graph rewriting. Their parallel analysis is context- and flow-insensitive, achieving a speedup of up to 3× on 8 CPU cores. Recently, the whole-program sparse flow-sensitive pointer analysis (Hardekopf and Lin, 2011) is parallelised on multi-core CPUs (Nagaraj and Govindarajan, 2013) and GPUs (Nasre, 2013). The speedups are up to 2.6× on 8 CPU cores. In their report, Singer and Ward describe a parallel scalar SSA form for Java programs by considering top-level pointers only. To the best of our knowledge, this paper proposes the first approach to parallelising interprocedural memory SSA construction that achieves an average speedup of 4.7× (up to 6.9×) on 8 CPU cores.

5. Conclusion

This paper presented PanSSA, the first parallel memory SSA construction approach by partitioning the whole-program memory objects into uniquely identified memory regions to fully parameterise indirect memory accesses in a function. Thus, the memory dependences across functions are decoupled for parallelising memory SSA construction. Our results show that PanSSA can achieve an average speedup of 4.7× on an 8-core machine, making it deployable in optimising compilers and program analysis tools. There few possible future directions.

There are interesting directions along this work. One possible future work is to extend PanSSA to support fine-grained parallelism than function level (e.g., basic block and code region level) for constructing memory SSA form. For example, applying region-based analysis (Ye et al., 2014) to parameterise selected program parts for parallel SSA construction. Another interesting direction is to apply the proposed parameterised approach to support parallelising precise pointer analysis (e.g., whole-program flow-sensitive analysis (Sui et al., 2016) and/or demand-driven analysis (Sui and Xue, 2016a)).

References

ISO90, 1990. ISO/IEC. international standard ISO/IEC 9899, programming languages - C.

Yulei Sui is a Lecturer (Assistant Professor) and an ARC DECRA at Faculty of Engineering and Information Technology, University of Technology Sydney (UTS). He obtained his Ph.D. degree from University of New South Wales (UNSW), where he also holds an Adjunct Lecturer position. He is broadly interested in the research field of software engineering and programming languages, particularly interested in static and dynamic program analysis for software bug detection and compiler optimizations. He worked as a software engineer in Program Analysis Group for Memory Safe C project in Oracle Lab Australia. He was an Australian IPRS scholarship holder, a keynote speaker at EuroLVM and a Best Paper Award winner at CGO, and has been awarded an Australian Discovery Early Career Researcher Award (DECRA) 2017–2019.

Hua Yan is currently a research associate of School of Computer Science and Engineering at University of New South Wales, Sydney, Australia. He obtained his Ph.D. degree from University of New South Wales, Australia, in January 2018. Hua Yan’s research interests include software engineering, software security, program analysis, testing and formal verification.
Zheng Zheng is currently an associate professor and the Vice Dean of Department of Automatic Control, School of Automation Science and Electrical Engineering, Beihang University. He received his Ph.D. degree in Computer Software and Theory from Institute of Computing Technology, Chinese Academy of Sciences. He is an Area Editor of International Journal of Computational Intelligence Systems. He has served as an investigator on Beijing Youth Talent Project of China, and was honoured by The Dean Scholarship of Chinese Academy of Science.

Yupeng Zhang received his Ph.D. degree in computer science from Northwestern Polytechnical University, China, in 2009. He is currently working as an Assistant Professor at University of Houston. Dr. Zhang has worked at Boise State University and Dakota State University (U.S.), University of Melbourne (Australia), Imperial College London (U.K.) and Northwestern Polytechnical University (China) as a Cybersecurity Expert for more than 15 years. He has published 49 papers in peer-reviewed journals and conference and served as a program committee member for many conferences in his area.

Jingling Xue received his B.Sc. and M.Sc. degrees in Computer Science and Engineering from Tsinghua University in 1984 and 1987, respectively, and his Ph.D. degree in Computer Science and Engineering from Edinburgh University in 1992. He is currently a Professor in the School of Computer Science and Engineering, University of New South Wales, Australia, where he heads the Programming Languages and Compilers Group. His main research interest has been programming languages and compilers for about 20 years. He is currently supervising a group of postdocs and Ph.D. students on a number of topics including programming and compiler techniques for multicore processors and embedded systems, concurrent programming models, and program analysis for detecting bugs and security vulnerabilities. He is presently an Associate Editor of IEEE Transactions on Computers, Software: Practice and Engineering, International Journal of Parallel, Emergent and Distributed Systems, and Journal of Computer Science and Technology. He has served in various capacities on the Program Committees of many conferences in his field.