Selective Screeners: Complexity and Optimizations

Last Updated: Sunday 9th October 2022

Selective Screeners

Recall that a screener is a transformation that is both an instrumentor and one that is strenghtening. The screeners we have seen so far are universal insofar as they check every memory access for safety. On the other hand, a screener that elects to eliminate one or more checks which are determined to be redundant and can safely be removed is termed a selective screener.

1. The check count function $\mathsf{ch}_{T, P} : \mathbb{N} \to \mathbb{N}$ measures the number of checks made by the output of screener $T : \mathbb{P} \to \mathbb{P}$ on program $P \in \mathbb{P}$ , namely $T(P)$ , as a function of $n$ , the input length. The screener $T$ is universal if and only if $\mathsf{ch}_{T, P}(n) = m_P(n) \; \forall P \in \mathbb{P}$ . Otherwise, the screener $T$ is selective.

Recall that $m_P(n)$ is the memory complexity i.e. the number of loads/stores in program $P$ as a function of $n$ . The two components that determine a screener's runtime complexity are 1). the check count; and 2). the runtime cost of each check. The latter depends on the underlying address manager implementation and more specifically, the data structure adopted to take care of the bookkeeping. Let $b(n)$ denote the bookkeeping cost (as a function of $n$ ) in terms of the number of memory accesses incurred by a single safety check. Consider a selective screener that aims to achieve at most a slowdown factor of $k$ where the runtime efficiency metric we use is the total number of memory accesses made by $T(P)$ for some $P \in \tilde{\mathbb{P}}$ . Fix some $n$ and let $r = \frac{\mathsf{ch}_{T, P}(n)}{m_P(n)}$ which is the fraction of memory accesses that are checked. Also, let $b = b(n)$ be the cost per check and let $m = m_P(n)$ . Then we have \[m + rmb \leq km\] where $k > 1$ . It follows that $r \leq (k - 1)/b$ . If pointer tagging is employed, the $b$ term is a constant. When constant factors apply, it is important to remember that the limitations of HRAM0 mean that such constants are considerably larger than in practical architectures (HRAM0 has no native support for bitwise operations for example) and it is most relevant to explore the performance of certain types of screeners on these architectures. This is something we return to in a later article. In previous pages, the goal was to optimize $b(n)$ whereas our goal here is to minimize $r \in [0, 1]$ ; that is, eliminate as many checks as we can.

Eliminating Checks

Unlike a universal screener which checks every memory access for safety, a selective screener aims to eliminate checks that it determines are unnecessary without compromising detection of unsafe accesses. A selective screener performs static analysis and builds up useful information about pointers, which then informs its ability to accurately eliminate checks. Consider the following example of a program, namely, selection sort, and in particular, its implementation below. We will use this running example to help describe a series of optimizations.

Program $P_{\mathsf{SelSort}}$

###############################################################################
# Selection Sort 
#
# description: sorts array pointed to by r4 in-place
#             (function dirties r5-r12 (i.e.not saved/restored from stack))
# arguments: r4 - address of array of integers to sort of length n (reg n)
###############################################################################
selsort:
        put 0, r5 # r5 is outer index
        add r2, n, r12 # r12 = n - 1
        sub r12, r5, r9
        brn r9, outerloop
        brn r2, end_outerloop

outerloop:
        add r5, r4, r6
        lod r6, r7 # r7 is min value
        put 0, r8
        add r8, r5, r8
        sub r2, r5, r9 # r9 = r5 + 1 = inner index

        sub n, r9, r10
        brn r10, innerloop
        brn r2, end_innerloop

innerloop:
        add r9, r4, r10
        lod r10, r10
        sub r7, r10, r11
        brn r11, newmin
        brn r2, nextiter
newmin:
        put 0, r11
        add r11, r9, r8
        add r11, r10, r7
nextiter:
        sub r2, r9, r9
        sub n, r9, r10
        brn r10, innerloop

end_innerloop:
        add r8, r4, r8
        lod r6, r10
        sto r7, r6
        sto r10, r8

        sub r2, r5, r5
        sub r12, r5, r9
        brn r9, outerloop

end_outerloop:
        ret

It is straightforward to calculate the number of memory accesses. We derive $m_{P_{\mathsf{SelSort}}}(n) = n^2/2 + 5n/2 - 3$ . With a universal screener such as $\mathsf{UScreener}$ that we developed previously, it naturally holds that $\mathsf{ch}_{\mathsf{UScreener}}(n) = n^2/2 + 5n/2 - 3$ . Now we will consider a series of optimizations that will help us eliminate some of these checks.

Optimizations

Literature abounds on control flow analysis and this is what is needed here first. Essentially, we start by building a control flow graph by assembling basic blocks (a basic block is a straight line sequence of instructions that can be entered only at the beginning and exited only at the end) and adding edges between the blocks based on branches etc. Loops are detected by recognizing backward edges in the graph. There are a number of forms such loops may take and each form can be distinguished by its unique characteristics.

Repeat Access to the Same Pointer

If a pointer is stored in a register $r$ and it is checked for safety at code location $i$ then if at another location $j$ a load or store is made to register $r$ , then given certain conditions we can avoid checking it for safety again. Firstly, if both instructions belong to different basic blocks, then assert that the basic block containing $i$ dominates the basic block containing $j$ . Secondly, with data flow analysis, verify that the register $r$ is not modified on any execution path between $i$ and $j$ . Finally, we must ascertain that no call to free occurs on any execution path between $i$ and $j$ . More thorough analysis can exclude certain cases of free, but as a starting point, we ensure that no call to free occurs whatsoever.

Let $T_0 := \mathsf{UScreener}$ . Applying $T_0$ to $P_{\mathsf{SelSort}}$ results in there being checks for the instructions on lines 9 (starting from the selsort label and excluding empty lines) lod r6, r7, 18 lod r10, r10, 32 lod r6, r10, 33 sto r7, r6, and 34 sto r10, r8. Consider the load instruction at line 9: lod r6, r7. A check will duly be added by $T_0$ for the address in register r6. Now consider the load instruction on line 32: lod r6, r10. Firstly, we see that in order to reach this instruction, the program has to first reach the instruction at line 9. Secondly, it is easy to see that the register r6 is not modified between these two instructions. Finally, we can readily ascertain that no call to free occurs on any execution path between these instructions. Therefore, it can be determined that checking for safety prior to the instruction at line 32 is unnecessary and the check can be eliminated. Furthermore, we can update our most recent determination as to the safety of the address at r6 to line 32 (we will refer to this location in the code segment as location $i$ ). The subsequent instruction (at location $i + 3$ i.e. line 33) is another memory access instruction: sto r7, r6. Here we see that r6 is referenced again. Applying the same reasoning as above, we conclude that no safety check needs to be performed for this instruction either. No further checks can be eliminated based on this particular approach, but we have managed to eliminate two checks from within a loop that overall leads to $2(n - 1)$ fewer checks. We denote by $T_1$ our new screener, which implements the above optimization.

Loop-Invariant Pointers

Another optimization we describe is as follows. Consider the following code:

# assume r4 contains a pointer to an array with the number of elements given by r5
count_elems_equal_to_first:

# loop, index is r6
put 1, r6 # skip first element

# count of number of elements equal to the first element is in r0 (ret value)
put 1, r0 # first element is automatically counted

loop:
lod r4, r7
add r5, r4, r8
lod r8, r8

# check if [r4] == [r4 + r5]
sub r7, r8, r9
brz r9, eq # brz (branch if zero) macro defined in spec
brn r2, iter # recall r2=-1 so here we jump to iter

# Elements are equal so increment count
eq:
sub r2, r0, r0 # r0 = r0 + 1

iter:
sub r5, r6, r7
brn r7, loop

ret # return num elements equal to first element

Observe that the load instruction after the label loop, namely lod r4. r7, always loads from a pointer r4 that is not modified during execution of the loop; a condition we call loop-invariant. Our screener $T_1$ will add a check before this instruction which will be executed on every iteration of the loop. A simple optimization to make is to check if a pointer in a load/store is loop-invariant then it is sufficient to move the check outside the loop, where it will be executed once instead of say (as in this case) r5 times. Note that there is no case where this optimization applies in our example program $P_{\mathsf{SelSort}}$ . Let us call the derivative of $T_1$ that implements this optimization $T_2$ .

Consider a transformation $\mathsf{LP}_k : \mathbb{P} \to \mathbb{P}$ that peels every loop in a program by $k$ iterations where peeling refers to inserting copies of the loop body preceded by the loop guard $k$ times before the remainder of the loop. This is a relatively straightforward transformation. Let $T'_2 = T_1 \circ \mathsf{LP}_1$ . It follows that \[\mathsf{ch}_{T_2, P} = \mathsf{ch}_{T'_2, P} \; \; \forall P \in \tilde{\mathbb{P}}.\] The reason for this is that in the case of $T'_2$ , for loop-invariant pointers, a check will be added in the peeled iteration and the optimization implemented in $T_1$ will detect this as dominating the usage of the pointer in the remainder of the loop and therefore remove the check inside the loop. This will have the same effect as directly removing checks for loop-invariant pointers in $T_2$ .

Reducing Checks to Sequential Pointers

It is important to conduct data flow analysis to better understand where possible savings can be made. Most useful is assigning particular symbols to unknown values and assigning algebraic expressions to registers at different points of the program which allow more accurate inferences to be made regarding the program's behavior. Sequential pointers are addresses of arrays whose elements are indexed in sequence (say in a loop). As aforementioned, identifying loops is particularly important. Returning to $P_{\mathsf{SelSort}}$ again, we can identify two loops, one nested inside the other. The index of the outer loop is register $\mathsf{r5}$ . This is the only variable that either increments or decrements by a constant. On line 9, there is a load instruction with respect to the address contained in register $\mathsf{r6}$ . In the previous line, the value of $\mathsf{r6}$ is computed as the sum of $\mathsf{r4}$ and the index $\mathsf{r5}$ . Therefore, we can infer that $\mathsf{r4}$ is some kind of "base" pointer. Hence, we have identified a sequential pointer. Similarly, the index of the inner loop is $\mathsf{r9}$ and the same sequential pointer can be identified on line 10 where an address to load from in line 11 is computed in register $\mathsf{r10}$ as the sum of index $\mathsf{r9}$ and base pointer $\mathsf{r4}$ . Additional analysis is needed to determine the range of a sequential pointer i.e. lower bound and upper bound of an index in a particular loop. This requires us to determine the terminating condition for a loop.

In $P_{\mathsf{SelSort}}$ , let us first examine the outer loop. We see on line 2 that the index $\mathsf{r5}$ is initialized to 0. Since the index increments on every iteration, this is a forward sequence pointer, then the lower bound is 0. Next we must determine the loop's terminating condition. We see that the characteristic backward edge that defines the loop jumps to its beginning conditioned on whether $\mathsf{r5} - \mathsf{r12} < 0$ i.e. when the index $\mathsf{r5}$ is less than $\mathsf{r12}$ . Data flow analysis reveals that $\mathsf{r12}$ holds the value $n - 1$ . Therefore, the upper bound is $n - 1$ , which is a strict upper bound. The range we are interested in then is from 0 to $n - 2$ (inclusive). Let us denote the value of the register $\mathsf{r4}$ by $b$ . Hence, the sequence pointer accessed in the outer loop is an array at address $b$ such that the element at index $i$ has address $b + i$ and $i$ takes on values in the range $0, \ldots, n - 2$ .

Now we examine the inner loop. The index of the inner loop is the register $\mathsf{r9}$ . This can be ascertained by identifying the register that increments/decrements by a non-zero constant. Note that from an algorithmic perspective, our approach only works for certain common cases and will fail for other loop types. As an example, a case involving an index that increments by a variable amount on each iteration will not be handled. More complex analysis is required for such cases. Repeating the steps from above with a focus on the inner loop allows us to ascertain that the range of the sequence pointer accessed in the inner loop is $i + 1, \ldots, n - 1$ (inclusive). Since the initial value of $i$ is 0, this concretely means the range is $0, \ldots, n - 1$ . Therefore, the putative array has at least $n$ elements.

Overall, we can eliminate the checks inside the loops and place a single check before the loop for address range starting at $\mathsf{r4}$ and ending at $\mathsf{r4} + n$ (exclusive). Let us call $T_3$ the screener that implements these optimizations. The number of checks made by $T_3$ for the program $P_{\mathsf{SelSort}}$ is \[\mathsf{ch}_{T_3, P_{\mathsf{SelSort}}}(n) = 1\]

To be continued...