### Destination-passing Style

This is a brief overview of our paper Shaikhha et al, “Destination-passing style for efficient memory management”, SIGPLAN FHPC 2017. ACM, PDF.

Lots of work exists on compilation of high-level functional languages such as Haskell or F# to native code, and in some cases, benchmarks show competitive efficiency with handwritten low-level code. However there remain cases where the compiled code has significant overhead due to dependence on garbage collection and heap memory management. These include important real-world scenarios with numerical workloads, such as computer vision, VR/AR, and AI/machine learning. Such workloads may do lots of matrix/vector math, with lots of temporary array variables. Of course some temporaries can be eliminated by loop fusion, but many must remain (e.g. a temporary is used multiple times, so it would be expensive to recompute it) and they incur the costs of heap allocation.

The goal of DPS is to minimize dependency of the executable code on garbage collection (GC) or heap allocation. The key idea is that when a function returns a large object, such as an array, the space for the returned value is allocated by the _caller_ rather than the called function. This opens many opportunities for stack allocation, leading to much more efficient code. But there's a but: in general we can't compute the space needed for the result without computing the result. We will handle that problem below, but first let's look at a concrete example.

Let's see an example function `main`

, which takes vectors a,b,c and computes ‖(a+b)+c‖.

function vadd(a::Vector,b::Vector)::Vector function norm(a::Vector)::Real function main(a::Vector,b::Vector, c::Vector)::Real norm(vadd(vadd(a, b), c)) end

The function `vadd`

implements vector addition, and for the purposes of our example has a peculiar implementation where if the vectors have different sizes, it pads the shorter one with zeros. Its C++ code looks something like this. (Ignore the trivial optimization opportunities.)

struct Vector { size_t size; double* data }; // Small fixed struct, can return in registers/on stack Vector vadd(Vector a, Vector b) { double* out = gcnew double[max(a.size, b.size)]; // Allocate space for the return value on the GC heap vadd_blas(min(a.size,b.size), a.data, b.data, out); // vadd_blas is some efficient implementation std::fill(out + min(a.size, b.size), out + max(a.size, b.size), 0.0); return Vector {max(a.size,b.size), out}; }

Therefore calling main above involves two calls to gcnew, and incurs garbage collector overhead to clean up the memory. One can use region allocators and lifetime analysis to turn this into simpler heap allocation or bump allocation, but this simply localizes the memory management overhead. Instead DPS is a simple transformation that cleanly yields efficient memory management.

Imagine that the writer of `vadd`

had kindly supplied a function which computed the size of `vadd`

's result:

size_t vadd_size(Vector a, Vector b) { return max(a.size, b.size); }

And another version `vadd_dps`

which takes a pre-sized result buffer, which the caller must ensure is the correct size

Vector vadd_dps(void* buf, Vector a, Vector b) { double* out = (double*)buf; vadd_blas(min(a.size,b.size), a.data, b.data, out); std::fill(out + min(a.size, b.size), out + max(a.size, b.size), 0.0); return Vector {max(a.size,b.size), out}; }

Then we could eliminate all heap allocations in main using alloca:

double main(Vector a, Vector b) { Vector tmp1 = vadd_dps(alloca<double>(vadd_size(a,b)), a, b); Vector tmp2 = vadd_dps(alloca<double>(vadd_size(tmp1,c)), tmp1, c); vadd_dps(&tmp1, tmp2, c); return norm(tmp2); }

Even if we can't use alloca, maybe because stack space is limited, we can eliminate GC in main using explicit heap allocation:

double main(Vector a, Vector b) { Vector tmp1 = vadd_dps(new double[vadd_size(a,b)], a, b); Vector tmp2 = vadd_dps(new double[vadd_size(tmp1,c)], tmp1, c); double ret = norm(tmp2); delete [] tmp2.data; delete [] tmp1.data; return ret; }

Now, there are perennial arguments of the form “GC is faster than “new/delete”, but notice here that the allocator can be a bump allocator, so the arguments for GC/regions do not apply.

So, what a lovely story. There's one problem though. The writer of the `vadd`

function *didn't* give us `vadd_size`

, just `vadd`

. And in general it's impossible to determine the result size for an arbitrary block of 3rd party code. OK, but we can still write `vadd_size`

, just inefficiently:

size_t vadd_size(Vector a, Vector b) { Vector tmp = vadd(a,b); // Call vadd size_t ret = tmp.size; gcdelete tmp.data; // We know we don't need it, so it could have been bump-allocated return ret; }

And of course, we could use new/delete, or a bump allocator. What's more, when the compiler tries to inline `vadd`

in `vadd_size`

, the fact that tmp.data is never accessed allows dead code elimination to remove the call to `vadd_blas`

(remember we're compiling *to* C++, so the semantics are defined by the source language, which has referential transparency, i.e. can make assumptions about side effects).

size_t vadd_size(Vector a, Vector b) { //DCE double* out = gcnew double[max(a.size, b.size)]; // Allocate space for the return value on the GC heap //DCE vadd_blas(min(a.size,b.size), a.data, b.data, out); // vadd_blas is some efficient implementation //DCE std::fill(out + min(a.size, b.size), out + max(a.size, b.size), 0.0); Vector tmp {max(a.size,b.size), /* DCE out */}; size_t ret = tmp.size; // DCE gcdelete tmp.data; // We know we don't need it, so it could have been bump-allocated return ret; }

Similarly `vadd_dps`

has an inefficient implementation

void vadd_dps(void* buf, Vector a, Vector b) { Vector tmp = vadd(a,b); std::copy(tmp.data, tmp.data+tmp.size, buf); gcdelete tmp.data; return Vector(tmp.size, buf); }

which again, can be easily optimized. So, what's the point? We seem to have just made `vadd`

less efficient, and any gains we talk about could have been made by inlining `vadd`

in main.

The point is this: when compiling `vadd`

, we can easily compile and optimize `vadd_size`

and `vadd_dps`

, without cross-module inlining. When a caller sees `vadd`

, it can observe the existence of the `_size`

and `_dps`

variants, and know that stack-discipline allocation will yield more efficient code. This will mean less stuff on the GC heap, and less GC overhead. With careful subsetting of the source language (and with the introduction of an explicit `gcnew`

or `new/delete`

in the source), we can compile many sensible programs from a functional language such as F# to non-garbage-collected C.

#### Examples.

[Here](https://github.com/awf/Coconut/blob/master/Examples/GMM/FSmooth/usecases_gmm.fs) is a piece of F# to compute the log-likelihood of a Gaussian Mixture model, which is a canonical machine learning model. Slightly more old fashioned than a deep neural network, but with arguably more interesting code complexity. And this compiles to non-GC C (see [here](https://github.com/awf/Coconut/blob/master/outputs/C/usecases_gmm_opt.h)), so can be directly incorporated into projects with no runtime, so we get the joy of functional programming while being another step closer to optimal performance.

#### Related work

We didn't invent DPS, we just used it for F#. It is a technique with a long history. From an Aaron Turon blog post: In a 1998 paper, Yasuhiko Minamide studied an idea that has come to be known as “destination-passing style”. There, a common special case of continuation-passing was noticed, where the continuation's only job was to allocated a cons cell.
It was also common in the early windows APIs to have a call `foo_size`

which would report the buffer size needed for `foo`

.

#### Limitations, future work

In the paper linked above, we didn't talk about F#, but a simpler language F~ (pronounced “F smooth”). That was so we could prove that `foo_size`

would compile to an O(1) operation. However, this isn't necessary in general. Even if the function is complicatedly recursive (e.g. an array of Ackerman size), we can indeed compute the size first, and then fill the array. It may cost twice as much, but look at the alternatives: build a list or tree, then convert to an array? In some cases that will be the better way to go, in some cases the 2x cost of precomputing the size will be better.

Also, as written F~ doesn't allow for _any_ heap allocation, so you can't create a linked list. This is easily overcome by adding a `new`

operator to the language which makes explicit when heap allocation occurs. This can be paired with manual memory management, smart pointers, or full GC, just as C++ programmers choose today. The point is that the numerical parts of the program can be fast, while the remainder is no slower than today.