Tapir - Life, the Universe, and Everything: The adventures of William (Billy) Moses

Tapir is a new framework for compilers that offers the ability to do a variety of optimizations on parallel code, created by myself, TB Schardl, and Charles Leiserson [1].

Since we’re getting a number of questions about it, I thought I’d write this blog post to talk about some of its features!

The problem

The reason for Tapir’s existence starts with a problem developers may encounter when writing parallel programs.

I’m also a physics major, so suppose that I wanted to write a program to normalize some vectors as follows:

__attribute__((const)) double norm(const double *A, int n);

void normalize(double *restrict out, const double *restrict in, int n) {
  for (int i = 0; i < n; ++i)
    out[i] = in[i] / norm(in, n);
}

C code to normalize a vector.

This code runs decently fast on my test machine (with n=64M), completing in 0.312s.

Yet, I have a lot of vectors and this time simply won’t do! Since my beefy test machine has multiple cores (18 to be exact), let’s see if we can get it to run faster by using parallelism! Now I know I probably won’t get the full 18× speedup because of some overhead of the parallel runtime, but let’s try to get at least a little speedup.

Noticing that none of the iterations of my loop depend on each other, I rewrite my serial for loop to be a parallel loop. For now, I choose OpenMP [2].

__attribute__((const)) double norm(const double *A, int n);

void normalize(double *restrict out, const double *restrict in, int n) {
  #pragma omp parallel for
  for (int i = 0; i < n; ++i)
    out[i] = in[i] / norm(in, n);
}

Parallel version of normalize code using OpenMP.

Very excited to see how much faster my code will run, I wait for the results and find… it runs ~1000× slower!

Serial Runtime	18-core Runtime
0.3 seconds	3 minutes

Comparison of parallel vs serial runtime for normalize code

Why did this happen?

The short answer is that your compiler is clever. For the serial code, it is able to perform an optimization known as Loop Invariant Code motion or LICM [3]. In effect, it realized that the call to norm is constant across all iterations of the loop and it could transform it so it only calls it once as shown below.

__attribute__((const)) double norm(const double *A, int n);

void normalize(double *restrict out, const double *restrict in, int n) {
  double temp = norm(in, n);
  for (int i = 0; i < n; ++i)
    out[i] = in[i] / temp;
}

Serial version of normalize code after LICM

However, your standard off-the-shelf compiler isn’t able to do the same for the parallel code – even though it seems like it should. The reason for this is that the modern compilers represent parallel code as “syntactic sugar” and effectively perform source-to-source transformations. For example, the parallel program above is transformed into something that roughly looks like this:

struct norm_data {
  double* out;
  double* in;
  int n;
};

void __openmp_parallel_for(void(*body)(norm_data closure, int i), int size);

void normalize_closure(norm_data closure, int i) {
  closure.out[i] = closure.in[i] = norm(i, closure.n);
}

void normalize(double *restrict out, const double *restrict in, int n) {
  __openmp_parallel_for((struct norm_data){out, in, n}, n);
}

C-like representation of normalization code after syntactic sugar is removed.

From the compiler’s perspective, normalize_closure is just some random function – it has no idea that the call to norm could be constant or that it’s even being called inside of a loop! ¹

That’s where Tapir comes in. It is a representation for parallel programs inside of the compiler that allows existing serial optimizations to run on parallel code. It also allows for programmers to start implementing parallel-specific optimizations – which if this interests you, you should email me at [email protected].

Happily, Tapir allows this optimization to happen, and when compiling with Tapir we find a nice speedup as expected.

Serial Runtime	18-core Runtime (No Tapir)	18-core Runtime (Tapir)
0.3 seconds	3 minutes	0.081 seconds

Comparison of parallel vs serial runtime for normalize code

I won’t go into the details of how Tapir works, but if you’re interested you should read the full paper available here. Likewise the (incredibly messy) code for Tapir is available on my Github.

Installation

If you want to try Tapir out, you have a few options at your disposal.

If you want to install it on your personal computer, you’re going to have to build it from source.

Happily, we’ve build a nice script that should do most of the heavy lifting for you. However, be warned that building a compiler will

Take forever (~hour on a decent laptop)
Use a ton of memory (take ~12GB of RAM)

If your machine doesn’t meet the specs necessary to build, you should consider asking a friend with a similar OS to build and send you the binaries, or perhaps one day we’ll build them automatically and have links for you to download.

To install from source, run the following commands:

git clone --recursive https://github.com/wsmoses/Tapir-Meta.git
cd Tapir-Meta
bash build.sh

If all goes well, you should see a “Installation successful” message. Now, you need to add this new compiler to your path so when you run clang, you’ll actually be using this.

To do this, simply run

source ./setup-env.sh

Tapir Usage Guide

Currently, our implementation of Tapir in LLVM [4] allows you to write Cilk code, that will be optimized with Tapir. There’s no reason that Tapir can’t work with OpenMP or other frameworks – but we simply haven’t written the frontend to parse those constructs (which if you’re interested – email me).

As a brief refresher, Cilk has three constructs to enable parallelism: cilk_spawn, cilk_sync, and cilk_for. cilk_spawn says that a call to a function can run in parallel with anything else running in the function. cilk_sync acts as a boundary that forces all cilk_spawn’s in the current scope to complete before continuing. Lastly, cilk_for specifies iterations of a for loop that can run in parallel (and you can think of as a cilk_spawn inside of the body, with a cilk_sync after the loop is done) [5].

For ease of our own testing (and hopefully this will be useful to anyone using Tapir), we extended Cilk slightly, such that you can spawn arbitrary blocks of code rather than just functions.

For example, you can do so in the following (racy) code:

#include <cilk/cilk.h>
#include <stdio.h>

int main() {
  int sum = 0;
  cilk_spawn { sum+=1; }
  sum+=2;
  cilk_sync;
  printf("The sum is %d\n", sum);
  return 0;
}

Block-like cilk_spawn syntax

Now let’s try compiling some code!

To compile a cilk code you have a few options for how you want it compiled:

You compile the program just like you would any regular Cilk program, except now Tapir optimizations are going to be enabled.

clang mycode.c -fcilkplus

Sometimes, however, you may not want to use the any tapir optimizations. For example, this may be useful when debugging. To do so, add the -fdetach flag.

clang mycode.c -fcilkplus -fdetach

Something important to note: if you compile cilk code and don’t use one of these flags it will happily compile it, but make it serial!

Race Detection

[Note that since Tapir is an ongoing project, it’s possible the optimizer doesn’t work the same when you try it. However, the following instructions should still work, and it’s still important to be cognizant of the following “gotcha”.]

Now that you have your compiler working, let’s use it to let’s to do race-detection! Tapir currently uses cilksan for race-detection.

In the above example, there’s a race on the variable sum.

To perform race detection, we need only add the -fsanitize=cilk flag.

clang mycode.c -fcilkplus -fsanitize=cilk -O2

Now let’s run the code!

You should get the following output:

The sum is 3

Race detector detected total of 0 races.
Race detector suppressed 0 duplicate error messages

max sync block size seen: 1    (from user input: 0, average: 0.333333)
max continuation depth seen: 1

It said there’s no races, what’s going on here?

It turns out that tapir actually optimized away the race in this case and just prints 3! To force it to not do this optimization we can set -O0.

clang mycode.c -fcilkplus -fsanitize=cilk -O0

We should now see the following:

Race detected at address 0x7ffdd8527390
  write access at 0x401633: (racetest.c:0)
  write access at 0x4011b0: (??:0)

steal points: 0, 0, 0
curr sync block size: 1
frame id: 1
The sum is 3

Race detector detected total of 1 races.
Race detector suppressed 0 duplicate error messages

max sync block size seen: 1    (from user input: 0, average: 0.333333)
max continuation depth seen: 1

Of course, Tapir can’t always optimize away the races and the race detector will indeed work without adding the -O0 flag in those cases.

Using Tapir

If you decide to use or reference Tapir in academic work, please cite the following. If you’re interested in using Tapir in an industrial setting, please email me.

@inproceedings{tapir,
  title = {Tapir: Embedding Fork-Join Parallelism into {LLVM}'s Intermediate Representation},
  author = {Schardl, Tao B. and Moses, William S. and Leiserson, Charles E.},
  booktitle = {Proc.\ 22nd Symposium on Principles and Practice of Parallel Programming},
  year = {2017}
}

References

T. B. Schardl, W. S. Moses, and C. E. Leiserson, “Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation,” in Proc. 22nd Symposium on Principles and Practice of Parallel Programming, 2017.
OpenMP Architecture Review Board, OpenMP Application Program Interface, Version 4.0. 2013.
S. S. Muchnick, Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.
C. Lattner and V. Adve, “LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation,” in CGO, 2004, pp. 75–87.
M. Frigo, C. E. Leiserson, and K. H. Randall, “The Implementation of the Cilk-5 Multithreaded Language,” in ACM SIGPLAN Conference on Programming Language Design and Implementation, 1998, pp. 212–223.

Sometimes OpenMP tries to be slightly clever and use something known as strip mining. Doing so means that instead of running one iteration inside the extracted function, it runs a few iterations. Let’s suppose that it runs B iterations in the extracted function. Now it can successfully run LICM in the extracted function. This does improve performance, but is still worse than being able to perform LICM before the extraction. The total work in in the model where B iterations are in the extracted function is O(N^2/B), instead of O(N) in the serial case. ↩