I'm a PhD candidate at MIT studying computer science and getting up to all sorts of mischief. Previously, I attended MIT as an undergraduate and M.Eng in computer science and physics. Before that, I attended Thomas Jefferson High School for Science and Technology (TJHSST) in Northern Virginia.
While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model.We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7\texttimes.
@inproceedings{moses2022high, author = {Moses, William S. and Ivanov, Ivan R. and Domke, Jens and Endo, Toshio and Doerfert, Johannes and Zinenko, Oleksandr}, title = {High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs}, year = {2023}, isbn = {9798400700156}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3572848.3577475}, doi = {10.1145/3572848.3577475}, booktitle = {Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming}, pages = {119–134}, numpages = {16}, keywords = {CUDA, MLIR, barrier synchronization, polygeist}, location = {Montreal, QC, Canada}, series = {PPoPP '23}, shortname = {PPoPP '23}, papertype = {conference}, pdf = {https://dl.acm.org/doi/pdf/10.1145/3572848.3577475} }
@inproceedings{enzymePar, title = {Scalable Automatic Differentiation of Multiple Parallel Paradigms through Compiler Augmentation}, author = {Moses, William S and Hari Krishna Narayanan, Sri and Paehler, Ludger and Churavy, Valentinand H{\"u}ckelheim, Jan and Schanen, Michel and Doerfert, Johannes and Hovland, Paul}, booktitle = { {SC} '22: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, year = {2022}, location = {St. Louis, Missouri}, conference = { {SC} '22: The International Conference for High Performance Computing, Networking, Storage and Analysis}, award = {Best Student Paper Award and Best Paper Finalist}, shortname = {SC '22}, papertype = {conference}, overleaf = {https://www.overleaf.com/project/60eb1022cd54f70251e1294b}, pdf = {https://c.wsmoses.com/papers/enzymePar.pdf} }
@inproceedings{transformersLLVM, title = {Enabling Transformers to Understand Low-Level Programs}, author = {Guo, Zifan and Moses, William S.}, booktitle = {2022 IEEE High Performance Extreme Computing Conference (HPEC)}, year = {2022}, organization = {IEEE}, shortname = {HPEC '22}, papertype = {conference}, overleaf = {https://www.overleaf.com/project/62d1dd7a6097679455a75368}, pdf = {https://c.wsmoses.com/papers/hpectransformers.pdf} }
@article{brown2022performance, title = {Performance Portable Solid Mechanics via Matrix-Free $ p $-Multigrid}, author = {Brown, Jed and Barra, Valeria and Beams, Natalie and Ghaffari, Leila and Knepley, Matthew and Moses, William and Shakeri, Rezgar and Stengel, Karen and Thompson, Jeremy L and Zhang, Junchao}, journal = {arXiv preprint arXiv:2204.01722}, papertype = {preprint}, shortname = {arXiv}, year = {2022}, pdf = {https://arxiv.org/pdf/2204.01722.pdf} }
Computing derivatives is key to many algorithms in scientific computing and machine learning such as optimization, uncertainty quantification, and stability analysis. Enzyme is a LLVM compiler plugin that performs reverse-mode automatic differentiation (AD) and thus generates high performance gradients of programs in languages including C/C++, Fortran, Julia, and Rust. Prior to this work, Enzyme and other AD tools were not capable of generating gradients of GPU kernels. Our paper presents a combination of novel techniques that make Enzyme the first fully automatic reversemode AD tool to generate gradients of GPU kernels. Since unlike other tools Enzyme performs automatic differentiation within a general-purpose compiler, we are able to introduce several novel GPU and AD-specific optimizations. To show the generality and efficiency of our approach, we compute gradients of five GPU-based HPC applications, executed on NVIDIA and AMD GPUs. All benchmarks run within an order of magnitude of the original program’s execution time. Without GPU and AD-specific optimizations, gradients of GPU kernels either fail to run from a lack of resources or have infeasible overhead. Finally, we demonstrate that increasing the problem size by either increasing the number of threads or increasing the work per thread, does not substantially impact the overhead from differentiation.
@inproceedings{enzymeGPU, title = {Reverse-Mode Automatic Differentiation and Optimization of {GPU} Kernels via Enzyme}, author = {Moses, William S and Churavy, Valentin and Paehler, Ludger and H{\"u}ckelheim, Jan and Hari Krishna Narayanan, Sri and Schanen, Michel and Doerfert, Johannes}, booktitle = { {SC} '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, year = {2021}, location = {St. Louis, Missouri}, conference = { {SC} '21: The International Conference for High Performance Computing, Networking, Storage and Analysis}, award = {Best Student Paper Finalist and Best Reproducibility Advancement Finalist}, shortname = {SC '21}, pdf = {https://c.wsmoses.com/papers/EnzymeGPU.pdf}, papertype = {conference}, tex = {https://github.com/wsmoses/Paper-EnzymeSC21}, overleaf = {https://www.overleaf.com/project/60395b62c1de7024e5b878fc} }
We present Polygeist, a new tool that reroutes polyhedral compilation flows to use the representation available in the recent MLIR compilation infrastructure. It consists of two parts: a C and C++ frontend capable of converting a wide variety of existing codes into MLIR suitable for polyhedral trans- formation, and a bi-directional conversion between MLIR’s polyhedral representation and existing polyhedral exchange formats. We demonstrate Polygeist’s flow by converting the entire Polybench/C benchmark suite into MLIR, and by per- forming an IR-to-IR optimization leveraging an existing polyhedral compiler (Pluto). Our flow produces results within 1.25% of the state-of-the-art Clang compiler, enabling direct comparison of source-to-source and IR-to-binary compilers. We believe Polygeist can improve the interoperation between MLIR and the existing polyhedral tooling, benefiting both the research and the production compiler communities.
@inproceedings{polygeistIMPACT, title = {Polygeist: Affine {C} in {MLIR}}, author = {Moses, William S. and Chelini, Lorenzo and Zhao, Ruizhe and Zinenko, Oleksandr}, booktitle = {IMPACT 2021-11th International Workshop on Polyhedral Compilation Techniques}, year = {2021}, pdf = {https://acohen.gitlabpages.inria.fr/impact/impact2021/papers/IMPACT_2021_paper_1.pdf}, shortname = {IMPACT '21}, papertype = {workshop}, tex = {https://github.com/wsmoses/Paper-PolygeistIMPACT21}, overleaf = {https://braintex.goog/project/5fc12a5b595bff00806782b7} }
We present Polygeist, a new compilation flow that connects the MLIR compiler infrastructure to cutting edge polyhedral optimization tools. It consists of a C and C++ frontend capable of converting a broad range of existing codes into MLIR suitable for polyhedral transformation and a bi-directional conversion between MLIR and OpenScop exchange format. The Polygeist/MLIR intermediate representation featuring high-level (affine) loop constructs and n-D arrays embedded into a single static assignment (SSA) substrate enables an unprecedented combination of SSA-based and polyhedral optimizations. We illustrate this by proposing and implementing two extra transformations: statement splitting and reduction parallelization. Our evaluation demonstrates that Polygeist outperforms on average both an LLVM IR-level optimizer (Polly) and a source-to-source state-of-the-art polyhedral compiler (Pluto) when exercised on the Polybench/C benchmark suite in sequential (2.53x vs 1.41x, 2.34x)and parallel mode (9.47x vs 3.26x, 7.54x) thanks to the new representation and transformations.
@inproceedings{polygeistPACT, title = {Polygeist: Raising {C} to Polyhedral {MLIR}}, author = {Moses, William S. and Chelini, Lorenzo and Zhao, Ruizhe and Zinenko, Oleksandr}, booktitle = {Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques}, numpages = {12}, location = {Virtual Event}, shortname = {PACT '21}, publisher = {Association for Computing Machinery}, year = {2021}, address = {New York, NY, USA}, keywords = {Polygeist, MLIR, Polyhedral, LLVM, Compiler, C++, Pluto, Polly, OpenScop, Parallel, OpenMP, Affine, Raising, Transformation, Splitting, Automatic-Parallelization, Reduction, Polybench}, pdf = {https://c.wsmoses.com/papers/Polygeist_PACT.pdf}, papertype = {conference}, tex = {https://github.com/wsmoses/Paper-PolygeistPACT21}, overleaf = {https://braintex.goog/project/6113bf60410be30098c30301} }
@article{haj2020protuner, title = {ProTuner: tuning programs with Monte Carlo tree search}, author = {Haj-Ali, Ameer and Genc, Hasan and Huang, Qijing and Moses, William and Wawrzynek, John and Asanovi{\'c}, Krste and Stoica, Ion}, journal = {arXiv preprint arXiv:2005.13685}, year = {2020}, shortname = {arXiv}, papertype = {preprint}, tex = {https://github.com/wsmoses/Paper-Protuner}, overleaf = {https://www.overleaf.com/project/5e4f77f9a690c5000145ac3a}, pdf = {https://arxiv.org/pdf/2005.13685.pdf} }
The performance of the code a compiler generates depends on the order in which it applies the optimization passes. Choosing a good order–often referred to as the \em phase-ordering problem–is an NP-hard problem. As a result, existing solutions rely on a variety of heuristics. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. To this end, we implement a framework that takes a program and finds a sequence of passes that optimize the performance of the generated circuit. Without loss of generality, we instantiate this framework in the context of an LLVM compiler and target high-level synthesis programs. We use random forests to quantify the correlation between the effectiveness of a given pass and the program’s features. This helps us reduce the search space by avoiding orderings that are unlikely to improve the performance of a given program. We compare the performance of deep reinforcement learning to state-of-the-art algorithms that address the phase-ordering problem. In our evaluation, we show that reinforcement learning improves circuit performance by 28% when compared to using the -O3 compiler flag, and it achieves competitive results compared to the state-of-the-art solutions, while requiring fewer samples. More importantly, unlike existing state-of-the-art solutions, our reinforcement learning solution can generalize to more than 12,000 different programs after training on as few as a hundred programs for less than ten minutes.
@article{haj2020autophase, title = {AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning}, author = {Haj-Ali, Ameer and Huang, Qijing Jenny and Xiang, John and Moses, William and Asanovic, Krste and Wawrzynek, John and Stoica, Ion}, booktitle = {Proceedings of Machine Learning and Systems}, editor = {Dhillon, I. and Papailiopoulos, D. and Sze, V.}, pages = {70--81}, url = {https://proceedings.mlsys.org/paper/2020/file/4e732ced3463d06de0ca9a15b6153677-Paper.pdf}, volume = {2}, year = {2020}, papertype = {conference}, tex = {https://github.com/wsmoses/Paper-AutophaseMLSys}, overleaf = {https://www.overleaf.com/project/5c9ad45dd502c2597ba2c3b1} }
@misc{govindarajan2020syfer, title = { {SyFER-MLIR}: Integrating Fully Homomorphic Encryption Into the {MLIR} Compiler Framework}, author = {Govindarajan, Sanath and Moses, William S}, pdf = {https://math.mit.edu/research/highschool/primes/materials/2020/Govindarajan-Moses.pdf}, papertype = {preprint}, year = {2020} }
Applying differentiable programming techniques and machine learning algorithms to foreign programs requires developers to either rewrite their code in a machine learning framework, or otherwise provide derivatives of the foreign code. This paper presents Enzyme, a high-performance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of statically analyzable programs expressed in the LLVM intermediate representation (IR). Enzyme can synthesize gradients for programs written in any language whose compiler targets LLVM IR including C, C++, Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD capabilities in these languages. Unlike traditional source-to-source and operator-overloading tools, Enzyme performs AD on optimized IR. On a machine-learning focused benchmark suite including Microsoft’s ADBench, AD on optimized IR achieves a geometric mean speedup of 4.5x over AD on IR before optimization allowing Enzyme to achieve state-of-the-art performance. Packaging Enzyme for PyTorch and TensorFlow provides convenient access to gradients of foreign code with state-of-the art performance, enabling foreign code to be directly incorporated into existing machine learning workflows.
@inproceedings{enzymeNeurips, author = {Moses, William and Churavy, Valentin}, booktitle = {Advances in Neural Information Processing Systems}, editor = {Larochelle, H. and Ranzato, M. and Hadsell, R. and Balcan, M. F. and Lin, H.}, pages = {12472--12485}, publisher = {Curran Associates, Inc.}, title = {Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients}, url = {https://proceedings.neurips.cc/paper/2020/file/9332c513ef44b682e9347822c2e457ac-Paper.pdf}, volume = {33}, year = {2020}, award = {Spotlight Presentation}, shortname = {NeurIPS '20}, pdf = {https://proceedings.neurips.cc/paper/2020/file/9332c513ef44b682e9347822c2e457ac-Paper.pdf}, hackernews = {https://news.ycombinator.com/item?id=26012289}, reddit = {https://www.reddit.com/r/cpp/comments/j7fb4a/enzyme_highperformance_automatic_differentiation}, papertype = {conference}, tex = {https://github.com/wsmoses/Paper-EnzymeNeurips20}, overleaf = {https://www.overleaf.com/project/5ec2b7bc4e59f40001a9be13} }
@inproceedings{huang2019autophase, title = {Autophase: Compiler phase-ordering for {HLS} with deep reinforcement learning}, author = {Huang, Qijing and Haj-Ali, Ameer and Moses, William and Xiang, John and Stoica, Ion and Asanovic, Krste and Wawrzynek, John}, booktitle = {2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)}, pages = {308--308}, year = {2019}, organization = {IEEE}, shortname = {FCCM '19}, papertype = {workshop}, tex = {https://github.com/wsmoses/Paper-AutophaseFCCM}, overleaf = {https://www.overleaf.com/project/5c1325e30800c53868650613}, pdf = {https://ieeexplore.ieee.org/abstract/document/8735549} }
Deterministic software transactional memory (STM) is a useful programming model for writing parallel codes, as it improves programmability (by supporting transactions) and debuggability (by supporting determinism). This paper presents LiTM, a new deterministic STM system that achieves both simplicity and efficiency at the same time. LiTM implements the deterministic reservations framework of Blelloch et al., but without requiring the programmer to understand the internals of the algorithm. Instead, the programmer writes the program in a transactional fashion and LiTM manages all data conflicts and automatically achieves deterministic parallelism. Our experiments on six benchmarks show that LiTM outperforms the state-of-the-art framework Galois by up to 5.8x on a 40-core machine
@inproceedings{xia2019litm, title = { {LiTM}: A Lightweight Deterministic Software Transactional Memory System}, author = {Xia, Yu and Yu, Xiangyao and Moses, William and Shun, Julian and Devadas, Srinivas}, booktitle = {Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores}, shortname = {PPoPP PMAMM '19}, pages = {1--10}, year = {2019}, organization = {ACM}, pdf = {https://c.wsmoses.com/papers/litm.pdf}, papertype = {workshop} }
Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: (1) a domain-specific language with a tensor notation close to the mathematics of deep learning; (2) a Just-InTime optimizing compiler based on the polyhedral framework; (3) carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels; (4) the transparent integration of our flow into PyTorch and Caffe2, providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The performance is comparable to, and often exceeds the performance of, highly tuned libraries.
@article{Vasilache:2019:NAL:3366460.3355606, author = {Vasilache, Nicolas and Zinenko, Oleksandr and Theodoridis, Theodoros and Goyal, Priya and Devito, Zachary and Moses, William S. and Verdoolaege, Sven and Adams, Andrew and Cohen, Albert}, booktitle = {Architecture and Code Optimization (TACO)}, shortname = {TACO}, title = {The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated {GPU} Kernels, Automatically}, journal = {ACM Trans. Archit. Code Optim.}, issue_date = {October 2019}, volume = {16}, number = {4}, month = oct, year = {2019}, issn = {1544-3566}, pages = {38:1--38:26}, articleno = {38}, numpages = {26}, url = {http://doi.acm.org/10.1145/3355606}, doi = {10.1145/3355606}, acmid = {3355606}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {Deep learning layers, GPU acceleration, polyhedral compilation}, pdf = {https://c.wsmoses.com/papers/tc-taco.pdf}, papertype = {journal}, tex = {https://github.com/wsmoses/Paper-TCTACO} }
Tapir (pronounced TAY-per) is a compiler intermediate representation (IR) that embeds recursive fork-join parallelism, as supported by task-parallel programming platforms such as Cilk and OpenMP, into a mainstream compiler’s IR. Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations on and across parallel control constructs. Remedying this situation has generally been thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir leverages the “serial-projection property,” which is commonly satisfied by task-parallel programs, to handle the semantics of these programs without an extensive rework of the compiler.For recursive fork-join programs that satisfy the serial-projection property, Tapir enables effective compiler optimization of parallel programs with only minor changes to existing compiler analyses and code transformations. Tapir uses the serial-projection property to order logically parallel fine-grained tasks in the program’s control-flow graph. This ordered representation of parallel tasks allows the compiler to optimize parallel codes effectively with only minor modifications. For example, to implement Tapir/LLVM, a prototype of Tapir in the LLVM compiler, we added or modified less than 3,000 lines of LLVM’s half-million-line core middle-end functionality.These changes sufficed to enable LLVM’s existing compiler optimizations for serial code—including loop-invariant-code motion, common-subexpression elimination, and tail-recursion elimination—to work with parallel control constructs such as parallel loops and Cilk’s Cilk_Spawn keyword. Tapir also supports parallel optimizations, such as loop scheduling, which restructure the parallel control flow of the program. By making use of existing LLVM optimizations and new parallel optimizations, Tapir/LLVM can optimize recursive fork-join programs more effectively than traditional compilation methods. On a suite of 35 Cilk application benchmarks, Tapir/LLVM produces more efficient executables for 30 benchmarks, with faster 18-core running times for 26 of them, compared to a nearly identical compiler that compiles parallel linguistic constructs the traditional way.
@article{tapirTOPC, author = {Schardl, Tao B. and Moses, William S. and Leiserson, Charles E.}, title = {Tapir: Embedding Recursive Fork-Join Parallelism into {LLVM}'s Intermediate Representation}, year = {2019}, pdf = {https://dl.acm.org/doi/10.1145/3365655}, shortname = {TOPC}, issue_date = {December 2019}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {6}, number = {4}, issn = {2329-4949}, url = {https://doi.org/10.1145/3365655}, doi = {10.1145/3365655}, journal = {ACM Trans. Parallel Comput.}, month = dec, articleno = {19}, numpages = {33}, keywords = {parallel computing, control-flow graph, OpenMP, fork-join parallelism, optimization, compiling, LLVM, multicore, serial-projection property, Tapir, Cilk}, papertype = {journal} }
An algorithmic decision-maker incentivizes people to act in certain ways to receive better decisions. These incentives can dramatically influence subjects’ behaviors and lives, and it is important that both decision-makers and decision-recipients have clarity on which actions are incentivized by the chosen model. While for linear functions, the changes a subject is incentivized to make may be clear, we prove that for many non-linear functions (e.g. neural networks, random forests), classical methods for interpreting the behavior of models (e.g. input gradients) provide poor advice to individuals on which actions they should take. In this work, we propose a mathematical framework for understanding algorithmic incentives as the challenge of solving a Markov Decision Process, where the state includes the set of input features, and the reward is a function of the model’s output. We can then leverage the many toolkits for solving MDPs (e.g. tree-based planning, reinforcement learning) to identify the optimal actions each individual is incentivized to take to improve their decision under a given model. We demonstrate the utility of our method by estimating the maximally-incentivized actions in two real-world settings: a recidivism risk predictor we train using ProPublica’s COMPAS dataset, and an online credit scoring tool published by the Fair Isaac Corporation (FICO)
@inproceedings{shavit2019extracting, title = {Extracting Incentives from Black-Box Decisions}, shortname = {NeurIPS AI in FS}, author = {Shavit, Yonadav and Moses, William S.}, year = {2019}, booktitle = {2019 NeurIPS Workshop on AI in Financial Services}, pdf = {https://arxiv.org/pdf/1910.05664.pdf}, papertype = {workshop}, tex = {https://github.com/wsmoses/Paper-IncentiveNeuripsFinancial}, overleaf = {https://www.overleaf.com/project/5d76a1cf13500f0001f9999d} }
Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and Theano, explore different tradeoffs between usability and expressiveness, research or production orientation and supported hardware. They operate on a DAG of computational operators, wrapping high-performance libraries such as CUDNN (for NVIDIA GPUs) or NNPACK (for various CPUs), and automate memory allocation, synchronization, distribution. Custom operators are needed where the computation does not fit existing high-performance library calls, usually at a high engineering cost. This is frequently required when new operators are invented by researchers: such operators suffer a severe performance penalty, which limits the pace of innovation. Furthermore, even if there is an existing runtime call these frameworks can use, it often does not offer optimal performance for a user’s particular network architecture and dataset, missing optimizations between operators as well as optimizations that can be done knowing the size and shape of data. Our contributions include (1) a language close to the mathematics of deep learning called Tensor Comprehensions, (2) a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated by an autotuner. In particular, we demonstrate the suitability of the polyhedral framework to construct a domain-specific optimizer effective on state-of-the-art deep learning models on GPUs. Our flow reaches up to 4× speedup over NVIDIA libraries on kernels relevant to the Machine Learning Community, and on an actual model used in production at Facebook. It is integrated with mainstream frameworks Caffe2 (production-oriented), PyTorch (research-oriented), through the ATen asynchronous tensor library.
@article{tc, title = {Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions}, booktitle = {arXiv preprint}, shortname = {arXiv}, author = {Vasilache, Nicolas and Zinenko, Oleksandr and Theodoridis, Theodoros and Goyal, Priya and DeVito, Zachary and Moses, William S and Verdoolaege, Sven and Adams, Andrew and Cohen, Albert}, journal = {arXiv preprint arXiv:1802.04730}, year = {2018}, reddit = {http://www.reddit.com/r/MachineLearning/comments/7xjqq9/r_announcing_tensor_comprehensions/}, hackernews = {http://news.ycombinator.com/item?id=16377389}, pdf = {https://arxiv.org/pdf/1802.04730.pdf}, papertype = {preprint} }
Optimizing compilers for task-level parallelism are still in their infancy. This work explores a compiler front end that translates OpenMP tasking semantics to Tapir, an extension to LLVM IR that represents fork-join parallelism. This enables analyses and optimizations that were previously inaccessible to OpenMP codes, as well as the ability to target additional runtimes at code generation. Using a Cilk runtime back end, we compare results to existing OpenMP implementations. Initial performance results for the Barcelona OpenMP task suite show performance improvements over existing implementations.
@inproceedings{openmpir, author = {Stelle, George and Moses, William S. and Olivier, Stephen L. and McCormick, Patrick}, title = { {OpenMPIR}: Implementing OpenMP Tasks with Tapir}, booktitle = {Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC}, shortname = {LLVM-HPC'17}, year = {2017}, isbn = {978-1-4503-5565-0}, location = {Denver, CO, USA}, pages = {3:1--3:12}, articleno = {3}, numpages = {12}, url = {http://doi.acm.org/10.1145/3148173.3148186}, doi = {10.1145/3148173.3148186}, acmid = {3148186}, publisher = {ACM}, address = {New York, NY, USA}, pdf = {https://c.wsmoses.com/papers/openmpir.pdf}, papertype = {workshop}, tex = {https://github.com/lanl/openmpir-llvm2017} }
This paper explores how fork-join parallelism, as supported by concurrency platforms such as Cilk and OpenMP, can be embedded into a compiler’s intermediate representation (IR). Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations across parallel control constructs. Remedying this situation is generally thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir is a compiler IR that represents logically parallel tasks asymmetrically in the program’s control flow graph. Tapir allows the compiler to optimize across parallel control constructs with only minor changes to its existing analyses and code transformations. To prototype Tapir in the LLVM compiler, for example, we added or modified about 6000 lines of LLVM’s 4-million-line codebase. Tapir enables LLVM’s existing compiler optimizations for serial code – including loop-invariant-code motion, common-subexpression elimination, and tail-recursion elimination – to work with parallel control constructs such as spawning and parallel loops. Tapir also supports parallel optimizations such as loop scheduling.
@inproceedings{tapir, author = {Schardl, Tao B. and Moses, William S. and Leiserson, Charles E.}, title = {Tapir: Embedding Fork-Join Parallelism into {LLVM}'s Intermediate Representation}, booktitle = {Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, shortname = {PPoPP '17}, month = jan, year = {2017}, isbn = {978-1-4503-4493-7}, location = {Austin, Texas, USA}, pages = {249--265}, numpages = {17}, pdf = {https://c.wsmoses.com/papers/tapir.pdf}, doi = {10.1145/3018743.3018758}, acmid = {3018758}, publisher = {ACM}, address = {New York, NY, USA}, award = {Best Paper Award}, reddit = {http://www.reddit.com/r/programming/comments/5ra59l/mit_says_their_modified_llvm_compiler_optimizes/}, mitnews = {http://news.mit.edu/2017/optimizing-code-compiler-parallel-programs-0130}, hackernews = {http://news.ycombinator.com/item?id=13568585}, blog = {/tapir}, papertype = {conference} }
This thesis explores how fork-join parallelism, as supported by concurrency platforms such as Cilk and OpenMP, can be embedded into a compiler’s intermediate representation (IR). Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations across parallel control constructs. Remedying this situation is generally thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir is a compiler IR that represents logically parallel tasks asymmetrically in the program’s control flow graph. Tapir allows the compiler to optimize across parallel control constructs with only minor changes to its existing analyses and code transformations. To prototype Tapir in the LLVM compiler, for example, the Tapir team added or modi- fied about 6000 lines of LLVM’s 4-million-line codebase. Tapir enables LLVM’s existing compiler optimizations for serial code — including loop-invariant-code motion, commonsubexpression elimination, and tail-recursion elimination — to work with parallel control constructs such as spawning and parallel loops. Tapir also supports parallel optimizations such as loop scheduling.
@mastersthesis{wmoses-meng, title = {How {S}hould {C}ompilers {R}epresent {F}ork-{J}oin {P}arallelism?}, author = {Moses, William S.}, booktitle = {Master's Thesis}, shortname = {Thesis '17}, school = {Massachusetts Institute of Technology}, year = {2017}, month = may, pdf = {https://c.wsmoses.com/papers/wmoses-meng.pdf}, blog = {/tapir}, papertype = {thesis}, overleaf = {https://www.overleaf.com/project/58fceac4b55260d42f75e8c3}, tex = {https://github.com/wsmoses/Paper-MEng} }
@inproceedings{tapirCPC, title = {Embedding Fork-Join Parallelism into LLVM IR}, author = {Moses, William S. and Schardl, Tao B. and Leiserson, Charles E.}, booktitle = {19th Workshop on Compilers for Parallel Computing}, year = {2016}, shortname = {CQC '16}, pdf = {https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf}, papertype = {workshop} }
@inproceedings{spacex15, title = {Extreme Multi-Resolution Visualization: A Challenge on Many Levels}, author = {Balme, Joanna and Brown-Dymkoski, Eric and Guerrero, Victor and Jones, Stephen and Kessler, Andre and Lichtl, Adam and Lung, Kevin and Moses, William and Museth, Ken and Roberson, Nathan and others}, booktitle = {SuperComputing Visualization Contest 2015}, shortname = {SCVis '15}, year = {2015}, pdf = {https://c.wsmoses.com/papers/spacex15.pdf}, papertype = {workshop} }
Music has long been an interesting subject of analysis for mathematicians and has led to many interesting questions in music theory and other fields. For the most part, computer scientists have looked into applying artificial intelligence to music and finding algorithms and data structures to solve various problems in music. Prior work on these algorithms often involves computing various properties of music such as the edit distance between two songs or the optimal fingering. These problems tend to be solvable in polynomial time using dynamic programming and have various application such as the music identification service Shazam or operations on RISM, an online music database. This paper takes an additional step in this direction, asking what sorts of problems in music cannot be efficiently computed. Specifically, this paper asks how various constraints affect the computational complexity of arranging music originally written for one set of instruments down to a single instrument. The paper then applies these results to other domains including musical choreography (such as ice skating and ballet) as well as creating levels for rhythm games (such as Rock Band). We prove that all of the problems are NP-complete, meaning that there is no efficient algorithm to solve them (assuming the standard conjecture that P != NP).
@incollection{moves15, author = {Demaine, Erik D. and Moses, William S.}, title = {Computational Complexity of Arranging Music}, booktitle = {Revised Papers from MOVES 2015: Mathematics of Various Entertaining Subjects}, shortname = {MOVES '15}, publisher = {Princeton University Press}, year = {2015}, pdf = {https://c.wsmoses.com/papers/moves15.pdf}, papertype = {book}, overleaf = {https://www.overleaf.com/project/54f60a71728bff850ed36c73}, tex = {https://github.com/wsmoses/Paper-MOVES15} }
Adaptive frequency hopping is one way to maximize the utilization of the wireless spectrum. Yet, when the environment itself is changing, the frequency at which the radio senses can become increasingly less optimal. By having the radio create a model of the environment based off of the sensing data, it is possible to achieve high data rates when the spectrum is not being heavily utilized and maintain a low level of interference at times when it is. The radio was modeled both mathematically and run in simulations. The outcomes of these tests were compared with existing standards such as Bluetooth (random frequency hopping) and IEEE 802.22 (fixed sensing rate). In order to evaluate data rate and interference simultaneously, a metric was created that combined them by taking the product of data rate and ( 1 - interference ). Overall, the online adaptive frequency hopper had a 35% increase in the combined metric over the random frequency hopper and 25% increase over the fixed sensing rate radio.
@misc{oafh, title = {Online Adaptive Frequency Hopping}, author = {Moses, William and Robertson, Andrew and Dell, John}, booktitle = {TJHSST Teknos 2014}, shortname = {TJHSST '14}, year = {2014}, pdf = {https://c.wsmoses.com/papers/oafh.pdf}, papertype = {report} }
6.828: Operating Systems
A graduate-level class on the design and implementation of operating systems. Topics included: locking, virtual memory, file systems, threads, context switches, kernel interrupts, system calls, IPC, and MMIO devices. Over the semester, implemented a functioning multicore operating system called jos, designed to be a simpler version of xv6.
|
6.867: Machine Learning
A graduate-level class in machine learning. Topics included support vector machines, neural networks (including convolutional and deep networks), decision tree, boosting algorithms, Bayesian methods, and Markov models.
|
6.UAR: Seminar in Undergraduate Research
The first semester in a year-long class in methods for advanced research.
|
21M.359: Interactive Music Systems
A combination humanties and computer science subject focusing on developping interactive systems involving audio. Labs included making games using input devices such as the Microsoft Kinect, audio analysis, and developing a simple guitar hero game.
|
11.011: The Art and Science of Negotiation
An advanced class on negotiation skills with an emphasis on methods in mutual gains.
|
6.115: Microcomputer Project Laboratory
An advanced laboratory subject on analysis and design of embedded systems from capacitors to assembly and MMIO. Emphasizes construction of complete systems, including labs involving a five-axis robot arm, a fluorescent lamp ballast, a tomographic imaging station (e.g., a CAT scan), and a simple calculator.
|
CMS.335: Short Documentary
A humanities subject exploring the world of short documentaries.
|
6.THM: Thesis
A class denoting work on my master's thesis on Tapir and related advances in performance engineering parallel programs
|
6.172: Performance Engineering
An advanced undergraduate course in performance engineering covering both the theory and practice behind technologies in parallelism, cache-behavior, among others.
|
6.002: Circuits
Introductory class in circuit design.
|
8.370: Quantum Computation
A graduate-level class in quantum computation and quantum algorithms.
|
8.07: Advanced Electrodynamics
An advanced undergraduate class in electrodynamics focusing on moving systems and the behavior of materials.
|
24.241: Logic I
An undergraduate class on formal logic.
|
6.854: Advanced Algorithms
A graduate-level course in algorithms. Topics include universal/consistent hashing, dimensionality reduction, multi-commodity flow and other generalizations of max-flow, linear programming and duality, gradient descent, interior points, multiplicative weights, semidefinited programming, and compressed sensing.
|
6.035: Computer Language Engineering
An advanced class on computer languages and compilers.
|
8.962: General Relativity
A graduate-level class on general relativity
|
8.06: Quantum Mechanics III
The most advanced undergraduate course in quantum mechanics
|
4.352: Advanced Video Making
An advanced course in video production.
|
6.886: Performance Engineering for Multicore Systems
A seminar class in performance engineering with a focus on performing research in the field
|
6.046: Design and Analysis Algorithms
The most advanced undergraduate course in algorithms spanning subjects from Van Emde Boas trees to network flow.
|
8.S05: Quantum Mechanics II
An intermediate course in Quantum Mechanics, using linear algebra as a basis for describing subjects
from the quantum harmonic oscillator, to squeeze states, to the fine structure of Hydrogen.
|
8.044: Statistical Mechanics
An class in statistical mechanics using quantum mechanics to build up
more macroscopic effects.
|
21W.789: Communicating With Mobile Technology
A class on how to design, build, and market new computer science start-ups made
for mobile platforms.
|
18.075: Advanced Calculus for Engineers (Complex Analysis)
Complex analysis and differential equations taught from the perspective
of engineering applications.
|
8.223: Classical Mechanics II
An advanced course in classical mechanics covering approximations, Lagrangian mechanics, Hamiltonian Mechanics, among other topics.
|
18.031: System Functions and the Laplace Transform
A supplementary class in differential equations covering system functions and the Laplace transform as methods of solving differential equations and modeling systems.
|
6.179: Introduction to C/C++ (Instructor)
An introductory class to C/C++ that I taught with William Qian with over 200 students. Geared towards those already with a programming background, this provided a fast-paced introduction to C/C++ covering more advacned topics such as template metaprogramming. There were weekly algorithmic PSETS (USACO style) and a capstone final project.
|
6.890: Algorithmic Lower Bounds, Fun With Hardness Proofs
A graduate-level computer science class on hardness (NP,PSPACE,EXPSPACE,etc). Covered material ranging from 3SAT to APX and constraint logic. Produced final paper with hopes of publishing in the near future. Final Paper Taught by Erik Demaine.
|
8.04: Quantum Mechanics I
An introductory course in quantum mechanics covering the basic experiments such as the double slit experiment through interference, quantum tunneling, operators, and the hydrogen atom.
|
5.111: Chemistry
An introductory course in chemistry covering topics such as equilibrium, acid-base, orbital shapes, crystal field theory, d-orbitals, thermodynamics, and kinetics.
|
7.012: Biology
An introductory course in biology focusing on genetics and neurobiology. Taught by Eric Lander.
|
21W.035: Science Writing for the Public
A writing class focusing on making science accessible to the public. Projects included: a news article and a researched science article.
|
12.000: Solving Complex Problems (Terrascope)
A seminar class focusing on solving the world's energy problem in the next 100 years. Redulted in heavy research in climate change, statistical modeling and predictions of the future, a
website
with articles on current and future energy issues and solutions and an hour-long presentation followed by a two-hour question and answer session with experts and executives from oil companies.
More information on the class including a recording of the presenation can be found here.
|
Advanced Mathematical Techniques for Scientists and Engineers
Post-mutivariable calculus math course focusing on mathematical methods with important applications to physics and engineering such as Fourier series, linear algebra, introductory complex analysis, the Gamma function, Bessel functions, Legendre polynomials, and the Riemann Zeta function. Includes a brief introduction to quantum mechanics and statistical mechanics.
|
Quantum Physics and Electrodynamics
An introductory course in quantum mechanics covering Stern-Gerlach machines, entanglement, tunneling as well as a more rigorous treatment of E&M through the use of fourier transforms and Greens functions as well as an introduction to relativity.
|
Quantum and Modern Physics Research Lab Mentorship
A capstone class focused on research in physics. Researched wireless communication at the Naval Research Laboratory. Research was named a semifinalist in
Intel's Science and Talent Search and patent.
|
Computational Physics
Introduction to computational techniques used to study and analyze physical phenomena.
|
Advanced Communication Data Streams
A laboratory-based class focusing on how data is transmitted. Covered signal analysis topics such as convolutions, filters, and fourier transforms.
|
Artificial Intelligence
Class focusing on artificial intelligence, machine learning, and applications thereof. Subjects included genetic algorithms, neural networks, and image processing.
|
Parallel Computing
Introduction to parallel algorithms, and techniques. Covers CUDA, MPI, OpenMP, and OpenGL. Labs included n-body simulation, ray tracer, parallel sorts, and huffman encoding.
|