William S. Moses
Assistant Professor, University of Illinois Urbana-Champaign (UIUC)

I'm an assistant professor at UIUC in the Computer Science and Electrical and Computer Engineering departments, and researcher at Google Deepmind/Cloud. Previously, I was a J. Tinsley Oden Faculty Fellow at the University of Texas, Austin. I recieved my PhD, MEng, and SB at MIT in computer science and physics. Before that, I attended Thomas Jefferson High School for Science and Technology (TJHSST) in Northern Virginia. My group has multiple PhD positions available for next fall. If you are interested in compilers or high performance computing, including with applications to machine learning, climate science, databases, security, or biology, please reach out and apply to the Computer Science department at UIUC (I can also advise students from other departments). If you are an undergraduate interested in working with my group, please email me with a coding sample, writing sample, and list of topics you are interested in.

	[email protected]
	703-638-2387
	UIUC Siebel Center for Computer Science 201 N Goodwin Ave Room 4128, Urbana, IL 61801-2302

Papers

Thinking Fast and Correct: Automated Rewriting of Numerical Code through Compiler Augmentation, Distinguished Paper Award Qian, Siyuan Brant and Sathia, Vimarsh and Ivanov, Ivan R and Hückelheim, Jan and Hovland, Paul and Moses, William S. CGO’26.

@inproceedings{poseidon,
  title = {Thinking Fast and Correct: Automated Rewriting of Numerical Code through Compiler Augmentation},
  author = {Qian, Siyuan Brant and Sathia, Vimarsh and Ivanov, Ivan R and H{\"u}ckelheim, Jan and Hovland, Paul and Moses, William S},
  year = {2026},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  location = {Sydney, NSW, Australia},
  series = {CGO '26},
  shortname = {CGO'26},
  papertype = {conference},
  award = {Distinguished Paper Award},
  pdf = {https://c.wsmoses.com/papers/poseidon.pdf}
}

Differentiable lagrangian shock hydrodynamics with application to stable shock acceleration of density interfaces Korner, Kevin and Talamini, Brandon and Andrej, Julian and Tupek, Michael and Moses, William and Tortorelli, Daniel and Rieben, Robert and Kolev, Tzanio and Bramwell, Jamie and White, Daniel and Belof, Jonathan and Schill, William. CMAME.

We develop a gradient based optimization approach for the equations of compressible, Lagrangian hydrodynamics and demonstrate how it can be employed to automatically uncover strategies to control hydrodynamic instabilities arising from shock acceleration of density interfaces. Strategies for controlling the Richtmyer-Meshkov instability (RMI) are of great benefit for inertial confinement fusion (ICF) where shock interactions with many small imperfections in the density interface lead to instabilities which rapidly grow over time. These instabilities lead to mixing which, in the case of laser driven ICF, quenches the runaway fusion process ruining the potential for positive energy return. We demonstrate that control of these instabilities can be achieved by optimization of initial conditions with ( > 100) parameters. Optimizing over a large parameter space like this is not possible with gradient-free optimization strategies. This requires computation of the gradient of the outputs of a numerical solution to the equations of Lagrangian hydrodynamics with respect to the inputs. We show that the efficient computation of these gradients is made possible via a judicious application of (i) adjoint methods, the exact formal representation of sensitivities involving partial differential equations, and (ii) automatic differentiation (AD), the algorithmic calculation of derivatives of functions. Careful regularization of multiple operators including artificial viscosity and timestep control is required. We perform design optimization of  > 100 parameter energy field driving the Richtmyer Meshkov instability showing significant suppression while simultaneously enhancing the acceleration of the interface relative to a nominal baseline case.

@article{KORNER2026118663,
  title = {Differentiable lagrangian shock hydrodynamics with application to stable shock acceleration of density interfaces},
  journal = {Computer Methods in Applied Mechanics and Engineering},
  volume = {451},
  pages = {118663},
  year = {2026},
  issn = {0045-7825},
  doi = {https://doi.org/10.1016/j.cma.2025.118663},
  url = {https://www.sciencedirect.com/science/article/pii/S0045782525009351},
  author = {Korner, Kevin and Talamini, Brandon and Andrej, Julian and Tupek, Michael and Moses, William and Tortorelli, Daniel and Rieben, Robert and Kolev, Tzanio and Bramwell, Jamie and White, Daniel and Belof, Jonathan and Schill, William},
  keywords = {Topology optimization, Hydrophysics, Richtmyer-Meshkov instability, Shock shaping, Interfaces},
  shortname = {CMAME},
  papertype = {journal},
  pdf = {https://c.wsmoses.com/papers/diff_lagrangian.pdf}
}

Hierarchical Interferometric Bayesian Imaging Tiede, Paul and Moses, William and Churavy, Valentin and Johnson, Michael D. and Pesce, Dominic W. and Blackburn, Lindy and Galison, Peter. Astrophysical Journal.

Very long baseline interferometry (VLBI) achieves the highest angular resolution in astronomy. VLBI measures corrupted Fourier components, known as visibilities. Reconstructing on-sky images from these visibilities is a challenging inverse problem, particularly for sparse arrays such as the Event Horizon Telescope (EHT) and the Very Long Baseline Array, where incomplete sampling and severe calibration errors introduce significant uncertainty in the image. To help guide convergence and control the uncertainty in image reconstructions, regularization on the space of images is utilized, such as enforcing smoothness or similarity to a fiducial image. Coupled with this regularization is the introduction of a new set of parameters that modulate its strength. We present a hierarchical Bayesian imaging approach (hierarchical interferometric Bayesian Imaging, HIBI) that enables the quantification of uncertainty for all parameters. Incorporating instrumental effects within HIBI is straightforward, allowing for simultaneous imaging and calibration of data. To showcase HIBI’s effectiveness and flexibility, we build a simple imaging model based on Markov random fields and demonstrate how different physical components can be included, e.g., black hole shadow size, and their uncertainties can be inferred. For example, while the original EHT publications were unable to constrain the ring width of M87, HIBI measures a width of 9.3 ± 1.3 μas. We apply HIBI to image and calibrate EHT synthetic data, real EHT observations of M87, and multifrequency observations of OJ 287. Across these tests, HIBI accurately recovers a wide variety of image structures and quantifies their uncertainties. HIBI is publicly available in the Comrade VLBI software repository.

@article{comrade,
  doi = {10.3847/1538-4357/ae2749},
  url = {https://doi.org/10.3847/1538-4357/ae2749},
  year = {2026},
  month = jan,
  publisher = {The American Astronomical Society},
  volume = {997},
  number = {2},
  pages = {262},
  author = {Tiede, Paul and Moses, William and Churavy, Valentin and Johnson, Michael D. and Pesce, Dominic W. and Blackburn, Lindy and Galison, Peter},
  title = {Hierarchical Interferometric Bayesian Imaging},
  journal = {The Astrophysical Journal},
  shortname = {Astrophysical Journal},
  papertype = {journal},
  pdf = {https://c.wsmoses.com/papers/comrade.pdf}
}

Mind the Abstraction Gap: Bringing Equality Saturation to Real-World ML Compilers Vohra, Arya and Lee, Leo Seojun and Bachurski, Jakub and Zinenko, Oleksandr and Phothilimthana, Phitchaya Mangpo and Cohen, Albert and Moses, William S.. OOPSLA ’25.

Machine learning (ML) compilers rely on graph-level transformations to enhance the runtime performance of ML models. However, performing local transformations on individual operations can create effects far beyond the location of the rewrite. In particular, a local rewrite can change the profitability or legality of hard-to-predict downstream transformations, particularly regarding data layout, parallelization, fine-grained scheduling, and memory management. As a result, program transformations are often driven by manually-tuned compiler heuristics, which are quickly rendered obsolete by new hardware and model architectures. Instead of hand-written local heuristics, we propose the use of equality saturation. We replace such heuristics with a more robust global performance model, which accounts for downstream transformations. Equality saturation addresses the challenge of local optimizations inadvertently constraining or negating the benefits of subsequent transformations, thereby providing a solution that is inherently adaptable to newer workloads. While this approach still requires a global performance model to evaluate the profitability of transformations, it holds significant promise for increased automation and adaptability. This paper addresses challenges in applying equality saturation on real-world ML compute graphs and state-of-the-art hardware. By doing so, we present an improved method for discovering effective compositions of graph optimizations. We study different cost modeling approaches to deal with fusion and layout optimization, and tackle scalability issues that arise from considering a very wide range of algebraic optimizations. We design an equality saturation pass for the XLA compiler, with an implementation in C++ and Rust. We demonstrate an average speedup of 3.45% over XLA’s optimization flow across our benchmark suite on various CPU and GPU platforms, with a maximum speedup of 56.26% for NasRNN on CPU.

@article{constable,
  author = {Vohra, Arya and Lee, Leo Seojun and Bachurski, Jakub and Zinenko, Oleksandr and Phothilimthana, Phitchaya Mangpo and Cohen, Albert and Moses, William S.},
  title = {Mind the Abstraction Gap: Bringing Equality Saturation to Real-World ML Compilers},
  year = {2025},
  issue_date = {October 2025},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  volume = {9},
  number = {OOPSLA2},
  url = {https://doi.org/10.1145/3763062},
  pdf = {https://dl.acm.org/doi/pdf/10.1145/3763062},
  doi = {10.1145/3763062},
  journal = {Proc. ACM Program. Lang.},
  month = oct,
  articleno = {284},
  numpages = {28},
  keywords = {XLA, e-graphs, equality saturation, optimization},
  shortname = {OOPSLA '25},
  papertype = {conference}
}

Sound and Modular Activity Analysis for Automatic Differentiation in MLIR Peng, Mai Jacob and Moses, William S. and Zinenko, Oleksandr and Dubach, Christophe. OOPSLA ’25.

Computing derivatives is paramount for multiple domains ranging from training neural networks to precise climate simulations. While derivatives can be generated by Automatic Differentiation (AD) tools, they often require aggressive optimization to avoid compromising program performance. One of the central optimizations consists of identifying inactive operations that do not contribute to the partial derivatives of interest. Multiple tools provide activity analyses for a variety of input languages, though often with only informal correctness guarantees. This paper formally defines activity analysis for AD as an abstract interpretation, proves its soundness, and implements it within the MLIR compiler infrastructure. To account for MLIR’s genericity, a subset of MLIR’s internal representation amenable to AD is formalized for the first time. Furthermore, the paper proposes a sound intraprocedural approximation of the whole-program activity analysis via function summaries along with a mechanism to automatically derive these summaries from function definitions. The implementation is evaluated on a differentiation-specific benchmark suite. It achieves a 1.24 geometric mean speedup on CPU and a 1.7 geometric mean speedup on GPU in the runtime of generated programs, when compared to a baseline that does not use activity analysis. The evaluation also demonstrates that the intraprocedural analysis with function summaries proves inactive 100% of instructions proven inactive by the whole-program analysis.

@article{mlir_activity,
  author = {Peng, Mai Jacob and Moses, William S. and Zinenko, Oleksandr and Dubach, Christophe},
  title = {Sound and Modular Activity Analysis for Automatic Differentiation in MLIR},
  year = {2025},
  issue_date = {October 2025},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  volume = {9},
  number = {OOPSLA2},
  url = {https://doi.org/10.1145/3763125},
  pdf = {https://dl.acm.org/doi/pdf/10.1145/3763125},
  doi = {10.1145/3763125},
  journal = {Proc. ACM Program. Lang.},
  month = oct,
  articleno = {347},
  numpages = {28},
  keywords = {Activity analysis, Automatic differentiation, Data flow analysis, MLIR},
  shortname = {OOPSLA '25},
  papertype = {conference}
}

DJ4Earth: Differentiable, and Performance-portable Earth System Modeling via Program Transformations Moses, William S and Cheng, Gong and Churavy, Valentin and Gelbrecht, Maximilian and Klöwer, Milan and Kump, Joseph and Morlighem, Mathieu and Williamson, Sarah and Apte, Dhruv and Berg, Paul and others. Authorea.

@article{moses2025dj4earth,
  title = {DJ4Earth: Differentiable, and Performance-portable Earth System Modeling via Program Transformations},
  author = {Moses, William S and Cheng, Gong and Churavy, Valentin and Gelbrecht, Maximilian and Kl{\"o}wer, Milan and Kump, Joseph and Morlighem, Mathieu and Williamson, Sarah and Apte, Dhruv and Berg, Paul and others},
  journal = {Authorea Preprints},
  year = {2025},
  publisher = {Authorea},
  shortname = {Authorea},
  papertype = {preprint},
  url = {https://essopenarchive.org/doi/full/10.22541/essoar.176314951.18114616},
  pdf = {https://c.wsmoses.com/papers/dj4earth_preprint.pdf}
}

RAPTOR: Practical Numerical Profiling of Scientific Applications, Best Reproducibility Advacement Hoerold, Faveo and Ivanov, Ivan R and Dhruv, Akash and Moses, William S and Dubey, Anshu and Wahib, Mohamed and Domke, Jens. SC ’25.

@inproceedings{raptor,
  title = {RAPTOR: Practical Numerical Profiling of Scientific Applications},
  author = {Hoerold, Faveo and Ivanov, Ivan R and Dhruv, Akash and Moses, William S and Dubey, Anshu and Wahib, Mohamed and Domke, Jens},
  booktitle = { {SC} '25: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {ACM},
  address = {New York, NY, USA},
  year = {2025},
  location = {St. Louis, Missouri},
  conference = { {SC} '25: The International Conference for High Performance Computing, Networking, Storage and Analysis},
  shortname = {SC '25},
  papertype = {conference},
  pdf = {https://dl.acm.org/doi/pdf/10.1145/3712285.3759810},
  award = {Best Reproducibility Advacement}
}

The Quantum Tortoise and the Classical Hare: When Will Quantum Computers Outpace Classical Ones and When Will They Be Left Behind? Choi, Sukwoong and Moses, William S. and Thompson, Neil. IEEE.

@article{11045206,
  author = {Choi, Sukwoong and Moses, William S. and Thompson, Neil},
  journal = {Proceedings of the IEEE},
  title = {The Quantum Tortoise and the Classical Hare: When Will Quantum Computers Outpace Classical Ones and When Will They Be Left Behind?},
  year = {2025},
  volume = {113},
  number = {2},
  pages = {113-124},
  keywords = {Quantum computing;Qubit;Hardware;Quantum algorithm;Quantum advantage;Computer science;Approximation algorithms;Encryption;Costs;Performance analysis},
  doi = {10.1109/JPROC.2025.3574102},
  shortname = {IEEE},
  papertype = {journal},
  pdf = {https://ieeexplore.ieee.org/abstract/document/11045206}
}

Optimizing ML Training with Metagradient Descent Engstrom, Logan and Ilyas, Andrew and Chen, Benjamin and Feldmann, Axel and Moses, William and Madry, Aleksander. arXiv.

A major challenge in training large-scale machine learning models is configuring the training process to maximize model performance, i.e., finding the best training setup from a vast design space. In this work, we unlock a gradient-based approach to this problem. We first introduce an algorithm for efficiently calculating metagradients – gradients through model training – at scale. We then introduce a “smooth model training” framework that enables effective optimization using metagradients. With metagradient descent (MGD), we greatly improve on existing dataset selection methods, outperform accuracy-degrading data poisoning attacks by an order of magnitude, and automatically find competitive learning rate schedules.

@misc{engstrom2025optimizingmltrainingmetagradient,
  title = {Optimizing {ML} Training with Metagradient Descent},
  author = {Engstrom, Logan and Ilyas, Andrew and Chen, Benjamin and Feldmann, Axel and Moses, William and Madry, Aleksander},
  year = {2025},
  eprint = {2503.13751},
  archiveprefix = {arXiv},
  primaryclass = {stat.ML},
  url = {https://arxiv.org/abs/2503.13751},
  pdf = {https://arxiv.org/pdf/2409.03864},
  shortname = {arXiv},
  papertype = {preprint}
}

The MLIR Transform Dialect: Your Compiler Is More Powerful Than You Think Lücke, Martin Paul and Zinenko, Oleksandr and Moses, William S. and Steuwer, Michel and Cohen, Albert. CGO’25.

To take full advantage of a specific hardware target, performance engineers need to gain control on compilers in order to leverage their domain knowledge about the program and hardware. Yet, modern compilers are poorly controlled, usually by configuring a sequence of coarse-grained monolithic black-box passes, or by means of predefined compiler annotations/pragmas. These can be effective, but often do not let users precisely optimize their varying compute loads. As a consequence, performance engineers have to resort to implementing custom passes for a specific optimization heuristic, requiring compiler engineering expert knowledge. In this paper, we present a technique that provides fine-grained control of general-purpose compilers by introducing the Transform dialect, a controllable IR-based transformation system implemented in MLIR. The Transform dialect empowers performance engineers to optimize their various compute loads by composing and reusing existing—but currently hidden—compiler features without the need to implement new passes or even rebuilding the compiler. We demonstrate in five case studies that the Transform dialect enables precise, safe composition of compiler transformations and allows for straightforward integration with state-of-the-art search methods.

@inproceedings{transformcgo,
  author = {L\"{u}cke, Martin Paul and Zinenko, Oleksandr and Moses, William S. and Steuwer, Michel and Cohen, Albert},
  title = {The MLIR Transform Dialect: Your Compiler Is More Powerful Than You Think},
  year = {2025},
  isbn = {9798400712753},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3696443.3708922},
  doi = {10.1145/3696443.3708922},
  booktitle = {Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization},
  pages = {241–254},
  numpages = {14},
  keywords = {Controllable Compiler, MLIR, Transform Dialect, Transform Scripts},
  location = {Las Vegas, NV, USA},
  series = {CGO '25},
  pdf = {https://dl.acm.org/doi/pdf/10.1145/3696443.3708922},
  shortname = {CGO'25},
  papertype = {conference}
}

A taxonomy of automatic differentiation pitfalls Hückelheim, Jan and Menon, Harshitha and Moses, William and Christianson, Bruce and Hovland, Paul and Hascoët, Laurent. WIREs.

Abstract Automatic differentiation is a popular technique for computing derivatives of computer programs. While automatic differentiation has been successfully used in countless engineering, science, and machine learning applications, it can sometimes nevertheless produce surprising results. In this paper, we categorize problematic usages of automatic differentiation, and illustrate each category with examples such as chaos, time-averages, discretizations, fixed-point loops, lookup tables, linear solvers, and probabilistic programs, in the hope that readers may more easily avoid or detect such pitfalls. We also review debugging techniques and their effectiveness in these situations. This article is categorized under: Technologies > Machine Learning

@article{https://doi.org/10.1002/widm.1555,
  author = {Hückelheim, Jan and Menon, Harshitha and Moses, William and Christianson, Bruce and Hovland, Paul and Hascoët, Laurent},
  title = {A taxonomy of automatic differentiation pitfalls},
  journal = {WIREs Data Mining and Knowledge Discovery},
  volume = {14},
  number = {6},
  pages = {e1555},
  keywords = {autodiff, automatic differentiation, backpropagation},
  doi = {https://doi.org/10.1002/widm.1555},
  url = {https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/widm.1555},
  eprint = {https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1555},
  year = {2024},
  shortname = {WIREs},
  papertype = {journal},
  pdf = {https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1555}
}

Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training Ivanov, Ivan R. and Meyer, Joachim and Grossman, Aiden and Moses, William S. and Doerfert, Johannes. arXiv.

The size and complexity of software applications is increasing at an accelerating pace. Source code repositories (along with their dependencies) require vast amounts of labor to keep them tested, maintained, and up to date. As the discipline now begins to also incorporate automatically generated programs, automation in testing and tuning is required to keep up with the pace – let alone reduce the present level of complexity. While machine learning has been used to understand and generate code in various contexts, machine learning models themselves are trained almost exclusively on static code without inputs, traces, or other execution time information. This lack of training data limits the ability of these models to understand real-world problems in software.

In this work we show that inputs, like code, can be generated automatically at scale. Our generated inputs are stateful, and appear to faithfully reproduce the arbitrary data structures and system calls required to rerun a program function. By building our tool within the compiler, it both can be applied to arbitrary programming languages and architectures and can leverage static analysis and transformations for improved performance. Our approach is able to produce valid inputs, including initial memory states, for 90% of the ComPile dataset modules we explored, for a total of 21.4 million executable functions. Further, we find that a single generated input results in an average block coverage of 37%, whereas guided generation of five inputs improves it to 45%.

@misc{ivanov2024inputgenguidedgenerationstateful,
  title = {Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training},
  author = {Ivanov, Ivan R. and Meyer, Joachim and Grossman, Aiden and Moses, William S. and Doerfert, Johannes},
  year = {2024},
  eprint = {2406.08843},
  archiveprefix = {arXiv},
  primaryclass = {cs.SE},
  url = {https://arxiv.org/abs/2406.08843},
  pdf = {https://arxiv.org/pdf/2406.08843},
  shortname = {arXiv},
  papertype = {preprint}
}

Retargeting and Respecializing GPU Workloads for Performance Portability Ivanov, Ivan R. and Zinenko, Oleksandr and Domke, Jens and Endo, Toshio and Moses, William S.. CGO’24.

@inproceedings{10444828,
  author = {Ivanov, Ivan R. and Zinenko, Oleksandr and Domke, Jens and Endo, Toshio and Moses, William S.},
  booktitle = {2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)},
  title = {Retargeting and Respecializing GPU Workloads for Performance Portability},
  year = {2024},
  volume = {},
  issn = {},
  pages = {119-132},
  doi = {10.1109/CGO57630.2024.10444828},
  url = {https://doi.ieeecomputersociety.org/10.1109/CGO57630.2024.10444828},
  publisher = {IEEE Computer Society},
  address = {Los Alamitos, CA, USA},
  month = mar,
  shortname = {CGO'24},
  pdf = {https://c.wsmoses.com/papers/polygeist24.pdf},
  papertype = {conference}
}

ComPile: A Large IR Dataset from Production Sources Grossman, Aiden and Paehler, Ludger and Parasyris, Konstantinos and Ben-Nun, Tal and Hegna, Jacob and Moses, William and Diaz, Jose M Monsalve and Trofin, Mircea and Doerfert, Johannes. arXiv.

Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI’s ChatGPT, Google’s Bard, or Anthropic’s Claude, the way we translate code from one language into another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, the targeted tasks often remain the same within the individual classes of models. Relying solely on the ability of modern models to extract information from unstructured code does not take advantage of 70 years of programming language and compiler development by not utilizing the structure inherent to programs in the data collection. This detracts from the performance of models working over a tokenized representation of input code and precludes the use of these models in the compiler itself. To work towards the first intermediate representation (IR) based models, we fully utilize the LLVM compiler infrastructure, shared by a number of languages, to generate a 182B token dataset of LLVM IR. We generated this dataset from programming languages built on the shared LLVM infrastructure, including Rust, Swift, Julia, and C/C++, by hooking into LLVM code generation either through the language’s package manager or the compiler directly to extract the dataset of intermediate representations from production grade programs. Statistical analysis proves the utility of our dataset not only for large language model training, but also for the introspection into the code generation process itself with the dataset showing great promise for machine-learned compiler components.

@article{grossman2023compile,
  title = {ComPile: A Large IR Dataset from Production Sources},
  author = {Grossman, Aiden and Paehler, Ludger and Parasyris, Konstantinos and Ben-Nun, Tal and Hegna, Jacob and Moses, William and Diaz, Jose M Monsalve and Trofin, Mircea and Doerfert, Johannes},
  journal = {arXiv preprint arXiv:2309.15432},
  year = {2023},
  shortname = {arXiv},
  pdf = {https://arxiv.org/pdf/2309.15432.pdf}
}

The Quantum Tortoise and the Classical Hare: A simple framework for understanding which problems quantum computing will accelerate (and which it will not) Choi, Sukwoong and Moses, William S and Thompson, Neil. arXiv.

Quantum computing promises transformational gains for solving some problems, but little to none for others. For anyone hoping to use quantum computers now or in the future, it is important to know which problems will benefit. In this paper, we introduce a framework for answering this question both intuitively and quantitatively. The underlying structure of the framework is a race between quantum and classical computers, where their relative strengths determine when each wins. While classical computers operate faster, quantum computers can sometimes run more efficient algorithms. Whether the speed advantage or the algorithmic advantage dominates determines whether a problem will benefit from quantum computing or not. Our analysis reveals that many problems, particularly those of small to moderate size that can be important for typical businesses, will not benefit from quantum computing. Conversely, larger problems or those with particularly big algorithmic gains will benefit from near-term quantum computing. Since very large algorithmic gains are rare in practice and theorized to be rare even in principle, our analysis suggests that the benefits from quantum computing will flow either to users of these rare cases, or practitioners processing very large data.

@article{choi2023quantum,
  title = {The Quantum Tortoise and the Classical Hare: A simple framework for understanding which problems quantum computing will accelerate (and which it will not)},
  author = {Choi, Sukwoong and Moses, William S and Thompson, Neil},
  journal = {arXiv preprint arXiv:2310.15505},
  year = {2023},
  series = {arXiv},
  shortname = {arXiv},
  pdf = {https://arxiv.org/pdf/2310.15505.pdf},
  papertype = {preprint}
}

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs Moses, William S. and Ivanov, Ivan R. and Domke, Jens and Endo, Toshio and Doerfert, Johannes and Zinenko, Oleksandr. PPoPP ’23.

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model.We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7\texttimes.

@inproceedings{moses2022high,
  author = {Moses, William S. and Ivanov, Ivan R. and Domke, Jens and Endo, Toshio and Doerfert, Johannes and Zinenko, Oleksandr},
  title = {High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs},
  year = {2023},
  isbn = {9798400700156},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3572848.3577475},
  doi = {10.1145/3572848.3577475},
  booktitle = {Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming},
  pages = {119–134},
  numpages = {16},
  keywords = {CUDA, MLIR, barrier synchronization, polygeist},
  location = {Montreal, QC, Canada},
  series = {PPoPP '23},
  shortname = {PPoPP '23},
  papertype = {conference},
  pdf = {https://dl.acm.org/doi/pdf/10.1145/3572848.3577475}
}

Transparent Checkpointing for Automatic Differentiation of Program Loops Through Expression Transformations Schanen, Michel and Narayanan, Sri Hari Krishna and Williamson, Sarah and Churavy, Valentin and Moses, William S. and Paehler, Ludger. ICCP ’23.

Automatic differentiation (AutoDiff) in machine learning is largely restricted to expressions used for neural networks (NN), with the depth rarely exceeding a few tens of layers. Compared to NN, numerical simulations typically involve iterative algorithms like time steppers that lead to millions of iterations. Even for modest-sized models, this may yield infeasible memory requirements when applying the adjoint method, also called backpropagation, to time-dependent problems. In this situation, checkpointing algorithms provide a trade-off between recomputation and storage. This paper presents the package Checkpointing.jl that leverages expression transformations in the programming language Julia and the package ChainRules.jl to automatically and transparently transform loop iterations into differentiated loops. The user may choose between various checkpointing algorithm schemes and storage devices. We describe the unique design of Checkpointing.jl and demonstrate its features on an automatically differentiated MPI implementation of Burgers’ equation on the Polaris cluster at the Argonne Leadership Computing Facility.

@inproceedings{10.1007/978-3-031-36024-4_37,
  author = {Schanen, Michel and Narayanan, Sri Hari Krishna and Williamson, Sarah and Churavy, Valentin and Moses, William S. and Paehler, Ludger},
  editor = {Miky{\v{s}}ka, Ji{\v{r}}{\'i} and de Mulatier, Cl{\'e}lia and Paszynski, Maciej and Krzhizhanovskaya, Valeria V. and Dongarra, Jack J. and Sloot, Peter M.A.},
  title = {Transparent Checkpointing for Automatic Differentiation of Program Loops Through Expression Transformations},
  booktitle = {Computational Science -- ICCS 2023},
  year = {2023},
  publisher = {Springer Nature Switzerland},
  address = {Cham},
  pages = {483--497},
  isbn = {978-3-031-36024-4},
  shortname = {ICCP '23},
  pdf = {https://c.wsmoses.com/papers/checkpointing_jl.pdf},
  papertype = {conference}
}

Understanding Automatic Differentiation Pitfalls Hückelheim, Jan and Menon, Harshitha and Moses, William and Christianson, Bruce and Hovland, Paul and Hascoët, Laurent. arXiv.

Automatic differentiation, also known as backpropagation, AD, autodiff, or algorithmic differentiation, is a popular technique for computing derivatives of computer programs accurately and efficiently. Sometimes, however, the derivatives computed by AD could be interpreted as incorrect. These pitfalls occur systematically across tools and approaches. In this paper we broadly categorize problematic usages of AD and illustrate each category with examples such as chaos, time-averaged oscillations, discretizations, fixed-point loops, lookup tables, and linear solvers. We also review debugging techniques and their effectiveness in these situations. With this article we hope to help readers avoid unexpected behavior, detect problems more easily when they occur, and have more realistic expectations from AD tools

@article{huckelheim2023understanding,
  title = {Understanding Automatic Differentiation Pitfalls},
  author = {H{\"u}ckelheim, Jan and Menon, Harshitha and Moses, William and Christianson, Bruce and Hovland, Paul and Hasco{\"e}t, Laurent},
  journal = {arXiv preprint arXiv:2305.07546},
  year = {2023},
  series = {arXiv},
  shortname = {arXiv},
  pdf = {https://arxiv.org/pdf/2305.07546.pdf},
  papertype = {preprint}
}

Scalable Automatic Differentiation of Multiple Parallel Paradigms through Compiler Augmentation, Best Student Paper Award and Best Paper Finalist Moses, William S and Hari Krishna Narayanan, Sri and Paehler, Ludger and Churavy, Valentinand Hückelheim, Jan and Schanen, Michel and Doerfert, Johannes and Hovland, Paul. SC ’22.

@inproceedings{enzymePar,
  title = {Scalable Automatic Differentiation of Multiple Parallel Paradigms through Compiler Augmentation},
  author = {Moses, William S and Hari Krishna Narayanan, Sri and Paehler, Ludger and Churavy, Valentinand H{\"u}ckelheim, Jan and Schanen, Michel and Doerfert, Johannes and Hovland, Paul},
  booktitle = { {SC} '22: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  year = {2022},
  location = {St. Louis, Missouri},
  conference = { {SC} '22: The International Conference for High Performance Computing, Networking, Storage and Analysis},
  award = {Best Student Paper Award and Best Paper Finalist},
  shortname = {SC '22},
  papertype = {conference},
  overleaf = {https://www.overleaf.com/project/60eb1022cd54f70251e1294b},
  pdf = {https://c.wsmoses.com/papers/enzymePar.pdf}
}

Enabling Transformers to Understand Low-Level Programs Guo, Zifan and Moses, William S.. HPEC ’22.

@inproceedings{transformersLLVM,
  title = {Enabling Transformers to Understand Low-Level Programs},
  author = {Guo, Zifan and Moses, William S.},
  booktitle = {2022 IEEE High Performance Extreme Computing Conference (HPEC)},
  year = {2022},
  organization = {IEEE},
  shortname = {HPEC '22},
  papertype = {conference},
  overleaf = {https://www.overleaf.com/project/62d1dd7a6097679455a75368},
  pdf = {https://c.wsmoses.com/papers/hpectransformers.pdf}
}

Performance Portable Solid Mechanics via Matrix-Free p -Multigrid Brown, Jed and Barra, Valeria and Beams, Natalie and Ghaffari, Leila and Knepley, Matthew and Moses, William and Shakeri, Rezgar and Stengel, Karen and Thompson, Jeremy L and Zhang, Junchao. arXiv.

@article{brown2022performance,
  title = {Performance Portable Solid Mechanics via Matrix-Free $ p $-Multigrid},
  author = {Brown, Jed and Barra, Valeria and Beams, Natalie and Ghaffari, Leila and Knepley, Matthew and Moses, William and Shakeri, Rezgar and Stengel, Karen and Thompson, Jeremy L and Zhang, Junchao},
  journal = {arXiv preprint arXiv:2204.01722},
  papertype = {preprint},
  shortname = {arXiv},
  year = {2022},
  pdf = {https://arxiv.org/pdf/2204.01722.pdf}
}

Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme, Best Student Paper Finalist and Best Reproducibility Advancement Finalist Moses, William S and Churavy, Valentin and Paehler, Ludger and Hückelheim, Jan and Hari Krishna Narayanan, Sri and Schanen, Michel and Doerfert, Johannes. SC ’21.

Computing derivatives is key to many algorithms in scientific computing and machine learning such as optimization, uncertainty quantification, and stability analysis. Enzyme is a LLVM compiler plugin that performs reverse-mode automatic differentiation (AD) and thus generates high performance gradients of programs in languages including C/C++, Fortran, Julia, and Rust. Prior to this work, Enzyme and other AD tools were not capable of generating gradients of GPU kernels. Our paper presents a combination of novel techniques that make Enzyme the first fully automatic reversemode AD tool to generate gradients of GPU kernels. Since unlike other tools Enzyme performs automatic differentiation within a general-purpose compiler, we are able to introduce several novel GPU and AD-specific optimizations. To show the generality and efficiency of our approach, we compute gradients of five GPU-based HPC applications, executed on NVIDIA and AMD GPUs. All benchmarks run within an order of magnitude of the original program’s execution time. Without GPU and AD-specific optimizations, gradients of GPU kernels either fail to run from a lack of resources or have infeasible overhead. Finally, we demonstrate that increasing the problem size by either increasing the number of threads or increasing the work per thread, does not substantially impact the overhead from differentiation.

@inproceedings{enzymeGPU,
  title = {Reverse-Mode Automatic Differentiation and Optimization of {GPU} Kernels via Enzyme},
  author = {Moses, William S and Churavy, Valentin and Paehler, Ludger and H{\"u}ckelheim, Jan and Hari Krishna Narayanan, Sri and Schanen, Michel and Doerfert, Johannes},
  booktitle = { {SC} '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  year = {2021},
  location = {St. Louis, Missouri},
  conference = { {SC} '21: The International Conference for High Performance Computing, Networking, Storage and Analysis},
  award = {Best Student Paper Finalist and Best Reproducibility Advancement Finalist},
  shortname = {SC '21},
  pdf = {https://c.wsmoses.com/papers/EnzymeGPU.pdf},
  papertype = {conference},
  tex = {https://github.com/wsmoses/Paper-EnzymeSC21},
  overleaf = {https://www.overleaf.com/project/60395b62c1de7024e5b878fc}
}

Polygeist: Raising C to Polyhedral MLIR Moses, William S. and Chelini, Lorenzo and Zhao, Ruizhe and Zinenko, Oleksandr. PACT ’21.

We present Polygeist, a new compilation flow that connects the MLIR compiler infrastructure to cutting edge polyhedral optimization tools. It consists of a C and C++ frontend capable of converting a broad range of existing codes into MLIR suitable for polyhedral transformation and a bi-directional conversion between MLIR and OpenScop exchange format. The Polygeist/MLIR intermediate representation featuring high-level (affine) loop constructs and n-D arrays embedded into a single static assignment (SSA) substrate enables an unprecedented combination of SSA-based and polyhedral optimizations. We illustrate this by proposing and implementing two extra transformations: statement splitting and reduction parallelization. Our evaluation demonstrates that Polygeist outperforms on average both an LLVM IR-level optimizer (Polly) and a source-to-source state-of-the-art polyhedral compiler (Pluto) when exercised on the Polybench/C benchmark suite in sequential (2.53x vs 1.41x, 2.34x)and parallel mode (9.47x vs 3.26x, 7.54x) thanks to the new representation and transformations.

@inproceedings{polygeistPACT,
  title = {Polygeist: Raising {C} to Polyhedral {MLIR}},
  author = {Moses, William S. and Chelini, Lorenzo and Zhao, Ruizhe and Zinenko, Oleksandr},
  booktitle = {Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques},
  numpages = {12},
  location = {Virtual Event},
  shortname = {PACT '21},
  publisher = {Association for Computing Machinery},
  year = {2021},
  address = {New York, NY, USA},
  keywords = {Polygeist, MLIR, Polyhedral, LLVM, Compiler, C++, Pluto, Polly, OpenScop, Parallel, OpenMP, Affine, Raising, Transformation, Splitting, Automatic-Parallelization, Reduction, Polybench},
  pdf = {https://c.wsmoses.com/papers/Polygeist_PACT.pdf},
  papertype = {conference},
  tex = {https://github.com/wsmoses/Paper-PolygeistPACT21},
  overleaf = {https://braintex.goog/project/6113bf60410be30098c30301}
}

Polygeist: Affine C in MLIR Moses, William S. and Chelini, Lorenzo and Zhao, Ruizhe and Zinenko, Oleksandr. IMPACT ’21.

We present Polygeist, a new tool that reroutes polyhedral compilation flows to use the representation available in the recent MLIR compilation infrastructure. It consists of two parts: a C and C++ frontend capable of converting a wide variety of existing codes into MLIR suitable for polyhedral trans- formation, and a bi-directional conversion between MLIR’s polyhedral representation and existing polyhedral exchange formats. We demonstrate Polygeist’s flow by converting the entire Polybench/C benchmark suite into MLIR, and by per- forming an IR-to-IR optimization leveraging an existing polyhedral compiler (Pluto). Our flow produces results within 1.25% of the state-of-the-art Clang compiler, enabling direct comparison of source-to-source and IR-to-binary compilers. We believe Polygeist can improve the interoperation between MLIR and the existing polyhedral tooling, benefiting both the research and the production compiler communities.

@inproceedings{polygeistIMPACT,
  title = {Polygeist: Affine {C} in {MLIR}},
  author = {Moses, William S. and Chelini, Lorenzo and Zhao, Ruizhe and Zinenko, Oleksandr},
  booktitle = {IMPACT 2021-11th International Workshop on Polyhedral Compilation Techniques},
  year = {2021},
  pdf = {https://acohen.gitlabpages.inria.fr/impact/impact2021/papers/IMPACT_2021_paper_1.pdf},
  shortname = {IMPACT '21},
  papertype = {workshop},
  tex = {https://github.com/wsmoses/Paper-PolygeistIMPACT21},
  overleaf = {https://braintex.goog/project/5fc12a5b595bff00806782b7}
}

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients, Spotlight Presentation Moses, William and Churavy, Valentin. NeurIPS ’20.

Applying differentiable programming techniques and machine learning algorithms to foreign programs requires developers to either rewrite their code in a machine learning framework, or otherwise provide derivatives of the foreign code. This paper presents Enzyme, a high-performance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of statically analyzable programs expressed in the LLVM intermediate representation (IR). Enzyme can synthesize gradients for programs written in any language whose compiler targets LLVM IR including C, C++, Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD capabilities in these languages. Unlike traditional source-to-source and operator-overloading tools, Enzyme performs AD on optimized IR. On a machine-learning focused benchmark suite including Microsoft’s ADBench, AD on optimized IR achieves a geometric mean speedup of 4.5x over AD on IR before optimization allowing Enzyme to achieve state-of-the-art performance. Packaging Enzyme for PyTorch and TensorFlow provides convenient access to gradients of foreign code with state-of-the art performance, enabling foreign code to be directly incorporated into existing machine learning workflows.

@inproceedings{enzymeNeurips,
  author = {Moses, William and Churavy, Valentin},
  booktitle = {Advances in Neural Information Processing Systems},
  editor = {Larochelle, H. and Ranzato, M. and Hadsell, R. and Balcan, M. F. and Lin, H.},
  pages = {12472--12485},
  publisher = {Curran Associates, Inc.},
  title = {Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients},
  url = {https://proceedings.neurips.cc/paper/2020/file/9332c513ef44b682e9347822c2e457ac-Paper.pdf},
  volume = {33},
  year = {2020},
  award = {Spotlight Presentation},
  shortname = {NeurIPS '20},
  pdf = {https://proceedings.neurips.cc/paper/2020/file/9332c513ef44b682e9347822c2e457ac-Paper.pdf},
  hackernews = {https://news.ycombinator.com/item?id=26012289},
  reddit = {https://www.reddit.com/r/cpp/comments/j7fb4a/enzyme_highperformance_automatic_differentiation},
  papertype = {conference},
  tex = {https://github.com/wsmoses/Paper-EnzymeNeurips20},
  overleaf = {https://www.overleaf.com/project/5ec2b7bc4e59f40001a9be13}
}

AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning Haj-Ali, Ameer and Huang, Qijing Jenny and Xiang, John and Moses, William and Asanovic, Krste and Wawrzynek, John and Stoica, Ion. .

The performance of the code a compiler generates depends on the order in which it applies the optimization passes. Choosing a good order–often referred to as the \em phase-ordering problem–is an NP-hard problem. As a result, existing solutions rely on a variety of heuristics. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. To this end, we implement a framework that takes a program and finds a sequence of passes that optimize the performance of the generated circuit. Without loss of generality, we instantiate this framework in the context of an LLVM compiler and target high-level synthesis programs. We use random forests to quantify the correlation between the effectiveness of a given pass and the program’s features. This helps us reduce the search space by avoiding orderings that are unlikely to improve the performance of a given program. We compare the performance of deep reinforcement learning to state-of-the-art algorithms that address the phase-ordering problem. In our evaluation, we show that reinforcement learning improves circuit performance by 28% when compared to using the -O3 compiler flag, and it achieves competitive results compared to the state-of-the-art solutions, while requiring fewer samples. More importantly, unlike existing state-of-the-art solutions, our reinforcement learning solution can generalize to more than 12,000 different programs after training on as few as a hundred programs for less than ten minutes.

@article{haj2020autophase,
  title = {AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning},
  author = {Haj-Ali, Ameer and Huang, Qijing Jenny and Xiang, John and Moses, William and Asanovic, Krste and Wawrzynek, John and Stoica, Ion},
  booktitle = {Proceedings of Machine Learning and Systems},
  editor = {Dhillon, I. and Papailiopoulos, D. and Sze, V.},
  pages = {70--81},
  url = {https://proceedings.mlsys.org/paper/2020/file/4e732ced3463d06de0ca9a15b6153677-Paper.pdf},
  volume = {2},
  year = {2020},
  papertype = {conference},
  tex = {https://github.com/wsmoses/Paper-AutophaseMLSys},
  overleaf = {https://www.overleaf.com/project/5c9ad45dd502c2597ba2c3b1}
}

ProTuner: tuning programs with Monte Carlo tree search Haj-Ali, Ameer and Genc, Hasan and Huang, Qijing and Moses, William and Wawrzynek, John and Asanović, Krste and Stoica, Ion. arXiv.

@article{haj2020protuner,
  title = {ProTuner: tuning programs with Monte Carlo tree search},
  author = {Haj-Ali, Ameer and Genc, Hasan and Huang, Qijing and Moses, William and Wawrzynek, John and Asanovi{\'c}, Krste and Stoica, Ion},
  journal = {arXiv preprint arXiv:2005.13685},
  year = {2020},
  shortname = {arXiv},
  papertype = {preprint},
  tex = {https://github.com/wsmoses/Paper-Protuner},
  overleaf = {https://www.overleaf.com/project/5e4f77f9a690c5000145ac3a},
  pdf = {https://arxiv.org/pdf/2005.13685.pdf}
}

SyFER-MLIR: Integrating Fully Homomorphic Encryption Into the MLIR Compiler Framework Govindarajan, Sanath and Moses, William S. .

@misc{govindarajan2020syfer,
  title = { {SyFER-MLIR}: Integrating Fully Homomorphic Encryption Into the {MLIR} Compiler Framework},
  author = {Govindarajan, Sanath and Moses, William S},
  pdf = {https://math.mit.edu/research/highschool/primes/materials/2020/Govindarajan-Moses.pdf},
  papertype = {preprint},
  year = {2020}
}

Autophase: Compiler phase-ordering for HLS with deep reinforcement learning Huang, Qijing and Haj-Ali, Ameer and Moses, William and Xiang, John and Stoica, Ion and Asanovic, Krste and Wawrzynek, John. FCCM ’19.

@inproceedings{huang2019autophase,
  title = {Autophase: Compiler phase-ordering for {HLS} with deep reinforcement learning},
  author = {Huang, Qijing and Haj-Ali, Ameer and Moses, William and Xiang, John and Stoica, Ion and Asanovic, Krste and Wawrzynek, John},
  booktitle = {2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
  pages = {308--308},
  year = {2019},
  organization = {IEEE},
  shortname = {FCCM '19},
  papertype = {workshop},
  tex = {https://github.com/wsmoses/Paper-AutophaseFCCM},
  overleaf = {https://www.overleaf.com/project/5c1325e30800c53868650613},
  pdf = {https://ieeexplore.ieee.org/abstract/document/8735549}
}

Extracting Incentives from Black-Box Decisions Shavit, Yonadav and Moses, William S.. NeurIPS AI in FS.

An algorithmic decision-maker incentivizes people to act in certain ways to receive better decisions. These incentives can dramatically influence subjects’ behaviors and lives, and it is important that both decision-makers and decision-recipients have clarity on which actions are incentivized by the chosen model. While for linear functions, the changes a subject is incentivized to make may be clear, we prove that for many non-linear functions (e.g. neural networks, random forests), classical methods for interpreting the behavior of models (e.g. input gradients) provide poor advice to individuals on which actions they should take. In this work, we propose a mathematical framework for understanding algorithmic incentives as the challenge of solving a Markov Decision Process, where the state includes the set of input features, and the reward is a function of the model’s output. We can then leverage the many toolkits for solving MDPs (e.g. tree-based planning, reinforcement learning) to identify the optimal actions each individual is incentivized to take to improve their decision under a given model. We demonstrate the utility of our method by estimating the maximally-incentivized actions in two real-world settings: a recidivism risk predictor we train using ProPublica’s COMPAS dataset, and an online credit scoring tool published by the Fair Isaac Corporation (FICO)

@inproceedings{shavit2019extracting,
  title = {Extracting Incentives from Black-Box Decisions},
  shortname = {NeurIPS AI in FS},
  author = {Shavit, Yonadav and Moses, William S.},
  year = {2019},
  booktitle = {2019 NeurIPS Workshop on AI in Financial Services},
  pdf = {https://arxiv.org/pdf/1910.05664.pdf},
  papertype = {workshop},
  tex = {https://github.com/wsmoses/Paper-IncentiveNeuripsFinancial},
  overleaf = {https://www.overleaf.com/project/5d76a1cf13500f0001f9999d}
}

The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically Vasilache, Nicolas and Zinenko, Oleksandr and Theodoridis, Theodoros and Goyal, Priya and Devito, Zachary and Moses, William S. and Verdoolaege, Sven and Adams, Andrew and Cohen, Albert. TACO.

Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: (1) a domain-specific language with a tensor notation close to the mathematics of deep learning; (2) a Just-InTime optimizing compiler based on the polyhedral framework; (3) carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels; (4) the transparent integration of our flow into PyTorch and Caffe2, providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The performance is comparable to, and often exceeds the performance of, highly tuned libraries.

@article{Vasilache:2019:NAL:3366460.3355606,
  author = {Vasilache, Nicolas and Zinenko, Oleksandr and Theodoridis, Theodoros and Goyal, Priya and Devito, Zachary and Moses, William S. and Verdoolaege, Sven and Adams, Andrew and Cohen, Albert},
  booktitle = {Architecture and Code Optimization (TACO)},
  shortname = {TACO},
  title = {The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated {GPU} Kernels, Automatically},
  journal = {ACM Trans. Archit. Code Optim.},
  issue_date = {October 2019},
  volume = {16},
  number = {4},
  month = oct,
  year = {2019},
  issn = {1544-3566},
  pages = {38:1--38:26},
  articleno = {38},
  numpages = {26},
  url = {http://doi.acm.org/10.1145/3355606},
  doi = {10.1145/3355606},
  acmid = {3355606},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {Deep learning layers, GPU acceleration, polyhedral compilation},
  pdf = {https://c.wsmoses.com/papers/tc-taco.pdf},
  papertype = {journal},
  tex = {https://github.com/wsmoses/Paper-TCTACO}
}

LiTM: A Lightweight Deterministic Software Transactional Memory System Xia, Yu and Yu, Xiangyao and Moses, William and Shun, Julian and Devadas, Srinivas. PPoPP PMAMM ’19.

Deterministic software transactional memory (STM) is a useful programming model for writing parallel codes, as it improves programmability (by supporting transactions) and debuggability (by supporting determinism). This paper presents LiTM, a new deterministic STM system that achieves both simplicity and efficiency at the same time. LiTM implements the deterministic reservations framework of Blelloch et al., but without requiring the programmer to understand the internals of the algorithm. Instead, the programmer writes the program in a transactional fashion and LiTM manages all data conflicts and automatically achieves deterministic parallelism. Our experiments on six benchmarks show that LiTM outperforms the state-of-the-art framework Galois by up to 5.8x on a 40-core machine

@inproceedings{xia2019litm,
  title = { {LiTM}: A Lightweight Deterministic Software Transactional Memory System},
  author = {Xia, Yu and Yu, Xiangyao and Moses, William and Shun, Julian and Devadas, Srinivas},
  booktitle = {Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores},
  shortname = {PPoPP PMAMM '19},
  pages = {1--10},
  year = {2019},
  organization = {ACM},
  pdf = {https://c.wsmoses.com/papers/litm.pdf},
  papertype = {workshop}
}

Tapir: Embedding Recursive Fork-Join Parallelism into LLVM’s Intermediate Representation Schardl, Tao B. and Moses, William S. and Leiserson, Charles E.. TOPC.

Tapir (pronounced TAY-per) is a compiler intermediate representation (IR) that embeds recursive fork-join parallelism, as supported by task-parallel programming platforms such as Cilk and OpenMP, into a mainstream compiler’s IR. Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations on and across parallel control constructs. Remedying this situation has generally been thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir leverages the “serial-projection property,” which is commonly satisfied by task-parallel programs, to handle the semantics of these programs without an extensive rework of the compiler.For recursive fork-join programs that satisfy the serial-projection property, Tapir enables effective compiler optimization of parallel programs with only minor changes to existing compiler analyses and code transformations. Tapir uses the serial-projection property to order logically parallel fine-grained tasks in the program’s control-flow graph. This ordered representation of parallel tasks allows the compiler to optimize parallel codes effectively with only minor modifications. For example, to implement Tapir/LLVM, a prototype of Tapir in the LLVM compiler, we added or modified less than 3,000 lines of LLVM’s half-million-line core middle-end functionality.These changes sufficed to enable LLVM’s existing compiler optimizations for serial code—including loop-invariant-code motion, common-subexpression elimination, and tail-recursion elimination—to work with parallel control constructs such as parallel loops and Cilk’s Cilk_Spawn keyword. Tapir also supports parallel optimizations, such as loop scheduling, which restructure the parallel control flow of the program. By making use of existing LLVM optimizations and new parallel optimizations, Tapir/LLVM can optimize recursive fork-join programs more effectively than traditional compilation methods. On a suite of 35 Cilk application benchmarks, Tapir/LLVM produces more efficient executables for 30 benchmarks, with faster 18-core running times for 26 of them, compared to a nearly identical compiler that compiles parallel linguistic constructs the traditional way.

@article{tapirTOPC,
  author = {Schardl, Tao B. and Moses, William S. and Leiserson, Charles E.},
  title = {Tapir: Embedding Recursive Fork-Join Parallelism into {LLVM}'s Intermediate Representation},
  year = {2019},
  pdf = {https://dl.acm.org/doi/10.1145/3365655},
  shortname = {TOPC},
  issue_date = {December 2019},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  volume = {6},
  number = {4},
  issn = {2329-4949},
  url = {https://doi.org/10.1145/3365655},
  doi = {10.1145/3365655},
  journal = {ACM Trans. Parallel Comput.},
  month = dec,
  articleno = {19},
  numpages = {33},
  keywords = {parallel computing, control-flow graph, OpenMP, fork-join parallelism, optimization, compiling, LLVM, multicore, serial-projection property, Tapir, Cilk},
  papertype = {journal}
}

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions Vasilache, Nicolas and Zinenko, Oleksandr and Theodoridis, Theodoros and Goyal, Priya and DeVito, Zachary and Moses, William S and Verdoolaege, Sven and Adams, Andrew and Cohen, Albert. arXiv.

Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and Theano, explore different tradeoffs between usability and expressiveness, research or production orientation and supported hardware. They operate on a DAG of computational operators, wrapping high-performance libraries such as CUDNN (for NVIDIA GPUs) or NNPACK (for various CPUs), and automate memory allocation, synchronization, distribution. Custom operators are needed where the computation does not fit existing high-performance library calls, usually at a high engineering cost. This is frequently required when new operators are invented by researchers: such operators suffer a severe performance penalty, which limits the pace of innovation. Furthermore, even if there is an existing runtime call these frameworks can use, it often does not offer optimal performance for a user’s particular network architecture and dataset, missing optimizations between operators as well as optimizations that can be done knowing the size and shape of data. Our contributions include (1) a language close to the mathematics of deep learning called Tensor Comprehensions, (2) a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated by an autotuner. In particular, we demonstrate the suitability of the polyhedral framework to construct a domain-specific optimizer effective on state-of-the-art deep learning models on GPUs. Our flow reaches up to 4× speedup over NVIDIA libraries on kernels relevant to the Machine Learning Community, and on an actual model used in production at Facebook. It is integrated with mainstream frameworks Caffe2 (production-oriented), PyTorch (research-oriented), through the ATen asynchronous tensor library.

@article{tc,
  title = {Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions},
  booktitle = {arXiv preprint},
  shortname = {arXiv},
  author = {Vasilache, Nicolas and Zinenko, Oleksandr and Theodoridis, Theodoros and Goyal, Priya and DeVito, Zachary and Moses, William S and Verdoolaege, Sven and Adams, Andrew and Cohen, Albert},
  journal = {arXiv preprint arXiv:1802.04730},
  year = {2018},
  reddit = {http://www.reddit.com/r/MachineLearning/comments/7xjqq9/r_announcing_tensor_comprehensions/},
  hackernews = {http://news.ycombinator.com/item?id=16377389},
  pdf = {https://arxiv.org/pdf/1802.04730.pdf},
  papertype = {preprint}
}

OpenMPIR: Implementing OpenMP Tasks with Tapir Stelle, George and Moses, William S. and Olivier, Stephen L. and McCormick, Patrick. LLVM-HPC’17.

Optimizing compilers for task-level parallelism are still in their infancy. This work explores a compiler front end that translates OpenMP tasking semantics to Tapir, an extension to LLVM IR that represents fork-join parallelism. This enables analyses and optimizations that were previously inaccessible to OpenMP codes, as well as the ability to target additional runtimes at code generation. Using a Cilk runtime back end, we compare results to existing OpenMP implementations. Initial performance results for the Barcelona OpenMP task suite show performance improvements over existing implementations.

@inproceedings{openmpir,
  author = {Stelle, George and Moses, William S. and Olivier, Stephen L. and McCormick, Patrick},
  title = { {OpenMPIR}: Implementing OpenMP Tasks with Tapir},
  booktitle = {Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC},
  shortname = {LLVM-HPC'17},
  year = {2017},
  isbn = {978-1-4503-5565-0},
  location = {Denver, CO, USA},
  pages = {3:1--3:12},
  articleno = {3},
  numpages = {12},
  url = {http://doi.acm.org/10.1145/3148173.3148186},
  doi = {10.1145/3148173.3148186},
  acmid = {3148186},
  publisher = {ACM},
  address = {New York, NY, USA},
  pdf = {https://c.wsmoses.com/papers/openmpir.pdf},
  papertype = {workshop},
  tex = {https://github.com/lanl/openmpir-llvm2017}
}

Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation, Best Paper Award Schardl, Tao B. and Moses, William S. and Leiserson, Charles E.. PPoPP ’17.

This paper explores how fork-join parallelism, as supported by concurrency platforms such as Cilk and OpenMP, can be embedded into a compiler’s intermediate representation (IR). Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations across parallel control constructs. Remedying this situation is generally thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir is a compiler IR that represents logically parallel tasks asymmetrically in the program’s control flow graph. Tapir allows the compiler to optimize across parallel control constructs with only minor changes to its existing analyses and code transformations. To prototype Tapir in the LLVM compiler, for example, we added or modified about 6000 lines of LLVM’s 4-million-line codebase. Tapir enables LLVM’s existing compiler optimizations for serial code – including loop-invariant-code motion, common-subexpression elimination, and tail-recursion elimination – to work with parallel control constructs such as spawning and parallel loops. Tapir also supports parallel optimizations such as loop scheduling.

@inproceedings{tapir,
  author = {Schardl, Tao B. and Moses, William S. and Leiserson, Charles E.},
  title = {Tapir: Embedding Fork-Join Parallelism into {LLVM}'s Intermediate Representation},
  booktitle = {Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  shortname = {PPoPP '17},
  month = jan,
  year = {2017},
  isbn = {978-1-4503-4493-7},
  location = {Austin, Texas, USA},
  pages = {249--265},
  numpages = {17},
  pdf = {https://c.wsmoses.com/papers/tapir.pdf},
  doi = {10.1145/3018743.3018758},
  acmid = {3018758},
  publisher = {ACM},
  address = {New York, NY, USA},
  award = {Best Paper Award},
  reddit = {http://www.reddit.com/r/programming/comments/5ra59l/mit_says_their_modified_llvm_compiler_optimizes/},
  mitnews = {http://news.mit.edu/2017/optimizing-code-compiler-parallel-programs-0130},
  hackernews = {http://news.ycombinator.com/item?id=13568585},
  blog = {/tapir},
  papertype = {conference}
}

How Should Compilers Represent Fork-Join Parallelism? Moses, William S.. Thesis ’17.

This thesis explores how fork-join parallelism, as supported by concurrency platforms such as Cilk and OpenMP, can be embedded into a compiler’s intermediate representation (IR). Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations across parallel control constructs. Remedying this situation is generally thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir is a compiler IR that represents logically parallel tasks asymmetrically in the program’s control flow graph. Tapir allows the compiler to optimize across parallel control constructs with only minor changes to its existing analyses and code transformations. To prototype Tapir in the LLVM compiler, for example, the Tapir team added or modi- fied about 6000 lines of LLVM’s 4-million-line codebase. Tapir enables LLVM’s existing compiler optimizations for serial code — including loop-invariant-code motion, commonsubexpression elimination, and tail-recursion elimination — to work with parallel control constructs such as spawning and parallel loops. Tapir also supports parallel optimizations such as loop scheduling.

@mastersthesis{wmoses-meng,
  title = {How {S}hould {C}ompilers {R}epresent {F}ork-{J}oin {P}arallelism?},
  author = {Moses, William S.},
  booktitle = {Master's Thesis},
  shortname = {Thesis '17},
  school = {Massachusetts Institute of Technology},
  year = {2017},
  month = may,
  pdf = {https://c.wsmoses.com/papers/wmoses-meng.pdf},
  blog = {/tapir},
  papertype = {thesis},
  overleaf = {https://www.overleaf.com/project/58fceac4b55260d42f75e8c3},
  tex = {https://github.com/wsmoses/Paper-MEng}
}

Embedding Fork-Join Parallelism into LLVM IR Moses, William S. and Schardl, Tao B. and Leiserson, Charles E.. CQC ’16.

@inproceedings{tapirCPC,
  title = {Embedding Fork-Join Parallelism into LLVM IR},
  author = {Moses, William S. and Schardl, Tao B. and Leiserson, Charles E.},
  booktitle = {19th Workshop on Compilers for Parallel Computing},
  year = {2016},
  shortname = {CQC '16},
  pdf = {https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf},
  papertype = {workshop}
}

Extreme Multi-Resolution Visualization: A Challenge on Many Levels Balme, Joanna and Brown-Dymkoski, Eric and Guerrero, Victor and Jones, Stephen and Kessler, Andre and Lichtl, Adam and Lung, Kevin and Moses, William and Museth, Ken and Roberson, Nathan and others. SCVis ’15.

@inproceedings{spacex15,
  title = {Extreme Multi-Resolution Visualization: A Challenge on Many Levels},
  author = {Balme, Joanna and Brown-Dymkoski, Eric and Guerrero, Victor and Jones, Stephen and Kessler, Andre and Lichtl, Adam and Lung, Kevin and Moses, William and Museth, Ken and Roberson, Nathan and others},
  booktitle = {SuperComputing Visualization Contest 2015},
  shortname = {SCVis '15},
  year = {2015},
  pdf = {https://c.wsmoses.com/papers/spacex15.pdf},
  papertype = {workshop}
}

Computational Complexity of Arranging Music Demaine, Erik D. and Moses, William S.. MOVES ’15.

Music has long been an interesting subject of analysis for mathematicians and has led to many interesting questions in music theory and other fields. For the most part, computer scientists have looked into applying artificial intelligence to music and finding algorithms and data structures to solve various problems in music. Prior work on these algorithms often involves computing various properties of music such as the edit distance between two songs or the optimal fingering. These problems tend to be solvable in polynomial time using dynamic programming and have various application such as the music identification service Shazam or operations on RISM, an online music database. This paper takes an additional step in this direction, asking what sorts of problems in music cannot be efficiently computed. Specifically, this paper asks how various constraints affect the computational complexity of arranging music originally written for one set of instruments down to a single instrument. The paper then applies these results to other domains including musical choreography (such as ice skating and ballet) as well as creating levels for rhythm games (such as Rock Band). We prove that all of the problems are NP-complete, meaning that there is no efficient algorithm to solve them (assuming the standard conjecture that P != NP).

@incollection{moves15,
  author = {Demaine, Erik D. and Moses, William S.},
  title = {Computational Complexity of Arranging Music},
  booktitle = {Revised Papers from MOVES 2015: Mathematics of Various Entertaining Subjects},
  shortname = {MOVES '15},
  publisher = {Princeton University Press},
  year = {2015},
  pdf = {https://c.wsmoses.com/papers/moves15.pdf},
  papertype = {book},
  overleaf = {https://www.overleaf.com/project/54f60a71728bff850ed36c73},
  tex = {https://github.com/wsmoses/Paper-MOVES15}
}

Online Adaptive Frequency Hopping Moses, William and Robertson, Andrew and Dell, John. TJHSST ’14.

Adaptive frequency hopping is one way to maximize the utilization of the wireless spectrum. Yet, when the environment itself is changing, the frequency at which the radio senses can become increasingly less optimal. By having the radio create a model of the environment based off of the sensing data, it is possible to achieve high data rates when the spectrum is not being heavily utilized and maintain a low level of interference at times when it is. The radio was modeled both mathematically and run in simulations. The outcomes of these tests were compared with existing standards such as Bluetooth (random frequency hopping) and IEEE 802.22 (fixed sensing rate). In order to evaluate data rate and interference simultaneously, a metric was created that combined them by taking the product of data rate and ( 1 - interference ). Overall, the online adaptive frequency hopper had a 35% increase in the combined metric over the random frequency hopper and 25% increase over the fixed sensing rate radio.

@misc{oafh,
  title = {Online Adaptive Frequency Hopping},
  author = {Moses, William and Robertson, Andrew and Dell, John},
  booktitle = {TJHSST Teknos 2014},
  shortname = {TJHSST '14},
  year = {2014},
  pdf = {https://c.wsmoses.com/papers/oafh.pdf},
  papertype = {report}
}

Presentations

Automatic differentiation and performance portability for Oceananigans via Enzyme + Reactant 2026 ECCO Annual Project Meeting, May 29, 2026

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation Sandia Labs Seminar, May 28, 2026

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation Los Alamos Advances in Applied Computer Science Invited Speaker Series , May 27, 2026

Reactant: Mathematical Optimization & Performance Portability for Julia functions with MLIR & XLA BHI Seminar, May 18, 2026

Making Waves in the Cloud: A Paradigm Shift for Scientific Computing through Compiler Technology University of Cambridge, Apr 21, 2026

Progress and challenges in simulating multiphysics at exascale: the case of aero-thermo-chemo-mechanics response of hypersonics thermal protection systems 2026 MICDE Predictive Science Symposium, Apr 15, 2026

SIAG/SC Prize Presentations and 2026 SIAM Activity Group on Supercomputing Early Career Prize Lecture, SIAM Early Career Prize Lecture SIAM CSE '26, Mar 5, 2026

Automating Bayesian Inference of Millimeter Source Association SkAI Seminar, Feb 11, 2026

Thinking Fast and Correct: Automated Rewriting of Numerical Code through Compiler Augmentation, Distinguished Paper Award CGO 2026, Feb 3, 2026

Multi-Accelerator Automatic Differentiationn PPoPP Workshop on Differentiable Parallel Programming 2026, Feb 1, 2026

Making Waves in the Cloud: A Paradigm Shift for Scientific Computing through Compiler Technology Dagstuhl Seminar 25341: Software Performance Engineering, Aug 21, 2025

Making Waves in the Cloud: A Paradigm Shift for Scientific Computing through Compiler Technology Google LLVM Summit, Aug 14, 2025

Reactant: Optimize Julia functions with MLIR & XLA JuliaCon 2025, Jul 25, 2025

EnzymeMLIR: High-Performance Automatic Differentiation of Tensor Code ICCOPT 2025, Jul 23, 2025

Making Waves in the Cloud: A Paradigm Shift for Scientific Computing through Compiler Technology Sustainable Computational Science and Engineering 2025, Mar 19, 2025

EnzymeMLIR: Combining Differentiation with High-Level Optimization PPoPP Workshop on Differentiable Parallel Programming 2025, Mar 2, 2025

Automatic Differentiation in MLIR MLIR Winter School, Mar 2, 2025

Polyhedral and Parallel Optimization through High-Level Constructs in MLIR UIUC Compiler Seminar, Sep 23, 2024

Differentiable and Portable Programming for Science CASS Community BOF Days, Jun 12, 2024

Differentiable Programming in Julia with Enzyme SIAM Conference on Mathematics of Planet Earth (MPE24), Jun 12, 2024

Exploring the Landscape of AI and ML in Compiler Development: Pros and Cons CASS Community BOF Days, Jun 11, 2024

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation CSCS, Jun 5, 2024

Enzyme.jl: High-Performance, Cross-Language, and Parallel Automatic Differentiation in Julia PASC 2024, Jun 4, 2024

Exploring the Landscape of AI and ML in Compiler Development: Pros and Cons CASS Community BOF Days, Jun 11, 2024

Automated Derivative Sparsity via Dead Code Elimination Winter EuroAD 2023, Dec 4, 2023

LLVM in the age of LLMs: Machine Learning for IR, Optimization, & More, Keynote AI4Dev @ SC 2023, Nov 13, 2023

Enzyme-MLIR: Early Experiments on multi-level automatic differentiation MLIR Summit @ LLVM Dev Meeting, Oct 10, 2023

CloudCompiler Amazon Research Awards Tech Talk Series, Aug 30, 2023

Enzyme: Fast and Effective Automatic Differentiation for Academia and Industry 2023 International Congress on Industrial and Applied Mathematics, Aug 23, 2023

An Introduction to Enzyme and Some Fun Recent Results Differentiable and Probabalistic Programming for Fundamental Physics, Jun 13, 2023

Recent Compiler-Based AD Results and Open Questions EuroAD 2023, Jun 13, 2023

Supercharging Programming Through Compiler Technology MIT Thesis Defense, May 1, 2023

Back Propagation and Automatic Differentiation MIT 18.335 Lecture, Apr 3, 2023

HTO: “Header”-Time Optimization, ACM Gold Award (1st place) CGO SRC 2023, Feb 28, 2023

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs PPoPP 2023, Feb 27, 2023

Enzyme Tutorial Enzyme Conference 2023, Feb 22, 2023

High-Performance GPU-to-CPU Transpilation and Optimization Mathworks Code Generation Seminar, Jan 12, 2023

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation UT Austin Oden Institute Seminar, Dec 13, 2022

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation BU Systems Seminar, Dec 9, 2022

Scalable Automatic Differentiation of Multiple Parallel Paradigms through Compiler Augmentation, Best Student Paper SC 2022, Nov 16, 2022

Polygeist C++ frontend for MLIR, Keynote Talk LLVM HPC @ SC 2022, Nov 13, 2022

Polygeist C++ frontend for MLIR MLIR Summit @ 2022 US LLVM Dev Meeting, Nov 10, 2022

Synthesization of Fast Gradients with Enzyme Second MODE Workshop on Differentiable Programming for Experiment Design, Sep 13, 2022

Enzyme: Automatic Differentiation for Parallel Programs LLPP '22, Aug 29, 2022

Enzyme.jl JuliaCon ESM MiniSymposium, Jul 25, 2022

Automatic Differentiation of Black Box Code with Enzyme RSS '22 Workshop on Differential Simulation, Jul 1, 2022

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation Columbia DSI Seminar, Jun 30, 2022

Updates on Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation ExaSGD Seminar, Jun 14, 2022

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation TUM Seminar, Jun 3, 2022

MLIR-In-The-Middle: compiling C++ and extensions via the new extensible infrastructure ISC LLVM Performance Workshop, Jun 2, 2022

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation CESMIX TST '22, May 25, 2022

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation Google/INRIA/ONERA AD Meeting, May 19, 2022

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation Imperial College London Seminar, May 13, 2022

A brief introduction to Enzyme.jl Cambridge Area Julia Users Network (CAJUN), May 4, 2022

Back Propagation and Automatic Differentiation MIT 18.065 Lecture, May 2, 2022

[Tutorial] An Guide to Performance Debugging LLVM-based Programs LLVM Performance Workshop at CGO '22, Apr 3, 2022

Enzyme and Enzyme.jl Updates DJ4Earth, Mar 21, 2022

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation NVIDIA Seminar, Feb 23, 2022

Reverse-Mode Automatic Differentiation and Optimization of GPU and Heterogeneous Parallel Programs via Enzyme SIAM PP22 GPU MiniSymposium, Jan 13, 2022

Enzyme: High-Performance Automatic Differentiation of LLVM LLNL Invited Seminar, Dec 14, 2021

How to Use Enzyme to Automatically Differentiate Any LLVM-based Language for CPU, GPU, and More Virtual LLVM Developer Meeting, Fall 2021, Nov 19, 2021

Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme, Best Student Paper Finalist and Best Reproducibility Advancement Finalist SC '21 (The International Conference for High Performance Computing, Networking, Storage, and Analysis), Nov 17, 2021

Enzyme: Fast, Language Agnostic, Differentiation of Parallel Programs in LLVM 7th Annual Workshop on the LLVM Compiler Infrastructure in HPC, Nov 14, 2021

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation Washington University of St. Louis Colloquium, Nov 12, 2021

Language-Independent Automatic Differentiation and Optimization of GPU Programs with Enzyme European Workshop on Automatic Differentiation 2021, Nov 4, 2021

Enzyme: High-Performance, Cross-Language, and Parallel Automatic Differentiation CU Boulder CS Colloquium, Oct 28, 2021

Differentiable Programming in C++ CPPCon 2021, Oct 26, 2021

Polygeist: Raising C to Polyhedral MLIR PACT Conference 2021, Sep 27, 2021

Polygeist: Raising C to Polyhedral MLIR Tobias Grosser Group Meeting (Edinburgh), Aug 9, 2021

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients! 2021 DOE CSGF Program Review, Jul 21, 2021

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients! Legion Group Meeting (Stanford), Jun 23, 2021

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients! Jiantao Jiao Group Meeting (Berkeley), Jun 16, 2021

Cymbl: To -jInfinity & Beyond CaaS Monthly Meeting (Princeton), Jun 3, 2021

Enzyme: High-Performance Automatic Differentiation CESMIX Group Meeting (MIT), May 18, 2021

Post-Optimization Automatic Differentiation by Synthesizing LLVM NVIDIA GTC 2021, Apr 12, 2021

Post-Optimization Automatic Differentiation by Synthesizing LLVM Differentiable Programming Workshop, Apr 7, 2021

Polygeist: Affine C in MLIR MLIR Open Design Meeting, Feb 11, 2021

Polygeist: Affine C in MLIR IMPACT 2021, Jan 20, 2021

Enzyme: High-Performance Automatic Differentiation of LLVM Languages For Inference (LAFI) 2021, Jan 17, 2021

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients, Spotlight NeurIPS 2020, Spotlight Talk, Dec 9, 2020

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients NeurIPS 2020, Poster "Talk", Dec 9, 2020

Cymbl: To -jInfinity & Beyond Apple Ted-K Talk, Nov 19, 2020

Frontend and Compiler Representations of Reducers for Clarity and Optimization OpenCilk Group Meeting, Oct 9, 2020

Enzyme: High-Performance Automatic Differentiation of LLVM, Best Student Presentation US LLVM Developer Meeting, Fall 2020, Oct 8, 2020

Post-Optimization Automatic Differentiation by Synthesizing LLVM European Workshop on Automatic Differentiation 2020, Aug 11, 2020

Making ML Fast for Arbitrary Code (Enzyme) Secure AI Labs Seminar Series, Jul 28, 2020

Post-Optimization Automatic Differentiation by Synthesizing LLVM Argonne National Laboratories Seminar, Jul 1, 2020

Header Time Optimization: Cross-Translation Unit Optimization via Annotated Headers, Keynote Talk Fourth LLVM Performance Workshop at CGO, Feb 23, 2020

Automated Bayesian Estimation of Quantum Error Models 3rd International Workshop on Quantum Compilation, Nov 7, 2019

“Header Time Optimization”: Cross-Translation Unit Optimization via Annotated Headers, Best Student Presentation (Tie) US LLVM Developer Meeting, Fall 2019, Oct 22, 2019

Enzyme: Efficient Cross-Platform AD by Synthesizing LLVM European Workshop on Automatic Differentiation 2019, Jul 2, 2019

Bayesian Estimation of Error Models for Improving Circuit Compilation LBL Internal Group Meeting, Aug 1, 2019

Efficient Cross-Platform Automatic Differentiation Supertech Group, May 20, 2019

How to Use LLVM To Optimize Parallel Programs US LLVM Developer Meeting, Fall 2018, Oct 18, 2018

Adaptive Value Iteration 6.832 Presentations, May 17, 2018

Quantum Computing for the Common Man 8.371 Presentations, May 7, 2018

Tensor Comprehensions Rework Deep Learning Summit Boston 2018, May 24, 2018

Tensor Comprehensions LLVM Workshop at CGO 2018, Feb 24, 2018

Leveraging LLVM to Optimize Parallel Programs US LLVM Developer Meeting, Fall 2017, Oct 18, 2017

Leveraging LLVM to Optimize Parallel Programs NSF Parlay Meeting, Fall 2016, Sep 29, 2017

Tapir: Embedding Fork-Join Parallelism into LLVM IR MIT Masterworks Poster Symposium, Apr 18, 2017

Tapir: Embedding Fork-Join Parallelism into LLVM IR MIT 6.S898 Lecture, Apr 2, 2017

Tapir: Embedding Fork-Join Parallelism into LLVM IR, 2nd place speaker MIT EECSCon 2017, Apr 18, 2017

Tapir: Embedding Fork-Join Parallelism into LLVM IR IBM PL Day 2016, Dec 5, 2017

Embedding Fork-Join Parallelism into LLVM IR Compilers for Parallel Computing 2016, Jul 6, 2016

Computational Complexity of Arranging Music Mathematics of Various of Entertaining Subjects (MOVES) 2015, Aug 3, 2015

Syntactic Simplifications for Reducer Hyperobjects Intel Corporation, Jan 22, 2015

Posters

NSF CSSI 2103942: Convergence of Bayesian inverse methods and scientific machine learning in Earth system models through universal differentiable programming NSF PI Meeting, Feb 26, 2022

HTO: “Header”-Time Optimization, ACM Gold Award (1st place) CGO SRC 2023, Feb 26, 2022

High-Performance GPU-to-CPU Transpilation and Optimization via Polygeist/MLIR 2022 US LLVM Dev Meeting, Nov 13, 2022

Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme ICML 2022 Beyond Bayes Workshop, Jul 22, 2022

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients NeurIPS 2020, Poster, Dec 9, 2020

Enzyme: High-Performance Automatic Differentiation of LLVM US LLVM Developer Meeting, Fall 2020, Oct 6, 2020

Cymbl: To -jInfinity & Beyond US LLVM Developer Meeting, Fall 2020, Oct 6, 2020

Learning Quantum Error Models American Physical Society March Meeting 2020, Mar 2, 2020

Extracting Incentives From Black-Box Decisions NeurIPS 2019 Workshop on Robust AI in Financial Services: Data, Fairness, Explainability, Trustworthiness, and Privacy, Dec 13, 2019

Automated Bayesian Estimation of Quantum Error Models 3rd International Workshop on Quantum Compilation, Nov 7, 2019

“Header Time Optimization”: Cross-Translation Unit Optimization via Annotated Headers US LLVM Developer Meeting, Fall 2019, Oct 23, 2019

Optimizing Nondeterminacy European LLVM Developer Meeting, Spring 2019, Apr 9, 2019

William S. Moses Assistant Professor, University of Illinois Urbana-Champaign (UIUC)

Papers

Presentations

Posters

William S. Moses
Assistant Professor, University of Illinois Urbana-Champaign (UIUC)