PLSE Blog RSS Feed

Parallelism in ML Training

keywords:

Modern ML training requires splitting work across clusters of GPUs to process data faster and fit model state into GPU memory. In ML systems, “parallelism” can mean several different things. Data and model parallelism distribute work across devices. Getting good performance also depends on intra-device parallelism: overlapping compute and communication operators on each individual GPU. This post gives an overview of both kinds of parallelism, then describes my work on abstractions for tuning inter- and intra-device parallelism strategies. This work is currently under submission to NSDI.

Read More: Parallelism in ML Training

Being a Long Prompter

keywords:

Large language models and coding agents are getting more intelligent, but what does that intelligence mean? Some people understand it as the ability to implement an entire piece of software with a short prompt. However, I stand on the opposite side: to me, the true surprise is their ability to precisely implement long prompts.

Read More: Being a Long Prompter

One-hole contexts

keywords:

A one-hole context is a simple data structure which manages term rewriting. Given a term \(t\) and a subterm \(s\) inside \(t\) that we want to rewrite, we can create a one-hole context that represents “\(t\) with a hole at \(s\)”. After separately rewriting \(s\) into \(s’\), we can then plug \(s’\) into that hole, to reconstruct \(t\) but with \(s\) swapped with \(s’\).

Read More: One-hole contexts

Design Hardware with Coroutines

keywords:

Hardware is typically designed using a hardware description language (HDL) like Verilog (see Gus’s blog post to learn about that!). Verilog has many well-documented problems, but another thing it does not offer is much abstraction. On the one hand, there is a reasonable argument that abstractions often are not free and low-level control is a high priority for hardware designers; every hardware designer is a performance engineer. A goal of my research is figuring out how to offer hardware designers additional zero-cost abstraction mechanisms to improve their productivity and this blog post is about one of the ways I have been thinking about doing that. I recently published an article with my collaborators at the LATTE workshop at ASPLOS about implementing cache-coherence protocols using coroutines. This blog post discusses why coroutines are a good choice for implementing hardware protocols, and in particular finite-state machines, without sacrificing control.

Read More: Design Hardware with Coroutines

How To Work With LLMs Without Losing Your Mind

keywords:

LLMs are increasingly used in software engineering research as researchers explore ways to incorporate them into tools and traditional development workflows such as testing and verification. My research for the past year or so has involved LLMs in one way or another, and I wanted to share some of the tips and tricks that have helped me keep what little remaining sanity I have.

Read More: How To Work With LLMs Without Losing Your Mind

Custom Data Structures in E-Graphs

keywords:

E-graphs are a data structure used to reason about program equivalence. Combined with specialized algorithms they can be used to build optimizers or compilers. However, their performance can struggle as the number of equivalent expressions explodes if we include on algebraic identities, such as associativity, commutativity, and distributivity (A/C/D).

Read More: Custom Data Structures in E-Graphs

Takeaways from My Experience Moving Computer Science Outreach from the Lab to K-12 Classrooms

keywords:

This past December, the PLSE lab had some exciting new outreach opportunities where volunteers from our lab visited classrooms for Hour of Code, a nationwide program for promoting computer science education in K-12 schools.

Read More: Takeaways from My Experience Moving Computer Science Outreach from the Lab to K-12 Classrooms